CSE 251A Project 2: Coordinate Descent for Logistic

Regression

Yi Rong

∗

Computer Science & Engineering

University of California, San Diego

rongyi@ucsd.edu

hi@rongyi.ai

Abstract

In this report, we discuss our attempt to use a better strat-

egy to pick the direction of coordinate descent at each step

rather than random selection. At each iteration, we pick the

gradient direction with the largest absolute value, and shows

that our method outperforms the random coordinate descent

in terms of convergence speed.

Keywords: Coordinate Descent, Logistic Regression

1 Introduction

Logistic Regression [

] is a regression analysis model widely

used to estimate the probablility of multiple target classes.

Dierent from linear regression, Logistic Regression esti-

mated the likelihood of each label, and further to that proba-

bility to predict the discrete label class. In this project, we

mainly use logistic regression as a class of optimization prob-

lems with the convex objective function to be minimized by

applying coordinate descent.

In the broad area of optimizations, the problem being

convex often makes solving for the global minimum or max-

imum much easier. The target function being convex means

nding a local minimum would yield the global minimum,

without the need to randomly escape a local optimum in

search of a better solution.

A good case for convex optimization is when the objective

function

𝑓

itself is twice dierentiable, then we can check

for its Hessian to be positive semidenite to make sure the

function is convex, and solve for zero gradient

∇𝑓 =

0 to

get to the global minimum. And similarly for functions that

are dierentiable but not twice dierentiable, and we know

it’s convex, solving for the global minimum is the same.

In practice, we can use (stocastic) gradient descent for an

iterative approach to reach minimum.

However, not all the convex functions are dierentiable

everywhere. And to minimize the convex non-dierentiable

function, setting gradients to zero will not result in a local

∗

Student ID: A14113185

Author’s address: Yi Rong, Computer Science & Engineering, University of

California, San Diego, rongyi@ucsd.edu, hi@rongyi.ai.

2018. 2475-1421/2018/1-ART1 $15.00

hps://doi.org/

minimum solution or not a solution at all. A typical exam-

ple is the

ℓ

𝑝

norm cone, which is convex but not dieren-

tiable somewhere, the gradient is either positive or negative

across the domain of the function. In some cases of a non-

dierentiable function, we can use some subgradient tricks to

still get to the global minimum we want, but for norm cone

which reaches its minimum at origin, coordinate descent

works reasonably well.

The Coordinate Descent [

] approach is a very simple

technique that is surprisingly ecient and scalable by doing

just coordinate-wise minimization. The high-level idea is

that for a multivariable function

𝑓

, we minimize it along one

direction each time, and over multiple iterations, we would

gradually reach the a local minimum. Since the algorithm

iterates through one direction each time, picking the opti-

mal one at a specic iteration would greatly improve the

minimization / convergence speed.

In this project report, we mainly discuss our attempt to

pick a best direction for coordinate descent at each step,

and how it applies to solve logistic regression problem. One

simple way for this problem is to just pick a random direc-

tion every time, or use round-robin strategy to loop over

each direction. We designed a better approach which picks

the direction in which the gradient is largest, and evaluate

the performance of our approach to a naive baseline of ran-

domly picking a direction each time, and shows that our

approach converges way faster than the baseline under the

same condition.

The report will be structured as follows: we rst formu-

late the Logistic Regression problem and Coordinate Descent

method in Section 2, and then present two approaches of

picking one direction at a time in Section 3, one being choos-

ing uniformly randomly, the other being our approach which

picks the direction of gradient with maximum absolute value.

After that we evaluate our result in Section 4. Finally we con-

clude our report in Section 5.

2 Problem Formulation

Logistic Regression can be described as follows: The train-

ing dataset consisted of the features

𝑋 = {𝑥

, 𝑥

, ..., 𝑥

𝑛

}

and

the corresponding categorical labels

𝑌 = {𝑦

, 𝑦

, ...,𝑦

𝑛

}

with a mapping

𝑓 (𝑥

𝑖

) = 𝑦

𝑖

. In our case the label is binary,

Conference’17, July 2017, Washington, DC, USA Yi Rong

only consisted of 0 and 1. We would like to train a classi-

er that returns the probablity of the label class given input

features, i.e.:

𝑃𝑟 (𝑦 = 1|𝑥)

We use the log-loss as metric to evaluate the regression

classier, maximizing the maximum-likelihood for the pre-

diction:

𝐿(𝑤) = −

𝑛



𝑖=1

ln 𝑃𝑟

𝑤

(𝑦

(𝑖)

|𝑥

(𝑖)

) =

𝑛



𝑖=1

𝑙𝑛(1 + 𝑒

−𝑦

(𝑖 )

𝑤 ·𝑥

(𝑖 )

)

Coordinate Descent method can be formulated as an opti-

mization problem:

𝑚𝑖𝑛

𝑥

𝑓 (𝑥)

where

𝑓 (𝑥) = 𝑔(𝑥) +

𝑖

ℎ

𝑖

(𝑥

𝑖

)

, with

𝑔

being convex and

dierentiable, and each

ℎ

𝑖

convex. The procedure starts with

rst sample 𝑥

(0)

and iteratively update 𝑥

𝑖

at step 𝑘:

𝑥

(𝑘)

𝑖

= 𝑎𝑟𝑔 min

𝑥

𝑖

𝑓 (𝑥

(𝑘)

, ..., 𝑥

(𝑘)

𝑖−1

, 𝑥

𝑖

, 𝑥

(𝑘−1)

𝑖+1

, ..., 𝑥

(𝑘−1)

𝑛

)

where 𝑖 = 1, ..., |𝑥 |.

In our case it’s

𝑚𝑖𝑛

𝑤

𝐿(𝑤)

, with

𝐿(𝑤)

being the objective

function mentioned above. And in this project, we want

to carefully pick a better

𝑖

(direction) for each

𝑘

(sample /

iteration).

3 Coordinate Descent and Coordinate

Selection

The idea behind my attempt for better coordinate selection is

that for each training iteration, we can compute the gradient

of our objective function

𝐿(𝑤)

with respected to the weight

𝑤

, which we denote

∇

𝑤

𝐿(𝑤)

. Instead of randomly picking

one coordinate to apply the update, we would like to fully

utilize the additional information in the gradient

∇𝐿

by com-

paring the norm of partial derivatives along each direction.

Since the gradient can be written as:

∇

𝑤

𝐿(𝑤) = [

𝜕𝐿

𝜕𝑤

𝜕𝐿

𝜕𝑤

, ...,

𝜕𝐿

𝜕𝑤

𝑑

]

⊤

where

𝑤

𝑖

is the

𝑖

-th component of the weight vector, with

a total of

𝑑

dimensions. At each step

𝑘

, we denote the weight

vector as

𝑤

(𝑘)

𝑖

. We now pick the coordinate in which the

gradient has the maximum absolute value, the intuition of

which is that updating along that direction would give us

the largest elementwise weight update:

𝑖

𝑘

= 𝑎𝑟𝑔

𝑑

max

𝑗=1

𝜕𝐿

𝜕𝑤

(𝑘−1)

𝑗

We then perform a weight update using the partial deriv-

ative (gradient along 𝑖

𝑘

direction):

𝑤

(𝑘)

𝑖

𝑘

= 𝑤

(𝑘)

𝑖

𝑘

− 𝜂 ·

𝜕𝐿

𝜕𝑤

(𝑘−1)

𝑖

𝑘

We describe the full Coordinate Descent training process

as follows:

Algorithm 1 Coordinate Descent with Direction Selection

Input X, Y

Output W

1: X: Input Dataset Features

2: Y: Input Dataset Labels

3: W ← weight_init() ⊲ Randonly initialize weights

4: 𝑘 ← 0 ⊲ 𝑘 is the iteration index

5: for 𝑘 ≤ MAX_ITER do

6: 𝑌 ← 𝑙𝑜𝑔𝑖𝑡 (WX) ⊲ logit is 𝜎(𝑥) = 1/(1 + 𝑒

−𝑥

)

7: ∇

𝑊

← (𝑌 − Y)X

8: 𝑖 ← pick_coordinate(∇

𝑊

) ⊲ Random / Max

9: W[𝑖] ← W [𝑖] − 𝜂

𝜕𝐿

𝜕W

𝑖

10: 𝑙𝑜𝑠𝑠 ← log_loss(W, X, 𝑌 ) ⊲ Record Loss

11: end for

12: Return W

In the above algorithm, the

pick_coordinate

function is

the main function we tweak for our experiment. We run the

above training procedure for both functions and evaluate

their result. The

log_loss

function is used to compute the

loss, i.e. the objective function, and record the value for

visualization.

4 Evaluation

4.1 Experiment Results

During our experiments, we ran the exact same classica-

tion task of logistic regression on the Wine dataset. We ran

the standard logistic regression from

scikit-learn

to serve

as the accuracy baseline and record its nal log-loss

𝐿∗

our target loss. We then ran two experiments using Coordi-

nate Descent, one with random direction (we call it Random

Coordinate Descent), the other picking the direction with

maximum gradient component (denoted Max Coordinate

Descent). The loss curve for each approach is presented in

Figure 1.

Before running Logistic Regression and Coordinate De-

scent, we preprocessed our input data by normalizing the fea-

tures using

scikit-learn

’s

StandardScaler()

. We found

that normalizing the data allows larger learning rate as well

as faster and more stable convergence. Turning o the nor-

malization does not aect the conclusion: our coordinate

descent logistic regression still converges closed to the opti-

mal loss 𝐿∗, just way more slowly.

We ran the Max Coordinate Descent until convergence,

i.e. the loss value reaches the source of truth

𝐿∗

within

a margin of error

𝜖 = ±

𝑒 −

6, assume it’s

𝑆

𝑚𝑎𝑥

steps. But

we did not ran Random Coordinate Descent to convergence,

CSE 251A Project 2: Coordinate Descent for Logistic Regression Conference’17, July 2017, Washington, DC, USA

instead we ran the same number of steps

𝑆

𝑚𝑎𝑥

and plot the

loss dierence.

The

𝐿∗

in my case is 3

𝑒 −

5, that is with

liblinear

solver,

and the regularization strength set to 1

𝑒 −

10 (cannot be

entirely disabled for that solver) to minimize the regularizer’s

eect. The learning rate for both coordinate descent methods

are set to 1

𝑒 −

2, and the

𝑆

𝑚𝑎𝑥

under this setting is 88868

steps.

0 20000 40000 60000 80000

# of iteration

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

loss

Loss Curve Comparison

Random Coordinate Descent

Max Coordinate Descent

Logistic Regression

Figure 1. Performance comparison between Random Selec-

tion and Max Value

From the Loss curve 1, we can see that Max Coordinate

Descent outperforms Randon Coordinate Descent in that

it converges much faster and make more valuable weight

updates on each steps, as well as reaches the target optimal

loss rst.

We present more data, including Prediction Accuracy,

Logrithmic-Scale Loss Curves in the Appendix section.

4.2 Future Work

Currently the Max Coordinate Descent is able to converge

to the optimal loss in a reasonable time, but there’s plenty

of room for improvement.

For example, the current objective function

𝐿(𝑊 )

does not

have regularization, but we can add L1 or L2 regularization

to the problem, and I believe it will result in even faster

convergence.

𝐿(𝑤) =

𝑛



𝑖=1

𝑙𝑛(1 + 𝑒

−𝑦

(𝑖 )

𝑤 ·𝑥

(𝑖 )

) + 𝜆||𝑤 ||

The coordinate descent process still works as is, only with

an additional 2

𝜆𝑤

added to the end of the gradient calcula-

tion.

5 Conclusion

In conclusion, we presented an alternative approach to select

a better coordinate at each step of Coordinate Descent by

picking the partial derivative with largest norm, and we

showed that our approach outperforms the random one in

every experiment, and reaches the optimal loss established

by the standard Logistic Regression library function.

References

[1]

Zhi-Quan Luo and Paul Tseng. 1992. On the convergence of the coordi-

nate descent method for convex dierentiable minimization. Journal of

Optimization Theory and Applications 72, 1 (1992), 7–35.

[2] Raymond E Wright. 1995. Logistic regression. (1995).

A Appendix

A.1 Misc

•

I did not nish the optional Sparse Coordinate Descent

part.

•

I did not attach the full jupyter notebook due to exces-

sive logging, but it’s always available upon request.

A.2 Additional Graphs

0 20000 40000 60000 80000

# of iteration

0.0

0.2

0.4

0.6

0.8

1.0

accuracy

Accuracy Curve

Random Coordinate Descent

Max Coordinate Descent

Figure 2. Accuracy Curve for both approaches

0 20000 40000 60000 80000

# of iteration

loss

Loss Curve Comparison (Log Scale)

Random Coordinate Descent

Max Coordinate Descent

Logistic Regression

Figure 3. Loss Curve, Log Scale

Conference’17, July 2017, Washington, DC, USA Yi Rong

A.3 Full Source Code for Coordinate Descent

class CoordDescentLR:

def __init__(self, X, y, lr=1e-2,

iter=10000, random=False):↩→

self.X = np.insert(X, 0, 1, axis=1)

self.y = y

self.lr = lr

self.w = np.zeros((len(self.X[0]), 1))

self.w += np.random.normal(0, 0.01, (14,

1))↩→

self.iter = iter

self.random = random

self.loss = []

self.acc = []

def logit(self, x):

return 1.0 / (1 + np.exp(-x))

def pick_coord(self, grad):

if self.random:

return np.random.randint(0, 13)

else:

return np.argmax(np.abs(grad))

def loss_acc(self):

Y = self.logit(self.X.dot(self.w))

loss = 0

for i in range(len(self.y)):

if self.y[i] == 0:

loss -= np.log(1 - Y[i])

else:

loss -= np.log(Y[i])

loss = 1.0 * loss / len(self.y)

Y[Y > 0.5] = 1

Y[Y <= 0.5] = 0

acc = 1.0 * np.sum([a == b for (a, b) in

zip(Y, self.y)]) / len(self.y)↩→

return loss, acc

def fit(self):

it = 0

loss = 1

while it < self.iter:

Y = self.logit(self.X.dot(self.w))

gradient = np.sum((Y - self.y) *

self.X, axis=0) / len(self.X)↩→

idx = self.pick_coord(gradient)

self.w[idx] = self.w[idx] - self.lr *

gradient[idx]↩→

loss, acc = self.loss_acc()

if it % 1000 == 0:

print(it, "loss = %0.6f" % loss,

"acc = %0.2f" % acc)↩→

self.loss.append(loss)

self.acc.append(acc)

it += 1

return loss

import numpy as np

from sklearn.preprocessing import StandardScaler

from sklearn.datasets import load_wine

data, labels = load_wine(return_X_y=True)

data = data[:(59 + 71)]

labels = np.expand_dims(labels[:(59 + 71)], axis

= 1)↩→

data = StandardScaler().fit_transform(data)

cdlr = CoordDescentLR(data, labels, lr = 1e-2)

loss = cdlr.fit()

A.4 Raw Experiment Data

A.4.1 Loss and Accuracy for Max Coordinate Descent.

Printed every 1000 step

0 loss = 0.687646 acc = 0.65

1000 loss = 0.218549 acc = 0.95

2000 loss = 0.151939 acc = 0.98

3000 loss = 0.121732 acc = 0.98

4000 loss = 0.101711 acc = 0.99

5000 loss = 0.086859 acc = 0.99

6000 loss = 0.075294 acc = 0.99

7000 loss = 0.066020 acc = 1.00

8000 loss = 0.058410 acc = 1.00

9000 loss = 0.052049 acc = 1.00

10000 loss = 0.046651 acc = 1.00

11000 loss = 0.042014 acc = 1.00

12000 loss = 0.037987 acc = 1.00

13000 loss = 0.034461 acc = 1.00

14000 loss = 0.031349 acc = 1.00

15000 loss = 0.028585 acc = 1.00

16000 loss = 0.026117 acc = 1.00

17000 loss = 0.023903 acc = 1.00

18000 loss = 0.021908 acc = 1.00

19000 loss = 0.020103 acc = 1.00

20000 loss = 0.018464 acc = 1.00

21000 loss = 0.016972 acc = 1.00

22000 loss = 0.015610 acc = 1.00

23000 loss = 0.014366 acc = 1.00

24000 loss = 0.013227 acc = 1.00

25000 loss = 0.012183 acc = 1.00

26000 loss = 0.011225 acc = 1.00

27000 loss = 0.010344 acc = 1.00

28000 loss = 0.009533 acc = 1.00

29000 loss = 0.008784 acc = 1.00

30000 loss = 0.008092 acc = 1.00

31000 loss = 0.007454 acc = 1.00

32000 loss = 0.006864 acc = 1.00

CSE 251A Project 2: Coordinate Descent for Logistic Regression Conference’17, July 2017, Washington, DC, USA

33000 loss = 0.006319 acc = 1.00

34000 loss = 0.005816 acc = 1.00

35000 loss = 0.005351 acc = 1.00

36000 loss = 0.004921 acc = 1.00

37000 loss = 0.004524 acc = 1.00

38000 loss = 0.004158 acc = 1.00

39000 loss = 0.003819 acc = 1.00

40000 loss = 0.003507 acc = 1.00

41000 loss = 0.003218 acc = 1.00

42000 loss = 0.002953 acc = 1.00

43000 loss = 0.002707 acc = 1.00

44000 loss = 0.002482 acc = 1.00

45000 loss = 0.002274 acc = 1.00

46000 loss = 0.002082 acc = 1.00

47000 loss = 0.001906 acc = 1.00

48000 loss = 0.001744 acc = 1.00

49000 loss = 0.001595 acc = 1.00

50000 loss = 0.001458 acc = 1.00

51000 loss = 0.001333 acc = 1.00

52000 loss = 0.001218 acc = 1.00

53000 loss = 0.001112 acc = 1.00

54000 loss = 0.001015 acc = 1.00

55000 loss = 0.000926 acc = 1.00

56000 loss = 0.000845 acc = 1.00

57000 loss = 0.000771 acc = 1.00

58000 loss = 0.000703 acc = 1.00

59000 loss = 0.000641 acc = 1.00

60000 loss = 0.000584 acc = 1.00

61000 loss = 0.000532 acc = 1.00

62000 loss = 0.000484 acc = 1.00

63000 loss = 0.000441 acc = 1.00

64000 loss = 0.000401 acc = 1.00

65000 loss = 0.000365 acc = 1.00

66000 loss = 0.000332 acc = 1.00

67000 loss = 0.000302 acc = 1.00

68000 loss = 0.000275 acc = 1.00

69000 loss = 0.000250 acc = 1.00

70000 loss = 0.000227 acc = 1.00

71000 loss = 0.000206 acc = 1.00

72000 loss = 0.000188 acc = 1.00

73000 loss = 0.000171 acc = 1.00

74000 loss = 0.000155 acc = 1.00

75000 loss = 0.000141 acc = 1.00

76000 loss = 0.000128 acc = 1.00

77000 loss = 0.000116 acc = 1.00

78000 loss = 0.000105 acc = 1.00

79000 loss = 0.000096 acc = 1.00

80000 loss = 0.000087 acc = 1.00

81000 loss = 0.000079 acc = 1.00

82000 loss = 0.000072 acc = 1.00

83000 loss = 0.000065 acc = 1.00

84000 loss = 0.000059 acc = 1.00

85000 loss = 0.000053 acc = 1.00

86000 loss = 0.000048 acc = 1.00

87000 loss = 0.000044 acc = 1.00

88000 loss = 0.000040 acc = 1.00

A.4.2 Loss and Accuracy for Max Coordinate Descent.

Printed every 1000 step

0 loss = 0.714214 acc = 0.05

1000 loss = 0.418063 acc = 0.92

2000 loss = 0.319870 acc = 0.92

3000 loss = 0.268120 acc = 0.93

4000 loss = 0.233155 acc = 0.93

5000 loss = 0.211913 acc = 0.95

6000 loss = 0.195072 acc = 0.95

7000 loss = 0.182140 acc = 0.96

8000 loss = 0.170792 acc = 0.96

9000 loss = 0.161120 acc = 0.96

10000 loss = 0.153495 acc = 0.96

11000 loss = 0.146775 acc = 0.96

12000 loss = 0.140248 acc = 0.96

13000 loss = 0.134625 acc = 0.96

14000 loss = 0.129832 acc = 0.96

15000 loss = 0.126017 acc = 0.96

16000 loss = 0.122225 acc = 0.97

17000 loss = 0.118584 acc = 0.97

18000 loss = 0.115092 acc = 0.97

19000 loss = 0.111811 acc = 0.97

20000 loss = 0.108983 acc = 0.97

21000 loss = 0.106358 acc = 0.97

22000 loss = 0.103915 acc = 0.97

23000 loss = 0.101491 acc = 0.97

24000 loss = 0.099183 acc = 0.97

25000 loss = 0.096969 acc = 0.98

26000 loss = 0.095136 acc = 0.98

27000 loss = 0.093320 acc = 0.98

28000 loss = 0.091567 acc = 0.98

29000 loss = 0.089839 acc = 0.98

30000 loss = 0.088243 acc = 0.98

31000 loss = 0.086767 acc = 0.98

32000 loss = 0.085416 acc = 0.98

33000 loss = 0.084090 acc = 0.98

34000 loss = 0.082839 acc = 0.98

35000 loss = 0.081557 acc = 0.98

36000 loss = 0.080332 acc = 0.98

37000 loss = 0.079133 acc = 0.98

38000 loss = 0.078124 acc = 0.98

39000 loss = 0.077072 acc = 0.98

40000 loss = 0.076118 acc = 0.98

41000 loss = 0.075163 acc = 0.98

42000 loss = 0.074254 acc = 0.99

43000 loss = 0.073262 acc = 0.99

44000 loss = 0.072341 acc = 0.99

45000 loss = 0.071435 acc = 0.99

46000 loss = 0.070682 acc = 0.99

47000 loss = 0.069840 acc = 0.99

48000 loss = 0.069012 acc = 0.99

Conference’17, July 2017, Washington, DC, USA Yi Rong

49000 loss = 0.068217 acc = 0.99

50000 loss = 0.067529 acc = 0.99

51000 loss = 0.066867 acc = 0.99

52000 loss = 0.066161 acc = 0.99

53000 loss = 0.065503 acc = 0.99

54000 loss = 0.064885 acc = 0.99

55000 loss = 0.064272 acc = 0.99

56000 loss = 0.063670 acc = 0.99

57000 loss = 0.063058 acc = 0.99

58000 loss = 0.062456 acc = 0.99

59000 loss = 0.061888 acc = 0.99

60000 loss = 0.061358 acc = 0.99

61000 loss = 0.060799 acc = 0.99

62000 loss = 0.060313 acc = 0.99

63000 loss = 0.059789 acc = 0.99

64000 loss = 0.059300 acc = 0.99

65000 loss = 0.058776 acc = 0.99

66000 loss = 0.058285 acc = 0.99

67000 loss = 0.057842 acc = 0.99

68000 loss = 0.057375 acc = 0.99

69000 loss = 0.056929 acc = 0.99

70000 loss = 0.056517 acc = 0.99

71000 loss = 0.056103 acc = 0.99

72000 loss = 0.055707 acc = 0.99

73000 loss = 0.055290 acc = 0.99

74000 loss = 0.054882 acc = 0.99

75000 loss = 0.054494 acc = 0.99

76000 loss = 0.054123 acc = 0.99

77000 loss = 0.053780 acc = 0.99

78000 loss = 0.053420 acc = 0.99

79000 loss = 0.053045 acc = 0.99

80000 loss = 0.052699 acc = 0.99

81000 loss = 0.052360 acc = 0.99

82000 loss = 0.052037 acc = 0.99

83000 loss = 0.051714 acc = 0.99

84000 loss = 0.051380 acc = 0.99

85000 loss = 0.051060 acc = 1.00

86000 loss = 0.050747 acc = 1.00

87000 loss = 0.050445 acc = 1.00

88000 loss = 0.050143 acc = 1.00