CSE 251A Project 2: Coordinate Descent for Logistic
Regression
Yi Rong
βˆ—
Computer Science & Engineering
University of California, San Diego
rongyi@ucsd.edu
hi@rongyi.ai
Abstract
In this report, we discuss our attempt to use a better strat-
egy to pick the direction of coordinate descent at each step
rather than random selection. At each iteration, we pick the
gradient direction with the largest absolute value, and shows
that our method outperforms the random coordinate descent
in terms of convergence speed.
Keywords: Coordinate Descent, Logistic Regression
1 Introduction
Logistic Regression [
2
] is a regression analysis model widely
used to estimate the probablility of multiple target classes.
Dierent from linear regression, Logistic Regression esti-
mated the likelihood of each label, and further to that proba-
bility to predict the discrete label class. In this project, we
mainly use logistic regression as a class of optimization prob-
lems with the convex objective function to be minimized by
applying coordinate descent.
In the broad area of optimizations, the problem being
convex often makes solving for the global minimum or max-
imum much easier. The target function being convex means
ξ˜›nding a local minimum would yield the global minimum,
without the need to randomly escape a local optimum in
search of a better solution.
A good case for convex optimization is when the objective
function
𝑓
itself is twice dierentiable, then we can check
for its Hessian to be positive semideξ˜›nite to make sure the
function is convex, and solve for zero gradient
βˆ‡π‘“ =
0 to
get to the global minimum. And similarly for functions that
are dierentiable but not twice dierentiable, and we know
it’s convex, solving for the global minimum is the same.
In practice, we can use (stocastic) gradient descent for an
iterative approach to reach minimum.
However, not all the convex functions are dierentiable
everywhere. And to minimize the convex non-dierentiable
function, setting gradients to zero will not result in a local
βˆ—
Student ID: A14113185
Author’s address: Yi Rong, Computer Science & Engineering, University of
California, San Diego, rongyi@ucsd.edu, hi@rongyi.ai.
2018. 2475-1421/2018/1-ART1 $15.00
hps://doi.org/
minimum solution or not a solution at all. A typical exam-
ple is the
β„“
𝑝
norm cone, which is convex but not dieren-
tiable somewhere, the gradient is either positive or negative
across the domain of the function. In some cases of a non-
dierentiable function, we can use some subgradient tricks to
still get to the global minimum we want, but for norm cone
which reaches its minimum at origin, coordinate descent
works reasonably well.
The Coordinate Descent [
1
] approach is a very simple
technique that is surprisingly ecient and scalable by doing
just coordinate-wise minimization. The high-level idea is
that for a multivariable function
𝑓
, we minimize it along one
direction each time, and over multiple iterations, we would
gradually reach the a local minimum. Since the algorithm
iterates through one direction each time, picking the opti-
mal one at a speciξ˜›c iteration would greatly improve the
minimization / convergence speed.
In this project report, we mainly discuss our attempt to
pick a best direction for coordinate descent at each step,
and how it applies to solve logistic regression problem. One
simple way for this problem is to just pick a random direc-
tion every time, or use round-robin strategy to loop over
each direction. We designed a better approach which picks
the direction in which the gradient is largest, and evaluate
the performance of our approach to a naive baseline of ran-
domly picking a direction each time, and shows that our
approach converges way faster than the baseline under the
same condition.
The report will be structured as follows: we ξ˜›rst formu-
late the Logistic Regression problem and Coordinate Descent
method in Section 2, and then present two approaches of
picking one direction at a time in Section 3, one being choos-
ing uniformly randomly, the other being our approach which
picks the direction of gradient with maximum absolute value.
After that we evaluate our result in Section 4. Finally we con-
clude our report in Section 5.
2 Problem Formulation
Logistic Regression can be described as follows: The train-
ing dataset consisted of the features
𝑋 = {π‘₯
1
, π‘₯
2
, ..., π‘₯
𝑛
}
and
the corresponding categorical labels
π‘Œ = {𝑦
1
, 𝑦
2
, ...,𝑦
𝑛
}
,
with a mapping
𝑓 (π‘₯
𝑖
) = 𝑦
𝑖
. In our case the label is binary,
1
Conference’17, July 2017, Washington, DC, USA Yi Rong
only consisted of 0 and 1. We would like to train a classi-
ξ˜›er that returns the probablity of the label class given input
features, i.e.:
π‘ƒπ‘Ÿ (𝑦 = 1|π‘₯)
We use the log-loss as metric to evaluate the regression
classiξ˜›er, maximizing the maximum-likelihood for the pre-
diction:
𝐿(𝑀) = βˆ’
𝑛
ξ›•
𝑖=1
ln π‘ƒπ‘Ÿ
𝑀
(𝑦
(𝑖)
|π‘₯
(𝑖)
) =
𝑛
ξ›•
𝑖=1
𝑙𝑛(1 + 𝑒
βˆ’π‘¦
(𝑖 )
𝑀 Β·π‘₯
(𝑖 )
)
Coordinate Descent method can be formulated as an opti-
mization problem:
π‘šπ‘–π‘›
π‘₯
𝑓 (π‘₯)
where
𝑓 (π‘₯) = 𝑔(π‘₯) +
Í
𝑖
β„Ž
𝑖
(π‘₯
𝑖
)
, with
𝑔
being convex and
dierentiable, and each
β„Ž
𝑖
convex. The procedure starts with
ξ˜›rst sample π‘₯
(0)
and iteratively update π‘₯
𝑖
at step π‘˜:
π‘₯
(π‘˜)
𝑖
= π‘Žπ‘Ÿπ‘” min
π‘₯
𝑖
𝑓 (π‘₯
(π‘˜)
1
, ..., π‘₯
(π‘˜)
π‘–βˆ’1
, π‘₯
𝑖
, π‘₯
(π‘˜βˆ’1)
𝑖+1
, ..., π‘₯
(π‘˜βˆ’1)
𝑛
)
where 𝑖 = 1, ..., |π‘₯ |.
In our case it’s
π‘šπ‘–π‘›
𝑀
𝐿(𝑀)
, with
𝐿(𝑀)
being the objective
function mentioned above. And in this project, we want
to carefully pick a better
𝑖
(direction) for each
π‘˜
(sample /
iteration).
3 Coordinate Descent and Coordinate
Selection
The idea behind my attempt for better coordinate selection is
that for each training iteration, we can compute the gradient
of our objective function
𝐿(𝑀)
with respected to the weight
𝑀
, which we denote
βˆ‡
𝑀
𝐿(𝑀)
. Instead of randomly picking
one coordinate to apply the update, we would like to fully
utilize the additional information in the gradient
βˆ‡πΏ
by com-
paring the norm of partial derivatives along each direction.
Since the gradient can be written as:
βˆ‡
𝑀
𝐿(𝑀) = [
πœ•πΏ
πœ•π‘€
1
,
πœ•πΏ
πœ•π‘€
1
, ...,
πœ•πΏ
πœ•π‘€
𝑑
]
⊀
where
𝑀
𝑖
is the
𝑖
-th component of the weight vector, with
a total of
𝑑
dimensions. At each step
π‘˜
, we denote the weight
vector as
𝑀
(π‘˜)
𝑖
. We now pick the coordinate in which the
gradient has the maximum absolute value, the intuition of
which is that updating along that direction would give us
the largest elementwise weight update:
𝑖
π‘˜
= π‘Žπ‘Ÿπ‘”
𝑑
max
𝑗=1
|
πœ•πΏ
πœ•π‘€
(π‘˜βˆ’1)
𝑗
|
We then perform a weight update using the partial deriv-
ative (gradient along 𝑖
π‘˜
direction):
𝑀
(π‘˜)
𝑖
π‘˜
= 𝑀
(π‘˜)
𝑖
π‘˜
βˆ’ πœ‚ Β·
πœ•πΏ
πœ•π‘€
(π‘˜βˆ’1)
𝑖
π‘˜
We describe the full Coordinate Descent training process
as follows:
Algorithm 1 Coordinate Descent with Direction Selection
Input X, Y
Output W
1: X: Input Dataset Features
2: Y: Input Dataset Labels
3: W ← weight_init() ⊲ Randonly initialize weights
4: π‘˜ ← 0 ⊲ π‘˜ is the iteration index
5: for π‘˜ ≀ MAX_ITER do
6: π‘Œ ← π‘™π‘œπ‘”π‘–π‘‘ (WX) ⊲ logit is 𝜎(π‘₯) = 1/(1 + 𝑒
βˆ’π‘₯
)
7: βˆ‡
π‘Š
← (π‘Œ βˆ’ Y)X
8: 𝑖 ← pick_coordinate(βˆ‡
π‘Š
) ⊲ Random / Max
9: W[𝑖] ← W [𝑖] βˆ’ πœ‚
πœ•πΏ
πœ•W
𝑖
10: π‘™π‘œπ‘ π‘  ← log_loss(W, X, π‘Œ ) ⊲ Record Loss
11: end for
12: Return W
In the above algorithm, the
pick_coordinate
function is
the main function we tweak for our experiment. We run the
above training procedure for both functions and evaluate
their result. The
log_loss
function is used to compute the
loss, i.e. the objective function, and record the value for
visualization.
4 Evaluation
4.1 Experiment Results
During our experiments, we ran the exact same classiξ˜›ca-
tion task of logistic regression on the Wine dataset. We ran
the standard logistic regression from
scikit-learn
to serve
as the accuracy baseline and record its ξ˜›nal log-loss
πΏβˆ—
as
our target loss. We then ran two experiments using Coordi-
nate Descent, one with random direction (we call it Random
Coordinate Descent), the other picking the direction with
maximum gradient component (denoted Max Coordinate
Descent). The loss curve for each approach is presented in
Figure 1.
Before running Logistic Regression and Coordinate De-
scent, we preprocessed our input data by normalizing the fea-
tures using
scikit-learn
’s
StandardScaler()
. We found
that normalizing the data allows larger learning rate as well
as faster and more stable convergence. Turning o the nor-
malization does not aect the conclusion: our coordinate
descent logistic regression still converges closed to the opti-
mal loss πΏβˆ—, just way more slowly.
We ran the Max Coordinate Descent until convergence,
i.e. the loss value reaches the source of truth
πΏβˆ—
within
a margin of error
πœ– = Β±
1
𝑒 βˆ’
6, assume it’s
𝑆
π‘šπ‘Žπ‘₯
steps. But
we did not ran Random Coordinate Descent to convergence,
2
CSE 251A Project 2: Coordinate Descent for Logistic Regression Conference’17, July 2017, Washington, DC, USA
instead we ran the same number of steps
𝑆
π‘šπ‘Žπ‘₯
and plot the
loss dierence.
The
πΏβˆ—
in my case is 3
.
3
𝑒 βˆ’
5, that is with
liblinear
solver,
and the regularization strength set to 1
𝑒 βˆ’
10 (cannot be
entirely disabled for that solver) to minimize the regularizer’s
eect. The learning rate for both coordinate descent methods
are set to 1
𝑒 βˆ’
2, and the
𝑆
π‘šπ‘Žπ‘₯
under this setting is 88868
steps.
0 20000 40000 60000 80000
# of iteration
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
loss
Loss Curve Comparison
Random Coordinate Descent
Max Coordinate Descent
Logistic Regression
Figure 1. Performance comparison between Random Selec-
tion and Max Value
From the Loss curve 1, we can see that Max Coordinate
Descent outperforms Randon Coordinate Descent in that
it converges much faster and make more valuable weight
updates on each steps, as well as reaches the target optimal
loss ξ˜›rst.
We present more data, including Prediction Accuracy,
Logrithmic-Scale Loss Curves in the Appendix section.
4.2 Future Work
Currently the Max Coordinate Descent is able to converge
to the optimal loss in a reasonable time, but there’s plenty
of room for improvement.
For example, the current objective function
𝐿(π‘Š )
does not
have regularization, but we can add L1 or L2 regularization
to the problem, and I believe it will result in even faster
convergence.
𝐿(𝑀) =
𝑛
ξ›•
𝑖=1
𝑙𝑛(1 + 𝑒
βˆ’π‘¦
(𝑖 )
𝑀 Β·π‘₯
(𝑖 )
) + πœ†||𝑀 ||
2
The coordinate descent process still works as is, only with
an additional 2
πœ†π‘€
added to the end of the gradient calcula-
tion.
5 Conclusion
In conclusion, we presented an alternative approach to select
a better coordinate at each step of Coordinate Descent by
picking the partial derivative with largest norm, and we
showed that our approach outperforms the random one in
every experiment, and reaches the optimal loss established
by the standard Logistic Regression library function.
References
[1]
Zhi-Quan Luo and Paul Tseng. 1992. On the convergence of the coordi-
nate descent method for convex dierentiable minimization. Journal of
Optimization Theory and Applications 72, 1 (1992), 7–35.
[2] Raymond E Wright. 1995. Logistic regression. (1995).
A Appendix
A.1 Misc
β€’
I did not ξ˜›nish the optional Sparse Coordinate Descent
part.
β€’
I did not attach the full jupyter notebook due to exces-
sive logging, but it’s always available upon request.
A.2 Additional Graphs
0 20000 40000 60000 80000
# of iteration
0.0
0.2
0.4
0.6
0.8
1.0
accuracy
Accuracy Curve
Random Coordinate Descent
Max Coordinate Descent
Figure 2. Accuracy Curve for both approaches
0 20000 40000 60000 80000
# of iteration
10
4
10
3
10
2
10
1
10
0
loss
Loss Curve Comparison (Log Scale)
Random Coordinate Descent
Max Coordinate Descent
Logistic Regression
Figure 3. Loss Curve, Log Scale
3
Conference’17, July 2017, Washington, DC, USA Yi Rong
A.3 Full Source Code for Coordinate Descent
class CoordDescentLR:
def __init__(self, X, y, lr=1e-2,
iter=10000, random=False):↩→
self.X = np.insert(X, 0, 1, axis=1)
self.y = y
self.lr = lr
self.w = np.zeros((len(self.X[0]), 1))
self.w += np.random.normal(0, 0.01, (14,
1))↩→
self.iter = iter
self.random = random
self.loss = []
self.acc = []
def logit(self, x):
return 1.0 / (1 + np.exp(-x))
def pick_coord(self, grad):
if self.random:
return np.random.randint(0, 13)
else:
return np.argmax(np.abs(grad))
def loss_acc(self):
Y = self.logit(self.X.dot(self.w))
loss = 0
for i in range(len(self.y)):
if self.y[i] == 0:
loss -= np.log(1 - Y[i])
else:
loss -= np.log(Y[i])
loss = 1.0 * loss / len(self.y)
Y[Y > 0.5] = 1
Y[Y <= 0.5] = 0
acc = 1.0 * np.sum([a == b for (a, b) in
zip(Y, self.y)]) / len(self.y)↩→
return loss, acc
def fit(self):
it = 0
loss = 1
while it < self.iter:
Y = self.logit(self.X.dot(self.w))
gradient = np.sum((Y - self.y) *
self.X, axis=0) / len(self.X)↩→
idx = self.pick_coord(gradient)
self.w[idx] = self.w[idx] - self.lr *
gradient[idx]↩→
loss, acc = self.loss_acc()
if it % 1000 == 0:
print(it, "loss = %0.6f" % loss,
"acc = %0.2f" % acc)↩→
self.loss.append(loss)
self.acc.append(acc)
it += 1
return loss
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine
data, labels = load_wine(return_X_y=True)
data = data[:(59 + 71)]
labels = np.expand_dims(labels[:(59 + 71)], axis
= 1)↩→
data = StandardScaler().fit_transform(data)
cdlr = CoordDescentLR(data, labels, lr = 1e-2)
loss = cdlr.fit()
A.4 Raw Experiment Data
A.4.1 Loss and Accuracy for Max Coordinate Descent.
Printed every 1000 step
0 loss = 0.687646 acc = 0.65
1000 loss = 0.218549 acc = 0.95
2000 loss = 0.151939 acc = 0.98
3000 loss = 0.121732 acc = 0.98
4000 loss = 0.101711 acc = 0.99
5000 loss = 0.086859 acc = 0.99
6000 loss = 0.075294 acc = 0.99
7000 loss = 0.066020 acc = 1.00
8000 loss = 0.058410 acc = 1.00
9000 loss = 0.052049 acc = 1.00
10000 loss = 0.046651 acc = 1.00
11000 loss = 0.042014 acc = 1.00
12000 loss = 0.037987 acc = 1.00
13000 loss = 0.034461 acc = 1.00
14000 loss = 0.031349 acc = 1.00
15000 loss = 0.028585 acc = 1.00
16000 loss = 0.026117 acc = 1.00
17000 loss = 0.023903 acc = 1.00
18000 loss = 0.021908 acc = 1.00
19000 loss = 0.020103 acc = 1.00
20000 loss = 0.018464 acc = 1.00
21000 loss = 0.016972 acc = 1.00
22000 loss = 0.015610 acc = 1.00
23000 loss = 0.014366 acc = 1.00
24000 loss = 0.013227 acc = 1.00
25000 loss = 0.012183 acc = 1.00
26000 loss = 0.011225 acc = 1.00
27000 loss = 0.010344 acc = 1.00
28000 loss = 0.009533 acc = 1.00
29000 loss = 0.008784 acc = 1.00
30000 loss = 0.008092 acc = 1.00
31000 loss = 0.007454 acc = 1.00
32000 loss = 0.006864 acc = 1.00
4
CSE 251A Project 2: Coordinate Descent for Logistic Regression Conference’17, July 2017, Washington, DC, USA
33000 loss = 0.006319 acc = 1.00
34000 loss = 0.005816 acc = 1.00
35000 loss = 0.005351 acc = 1.00
36000 loss = 0.004921 acc = 1.00
37000 loss = 0.004524 acc = 1.00
38000 loss = 0.004158 acc = 1.00
39000 loss = 0.003819 acc = 1.00
40000 loss = 0.003507 acc = 1.00
41000 loss = 0.003218 acc = 1.00
42000 loss = 0.002953 acc = 1.00
43000 loss = 0.002707 acc = 1.00
44000 loss = 0.002482 acc = 1.00
45000 loss = 0.002274 acc = 1.00
46000 loss = 0.002082 acc = 1.00
47000 loss = 0.001906 acc = 1.00
48000 loss = 0.001744 acc = 1.00
49000 loss = 0.001595 acc = 1.00
50000 loss = 0.001458 acc = 1.00
51000 loss = 0.001333 acc = 1.00
52000 loss = 0.001218 acc = 1.00
53000 loss = 0.001112 acc = 1.00
54000 loss = 0.001015 acc = 1.00
55000 loss = 0.000926 acc = 1.00
56000 loss = 0.000845 acc = 1.00
57000 loss = 0.000771 acc = 1.00
58000 loss = 0.000703 acc = 1.00
59000 loss = 0.000641 acc = 1.00
60000 loss = 0.000584 acc = 1.00
61000 loss = 0.000532 acc = 1.00
62000 loss = 0.000484 acc = 1.00
63000 loss = 0.000441 acc = 1.00
64000 loss = 0.000401 acc = 1.00
65000 loss = 0.000365 acc = 1.00
66000 loss = 0.000332 acc = 1.00
67000 loss = 0.000302 acc = 1.00
68000 loss = 0.000275 acc = 1.00
69000 loss = 0.000250 acc = 1.00
70000 loss = 0.000227 acc = 1.00
71000 loss = 0.000206 acc = 1.00
72000 loss = 0.000188 acc = 1.00
73000 loss = 0.000171 acc = 1.00
74000 loss = 0.000155 acc = 1.00
75000 loss = 0.000141 acc = 1.00
76000 loss = 0.000128 acc = 1.00
77000 loss = 0.000116 acc = 1.00
78000 loss = 0.000105 acc = 1.00
79000 loss = 0.000096 acc = 1.00
80000 loss = 0.000087 acc = 1.00
81000 loss = 0.000079 acc = 1.00
82000 loss = 0.000072 acc = 1.00
83000 loss = 0.000065 acc = 1.00
84000 loss = 0.000059 acc = 1.00
85000 loss = 0.000053 acc = 1.00
86000 loss = 0.000048 acc = 1.00
87000 loss = 0.000044 acc = 1.00
88000 loss = 0.000040 acc = 1.00
A.4.2 Loss and Accuracy for Max Coordinate Descent.
Printed every 1000 step
0 loss = 0.714214 acc = 0.05
1000 loss = 0.418063 acc = 0.92
2000 loss = 0.319870 acc = 0.92
3000 loss = 0.268120 acc = 0.93
4000 loss = 0.233155 acc = 0.93
5000 loss = 0.211913 acc = 0.95
6000 loss = 0.195072 acc = 0.95
7000 loss = 0.182140 acc = 0.96
8000 loss = 0.170792 acc = 0.96
9000 loss = 0.161120 acc = 0.96
10000 loss = 0.153495 acc = 0.96
11000 loss = 0.146775 acc = 0.96
12000 loss = 0.140248 acc = 0.96
13000 loss = 0.134625 acc = 0.96
14000 loss = 0.129832 acc = 0.96
15000 loss = 0.126017 acc = 0.96
16000 loss = 0.122225 acc = 0.97
17000 loss = 0.118584 acc = 0.97
18000 loss = 0.115092 acc = 0.97
19000 loss = 0.111811 acc = 0.97
20000 loss = 0.108983 acc = 0.97
21000 loss = 0.106358 acc = 0.97
22000 loss = 0.103915 acc = 0.97
23000 loss = 0.101491 acc = 0.97
24000 loss = 0.099183 acc = 0.97
25000 loss = 0.096969 acc = 0.98
26000 loss = 0.095136 acc = 0.98
27000 loss = 0.093320 acc = 0.98
28000 loss = 0.091567 acc = 0.98
29000 loss = 0.089839 acc = 0.98
30000 loss = 0.088243 acc = 0.98
31000 loss = 0.086767 acc = 0.98
32000 loss = 0.085416 acc = 0.98
33000 loss = 0.084090 acc = 0.98
34000 loss = 0.082839 acc = 0.98
35000 loss = 0.081557 acc = 0.98
36000 loss = 0.080332 acc = 0.98
37000 loss = 0.079133 acc = 0.98
38000 loss = 0.078124 acc = 0.98
39000 loss = 0.077072 acc = 0.98
40000 loss = 0.076118 acc = 0.98
41000 loss = 0.075163 acc = 0.98
42000 loss = 0.074254 acc = 0.99
43000 loss = 0.073262 acc = 0.99
44000 loss = 0.072341 acc = 0.99
45000 loss = 0.071435 acc = 0.99
46000 loss = 0.070682 acc = 0.99
47000 loss = 0.069840 acc = 0.99
48000 loss = 0.069012 acc = 0.99
5
Conference’17, July 2017, Washington, DC, USA Yi Rong
49000 loss = 0.068217 acc = 0.99
50000 loss = 0.067529 acc = 0.99
51000 loss = 0.066867 acc = 0.99
52000 loss = 0.066161 acc = 0.99
53000 loss = 0.065503 acc = 0.99
54000 loss = 0.064885 acc = 0.99
55000 loss = 0.064272 acc = 0.99
56000 loss = 0.063670 acc = 0.99
57000 loss = 0.063058 acc = 0.99
58000 loss = 0.062456 acc = 0.99
59000 loss = 0.061888 acc = 0.99
60000 loss = 0.061358 acc = 0.99
61000 loss = 0.060799 acc = 0.99
62000 loss = 0.060313 acc = 0.99
63000 loss = 0.059789 acc = 0.99
64000 loss = 0.059300 acc = 0.99
65000 loss = 0.058776 acc = 0.99
66000 loss = 0.058285 acc = 0.99
67000 loss = 0.057842 acc = 0.99
68000 loss = 0.057375 acc = 0.99
69000 loss = 0.056929 acc = 0.99
70000 loss = 0.056517 acc = 0.99
71000 loss = 0.056103 acc = 0.99
72000 loss = 0.055707 acc = 0.99
73000 loss = 0.055290 acc = 0.99
74000 loss = 0.054882 acc = 0.99
75000 loss = 0.054494 acc = 0.99
76000 loss = 0.054123 acc = 0.99
77000 loss = 0.053780 acc = 0.99
78000 loss = 0.053420 acc = 0.99
79000 loss = 0.053045 acc = 0.99
80000 loss = 0.052699 acc = 0.99
81000 loss = 0.052360 acc = 0.99
82000 loss = 0.052037 acc = 0.99
83000 loss = 0.051714 acc = 0.99
84000 loss = 0.051380 acc = 0.99
85000 loss = 0.051060 acc = 1.00
86000 loss = 0.050747 acc = 1.00
87000 loss = 0.050445 acc = 1.00
88000 loss = 0.050143 acc = 1.00
6