Linear Regression under Interval Truncation
Huiwen Lu, Kanlin Wang, Yi Rong, Sihan Liu
March 15, 2022
1 Introduction
In traditional linear regression, we try to recover a hidden model parameter w with samples
(x, y) of the form y = w
T
x + ϵ where ϵ is sampled from some noise distribution. Classical
results show that w can be recovered within the
2
-reconstruction error O(
p
k/n), where n
is the number of truncated/observable samples, and k is the dimension of w
. However, this
kind of classic technique does not apply to partially observable data, namely, the truncated
setting. Analysis from truncated samples is one of the biggest challenge in today’s life cause
truncation always happened whenever samples not in the bound are not observed. This
kind of case is very common in biological search, social science, business and economics field
either due to the limitation of data collecting device or inherit flaws of sampling processes.
If we ignore the truncation, as shown in various experiments, the regression result shows
very limited generalization property under regions where data is missing. Recently, a series
of work (See [DGTZ19], [DRZ20], [DSYZ21]) has lead to theoretically sound truncated lin-
ear regression with optimal sample complexity since the challenge was introduced in 1958.
While these are all polynomial-time algorithms, their run-time is far-from being practical
due to some complicated projection steps that rely heavily on the ellipsoid methods. On the
other hand, these works often assume that the data truncation is arbitrary and only oracle
access to the truncation set is provided while, in practice, the censoring setting, where the
truncation set is convex, is more ubiquitous.
Our Contributions. In this work, we give an efficient truncated linear-regression algorithm
tailored for the censoring setting. The run-time of the algorithm is in the worst case bounded
by O(n
3
), where n is the input data-size, and is on-average better when the data follows
certain structured distributions, showing that truncated linear regression can be practical.
Distribution of Work. We collaboratively wrote a outline and a full draft for the final
project report, and shared our implementation codebase for the necessary experiments. The
workload for the collaboration are divided roughly equally.
1
1.1 Related Work
1. Truncated Linear Regression [DRZ20] provided an efficient algorithms for the high-
dimensional truncated linear regression problem, and w* can be recovered with in the
l
2
-reconstruction error O(
p
(slogk)/n), where s is the sparsity of w*. They have an
assumption of sparsity, which we don’t plan to use because we are not focus on this
setting.
2. Computationally and Statistically Efficient Truncated Regression [DGTZ19] used Pro-
jected Stochastic Gradient Decent(PSGD) to recover w* with l
2
-reconstruction error
O(
p
(k/n). They have two main assumptions about the observable samples which
we also use. As the claim by their definition of projected set, they have polynomial
projected set, and each projection step has polynomial dependency on the number of
data points, which increases overall time complexity.
3. Efficient Truncated Linear Regression with Unknown Noise Variance [DSYZ21] revisit
the classical challenge of truncated linear regression without go through all the data in
the set. They use three assumptions to limited projection set, which is too simple to
apply in real life problem, and the projection set exponentially dependent on survival
while our projection set is more complicated which will fit more real problems and
polynomial dependency on survival.
4. Computing the approximate convex hull in high dimensions [SV16] found an effective
method with O(T
3
2
n
2
log(
T
ϵ
0
) to find an approximation of the convex hull, where T is
is close to the number of vertices of the approximation.
2 Problem Formulation and Result
We study the following restricted version of truncated linear regression where the truncated
set is assumed to be an interval S = [a, b]. Fix an optimal model parameter w
R
k
, the
truncated samples are generated through the following process:
1. Take samples x R
k
either arbitrarily or from a distribution D.
2. Generate the corresponding y label as
y = x
T
w
+ ϵ , (1)
where ϵ N(0, 1).
3. Return (x, y) if y [a, b].
Let (x
(1)
, y
(1)
) ···(x
(n)
, y
(n)
) be n samples obtained from the above sampling procedure. Our
goal is to find
ˆ
w which minimizes the loss
ˆ
w w
X
where X =
P
n
i=1
x
(i)
x
(i)
T
. The
following concept will be important to our discussion.
Definition 2.1 (Survival Probability). Let S be a measurable subset of R. Given x R
k
and w R
k
we define the survival probability α(w, x; S) of the sample with feature vector x
and parameters w as
α(w, x; S) = N
w
T
x, 1; S
.
2
When S is clear from the context we may refer to α(w, x; S) simply as α(w, x).
We will readily make the following assumptions which are standard in the literature of
truncated statistics and linear regression (See [DRZ20]).
Assumption 2.1 (Non-trivial Survival Probability). Let
x
(1)
, y
(1)
, . . . ,
x
(n)
, y
(n)
be
samples from the regression model (3.1). There exists a constant a > 0 such that
n
X
i=1
log
1
α (x
(i)
, w
)
x
(i)
x
(i)T
log
1
a
n
X
i=1
x
(i)
x
(i)T
.
Assumption 2.2 (Thickness of Covariance Matrix of Covariates). Let X be the k ×k
matrix defines as X =
1
n
P
n
i=1
x
(i)
x
(i)T
, where x
(i)
R
k
. Then for every i [n], it holds that
X
log(k)
n
x
(i)
x
(i)T
, and X b
2
· I
for some value b R
+
.
Assumption 2.3 (Normalization/Data Generation). Assume that w
2
2
β, and
x
2
2
1 for all data points.
We provide some justification over Assumption 2.3 which may seem a bit unnatural after a
first glance. To solve our problem in our more general setting where we have only assumed
that
x
(i)
B, we make sure, before running the algorithm, to divide all the x
(i)
by
B
k. This way the norm of w will be multiplied by B
k. So for the rest of the proof we
may safely assume
x
(i)
2
2
1 and w
2
2
β
where β = B
2
·k ·
˜
β for
˜
β being the assumption on w
2
2
before the normalization transfor-
mation.
Besides, we will also make the following assumption about the spatial distribution of data
points. Its motivation shall become clear after we walk through our approach in section 3.1.
Assumption 2.4. We assume that the convex hull of the data points x
(1)
, x
(2)
, ··· , x
(n)
has
at most γ vertices.
Now, we are ready to present our main theorem.
Theorem 2.2. Let
x
(1)
, y
(1)
, . . . ,
x
(n)
, y
(n)
be n samples from the linear regression model
defined in Section 2 with parameters w
. If Assumptions 2.1, 2.2, 2.3 hold, then there exists
an algorithm with success probability at least 2/3, that outputs ˆw R
k
such that
ˆw w
2
1/a
2
·
r
k · β
n
log(n).
Moreover, if Assumption 2.4 holds, the algorithm runs in time
˜
O
n
2
· γ
3/2
· k + n ·γ · k
3/2
.
3
3 Approaches
We follow the approach in [DRZ20]. Namely, we try to optimize the loss function corre-
sponding to the maximum log-likelihood of the underlying model.
Let l(w; x, y) =
1
2
(y w
T
x)
2
+ log
Z
zS
exp
1
2
z w
T
x
2
dz
.
Then, we try to optimize the following convex function
min
w
1
n
E
y∼N
(
w
T
x
(i)
,1,S
)
l(w; x
(i)
, y)
(2)
This is an un-constrained optimization task so we will omit the derivation of its dual. Notice
that we do have our constrained optimization task within the scope of our project and their
duals are derived in appendix. In [DRZ20], it has been shown the objective function is
convex and w = w
is the unique minimizer of the solution. Hence, given that we are
able to efficiently solve the optimization task, we can then safely recover the hidden model
parameter w
in
2
distance.
3.1 Improvement on Projection Set
In [DRZ20], the optimization is performed through PSGD (Project Stochastic Gradient
Descent) over the following convex set (See Algorithm 1 in [DRZ20] for detail)
D
r
=
(
w R
k
|
n
X
i=1
y
(i)
w
T
x
(i)
2
x
(i)
x
(i)
T
r ·
n
X
i=1
x
(i)
x
(i)
T
)
(3)
The rationale behind D
r
is that SGD works the best when l(w) is strongly convex. D
r
is one possible choice of the projection set where l(w) is strongly convex, poly(a)-strongly
convex, where a is the lower bound on the survival probability of individual data point.
However, in practice, it may be prohibitively expensive to compute projections into D
r
or
even perform membership query of D
r
. More specifically, in order to verify that w D
r
,
one needs to iterate through all data points, compute the covariance-like expression and
test its semi-definiteness. The routine outlined in [DRZ20] to perform projection consists
of a binary search process that takes the ellipsoid algorithm as its sub-routine. Though the
run-time remains polynomial, the approach quickly becomes impractical as the number of
data points, which is at least Ω(d
2
2
), grows.
In the work of [DSYZ21], a simpler projection set is proposed.
D =
(v, λ) =
w
σ
2
,
1
σ
2
R
k
× R |
1
8(5 + 2 log(1/a))
σ
2
σ
2
0
96
a
2
, w
2
2
β
.
A large merit of the projection set is that closed-form solution exists for projection from any
points x R
k
into the set, hence allowing very efficient implementation of PSGD. However,
4
the primary bottleneck is that the convexity structure of the objective function l(w) may
become exponentially weak with respect to the survival probability a. In particular, the
only theoretical guarantee stated is that l(w) is exp(poly(1/a))-strongly convex. If the
data point x happens to fall under the region such that w
T
·x is near the boundary of the
interval, the convergence rate of the proposed algorithm may deteriorate significantly.
As our main contribution, leveraging the underlying convex properties of the truncation
set, we propose the following projection set (and variants which will be discussed in detail in
Section 3.2) which ensures strong convexity of l(w) as well as efficient projecting procedure.
Definition 3.1 (Modified Projection Set). We define the projection set to be w satisfying
D =
w R
k
i [n] , α(x
(i)
, w) a
. (4)
One nice thing about the truncation set is that the projection task can be reduced to projec-
tion into some properly defined approximate convex-hull of the data points x
(i)
) (See [SV16]
for efficient algorithms to do so). If we can compute the extreme point of the approximate
convex-hull in advance, the projection task can be reduced to a quadratic programming task
with number of variables much smaller than the number of data-points. For such task, solver
that is much more efficient than the ellipsoid method exists. For example, if interior points
method is used, the iterations taken is of order
˜
O(
r) where r is the number of constrains
of the program. It is easy to see that the number of constrains is exactly the number of ver-
tices of this ”approximate” convex hull of data points, which makes Assumption 2.4 play a
particularly important role in characterizing the computational complexity of our algorithm.
Next, we shall see why we expect the number of extreme points of the convex hull should
be much smaller than the number of data points in practice. In the worst case, γ can be
as large as n. However, when data-points are i.i.d samples from a specific distribution, γ is
usually much smaller than n. In particular, for any distribution having a density with respect
to Lebesgue measure in R
d
, [Dwy91] shows that γ = o(n). Furthermore, the following results
are shown for the family of spherically symmetric distributions.
Theorem 3.2. Given probability measure p on R
d
that is spherically symmetric i.e. p(x) =
p(y) when x
2
= y
2
, let F (x) := Pr
vp
[v
2
x].
If F is a power law distribution i.e. F (x) = x
k
L(x) with k > 0 and L(x) slowly
varying, Assumption 2.4 holds for γ = Θ(1).
If F has exponential tail, Assumption 2.4 holds for γ = Θ
log
(d1)/2
x
.
For distributions in the unit d-ball satisfying F (1x) cx
k
for positive k, Assumption
2.4 holds for γ = Θ
n
(d1)/(2k+d1)
.
Another crucial observation of the projection set in Definition 3.1 is that the objective
function defined in Equation (2) enjoys good properties which ensures fast convergence of
PSGD,
5
Lemma 3.3. Let
x
(1)
, y
(1)
, . . . ,
x
(n)
, y
(n)
be n samples from the linear regression model
(3) with parameters w
. If Assumptions 2.1, 2.2 and 2.3 hold, then
1
n
P
n
i=1
E
h
y
(i)
z
(i)
x
(i)
2
2
i
O(β + 1/a),
H
¯
(w) =
1
n
P
n
i=1
E
h
z
(i)
E
z
(i)

2
x
(i)
x
(i)T
i
Ω(a
2
)I,
where y
(i)
N
w
T
x
(i)
, 1, S
, z
(i)
N
w
T
x
(i)
, 1, S
and w D.
Combining the result with standard result about PSGD gives our main theorem. In section
3.2, we discuss the technical detail of performing the projection step; In section 3.3, we show
the argument of Lemma 3.3 and concludes the proof of Theorem 2.2.
3.2 Efficient Projection
Firstly, to verify that some w R
k
is within D, we claim we do not have to check every data
points x
(i)
. Instead, let v
(1)
, ··· , v
(γ)
be the extreme points of the convex hull formed by
the data points. Then, it suffices to check that the inequality holds for every v
(i)
This is due
to the fact that w
T
x
(i)
= w
T
P
γ
i=1
α
i
· v
(i)
=
P
i
α
i
w
T
v
(i)
. Since the inequality is convex
when the truncation set S is an interval, we automatically have dist
P
i
α
i
w
T
v
(i)
, S
δ.
In general, finding the exact convex hull of points in k dimension has exponential de-
pendency on k. Fortunately, by the work of [SV16], we are able to find an approximate
convex-hull efficiently. Next, we give the formal definition of an approximate convex-hull.
Definition 3.4. Let S be a convex hull whose extreme points are {x
(i)
}. Then, we define
the distance from a point z to the convex hull S as
d(z, S)
2
= min
α
i
z
|S|
X
i=1
α
i
x
i
2
s. t. α
i
0,
|S|
X
i=1
α
i
= 1.
(5)
Then, the δ-approximate convex hull of S = {x
(1)
, ··· , x
(n)
} is the convex hull of a minimal
subset E S such that for all z S, it holds d(z, E) δ.
In [SV16], an efficient algorithm for computing approximate convex hulls is given.
Theorem 3.5. Given n data points in k dimension, assume that the δ-approximate con-
vex hull has γ extreme points. Then, there exists an efficient algorithm which computes
δ-approximate convex hull in time O(γ
3/2
· n
2
· d · log (γ)).
Since we are only computing the approximate convex hull of the data points, the projec-
tion set that is actually used by the algorithm won’t be exactly the one defined in Definition
3.1 but rather an ”approximate” one given below.
6
Definition 3.6 (Approximate Projection Set). Let v
(1)
, ··· , v
(γ)
be the extreme points of an
δ-approximate convex hull of the data points. We define the projection set to be w satisfying
D
δ
=
w R
k
i [γ] , α(w, v
(i)
) a
. (6)
Now, we are ready to present the algorithm for the projection step given that the extreme
points v
(1)
, ··· , v
(γ)
are computed before-hand. In particular, we need to solve the following
optimization problem
min
w
w w
0
2
2
subject to q w
T
v
(i)
p ,
(7)
where [q, p] is the appropriately chosen interval such that for the truncation set S, q =
arg min
z
{N(S; z, 1) a} and p = arg max
z
{N(S; z, 1) a}. We provide the dual of the
QP problem in Appendix. The QP problem can be solved using the interior point method
(See [BBV04]). In [CWZ13], it has been shown that the number of iterations for solving
a QP with r independent variables up to accuracy ϵ is of order O
r · log(r)
. In our
setting, it implies the projection step takes time at most
˜
O(k
3/2
· γ).
Lemma 3.7. Let D
δ
be defined as in Definition 3.6. Then, there exists an efficient algorithm
which projects any point w R
k
into D
δ
in time at most
˜
O(k
3/2
· γ).
3.3 Bounded Gradient Variance and Strong Convexity
In this section, we prove Lemma 3.3, which asserts that, if w lies within our projection set,
the gradient estimator has bounded variance and the objective function is strongly convex.
To do so, we first prove a structural lemma about the survival probability of any data points
under w D
δ
.
Lemma 3.8. Given that w D
δ
where δ <
a
10·β
, then for any x {x
(1)
, ··· , x
(n)
}, it holds
α(x, w) Ω(a).
Proof. Let H be the δ-approximate convex hull of the data points. By definition of D
δ
, we
have that for all ˜x H, w D
δ
, α(w, ˜x) a. By the definition of the approximate convex
hull, for all x {x
(1)
, ··· , x
(n)
}, there exists ˜x H such that dist(˜x, x) δ. Then, it follows
w
T
x w
T
˜x
β ·δ a/10. Notice that the survival probabilities α(w, ˜x) and α(w, x) are
exactly the mass of the normal distributions N
w
T
˜x, 1
, N
w
T
x, 1
with respect to the
truncation interval S. Since the total variation distance of the two normal distributions is
at most O
w
T
x w
T
˜x
, the result follows.
Next, we give the proof of Lemma 3.3. The proof is similar to the one in [DGTZ19] and
arguably simpler with the structural property of D
δ
in Lemma 3.8.
7
Proof of Lemma 3.3. Using Triangle Inequality, we have that
1
n
n
X
i=1
E
h
y
(i)
z
(i)
x
(i)
2
2
i
1
n
n
X
i=1
E
h
y
(i)
w
T
x
(i)
x
(i)
2
2
i
+
1
n
n
X
i=1
E
h
z
(i)
w
T
x
(i)
x
(i)
2
2
i
+
1
n
n
X
i=1
E
h
x
(i)T
w
(i)
w
x
(i)
2
2
i
.
And since we have that
x
(i)
2
1 and w
2
p
β ,
therefore, we have that
1
n
n
X
i=1
E
h
x
(i)T
w
(i)
w
x
(i)
2
2
i
2 · β.
Using Lemma 3 in [DRZ20], we have that
1
n
n
X
i=1
E
h
y
(i)
w
T
x
(i)
x
(i)
2
2
i
1
n
n
X
i=1
2 log
1
α (w, x
(i)
)
+ 4
x
(i)
2
2
2 log
1
a
+4 ,
where the last inequality follows from Lemma 3.8. Similarly, we have that
1
n
n
X
i=1
E
h
z
(i)
w
T
x
(i)
x
(i)
2
2
i
2 log
1
a
+ 4
Adding all the inequalities together, we have that
1
n
n
X
i=1
E
h
y
(i)
z
(i)
x
(i)
2
2
i
O(β + 1/a)
Next, we show the objective function strongly convex. Using Lemma 9 in [DRZ20], Lemma
3.8 and Assumption 2.2, we have that
n
X
i=1
E
h
z
(i)
E
z
(i)

2
x
(i)
x
(i)T
i
n
X
i=1
a
2
12
x
(i)
x
(i)T
a
2
12
n
X
i=1
x
(i)
x
(i)T
a
2
12
b
2
I
Overall, we then have
H
¯
(w) =
1
n
n
X
i=1
E
h
z
(i)
E
z
(i)

2
x
(i)
x
(i)T
i
Ω(a
2
) · I
8
Lastly, we will readily use the following standard result about the convergence rate of
PSGD. Combining it with Lemmas 3.3, 3.7 gives Theorem 2.2.
Theorem 3.9 (Theorem 3 of [Sha16] ). Let f : R
k
R be a convex function, such that
f(w) =
1
n
P
n
i=1
f
i
(w) where f
i
(w) = c
i
·w
T
x
(i)
+ q(w), c
i
R and x
(i)
R
k
with
x
(i)
2
1.
Let also w
(1)
, . . . , w
(n)
be the sequence produced by the PSGD algorithm where v
(1)
, . . . , v
(n)
is a sequence of random vectors such that E
v
(i)
| w
(i1)
= f
i
w
(i1)
for all i [n] and
w
= arg min w Df(w) be a minimizer of f. If we assume the following
1. bounded variance step: E
h
v
(i)
2
2
i
ρ
2
,
2. strong convexity: q(w) is λ-strongly convex,
3. bounded parameters: the diameter of D is at most ρ and also max
i
|c
i
| ρ,
then, E[f( ¯w)] f (w
) c ·
ρ
2
λn
· log(n), where ¯w is the output of the PSGD and c R
+
.
4 Evaluation and Future Work
4.1 Experiments
We use synthetic dataset to evaluate and compare our approach with others. We pick an
arbitrary model parameter w
R
k
and generate n random data-points. The y labels are
then computed following Equation (1) with Gaussian noise N(0, 1), and the truncated set is
taken by eliminating a portion of the data above a certain threshold with respect to y. The
plots are obtained through the following procedure:
1. Sort all the data points x
(i)
with increasing order of y
(i)
.
2. Plot the points (i, y
(i)
) as cyan points.
3. Plot the points (i, x
(i)
T
w) as red points, where w is the output of the corresponding
algorithm.
4. Add the horizontal blue line representing the truncation threshold, all the points above
that line are not visible to the algorithm
We designed four algorithm approaches to demonstrate the performance comparison.
We use Linear Regression on the full dataset (un-truncated) as the ground truth. Since we
generate the data in a linear fashion with random noise and this approach has access to
all the data, it should represent the best possible outcome among the rest, serving as an
algorithmic performance upper-bound. Secondly, we run Linear Regression on the truncated
dataset and evaluate on the full dataset, and we expect this to be the lower bound since
it does not take the data distribution into consideration and heavily fit to the truncated
dataset.
Then we propose two Convex Optimization based approaches. The first one is Con-
strained Linear Regression, rather than completely ignoring the invisible dataset, we use it
as a constraint in our optimization:
9
minimize
c
P
M
i=1
(y
(i)
c
T
x
(i)
)
2
subject to c
T
x
(i)
D
for i = M+1, . . . , K
This serves as a naive approach to improve linear regression on the truncated set. The
intuition is that for all the x
i
not in the truncated dataset (i.e. invisible in the training set),
we also enforce the corresponding prediction y
i
to meet the truncation criteria.
Finally we compare all the above three with our approach described in Section 3.
4.2 Environment Setup
We implemented the Linear Regression algorithm for approach 1 and 2 with Numpy [HMvdW
+
20]
using np.linalg.lstsq. The third naive convex optimization approach was implemented
using CVXPY [DB16]. Our own approach was implemented using PyTorch [PGM
+
19] with
a slightly modified PSGD procedure.
1
All of the algorithm implementations use the same pre-generated dataset for training,
evaluation and visualization. And we ran the experiments multiple times to minimize the
effects from random number generation and Gaussian noise. The dataset is set to truncate
75% of the data, i.e. the algorithm (other than approach 1) can only see the bottom 25% of
the data. The dataset contains 200 data points, each of them x
i
has 30 float features (i.e.
30-dimensional data). The label y is a scalar, and the data points with 50 smallest y values
are visible to the algorithm.
4.3 Performance
Table 1: Error rate for different approaches
Algorithm Error Rate (||y ˆy||) ||w ˆw||
Linear Regression, Full Dataset 2.516 0.74
Linear Regression, Truncated Dataset 6.386 2.23
Naive CVX Approach, Truncated Dataset 4.401 1.49
Our Approach, Truncated Dataset 3.137 0.94
In Figure 1, we demonstrate the result of linear regression when there is no truncation
as a benchmark for comparison; In Figure 2, we show the result of raw linear regression
on the truncated dataset; In Figure 3, we show the result of regularized linear regression
implemented by the CVXPY package on the truncated dataset; Finally, we show the result
of our truncated linear regression algorithm in Figure 4.
From Table 1, we can see that our algorithm outperforms both the regularized linear
regression and raw linear regression on the truncated dataset. Using our novel approach on
1
Full source code and jupyter notebook to reproduce the experiments available upon request
10
0 25 50 75 100 125 150 175 200
observations
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
y
truncation line
original data
LR on Full Dataset
Figure 1: Linear Regression on Original Dataset
only 25% of the data, we managed to get very close error rate to running linear regression
on the entire dataset.
4.4 Future Work
In the future, we would like to (i) test the computational and statistical limit of the algorithm
in practical settings (ii) extend the result into the multi-dimensional linear regression where
the output also lies in a high-dimensional space. The approach can be easily adapted to work
with arbitrary truncation set with convex structure and we are interested in the performance
of our algorithm in this more general setting.
11
0 25 50 75 100 125 150 175 200
observations
5
0
5
10
y
truncation line
original data
LR on truncated data
Figure 2: Linear Regression on Truncated Dataset
0 25 50 75 100 125 150 175 200
observations
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
y
truncation line
original data
CVX Fit
Figure 3: Alternative Approach with Truncation Filtering
12
0 25 50 75 100 125 150 175 200
observations
7.5
5.0
2.5
0.0
2.5
5.0
7.5
10.0
y
truncation line
original data
Our Approach
Figure 4: Our Approach on Truncated Dataset
References
[BBV04] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimiza-
tion. Cambridge university press, 2004.
[CWZ13] Xinzhong Cai, Guoqiang Wang, and Zihou Zhang. Complexity analysis and
numerical implementation of primal-dual interior-point methods for convex
quadratic optimization based on a finite barrier. Numerical Algorithms,
62(2):289–306, 2013.
[DB16] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling
language for convex optimization. Journal of Machine Learning Research,
17(83):1–5, 2016.
[DGTZ19] Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, and Manolis
Zampetakis. Computationally and statistically efficient truncated regression.
In Conference on Learning Theory, pages 955–960. PMLR, 2019.
[DRZ20] Constantinos Daskalakis, Dhruv Rohatgi, and Emmanouil Zampetakis. Trun-
cated linear regression in high dimensions. Advances in Neural Information
Processing Systems, 33:10338–10347, 2020.
13
[DSYZ21] Constantinos Daskalakis, Patroklos Stefanou, Rui Yao, and Emmanouil Zam-
petakis. Efficient truncated linear regression with unknown noise variance.
Advances in Neural Information Processing Systems, 34, 2021.
[Dwy91] Rex A Dwyer. Convex hulls of samples from spherically symmetric distribu-
tions. Discrete applied mathematics, 31(2):113–132, 1991.
[HMvdW
+
20] Charles R. Harris, K. Jarrod Millman, St´efan J. van der Walt, Ralf Gom-
mers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Se-
bastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer,
Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern´andez del
R´ıo, Mark Wiebe, Pearu Peterson, Pierre G´erard-Marchant, Kevin Sheppard,
Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and
Travis E. Oliphant. Array programming with NumPy. Nature, 585(7825):357–
362, September 2020.
[PGM
+
19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad-
bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,
Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De-
Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,
Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative
style, high-performance deep learning library. In H. Wallach, H. Larochelle,
A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances
in Neural Information Processing Systems 32, pages 8024–8035. Curran As-
sociates, Inc., 2019.
[Sha16] Ohad Shamir. Without-replacement sampling for stochastic gradient meth-
ods. Advances in neural information processing systems, 29, 2016.
[SV16] Hossein Sartipizadeh and Tyrone L Vincent. Computing the approximate
convex hull in high dimensions. arXiv preprint arXiv:1603.04422, 2016.
14
A Dual of Equation (7)
min w w
0
2
subject to
w
T
v
(i)
p 0 for i
w
T
(v
(i)
) + q 0
Therefore, the Lagrangian form is:
(w w
0
)
T
(w w
0
) +
γ
X
i=1
(λ
0i
(w
T
v
(i)
p)) +
γ
X
i=1
(λ
1i
(w
T
(v
(i)
) + q))
w
T
w + w
T
o
w
0
2w
T
w
0
+
γ
X
i=1
(λ
0i
(w
T
v
(i)
p)) +
γ
X
i=1
(λ
1i
(w
T
(v
(i)
) + q))
Take the derivative, we have:
L
w
= 2(w w
0
) +
γ
X
i=1
(λ
0i
λ
1i
)v
(i)
= 0
w = w
0
+
γ
X
i=1
λ
1i
λ
0i
2
v
(i)
after plugging back in to the original expression, we have
γ
X
i=1
λ
1i
λ
0i
2
v
(i)T
γ
X
j=1
λ
1j
λ
0j
2
v
(j)
+
γ
X
i=1
(λ
0i
(p) + λ
1i
q) +
γ
X
i=1
((λ
0i
λ
1i
)w
T
0
v
(i)
)
+
γ
X
i=1
λ
0i
v
(i)T
γ
X
j=1
λ
1j
λ
0j
2
v
(j)
γ
X
i=1
λ
1i
v
(i)T
γ
X
j=1
λ
1j
λ
0j
2
v
(j)
(8)
=
γ
X
i=1
(λ
0i
λ
1i
)v
(i)T
γ
X
j=1
λ
1j
λ
0j
2
v
(j)
+
γ
X
i=1
λ
1i
λ
0i
2
v
(i)T
γ
X
j=1
λ
1j
λ
0j
2
v
(j)
+
γ
X
i=1
(λ
0i
(p) + λ
1i
q) +
γ
X
i=1
((λ
0i
λ
1i
)w
T
0
v
(i)
) (9)
=
1
4
γ
X
i=1
(λ
0i
λ
1i
)v
(i)T
γ
X
j=1
(λ
0j
λ
1j
)v
(j)
+
γ
X
i=1
(λ
0i
(p) + λ
1i
q) +
γ
X
i=1
((λ
0i
λ
1i
)w
T
0
v
(i)
)
(10)
15
Therefore, the dual problem is:
max
1
4
γ
X
i=1
(λ
0i
λ
1i
)v
(i)T
γ
X
j=1
(λ
0j
λ
1j
)v
(j)
+
γ
X
i=1
(λ
0i
(p)+λ
1i
q)+
γ
X
i=1
((λ
0i
λ
1i
)w
T
0
v
(i)
)
(11)
subject to
i, λ
0i
0, λ
1i
0
16