Linear Regression under Interval Truncation

Huiwen Lu, Kanlin Wang, Yi Rong, Sihan Liu

March 15, 2022

1 Introduction

In traditional linear regression, we try to recover a hidden model parameter w∗ with samples

(x, y) of the form y = w

∗T

x + ϵ where ϵ is sampled from some noise distribution. Classical

results show that w∗ can be recovered within the ℓ

-reconstruction error O(

k/n), where n

is the number of truncated/observable samples, and k is the dimension of w

∗

. However, this

kind of classic technique does not apply to partially observable data, namely, the truncated

setting. Analysis from truncated samples is one of the biggest challenge in today’s life cause

truncation always happened whenever samples not in the bound are not observed. This

kind of case is very common in biological search, social science, business and economics ﬁeld

either due to the limitation of data collecting device or inherit ﬂaws of sampling processes.

If we ignore the truncation, as shown in various experiments, the regression result shows

very limited generalization property under regions where data is missing. Recently, a series

of work (See [DGTZ19], [DRZ20], [DSYZ21]) has lead to theoretically sound truncated lin-

ear regression with optimal sample complexity since the challenge was introduced in 1958.

While these are all polynomial-time algorithms, their run-time is far-from being practical

due to some complicated projection steps that rely heavily on the ellipsoid methods. On the

other hand, these works often assume that the data truncation is arbitrary and only oracle

access to the truncation set is provided while, in practice, the censoring setting, where the

truncation set is convex, is more ubiquitous.

Our Contributions. In this work, we give an eﬃcient truncated linear-regression algorithm

tailored for the censoring setting. The run-time of the algorithm is in the worst case bounded

by O(n

), where n is the input data-size, and is on-average better when the data follows

certain structured distributions, showing that truncated linear regression can be practical.

Distribution of Work. We collaboratively wrote a outline and a full draft for the ﬁnal

project report, and shared our implementation codebase for the necessary experiments. The

workload for the collaboration are divided roughly equally.

1.1 Related Work

1. Truncated Linear Regression [DRZ20] provided an eﬃcient algorithms for the high-

dimensional truncated linear regression problem, and w* can be recovered with in the

-reconstruction error O(

(slogk)/n), where s is the sparsity of w*. They have an

assumption of sparsity, which we don’t plan to use because we are not focus on this

setting.

2. Computationally and Statistically Eﬃcient Truncated Regression [DGTZ19] used Pro-

jected Stochastic Gradient Decent(PSGD) to recover w* with l

-reconstruction error

(k/n). They have two main assumptions about the observable samples which

we also use. As the claim by their deﬁnition of projected set, they have polynomial

projected set, and each projection step has polynomial dependency on the number of

data points, which increases overall time complexity.

3. Eﬃcient Truncated Linear Regression with Unknown Noise Variance [DSYZ21] revisit

the classical challenge of truncated linear regression without go through all the data in

the set. They use three assumptions to limited projection set, which is too simple to

apply in real life problem, and the projection set exponentially dependent on survival

while our projection set is more complicated which will ﬁt more real problems and

polynomial dependency on survival.

4. Computing the approximate convex hull in high dimensions [SV16] found an eﬀective

method with O(T

log(

) to ﬁnd an approximation of the convex hull, where T is

is close to the number of vertices of the approximation.

2 Problem Formulation and Result

We study the following restricted version of truncated linear regression where the truncated

set is assumed to be an interval S = [a, b]. Fix an optimal model parameter w

∗

∈ R

, the

truncated samples are generated through the following process:

1. Take samples x ∈ R

either arbitrarily or from a distribution D.

2. Generate the corresponding y label as

y = x

∗

+ ϵ , (1)

where ϵ ∼ N(0, 1).

3. Return (x, y) if y ∈ [a, b].

Let (x

(1)

, y

(1)

) ···(x

(n)

, y

(n)

) be n samples obtained from the above sampling procedure. Our

goal is to ﬁnd

w which minimizes the loss ∥

w − w

∗

∥

where X =

i=1



(i)



. The

following concept will be important to our discussion.

Deﬁnition 2.1 (Survival Probability). Let S be a measurable subset of R. Given x ∈ R

and w ∈ R

we deﬁne the survival probability α(w, x; S) of the sample with feature vector x

and parameters w as

α(w, x; S) = N



x, 1; S



When S is clear from the context we may refer to α(w, x; S) simply as α(w, x).

We will readily make the following assumptions which are standard in the literature of

truncated statistics and linear regression (See [DRZ20]).

Assumption 2.1 (Non-trivial Survival Probability). Let



(1)

, y

(1)



, . . . ,



(n)

, y

(n)



samples from the regression model (3.1). There exists a constant a > 0 such that

i=1

log



α (x

(i)

, w

∗

)



(i)

(i)T

⪯ log





i=1

(i)

(i)T

Assumption 2.2 (Thickness of Covariance Matrix of Covariates). Let X be the k ×k

matrix deﬁnes as X =

i=1

(i)

(i)T

, where x

(i)

∈ R

. Then for every i ∈ [n], it holds that

X ⪰

log(k)

(i)

(i)T

, and X ⪰ b

· I

for some value b ∈ R

Assumption 2.3 (Normalization/Data Generation). Assume that ∥w

∗

∥

≤ β, and

∥x∥

≤ 1 for all data points.

We provide some justiﬁcation over Assumption 2.3 which may seem a bit unnatural after a

ﬁrst glance. To solve our problem in our more general setting where we have only assumed

that



(i)



∞

≤ B, we make sure, before running the algorithm, to divide all the x

(i)

√

k. This way the norm of w will be multiplied by B

√

k. So for the rest of the proof we

may safely assume



(i)



≤ 1 and ∥w

∗

∥

≤ β

where β = B

·k ·

β for

β being the assumption on ∥w

∗

∥

before the normalization transfor-

mation.

Besides, we will also make the following assumption about the spatial distribution of data

points. Its motivation shall become clear after we walk through our approach in section 3.1.

Assumption 2.4. We assume that the convex hull of the data points x

(1)

, x

(2)

, ··· , x

(n)

has

at most γ vertices.

Now, we are ready to present our main theorem.

Theorem 2.2. Let



(1)

, y

(1)



, . . . ,



(n)

, y

(n)



be n samples from the linear regression model

deﬁned in Section 2 with parameters w

∗

. If Assumptions 2.1, 2.2, 2.3 hold, then there exists

an algorithm with success probability at least 2/3, that outputs ˆw ∈ R

such that

∥ˆw − w

∗

∥

≤ 1/a

k · β

log(n).

Moreover, if Assumption 2.4 holds, the algorithm runs in time



· γ

3/2

· k + n ·γ · k

3/2



3 Approaches

We follow the approach in [DRZ20]. Namely, we try to optimize the loss function corre-

sponding to the maximum log-likelihood of the underlying model.

Let l(w; x, y) =

(y − w

+ log



z∈S

exp



−



z − w





Then, we try to optimize the following convex function

min



y∼N

(

∗

(i)

,1,S

)



l(w; x

(i)

, y)





(2)

This is an un-constrained optimization task so we will omit the derivation of its dual. Notice

that we do have our constrained optimization task within the scope of our project and their

duals are derived in appendix. In [DRZ20], it has been shown the objective function is

convex and w = w

∗

is the unique minimizer of the solution. Hence, given that we are

able to eﬃciently solve the optimization task, we can then safely recover the hidden model

parameter w

∗

in ℓ

distance.

3.1 Improvement on Projection Set

In [DRZ20], the optimization is performed through PSGD (Project Stochastic Gradient

Descent) over the following convex set (See Algorithm 1 in [DRZ20] for detail)

(

w ∈ R

i=1



(i)

− w

(i)



(i)

⪯ r ·

i=1



(i)



)

(3)

The rationale behind D

is that SGD works the best when l(w) is strongly convex. D

is one possible choice of the projection set where l(w) is strongly convex, poly(a)-strongly

convex, where a is the lower bound on the survival probability of individual data point.

However, in practice, it may be prohibitively expensive to compute projections into D

even perform membership query of D

. More speciﬁcally, in order to verify that w ∈ D

one needs to iterate through all data points, compute the covariance-like expression and

test its semi-deﬁniteness. The routine outlined in [DRZ20] to perform projection consists

of a binary search process that takes the ellipsoid algorithm as its sub-routine. Though the

run-time remains polynomial, the approach quickly becomes impractical as the number of

data points, which is at least Ω(d

/ϵ

), grows.

In the work of [DSYZ21], a simpler projection set is proposed.

D =



(v, λ) =





∈ R

× R |

8(5 + 2 log(1/a))

≤

, ∥w∥

≤ β



A large merit of the projection set is that closed-form solution exists for projection from any

points x ∈ R

into the set, hence allowing very eﬃcient implementation of PSGD. However,

the primary bottleneck is that the convexity structure of the objective function l(w) may

become exponentially weak with respect to the survival probability a. In particular, the

only theoretical guarantee stated is that l(w) is exp(−poly(1/a))-strongly convex. If the

data point x happens to fall under the region such that w ∗

·x is near the boundary of the

interval, the convergence rate of the proposed algorithm may deteriorate signiﬁcantly.

As our main contribution, leveraging the underlying convex properties of the truncation

set, we propose the following projection set (and variants which will be discussed in detail in

Section 3.2) which ensures strong convexity of l(w) as well as eﬃcient projecting procedure.

Deﬁnition 3.1 (Modiﬁed Projection Set). We deﬁne the projection set to be w satisfying

D =



w ∈ R



∀i ∈ [n] , α(x

(i)

, w) ≥ a



. (4)

One nice thing about the truncation set is that the projection task can be reduced to projec-

tion into some properly deﬁned approximate convex-hull of the data points x

(i)

) (See [SV16]

for eﬃcient algorithms to do so). If we can compute the extreme point of the approximate

convex-hull in advance, the projection task can be reduced to a quadratic programming task

with number of variables much smaller than the number of data-points. For such task, solver

that is much more eﬃcient than the ellipsoid method exists. For example, if interior points

method is used, the iterations taken is of order

√

r) where r is the number of constrains

of the program. It is easy to see that the number of constrains is exactly the number of ver-

tices of this ”approximate” convex hull of data points, which makes Assumption 2.4 play a

particularly important role in characterizing the computational complexity of our algorithm.

Next, we shall see why we expect the number of extreme points of the convex hull should

be much smaller than the number of data points in practice. In the worst case, γ can be

as large as n. However, when data-points are i.i.d samples from a speciﬁc distribution, γ is

usually much smaller than n. In particular, for any distribution having a density with respect

to Lebesgue measure in R

, [Dwy91] shows that γ = o(n). Furthermore, the following results

are shown for the family of spherically symmetric distributions.

Theorem 3.2. Given probability measure p on R

that is spherically symmetric i.e. p(x) =

p(y) when ∥x∥

= ∥y∥

, let F (x) := Pr

v∼p

[∥v∥

≥ x].

• If F is a power law distribution i.e. F (x) = x

−k

L(x) with k > 0 and L(x) slowly

varying, Assumption 2.4 holds for γ = Θ(1).

• If F has exponential tail, Assumption 2.4 holds for γ = Θ



log

(d−1)/2



• For distributions in the unit d-ball satisfying F (1−x) ∼ cx

for positive k, Assumption

2.4 holds for γ = Θ



(d−1)/(2k+d−1)



Another crucial observation of the projection set in Deﬁnition 3.1 is that the objective

function deﬁned in Equation (2) enjoys good properties which ensures fast convergence of

PSGD,

Lemma 3.3. Let



(1)

, y

(1)



, . . . ,



(n)

, y

(n)



be n samples from the linear regression model

(3) with parameters w

∗

. If Assumptions 2.1, 2.2 and 2.3 hold, then

i=1





(i)

− z

(i)



(i)



≤ O(β + 1/a),

−

ℓ

(w) =

i=1



(i)

− E



(i)



(i)

(i)T

⪰ Ω(a

)I,

where y

(i)

∼ N



∗T

(i)

, 1, S



, z

(i)

∼ N



(i)

, 1, S



and w ∈ D.

Combining the result with standard result about PSGD gives our main theorem. In section

3.2, we discuss the technical detail of performing the projection step; In section 3.3, we show

the argument of Lemma 3.3 and concludes the proof of Theorem 2.2.

3.2 Eﬃcient Projection

Firstly, to verify that some w ∈ R

is within D, we claim we do not have to check every data

points x

(i)

. Instead, let v

(1)

, ··· , v

(γ)

be the extreme points of the convex hull formed by

the data points. Then, it suﬃces to check that the inequality holds for every v

(i)

This is due

to the fact that w

(i)

= w



i=1

· v

(i)



(i)

. Since the inequality is convex

when the truncation set S is an interval, we automatically have dist



(i)

, S



≤ δ.

In general, ﬁnding the exact convex hull of points in k dimension has exponential de-

pendency on k. Fortunately, by the work of [SV16], we are able to ﬁnd an approximate

convex-hull eﬃciently. Next, we give the formal deﬁnition of an approximate convex-hull.

Deﬁnition 3.4. Let S be a convex hull whose extreme points are {x

(i)

}. Then, we deﬁne

the distance from a point z to the convex hull S as

d(z, S)

= min



z −

|S|

i=1



s. t. α

≥ 0,

|S|

i=1

= 1.

(5)

Then, the δ-approximate convex hull of S = {x

(1)

, ··· , x

(n)

} is the convex hull of a minimal

subset E ⊆ S such that for all z ∈ S, it holds d(z, E) ≤ δ.

In [SV16], an eﬃcient algorithm for computing approximate convex hulls is given.

Theorem 3.5. Given n data points in k dimension, assume that the δ-approximate con-

vex hull has γ extreme points. Then, there exists an eﬃcient algorithm which computes

δ-approximate convex hull in time O(γ

3/2

· n

· d · log (γ/δ)).

Since we are only computing the approximate convex hull of the data points, the projec-

tion set that is actually used by the algorithm won’t be exactly the one deﬁned in Deﬁnition

3.1 but rather an ”approximate” one given below.

Deﬁnition 3.6 (Approximate Projection Set). Let v

(1)

, ··· , v

(γ)

be the extreme points of an

δ-approximate convex hull of the data points. We deﬁne the projection set to be w satisfying



w ∈ R



∀i ∈ [γ] , α(w, v

(i)

) ≥ a



. (6)

Now, we are ready to present the algorithm for the projection step given that the extreme

points v

(1)

, ··· , v

(γ)

are computed before-hand. In particular, we need to solve the following

optimization problem

min

∥w − w

∥

subject to q ≤ w

(i)

≤ p ,

(7)

where [q, p] is the appropriately chosen interval such that for the truncation set S, q =

arg min

{N(S; z, 1) ≥ a} and p = arg max

{N(S; z, 1) ≥ a}. We provide the dual of the

QP problem in Appendix. The QP problem can be solved using the interior point method

(See [BBV04]). In [CWZ13], it has been shown that the number of iterations for solving

a QP with r independent variables up to accuracy ϵ is of order O



√

r · log(r/ϵ)



. In our

setting, it implies the projection step takes time at most

O(k

3/2

· γ).

Lemma 3.7. Let D

be deﬁned as in Deﬁnition 3.6. Then, there exists an eﬃcient algorithm

which projects any point w ∈ R

into D

in time at most

O(k

3/2

· γ).

3.3 Bounded Gradient Variance and Strong Convexity

In this section, we prove Lemma 3.3, which asserts that, if w lies within our projection set,

the gradient estimator has bounded variance and the objective function is strongly convex.

To do so, we ﬁrst prove a structural lemma about the survival probability of any data points

under w ∈ D

Lemma 3.8. Given that w ∈ D

where δ <

10·β

, then for any x ∈ {x

(1)

, ··· , x

(n)

}, it holds

α(x, w) ≥ Ω(a).

Proof. Let H be the δ-approximate convex hull of the data points. By deﬁnition of D

, we

have that for all ˜x ∈ H, w ∈ D

, α(w, ˜x) ≥ a. By the deﬁnition of the approximate convex

hull, for all x ∈ {x

(1)

, ··· , x

(n)

}, there exists ˜x ∈ H such that dist(˜x, x) ≤ δ. Then, it follows



x − w

˜x



≤ β ·δ ≤ a/10. Notice that the survival probabilities α(w, ˜x) and α(w, x) are

exactly the mass of the normal distributions N



˜x, 1



, N



x, 1



with respect to the

truncation interval S. Since the total variation distance of the two normal distributions is

at most O





x − w

˜x





, the result follows.

Next, we give the proof of Lemma 3.3. The proof is similar to the one in [DGTZ19] and

arguably simpler with the structural property of D

in Lemma 3.8.

Proof of Lemma 3.3. Using Triangle Inequality, we have that

i=1





(i)

− z

(i)



(i)



≤

i=1





(i)

− w

∗T

(i)



(i)



i=1





(i)

− w

(i)



(i)



i=1



(i)T



(i)

− w

∗



(i)



And since we have that



(i)



≤ 1 and ∥w∥

≤

β ,

therefore, we have that

i=1



(i)T



(i)

− w

∗



(i)



≤ 2 · β.

Using Lemma 3 in [DRZ20], we have that

i=1





(i)

− w

∗T

(i)



(i)



≤

i=1



2 log



α (w∗, x

(i)

)



+ 4



∥x

(i)

∥

≤ 2 log





+4 ,

where the last inequality follows from Lemma 3.8. Similarly, we have that

i=1





(i)

− w

(i)



(i)



≤ 2 log





+ 4

Adding all the inequalities together, we have that

i=1





(i)

− z

(i)



(i)



≤ O(β + 1/a)

Next, we show the objective function strongly convex. Using Lemma 9 in [DRZ20], Lemma

3.8 and Assumption 2.2, we have that

i=1



(i)

− E



(i)



(i)

(i)T

⪰

i=1

(i)

(i)T

⪰

i=1

(i)

(i)T

⪰

Overall, we then have

−

ℓ

(w) =

i=1



(i)

− E



(i)



(i)

(i)T

⪰ Ω(a

) · I

Lastly, we will readily use the following standard result about the convergence rate of

PSGD. Combining it with Lemmas 3.3, 3.7 gives Theorem 2.2.

Theorem 3.9 (Theorem 3 of [Sha16] ). Let f : R

→ R be a convex function, such that

f(w) =

i=1

(w) where f

(w) = c

·w

(i)

+ q(w), c

∈ R and x

(i)

∈ R

with



(i)



≤ 1.

Let also w

(1)

, . . . , w

(n)

be the sequence produced by the PSGD algorithm where v

(1)

, . . . , v

(n)

is a sequence of random vectors such that E



(i)

| w

(i−1)



= ∇f



(i−1)



for all i ∈ [n] and

∗

= arg min w ∈ Df(w) be a minimizer of f. If we assume the following

1. bounded variance step: E



(i)



≤ ρ

2. strong convexity: q(w) is λ-strongly convex,

3. bounded parameters: the diameter of D is at most ρ and also max

| ≤ ρ,

then, E[f( ¯w)] −f (w

∗

) ≤ c ·

λn

· log(n), where ¯w is the output of the PSGD and c ∈ R

4 Evaluation and Future Work

4.1 Experiments

We use synthetic dataset to evaluate and compare our approach with others. We pick an

arbitrary model parameter w

∗

∈ R

and generate n random data-points. The y labels are

then computed following Equation (1) with Gaussian noise N(0, 1), and the truncated set is

taken by eliminating a portion of the data above a certain threshold with respect to y. The

plots are obtained through the following procedure:

1. Sort all the data points x

(i)

with increasing order of y

(i)

2. Plot the points (i, y

(i)

) as cyan points.

3. Plot the points (i, x

(i)

w) as red points, where w is the output of the corresponding

algorithm.

4. Add the horizontal blue line representing the truncation threshold, all the points above

that line are not visible to the algorithm

We designed four algorithm approaches to demonstrate the performance comparison.

We use Linear Regression on the full dataset (un-truncated) as the ground truth. Since we

generate the data in a linear fashion with random noise and this approach has access to

all the data, it should represent the best possible outcome among the rest, serving as an

algorithmic performance upper-bound. Secondly, we run Linear Regression on the truncated

dataset and evaluate on the full dataset, and we expect this to be the lower bound since

it does not take the data distribution into consideration and heavily ﬁt to the truncated

dataset.

Then we propose two Convex Optimization based approaches. The ﬁrst one is Con-

strained Linear Regression, rather than completely ignoring the invisible dataset, we use it

as a constraint in our optimization:

minimize

i=1

(i)

− c

(i)

)

subject to c

(i)

≥ D

for i = M+1, . . . , K

This serves as a naive approach to improve linear regression on the truncated set. The

intuition is that for all the x

not in the truncated dataset (i.e. invisible in the training set),

we also enforce the corresponding prediction y

to meet the truncation criteria.

Finally we compare all the above three with our approach described in Section 3.

4.2 Environment Setup

We implemented the Linear Regression algorithm for approach 1 and 2 with Numpy [HMvdW

20]

using np.linalg.lstsq. The third naive convex optimization approach was implemented

using CVXPY [DB16]. Our own approach was implemented using PyTorch [PGM

19] with

a slightly modiﬁed PSGD procedure.

All of the algorithm implementations use the same pre-generated dataset for training,

evaluation and visualization. And we ran the experiments multiple times to minimize the

eﬀects from random number generation and Gaussian noise. The dataset is set to truncate

75% of the data, i.e. the algorithm (other than approach 1) can only see the bottom 25% of

the data. The dataset contains 200 data points, each of them x

has 30 ﬂoat features (i.e.

30-dimensional data). The label y is a scalar, and the data points with 50 smallest y values

are visible to the algorithm.

4.3 Performance

Table 1: Error rate for diﬀerent approaches

Algorithm Error Rate (||y − ˆy||) ||w − ˆw||

Linear Regression, Full Dataset 2.516 0.74

Linear Regression, Truncated Dataset 6.386 2.23

Naive CVX Approach, Truncated Dataset 4.401 1.49

Our Approach, Truncated Dataset 3.137 0.94

In Figure 1, we demonstrate the result of linear regression when there is no truncation

as a benchmark for comparison; In Figure 2, we show the result of raw linear regression

on the truncated dataset; In Figure 3, we show the result of regularized linear regression

implemented by the CVXPY package on the truncated dataset; Finally, we show the result

of our truncated linear regression algorithm in Figure 4.

From Table 1, we can see that our algorithm outperforms both the regularized linear

regression and raw linear regression on the truncated dataset. Using our novel approach on

Full source code and jupyter notebook to reproduce the experiments available upon request

0 25 50 75 100 125 150 175 200

observations

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

truncation line

original data

LR on Full Dataset

Figure 1: Linear Regression on Original Dataset

only 25% of the data, we managed to get very close error rate to running linear regression

on the entire dataset.

4.4 Future Work

In the future, we would like to (i) test the computational and statistical limit of the algorithm

in practical settings (ii) extend the result into the multi-dimensional linear regression where

the output also lies in a high-dimensional space. The approach can be easily adapted to work

with arbitrary truncation set with convex structure and we are interested in the performance

of our algorithm in this more general setting.

0 25 50 75 100 125 150 175 200

observations

truncation line

original data

LR on truncated data

Figure 2: Linear Regression on Truncated Dataset

0 25 50 75 100 125 150 175 200

observations

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

truncation line

original data

CVX Fit

Figure 3: Alternative Approach with Truncation Filtering

0 25 50 75 100 125 150 175 200

observations

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

truncation line

original data

Our Approach

Figure 4: Our Approach on Truncated Dataset

References

[BBV04] Stephen Boyd, Stephen P Boyd, and Lieven Vandenberghe. Convex optimiza-

tion. Cambridge university press, 2004.

[CWZ13] Xinzhong Cai, Guoqiang Wang, and Zihou Zhang. Complexity analysis and

numerical implementation of primal-dual interior-point methods for convex

quadratic optimization based on a ﬁnite barrier. Numerical Algorithms,

62(2):289–306, 2013.

[DB16] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling

language for convex optimization. Journal of Machine Learning Research,

17(83):1–5, 2016.

[DGTZ19] Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, and Manolis

Zampetakis. Computationally and statistically eﬃcient truncated regression.

In Conference on Learning Theory, pages 955–960. PMLR, 2019.

[DRZ20] Constantinos Daskalakis, Dhruv Rohatgi, and Emmanouil Zampetakis. Trun-

cated linear regression in high dimensions. Advances in Neural Information

Processing Systems, 33:10338–10347, 2020.

[DSYZ21] Constantinos Daskalakis, Patroklos Stefanou, Rui Yao, and Emmanouil Zam-

petakis. Eﬃcient truncated linear regression with unknown noise variance.

Advances in Neural Information Processing Systems, 34, 2021.

[Dwy91] Rex A Dwyer. Convex hulls of samples from spherically symmetric distribu-

tions. Discrete applied mathematics, 31(2):113–132, 1991.

[HMvdW

20] Charles R. Harris, K. Jarrod Millman, St´efan J. van der Walt, Ralf Gom-

mers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Se-

bastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer,

Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern´andez del

R´ıo, Mark Wiebe, Pearu Peterson, Pierre G´erard-Marchant, Kevin Sheppard,

Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and

Travis E. Oliphant. Array programming with NumPy. Nature, 585(7825):357–

362, September 2020.

[PGM

19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Brad-

bury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein,

Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De-

Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,

Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative

style, high-performance deep learning library. In H. Wallach, H. Larochelle,

A. Beygelzimer, F. d'Alch´e-Buc, E. Fox, and R. Garnett, editors, Advances

in Neural Information Processing Systems 32, pages 8024–8035. Curran As-

sociates, Inc., 2019.

[Sha16] Ohad Shamir. Without-replacement sampling for stochastic gradient meth-

ods. Advances in neural information processing systems, 29, 2016.

[SV16] Hossein Sartipizadeh and Tyrone L Vincent. Computing the approximate

convex hull in high dimensions. arXiv preprint arXiv:1603.04422, 2016.

A Dual of Equation (7)

min ∥w − w

∥

subject to

(i)

− p ≤ 0 for ∀i

(−v

(i)

) + q ≤ 0

Therefore, the Lagrangian form is:

(w − w

)

(w − w

) +

i=1

(λ

(i)

− p)) +

i=1

(λ

(−v

(i)

) + q))

w + w

− 2w

i=1

(λ

(i)

− p)) +

i=1

(λ

(−v

(i)

) + q))

Take the derivative, we have:

∂L

∂w

= 2(w − w

) +

i=1

(λ

− λ

(i)

= 0

w = w

i=1

− λ

(i)

after plugging back in to the original expression, we have

i=1

− λ

(i)T

j=1

− λ

(j)

i=1

(λ

(−p) + λ

q) +

i=1

((λ

− λ

(i)

)

i=1

(i)T

j=1

− λ

(j)

−

i=1

(i)T

j=1

− λ

(j)

(8)

i=1

(λ

− λ

(i)T

j=1

− λ

(j)

i=1

− λ

(i)T

j=1

− λ

(j)

i=1

(λ

(−p) + λ

q) +

i=1

((λ

− λ

(i)

) (9)

= −

i=1

(λ

−λ

(i)T

j=1

(λ

−λ

(j)

i=1

(λ

(−p) + λ

q) +

i=1

((λ

−λ

(i)

)

(10)

Therefore, the dual problem is:

max−

i=1

(λ

−λ

(i)T

j=1

(λ

−λ

(j)

i=1

(λ

(−p)+λ

q)+

i=1

((λ

−λ

(i)

)

(11)

subject to

∀i, λ

≥ 0, λ

≥ 0