CSE203B: Linear Regression under Interval Truncation

In traditional linear regression, we try to recover a hidden model parameter $\vec w*$ with samples $(\vec x, y)$ of the form $y = \vec {w}^{*T} \vec x + \epsilon$, where $\epsilon$ is sampled from some noise distribution. Classical results show that $\vec w*$ can be recovered within the $\ell_2$-reconstruction error $O(\sqrt{k/n})$, where $n$ is the number of truncated/observable samples, and $k$ is the dimension of $\vec w^*$. However, this kind of classic technique does not apply to partially observable data, namely, the truncated setting. Analysis from truncated samples is one of the biggest challenge in today’s life cause truncation always happened whenever samples not in the bound are not observed. This kind of case is very common in biological search, social science, business and economics field either due to the limitation of data collecting device or inherit flaws of sampling processes. If we ignore the truncation, as shown in various experiments, the regression result shows very limited generalization property under regions where data is missing. Recently, a series of work (See) has lead to theoretically sound truncated linear regression with optimal sample complexity since the challenge was introduced in 1958. While these are all polynomial-time algorithms, their run-time is far-from being practical due to some complicated projection steps that rely heavily on the ellipsoid methods. On the other hand, these works often assume that the data truncation is arbitrary and only oracle access to the truncation set is provided while, in practice, the censoring setting, where the truncation set is convex, is more ubiquitous.

Our Contributions. In this work, we give an efficient truncated linear-regression algorithm tailored for the censoring setting. The run-time of the algorithm is in the worst case bounded by $O(n^3)$, where $n$ is the input data-size, and is on-average better when the data follows certain structured distributions, showing that truncated linear regression can be practical.

荣懿
荣懿
计算机科学&工程

My research interests include Machine Learning, Compilers and PL Design