Statistics Seminar: Ken Clarkson

Event Date Thu, Mar 5, 2020 @ 4:10pm - 5:30pm

Location
Mathematical Sciences Building 1147 (Colloquium Room)

Seminar starts at 4:10pm, preceded by UCD4IDS Roundtable Discussion at 3:15pm.

SPEAKER: Ken Clarkson, IBM Research

TITLE: "Dimensionality Reduction for Tukey Regression"

ABSTRACT: We give the first dimensionality reduction methods for the overconstrained Tukey regression problem. The Tukey loss function ||y||_M = sum_i M(y_i) has M(y_i) approximately |y_i|^p for residual errors y_i smaller than a prescribed threshold tau, where M(y_i) becomes constant for errors |y_i| > tau.

Our results depend on a new structural result, proven constructively, showing that for any d-dimensional subspace L subset R^n, there is a fixed bounded-size subset of coordinates containing, for every y in L, all the large coordinates of y *with respect to the Tukey loss function*. We think of these as "residual leverage scores", since two coordinates in y itself may have very different magnitude and yet both contribute the same value tau to the M-function.

Our methods reduce a given Tukey regression problem to a smaller weighted version, whose solution is a provably good approximate solution to the original problem. Our reductions are simple and easy to implement, and we give empirical results demonstrating their practicality, using existing heuristic solvers for the small versions.

One of our reductions uses row sampling, for an instance min_{x \in R^d} ||Ax-b||_M, where A in R^{n\times d} and b in R^n, with n >> d. The algorithm takes O~(nnz(A) + poly(d log n / epsilon)) time to return a weight vector with poly(d log n / epsilon) non-zero entries, such that the solution of the resulting weighted Tukey regression problem is a (1 + epsilon)-approximate solution. Here nnz(A) is the number of non-zero entries of A. Another reduction uses a sketching matrix S, chosen independently of A and b, so that the solution for a weighted version with inputs SA and Sb is an O(log n)-approximate solution. Here S has poly(d log n) rows and SA and Sb are computable in O(nnz(A)) time.

We also give exponential-time algorithms giving provably good solutions, and hardness results suggesting that a significant speedup in the worst case is unlikely.

This is joint work with Ruosong Wang and David Woodruff, CMU.

DATE: Thursday, March 5th,

SEMINAR TIME: 4:10pm

LOCATION: MSB 1147, Colloquium Room

REFRESHMENTS at 3:00pm in MSB 1147 followed by a ROUNTABLE DISCUSSION at 3:15pm (with UCD4IDS)