SCAIM Seminar: Hans De Sterck
- Date: 07/12/2016
- Time: 12:30
University of British Columbia
Accelerated Parallel Optimization Algorithms for Distributed Data Analytics in Apache Spark
Scalable parallel optimization methods are gaining importance for a wide range of machine learning applications, for example, as implemented in the machine learning library of the Apache Spark distributed data processing environment. I will discuss our work on accelerating parallel algorithms for two common applications in this area: matrix factorization for recommendation systems, and line search methods for problems such as logistic regression.
For the recommendation application, we accelerate the standard Alternating Least Squares (ALS) optimisation algorithm using a nonlinear conjugate gradient (NCG) wrapper around the ALS iterations. In parallel numerical experiments on a 16 node cluster with 256 computing cores, we demonstrate that the combined ALS-NCG method requires many fewer iterations and less time (with acceleration factors of 4 and more) than standalone ALS to reach movie rankings with high accuracy on the MovieLens 20M dataset and synthetic datasets with up to nearly 1 billion ratings (http://arxiv.org/abs/1508.03110).
The second part of the talk discusses a new type of parallel line search for large-scale unconstrained minimization of smooth loss functions such as logistic regression. We present a new line search technique that computes more accurate minima by evaluating a Taylor polynomial approximation to the loss function, which also reduces the parallel communication costs, resulting in overall efficiency gains of a factor of 2 or more in parallel compared to existing approaches (http://arxiv.org/abs/1510.08345).
This is joint work with Mike Hynes.
Location: ESB 4133
Lunch will be provided