A large part of a data scientist’s work involves tuning hyperparameters—making decisions about what happens to the data during all different steps of the process. Such tuning takes a significant amount of time and effort, and small differences in hyperparameters can significantly affect the performance of a particular pipeline. Through our system called “Deep Mining,” we seek to automatically tune the entire data processing pipeline—not just the classification algorithms as in the previous project – ATM. This involves standardizing the pipeline abstractions, and building and testing several hyperparameter selection and optimization methods. Since the earlier stages of the pipeline can be computationally expensive, our focus is also on determining the most efficient distribution strategy, and a sampling-based performance estimator.

Publications

Sample, Estimate, Tune: Scaling Bayesian Auto-Tuning of Data Science Pipelines (PDF)
Alec Anderson, Sebastien Dubois, Alfredo Cuesta-Infante and Kalyan Veeramachaneni. IEEE International Conference on Data Science and Advance Analytics, Tokyo, Japan. October, 2017.

Deep Mining: Scaling Bayesian Auto-tuning of Data Science Pipelines (PDF)
Alec W. Anderson, M.E. thesis, MIT Dept of EECS, August 2017. Advisor: Kalyan Veeramachaneni

Contributors

William Xue
Laura Gustafson
Akshay Ravikumar
Alfredo Cuesta-Infante
Alec Anderson (2016-17)
Sebastien Dubois (2015)