Increasingly, companies are generating large amounts of valuable customer data, but are unable to use this data to its full potential due to privacy-related considerations. The Synthetic Data Vault enables data scientists to sidestep data-sharing concerns and expand the pool of possible problem participants by generating synthetic data. By learning a generative model that accounts for dependence and relationships, the SDV creates new data that resembles the original set statistically, formally, and structurally—and therefore is easily used in its place. In our tests, data scientists using SDV data performed as well or better than data scientists using the original data in greater than 70% of cases.

Publications

The Synthetic Data Vault (PDF)
Neha Patki, Roy Wedge, Kalyan Veeramachaneni, IEEE International Conference on Data Science and Advance Analytics Montreal, CA. 2016.

The Synthetic Data Vault: Generative Modeling for Relational Databases (PDF)
Neha Patki, MEng. Thesis, MIT EECS, 2016. Advisor: Kalyan Veeramachaneni.

The Synthetic Student: A Machine Learning Model to Simulate MOOC Data (PDF)
Michael Wu, MEng. Thesis, MIT EECS, 2015. Advisor: Kalyan Veeramachaneni

Contributors

Andrew Montanez

Katharine Xiao (2016-17)
Neha Patki (2015-16)
Roy Wedge (2015-16)

Michael Wu (2014-15)