Synthetic Data Vault

Increasingly, companies are generating large amounts of valuable customer data, but are unable to use this data to its full potential due to privacy-related considerations. The Synthetic Data Vault enables data scientists to sidestep data-sharing concerns and expand the pool of possible problem participants by generating synthetic data. By learning a generative model that accounts for dependence and relationships, the SDV creates new data that resembles the original set statistically, formally, and structurally—and therefore is easily used in its place. In our tests, data scientists using SDV data performed as well or better than data scientists using the original data in greater than 70% of cases.

Publications

The Synthetic Data Vault (PDF)

Neha Patki, Roy Wedge, Kalyan Veeramachaneni, IEEE International Conference on Data Science and Advance Analytics Montreal, CA. 2016.

The Synthetic Data Vault: Generative Modeling for Relational Databases (PDF)

Neha Patki, MEng. Thesis, MIT EECS, 2016. Advisor: Kalyan Veeramachaneni.

The Synthetic Student: A Machine Learning Model to Simulate MOOC Data (PDF)

Michael Wu, MEng. Thesis, MIT EECS, 2015. Advisor: Kalyan Veeramachaneni

Contributors

Andrew Montanez
Katharine Xiao (2016-17)
Neha Patki (2015-16)
Roy Wedge (2015-16)
Michael Wu (2014-15)

Press

March 3, 2017 — Artificial data gives the same results as real data — without compromising privacy — MIT News

March 6, 2017 — Scientists Can Now Get Real Results from Fake Data — Popular Mechanics

April 10, 2017 — Artificial data reduces privacy concerns and helps with big data analysis — Tech Republic

March 7, 2017 — How Synthetic Data Can Overcome Privacy Concerns — The Huffington Post