Increasingly, companies are generating large amounts of valuable customer data, but are unable to use this data to its full potential due to privacy-related considerations. The Synthetic Data Vault enables data scientists to sidestep data-sharing concerns and expand the pool of possible problem participants by generating synthetic data. By learning a generative model that accounts for dependence and relationships, the SDV creates new data that resembles the original set statistically, formally, and structurally—and therefore is easily used in its place. In our tests, data scientists using SDV data performed as well or better than data scientists using the original data in greater than 70% of cases.

github

hdi

Neha Patki, Roy Wedge, Kalyan Veeramachaneni, IEEE International Conference on Data Science and Advance Analytics Montreal, CA. 2016.

Neha Patki, MEng. Thesis, MIT EECS, 2016. Advisor: Kalyan Veeramachaneni.

Michael Wu, MEng. Thesis, MIT EECS, 2015. Advisor: Kalyan Veeramachaneni

Andrew Montanez
Katharine Xiao (2016-17)
Neha Patki (2015-16)
Roy Wedge (2015-16)
Michael Wu (2014-15)

March 3, 2017 — Artificial data gives the same results as real data — without compromising privacy — MIT News

March 6, 2017 — Scientists Can Now Get Real Results from Fake Data — Popular Mechanics

April 10, 2017 — Artificial data reduces privacy concerns and helps with big data analysis ­— Tech Republic

March 7, 2017 — How Synthetic Data Can Overcome Privacy Concerns ­— The Huffington Post