When presented with a new dataset, human data scientists explore it in order to identify salient properties of the data elements, identify relationships between entities, and write processing software that makes use of those relationships accordingly. While there has been progress made on automatically processing the data to generate features or models, most automation systems rely on receiving a data model that has all the meta information about the data, including salient properties and relationships. In this project, we present a first version of our system, called ADEL- Automatic Data Elements Linking. Given a collection of files, this system generates a relational data schema and identifies other salient properties. It detects the type of each data field, which describes not only the programmatic data type but also the context in which the data originated, through a method called Type Detection. For each file, it identifies the field that uniquely describes each row in it, also known as a Primary Key. Then, it discovers relationships between different data entities with Relationship Discovery, and discovers any implicit constraints in the data through Hard Constraint Discovery. We posit two out of these four problems as learning problems.

Publications

Towards Automatic Data Element Linking (PDF)
Katharine Xiao, MEng. Thesis, MIT EECS, June 2017. Advisor: Kalyan Veeramachaneni.

Contributors

Andrew Montanez
Katharine Xiao (2016-17)