Exclusive: Tamr at the New Frontier of Big Data Curation

Our exclusive profile of Tamr (former Data Tamer), the latest startup from legendary Michael Stonebraker, which emerged from stealth mode to address the new field of Big Data Curation.

Data is becoming the most valuable asset for many companies, but studies show that most organizations use only a small part of the data they collect. With data volumes growing exponentially, and the increasing variety and heterogeneity of data sources, how can enterprises leverage more of their data?

Data Curation Help comes from the new and hot field of Data Curation. Unlike Data Integration, which uses a top-down approach for bringing diverse data sources into one model (not scalable to Big Data), Data Curation uses a bottom-up, data-driven approach. As Michael Brodie (who is an advisor to Tamr) put it in a recent KDnuggets Interview

Data Curation is for Big Data what Data Integration is for small data.

Tamr Tamr is an exciting new startup which wants to solve the data curation problem. It was co-founded in Fall 2012 as Data Tamer by two serial entrepreneurs - Michael Stonebraker, a legendary database researcher for whom it was a ninth startup, and Andy Palmer, who has been involved in founding and/or funding over 50 innovative companies. With such founders, the company has attracted a lot of financing - over $16 million from investors including Google Ventures and New Enterprise Associates (NEA), and a lot of attention, including a KDnuggets post Data Tamer startup from Michael Stonebraker, Still in Stealth Mode.

On May 19th, Data Tamer has emerged from stealth mode and renamed itself to Tamr.

Last week, I stopped by their offices in the heart of Harvard Square, Cambridge, and received a briefing from Andy Palmer, Tamr CEO, and his young team, including Alan Wagner and Nidhi Aggarwal.

Tamr's approach to solving the Data Curation problem is designed to scale and to improve with more data. The key ideas are

1. Scalability through automation: The size of the integration problems precludes a human-centric solution. Machine Learning methods are needed.

2. Data Cleaning: Enterprise data sources are inevitably quite dirty.

3. Non-programmer orientation: Current Extract, Transform and Load (ETL) systems have scripting languages that are appropriate for professional programmers. The scale of next generation problems requires that less skilled employees be able to perform integration tasks.

4. Incremental: New data sources must be integrated incrementally as they are uncovered. Data Curation is never finished!

Tamr also smartly combines automation and human expertise.

It starts with using Machine Learning and Data Analysis algorithms to find relationships between data elements and tries to automate most data curation tasks. In cases when machine learning is not enough, it has well-defined processes and UI for asking human experts for help, and uses a smart rewards structure to encourage the experts.

The Core of Tamr: Machine Learning with Human Insight

Tamr uses a continuous learning approach, and it learns more with usage, helping build an institutional memory and inventory of enterprise data.

Tamr supports RESTful APIs to allow companies use their existing visualization and data science toolkits on top of Tamr. Postgres is used as a back-end database.

Some examples of machine learning methods it uses are
  • Perform fuzzy string comparisons over attribute names using trigram cosine similarity.
  • Treats a column of data as a document and tokenize its values with a standard full text parser. Then, measure TF-IDF cosine similarity between columns. This method is suitable for text fields.
  • Use minimum description length (MDL) to compare values of two attributes. Compute the ratio of the size of the intersection of two columns' data to the size of their union. This method is well suited for categorical fields with small number of values.
  • Compute Welch's t-test for a pair of columns that contain numeric values and get the probability the columns were drawn from the same distribution.

In the screenshot below, Tamr identifies several fields which contain similar information for address and are candidates for merging.

Tamr Address Match

Tamr administrator has a nice interface for selecting domain experts who can help with particular questions, including ranking of experts by their expertise and current load.

Tamr interface for finding experts to help

Tamr has already worked with several large clients. In one case, it helped a major drug company integrate data from 15,000 spreadsheets from its biologists and chemists doing lab experiments. These spreadsheets had about 1M rows and 100K attribute names.

In another case, Tamr helped Verisk Health integrate the claims records for a collection of 300 insurance carriers, with 20 million records overall. The goal was to consolidate claims data by medical provider and Tamr solution was a significant improvement over previous approaches which required a lot of manual intervention.

For more technical details about Data Tamer, see Data Curation at Scale: The Data Tamer System, by Stonebraker et al, Proceedings of CIDR conference, 2013.

KDnuggets has also covered other companies in Data Curation space - see