KDnuggets Home » News » 2014 » Mar » Software » Trifacta – Tackling Data Wrangling with Automation and Machine Learning ( 14:n06 )

Trifacta – Tackling Data Wrangling with Automation and Machine Learning

Trifacta wants to solve the important problem of data cleaning and transformation by building better interfaces which use machine learning. It then aims to help enterprises make sense of their disparate data sources and cut down on time needed to prepare data for data science.

By Ajay Ohri, Mar 17, 2014.

Trifacta aims to tackle the problem of wrangling huge amounts of disparate data in the desired shape and format to be ready for summary, insights and analysis. It does it by building better interfaces that use machine learning for data transformation, and by adding data visualization.

Trifacta wants to cut down the data preparation / data cleaning phase - which customarily takes the bulk of data mining projects - and give more time to the analyst to focus on the more interesting discovery part of the project.

The possible data transformations are presented to the user ranked by their score according to the machine learning algorithms.
This makes sense as most errors within a dataset tend to be repeatable and data cleaning is about search and replace erroneous data with assumed imputed data by the analyst using statistical methods and training.

Co-founded by the trio of  Heer,  Kendell and Hellerstein, Trifacta is backed up by an impressive array of people including investors (Greylock  and Accel Partners), technical advisers from the rest of industry (Cloudera Chief Data Scientist Jeff Hammerbacher, Mike Bostock of D3.js, Data Scientist DJ Patil and Publisher Tim O'Reilly) and academic professors (from Berkeley, University of Washington and Stanford). Trifacta also started lining up key partners in Cloudera, Tableau Software and a PaaS provider Pivotal.

Trifacta approach to Big Data is to use a right balance between human, data, and algorithms. This is a key difference between the other data cleaning automation company we reviewed recently, Paxata.

Think of it as this way- suppose you had a very big spreadsheet like interactive software - like Excel on steroids, that not only enables you to work and edit huge datasets, but also captures what ever edits you are making in it's memory, and then applied machine learning to the edits you have made previously so the next time similar data needs to be transformed, it suggests possible edits to you. That in short is the power of Trifacta.

Trifacta screen shot

An additional innovation is the creation of Vega by Trifacta - this is an advancement on D3 for better data visualization for data scientists. Vega is a visualization grammar, a declarative format for creating and saving visualization designs. With Vega you can describe data visualizations in a JSON format, and generate interactive views using either HTML5 Canvas or SVG.

With both Trifacta and Paxata trying to make things easier for data scientists by automating the data cleaning part of a Big Data Analytics project, this space is going to go for further rounds of evolved innovations in our opinion.