Paxata automates Data Preparation for Big Data Analytics

Paxata wants to shorten and automate the data cleaning process, by augmenting data from a huge number of sources and by using machine learning to see statistical similarities between the data imported.

By Ajay Ohri, Mar 7, 2014.

logo-pax intends to take the munging out of the whole data science process by helping shorten and semi-automate the data cleaning process. It does so by data augmentation by a huge number of sources as well  from it's data enrichment library as well as using machine learning to see statistical similarities between the data imported.

In this case  machine learning leverages text mining and association analysis along with graph analysis .

The solution runs on the Rackspace cloud and by applying it's own algorithms to the data, Paxata creates a data  model  in  the form  of  a  graph,  with associations  among the data objects. These associations are then used  to resolve data quality issues.

For example if you import one dataset where Customer Id is named as Customer_ID and the other dataset it is named as Account_Number, the Paxata solution would be able to prompt the user that it is one and the same. This makes it incredibly useful given the comparatively enormous time, data scientists spend in the data preparation phase of a project. This additional time can thus be used for better visualizations or even higher level of analytics.

By thus preparing the data Paxata  enables it to be ready in a single dataset format ready for consumption for analysis for software like Tableau, Qlikview, Excel and any ODBC compliant tool. In addition you can export the prepared data to Hadoop clusters. The founding team is led by seasoned MDM entrepreneur CEO Prakash Nanduri. With $10 million  in  funding from Accel Partners, this is a startup that is making waves that will impact enterprise software for data science. 

Pricing is $3,500 a year for Pax Personal and $10,000 for Pax Share. While Cloud Storage is 1 GB with Pax Personal, it is 5 GB for Pax Share which seems a bit low. An additional point in Pax Share is API access through the command line. The third tier of pricing is custom pricing for Enterprises based on assessed needs.

You can watch a demo here

Paxata screenshot

With successful use cases including better business analytics, fraud analysis, demand forecasting and resource optimization, this solution can help a lot of businesses struggling with the data deluge of spreadsheets and data marts. I do hope that the Paxata team puts up a more automated demo (like upload your own dummy data) to demonstrate their solution in working as I think this will further enhance the credibility and ease of adoption of the automation process of the data preparation. An additional trial period or demo license can help spread the word even further.

Data Preparation automation has been the dream of many data scientists and this space will only heat up given the huge amounts of data being now processed in the Big Data Analytics era.