Dataiku Data Science Studio
Data Science Studio (DSS) from Dataiku is a complete Data Science software tool for developers and analysts, which significantly shortens the time-consuming load-clean-train-test-deploy cycles of building predictive applications. A community edition and a free trial available.
Load and Prepare
First of all, DSS enables direct and fast connection to the most common sources (Hadoop, SQL, Cassandra, MongoDB, S3, …) and formats (CSV, Excel, SAS, JSON, Avro, …) for data today. After connecting to a data source, the first step in any serious modeling job is cleaning the data. As you know, this process typically takes up as much as 80% of data scientists’ time. To speed this up, our studio comes with strong data integration capabilities. First, DSS automatically infers smart data types over data (such as Gender, Country, IP Address, URL, Date, …). Analysts can leverage these smart data types to validate and transform the data in an automated way (like getting the country out the IP, the day of week out of the date, and so on). They can also perform more mundane tasks such as replacement, grouping, splitting, calculating, and others on DSS’s interface which shows them an instant visual feedback of any operation.
Modeling: from columns to predictions
Once data is clean and rich, users can train supervised or unsupervised models. It automatically creates a tunable features engineering pipeline with missing value imputation, dummy variable encoding, impact coding, feature generation, and reduction.
It then trains different models using algorithms from the popular Scikit-Learn and H2O frameworks. Users can compare models, analyze their performance, and tune all the parameters from the web UI. Committed to a White Box approach, DSS lets advanced users generate the full Python source level code for the whole pipeline (in Python), so that they can modify it at will.
From model to predictive application
Once models are built, users can create robust and repeatable workflows that enable retraining or scoring. These workflows include the full data flow from raw data to predictions. The workflow engine is incremental (for instance, it can work incrementally on hourly updated data) and robust (it can intelligently recover from situations were part of the data was temporarily available). These workflows can be extended with custom code in Python, R, SQL or Pig, enabling full Hadoop based workloads. Predicted values can be accessed through a REST API or published directly by DSS to a variety of destinations (e.g. ElasticSearch, FTP servers, internal Data warehouse).
We have built Data Science Studio in order to shorten the load-clean-train-test cycles without dumbing them down; it is a tool for qualified data scientists while remaining accessible for less technical business intelligence or marketing profiles.
Free community edition is available (limited to 100,000 rows and one user). A free trial version of the tool is also available at http://www.dataiku.com/products/trynow/
Brief author bio:
Florian is Dataiku CEO. He began his tech career at Exalead, a search engine technology company based in Paris. There, he led the R&D team made of 50 brilliant data geeks, until the company’s $150 M exit in 2010. Florian then went on to become CTO at IsCool, a European leader in social gaming. Florian also served as freelance Lead Data Scientist in various companies, such as Criteo, the European Advertising leader. Florian speaks regularly at technical groups such as the Open World Forum or the Paris Java User Group.