Interview: Florian Douetteau, Dataiku Founder, on Empowering Data Scientists
Here is an interview with Florian Douetteau, founder of Dataiku, on how their tools empower data scientists, and how data science itself is evolving.
By Ajay Ohri, Author.
Dataiku develops a collaborative end-to- end software platform called Data Science Studio (DSS) that companies use to accelerate the development of in-house business and predictive solutions. It promises a significant increase in efficiency an productivity for the company’s data scientists, business analysts, and product managers.
Florian Douetteau is Dataiku’s Chief Executive Officer. Florian started his career at Exalead, an innovative search engine technology company. There, he led a R&D team of 50 brilliant data geeks, until the company was bought by Dassault Systemes in 2010 for $150 million. Florian was then CTO at IsCool, a European leader in social gaming, where he managed game analytics and one of the biggest European cloud setup. Florian also served as freelance Lead Data Scientist in various companies, such as Criteo, the European Advertising leader.
Here is my interview with him:
Ajay Ohri: Describe your journey as a data science startup. What was the reason for you deciding to make DSS.
Florian Douetteau: In 2012, my partners and I saw an opportunity: the data science market was (and still is) extremely fragmented. We’re living in a very interesting technological universe where lots of tools and options for working on and with data are available. Today, the challenge is more about applying the right tool and the right location and then tackling the complexity of having multiple storage systems and languages. For instance, you could prefer using Pig for some data munging, Hive for computations, Python or R for advanced modeling, ElasticSearch for search, Hadoop for large scale processing, and so on.
So we took a step back and looked at the big picture: what were we trying to solve and why? What existed and who else was trying to solve it. Then, we focused on the users. How could we solve the fragmentation problem of the data science ecosystem (for both proprietary and open source solutions) for them better than the rest? What weren’t those users getting from available solutions and how could we bring it to them intelligently? For us, that meant enabling our users, no matter their skill set or level of expertise, to collaborate while maintaining the freedom to use the tools and languages they know best.
Ajay Ohri: Describe your product- how does it help experienced as well as aspiring data scientists?
Florian Douetteau: Dataiku is guided by the belief that to succeed in the world’s rapidly evolving data ecosystem, companies - no matter their industry or size - must continuously re-invent & deliver innovative data products. With this in mind, our mission is to provide all organizations with the technological environment that will enable their teams to effectively dispense the data innovations of tomorrow. Dataiku’s approach to collaborative data science and machine learning enables these organizations to compete with the digital giants that have blossomed in the past decade.
Thanks to a collaborative and team-based user interface for data scientists and beginner analysts, to a unified framework for both development and deployment of data projects, and to immediate access to all the features and tools required to design data products from scratch, users can easily apply machine learning and data science techniques to all types, sizes, and formats of raw data to build and deploy predictive data flows.
Finally, without the hassle of connecting and drubbing tools, users of all experience levels can quickly learn and excel in languages like R or Python and discover what machine learning is really all about.
Ajay Ohri: What is user feedback from your customers. Can you describe some case studies where DSS use led to much better results than say alternative data science editors?
Florian Douettea: Thanks to Dataiku DSS’s collaborative features and advanced analytics capabilities, customers such as AXA, L’Oreal, Bechtel, Webbmason, Urban Insights, and many more easily apply machine learning and data science techniques to all types, sizes, and formats of raw data to build and deploy predictive data flows. Use cases range from churn prediction, fraud detection, dynamic customer segmentation, cost and logistic optimization, predictive maintenance, trend forecasting, and much more. And so far, feedback is excellent: increased team productivity (“Within a few months, we’d established 30% increase in productivity”), new business opportunities (“With DSS, we have internalized the design & deployment of our data solutions”), development of solutions that actually deliver additional revenue and savings (“DSS has paid for itself”), quick onboarding (“I’ve even had our marketing and business teams try it out”), easy and secure deployment (“We no longer have to re-code everything”), etc. And we plan on keeping it that way!
Ajay Ohri: What do you think about using Python and R together in a data science workflow. What advantages does it give over a single language approach?
Florian Douetteau: The ability to use different languages in one project (from SQL, R or Python to Hive, Pig, or all things Spark) is great for two main reasons:
- Different languages are more adapted to different parts of the data science workflow – for example, R may be better for statistical computations, whereas Python for algorithms, and Hive for all things Hadoop.
- People are more or less comfortable with all of the different languages and technologies that are available. If team managers enable them to use the tools they know best, they’re ensuring optimized productivity and individual freedom.
Ajay Ohri: A lot of data input is increasingly through APIs, or Parsing Text through web. How does DSS handle the often time-consuming task of crafting API request and parsing them into a data frame like structure?
Florian Douetteau: We provide with DSS a set of plugins that help to integrate with different text analytics or APIs. For instance, we provide a free plugin to the popular import.io service, a plugin to the IMDB APIs, a plugin to various rich open data sources such as the US patent database, Open street map, or project Gutenberg.
Ajay Ohri: We have Jupyter and RStudio as established data science interfaces. How does DSS score over them? Where are areas where DSS wont be a better option?
Florian Douetteau: We’ve integrated Jupyter as part of our offering. The core feature of our product are: Visual Data Preparation, Visual Machine Learning, Visualisation, Workflow, SQL Notebook, Code Notebooks. Code Notebook are actually implemented with the Jupyter framework.
Ajay Ohri: What are you future plans and ideas. Can we expand the pool data scientist by using easier to use tools?
Florian Douetteau: I like to think about two new interesting trends. One is "Live Data": it’s soon going to be all about building products that handle data coming from moving, living systems. This means real-time processing. It means deep learning techniques. It means having the tools and technologies that can handle the complexity of live data structures.
The other one is “Thinking Apps”. A large corporation today has applications that fall in two main categories:
- transactional application that enforce a business process,
- reporting applications that provide insights on the data.
There is an increasing demand for applications that need to follow a rather simple business process, but involve a vast amount of data that is reduced and analysed by an algorithm with a human making the final interactions. That’s the case of modern fraud detection applications, where an algorithm reduces all weak signals from the data, and where the human analyzes the resulting alerts. That’s also the case for modern marketing campaign management applications, where an algorithm analyses past campaigns, makes attributions, forecasts the current campaign performance, and a human takes possible new resource allocation decisions. And there’s a demand for similar “thinking apps” in process quality control, predictive maintenance, operational support, human resources, and so on.
What’s new is that today there’s a need for business applications that are not just passive displayers of information or controllers of a process. People expect apps that can “think” with them; and at Dataiku we can help people actually implement them. That’s exciting!
Bio: Ajay Ohri is the author of two books on R (R for Business Analytics and R for Cloud Computing) and a forthcoming book on Python Python for R Users.
- Dataiku Data Science Studio, now also runs on Apache Spark
- Using Python and R together: 3 main approaches
- Will Balkanization of Data Science lead to one Empire or many Republics?