Interview: Stefan Groschupf, Datameer on Balancing Accuracy and Simplicity in Analytics
We discuss common pain points in Big Data projects, evolution of Datameer technology, department specific solution – Datameer Professional, Datameer 5.0 Smart Execution, tacking over-simplicity and more.

Stefan is currently CEO and Chairman of Datameer, the company he co-founded in 2009 after several years of architecting and implementing distributed big data analytic systems for companies like Apple, EMI Music, Hoffmann La Roche, AT&T, the European Union, and others.Stefan is a frequent conference speaker, contributor to industry publications and books, holds patents and is advising a set of startups on product, scale and operations.
Here is my interview with him:
Anmol Rajpurohit: Q1. A lot of companies across industry verticals are investing today in Big Data. What are the most common pain points that you hear from such companies? How do you assess most of these companies on Big Data maturity?
Stefan Groschupf: Data maturity varies by vertical.
Obviously, emerging tech companies like Twitter, Facebook, LinkedIn, etc. adopted technologies like Hadoop, MongoDB or DataStax, very, very early. The second industry to quickly adopt big data has been financial services because they see data as their competitive edge. Now, we are seeing industries like healthcare and manufacturing adopting.

Pain points can also depend on how companies initially planned their big data approach. Early on, a lot of companies thought they could implement big data architecture themselves. Now, those kind of homegrown, duct-taped projects are falling apart.
A lot of companies have challenges around integration, monitoring management -- the standard enterprise challenges you have with emerging technologies. Products like Datameer and other applications built on top of Hadoop can be extremely helpful because they’re standard – teams don’t have to piece together what their predecessors did.
AR: Q2. The Datameer Product Stack has evolved significantly since the company's launch in 2009. How would you describe the current state? What components are you focusing on right now?

We still have a few things to build out, but we’re on the same path we’ve always been on. That's pretty cool compared to other companies that had to pivot many times.
AR: Q3. The recently released Datameer Professional, a Big Data analytics cloud platform utilizing "Hadoop-as-a-Service (HaaS)", is targeted toward department-specific deployments. What is the unique value in having department-specific Big Data solutions? How does it integrate with enterprise-wide deployments?

The bottom line is that department heads don’t have the resources to get started right now, but they’re asking for it. With that in mind, we released Datameer Professional which easily integrates with enterprise-wide deployments.
It provides a stepping stone so that individual business units can start realizing the value of big data analytics without waiting until their company is fully up and running on Hadoop.
AR: Q4. Datameer's latest product release (v 5.0) introduced a new feature called "Smart Execution" which dynamically selects computation framework for the underlying data. What are the computation frameworks that it selects one from and what data attributes are used to make this decision?
SG: There are so many new technologies in the Hadoop space right now that it can be confusing to figure out the right execution engine for a job. This does the work for you.

AR: Q5. As the number of business users interested in analyzing data has sky rocketed, there has been an unprecedented quest for making Analytics tools "super simple to use". Where do you draw the line between helpful simplicity and anti-productive over-simplicity? How can the business users examine whether their Analytics tools are over-simplifying the analytical insights?
SG: I don’t think we should oversimplify the analytical insight, but we should absolutely simplify the user experience. It’s amazing if you can get to the point where the user experience is so simplified that the technology is running in the background and it’s no longer noticeable. For example, elevators that are running predictive algorithms to move people even faster up and down when the riders have no idea that predictive analytics are even running, or what happens with banks behind the scenes when they catch fraudulent transactions on your credit card.
Around the analytics itself, I think it absolutely needs to have a lower barrier to entry. If it takes me two months to wait for IT to get something done or I can do it in Excel right now and make my boss happy, what will I do? I will use Excel. I think there's a huge opportunity to really lower the barrier to data analytics as low as Excel.
But, it’s really important to get the analytical insights accurate. We can't oversimplify that.
AR: Q6. In one of your blog posts earlier this year, you mentioned that, "in 2015, Hadoop will move from a special purpose platform to a general purpose platform". Can you please elaborate on that? While it is intuitive from a business user perspective to have all data on Hadoop regardless of the data characteristics (to make it easier to access), would it be efficient and effective from a technical perspective?
SG: Yes, it's so interesting. Companies have Cassandra, Oracle and Teradata, and then they also have Hadoop. They have everything. Hadoop, in most organizations that we talk with, is only used for data preparation. They're using Hadoop to clean data or to organize log files. Then they pull the data into Oracle and run Tableau on top of that.
Moving data is stupid, and having five different systems doesn't make sense. There needs to be a general-purpose data cluster, and Hadoop has a good chance to have that.

There's a unique opportunity here to make the Hadoop ecosystem a general-purpose data warehouse where you can have five people working on five petabytes or you have five thousand people each working on five megabytes and it doesn't really matter.
Second part of the interview

Related: