Interview: Stefan Groschupf, Datameer on Balancing Accuracy and Simplicity in Analytics
We discuss common pain points in Big Data projects, evolution of Datameer technology, department specific solution – Datameer Professional, Datameer 5.0 Smart Execution, tacking over-simplicity and more.
Stefan Groschupf is a big data veteran and serial entrepreneur with strong roots in the open source community. He was one of the very few early contributors to Nutch, the open source project that spun off Hadoop, which 10 years later, is considered a 20 billion dollar business.
Stefan is currently CEO and Chairman of Datameer, the company he co-founded in 2009 after several years of architecting and implementing distributed big data analytic systems for companies like Apple, EMI Music, Hoffmann La Roche, AT&T, the European Union, and others.Stefan is a frequent conference speaker, contributor to industry publications and books, holds patents and is advising a set of startups on product, scale and operations.
Here is my interview with him:
Anmol Rajpurohit: Q1. A lot of companies across industry verticals are investing today in Big Data. What are the most common pain points that you hear from such companies? How do you assess most of these companies on Big Data maturity?
Obviously, emerging tech companies like Twitter, Facebook, LinkedIn, etc. adopted technologies like Hadoop, MongoDB or DataStax, very, very early. The second industry to quickly adopt big data has been financial services because they see data as their competitive edge. Now, we are seeing industries like healthcare and manufacturing adopting.
As for pain points, again, different industries experience different challenges. The complexity of data is actually a bigger problem than the data size. I think that's important because you have Moore's Law catching up with everything, but the data complexity problem remains.
Pain points can also depend on how companies initially planned their big data approach. Early on, a lot of companies thought they could implement big data architecture themselves. Now, those kind of homegrown, duct-taped projects are falling apart.
A lot of companies have challenges around integration, monitoring management -- the standard enterprise challenges you have with emerging technologies. Products like Datameer and other applications built on top of Hadoop can be extremely helpful because they’re standard – teams don’t have to piece together what their predecessors did.
AR: Q2. The Datameer Product Stack has evolved significantly since the company's launch in 2009. How would you describe the current state? What components are you focusing on right now?
SG: Actually, I just looked at the original drawing of the Datameer plans and they are 100 percent the same. It was three boxes: data in, analytics and visualization. In the boxes, we had database connectors, spreadsheet user interface and point-and-click visualization. We always envisioned adding more functionality that we now are able to implement because of emerging improvements in the Hadoop space around Flink and Spark and more real-time execution environments.
We still have a few things to build out, but we’re on the same path we’ve always been on. That's pretty cool compared to other companies that had to pivot many times.
AR: Q3. The recently released Datameer Professional, a Big Data analytics cloud platform utilizing "Hadoop-as-a-Service (HaaS)", is targeted toward department-specific deployments. What is the unique value in having department-specific Big Data solutions? How does it integrate with enterprise-wide deployments?
SG: We’re finding that many department executives, especially from Fortune 1000 companies, want to get their feet wet with big data analytics. This includes marketing, HR, Finance, Sales – you name it. However, their IT departments still had not yet rolled out a big data solution or were still in the beginning stages.
The bottom line is that department heads don’t have the resources to get started right now, but they’re asking for it. With that in mind, we released Datameer Professional which easily integrates with enterprise-wide deployments.
It provides a stepping stone so that individual business units can start realizing the value of big data analytics without waiting until their company is fully up and running on Hadoop.
AR: Q4. Datameer's latest product release (v 5.0) introduced a new feature called "Smart Execution" which dynamically selects computation framework for the underlying data. What are the computation frameworks that it selects one from and what data attributes are used to make this decision?
SG: There are so many new technologies in the Hadoop space right now that it can be confusing to figure out the right execution engine for a job. This does the work for you.
Smart Execution selects from three execution frameworks, in-memory, Tez, or optimized MapReduce, and we have plans to add Spark, Flink, and other technologies, as they become enterprise ready. It uses a combination of factors like data set size, complexity of analytics, and capacity of the Hadoop environment to do cost-based optimization and then chooses the most optimal framework for each step of your analytics pipeline, switching automatically as circumstances change.
AR: Q5. As the number of business users interested in analyzing data has sky rocketed, there has been an unprecedented quest for making Analytics tools "super simple to use". Where do you draw the line between helpful simplicity and anti-productive over-simplicity? How can the business users examine whether their Analytics tools are over-simplifying the analytical insights?
SG: I don’t think we should oversimplify the analytical insight, but we should absolutely simplify the user experience.
It’s amazing if you can get to the point where the user experience is so simplified that the technology is running in the background and it’s no longer noticeable. For example, elevators that are running predictive algorithms to move people even faster up and down when the riders have no idea that predictive analytics are even running, or what happens with banks behind the scenes when they catch fraudulent transactions on your credit card.
Around the analytics itself, I think it absolutely needs to have a lower barrier to entry. If it takes me two months to wait for IT to get something done or I can do it in Excel right now and make my boss happy, what will I do? I will use Excel. I think there's a huge opportunity to really lower the barrier to data analytics as low as Excel.
But, it’s really important to get the analytical insights accurate. We can't oversimplify that.
AR: Q6. In one of your blog posts earlier this year, you mentioned that, "in 2015, Hadoop will move from a special purpose platform to a general purpose platform". Can you please elaborate on that? While it is intuitive from a business user perspective to have all data on Hadoop regardless of the data characteristics (to make it easier to access), would it be efficient and effective from a technical perspective?
SG: Yes, it's so interesting. Companies have Cassandra, Oracle and Teradata, and then they also have Hadoop. They have everything. Hadoop, in most organizations that we talk with, is only used for data preparation. They're using Hadoop to clean data or to organize log files. Then they pull the data into Oracle and run Tableau on top of that.
Moving data is stupid, and having five different systems doesn't make sense. There needs to be a general-purpose data cluster, and Hadoop has a good chance to have that.
We need different execution environments to work with our different workloads. I think Hadoop will evolve to work with both small and large data. The addition of Spark is certainly very interesting for small data. Having multiple execution environments in one cluster is really exciting as it allows users to work with different sizes of data, either structured or unstructured.
There's a unique opportunity here to make the Hadoop ecosystem a general-purpose data warehouse where you can have five people working on five petabytes or you have five thousand people each working on five megabytes and it doesn't really matter.
Anmol Rajpurohit is a software development intern at Salesforce. He is a former MDP Fellow and a graduate mentor for IoT-SURF at UCI-Calit2. He has presented his research work at various conferences including IEEE Big Data 2013. He is currently a graduate student (MS, Computer Science) at UC, Irvine.