2016 Silver BlogThe Big Data Ecosystem is Too Damn Big

The Big Data ecosystem is just too damn big! It's complex, redundant, and confusing. There are too many layers in the technology stack, too many standards, and too many engines. Vendors? Too many. What is the user to do?



By Andrew Brust, Datameer.

As it stands today, the big data ecosystem is just too large, complex and redundant. It’s a confusing market for companies who have bought into the idea of big data, but then stumble when they are faced with too many decisions, at too many layers in the technology stack. The big data ecosystem has too many standards. It has too many engines. It has too many vendors. The ecosystem, as it exists right now, alienates customers, inhibits funding of customer projects, and discourages political support for them within organizations. So what are you, the user, to do?

I’ll be presenting on this in detail at Hadoop Summit San Jose next week, so be sure to join me Wednesday, June 29th at 4:10pm in Ballroom B if you’ll be at the show.

The Big Data Ecosystem is Too Damn Big

Click to enlarge.

You’re Faced With Too Many Choices

  • BI/Analytics: At the top of the stack, there are seemingly endless choices. Whether Enterprise BI stalwarts, BI 2.0 challengers, or big data analytics players, the number of vendors and their similar positioning makes it really hard for customers. It’s difficult to distinguish between solutions – even significantly different ones – when the messaging and imagery are so similar.
  • Distributions: Move down in the stack and there’s plenty to choose from at the Hadoop and Spark distribution layer. It’s difficult enough that the “big three” (ClouderaHortonworks and MapR) each offer their own distributions of Hadoop, with Spark integrated. But add in other offerings from IBM, and the cloud players, large and small, and things get a little crazy. What’s difficult for the customer here is that the core of these stacks differ in their makeup and/or have different versions of the very same components.
  • Execution Engines: And speaking of components, we have too many execution engines, too. Hadoop shifted from MapReduce to Tez. Then Spark established itself. And now, it seems, Apache Flink is waiting in the wings. On the streaming side, Apache Storm, NiFi, Spark and Kafka, in various combinations, vie for mindshare. And while big data machine learning started with Apache Mahout, it seems to be shifting to Spark MLlib and elsewhere. Then there are the permutations. For example, Spark can run on YARN, Hadoop 2.0’s resource manager. But it doesn’t have to. And when you use the cloud-based Spark offering from Databricks (the company founded by Spark’s creators), it doesn’t.
  • SQL, Datasets and Streams: And while SQL made its way into the big data conversation to make it all “easier” to use by existing leveraging skillsets, there are too many SQL on big data solutions too. Should you use Hive, or Spark SQL? If you do use Hive, should you use it on MapReduce, or Tez? Plus, don’t forget Impala. Or HAWQ, Apache Drill, Presto and all the SQL-on-Hadoop bridges from the big database vendors, including Teradata, HP, Microsoft, Oracle and IBM. Let’s not even get into the fact that using SQL can be antithetical to Hadoop and its unique benefits. Yet another layer of confusion. Even within a well-defined stack with a small number of components, fragmentation can be rampant. In the Spark world, you can use Resilient Distributed Datasets (RDDs), DataFrames or Datasets. And Spark developers can use the new Spark Structured Streams for data in motion. But what about Kafka Streams? Those are shiny and new too.
  • To Code or Not to Code: When it comes to programming languages, should you code in R or Python? What about Scala? And for that matter, why not throw enterprise developers a bone and let them use Java and even C# to write their big data code? There’s control to be had here but at the cost of self-service and enabling more people within your organization.

How to Move Forward in a Confusing Ecosystem

 
Yes, things are in some disarray, but they are far from hopeless. We can clean up this mess, and we can let thesignificant value that the big data ecosystem has created stand out. Next week, at Hadoop Summit San Jose, I’ll be presenting some ideas for how we, as vendors, analysts, venture capitalists, and everyone else who makes up this big data ecosystem, can make the situation better. But more importantly, I’ll outline some tips and tricks for customers who are currently attempting to navigate these murky waters. A sneak peek, of sorts, for you now:

  1. Always Start With a Use Case
    Don’t get sold by shiny tech. In a recent Gartner survey, by far the top big data challenge cited by respondents was “determining how to get value from big data” (58% of respondents). How do you remedy that? Always start with defining your use case, then work your way toward finding the technology that will support it.
  2. Consider Control vs. Democratization
    As hinted at above, it may be tempting to give yourself/your team fine-level controls with tools that allow you to code. But be wary of how much control you actually need – is the greater good better served by getting data into the hands of more people in the organization with self-service tooling? Search for the right balance.
  3. Think Future-Ready
    We’ve already seen it. The industry is contractingexpandingcontractingexpanding. That’s why it’s incredibly important that as you evaluate your technology purchase, you look for signs the technology itself is “future-proof” or “future-ready” through modular, “pluggable” architecture. Because, while you may not want toleap on the next shiny new project or standard, you’ll want the option to migrate to it as it becomes prudent to do so.

Join me at the show next week, Wednesday, June 29th at 4:10pm in Ballroom B and/or check back for a follow up post with a recording of the session and some additional thoughts on our collective path forward.

Bio: Andrew Brust is Sr. Director of Market Strategy & Intelligence at Datameer and writes a blog for ZDNet called "Big on Data.” Andrew is co-author of "Programming Microsoft SQL Server 2012" (Microsoft Press); an advisor to NYTECH, the New York Technology Council and writes the Redmond Review column for VisualStudioMagazine.com.

Original. Reposted with permission.

Related: