The Post-Hadoop World: New Kid On The Block Technologies
Big Data technology has evolved rapidly, and although Hadoop and Hive are still its core components, a new breed of technologies has emerged and is changing how we work with data, enabling more fluid ways to process, store, and manage it.
Over the past six years, we’ve seen a rapid evolution in data processing platforms and technologies. While Hadoop and Hive remain core components of the data processing toolkit, a new breed of emerging technologies is changing the way we work with and use data. While we started with data in siloed pipelines, today we are able to process, store, and manage data in a more fluid way.
The early days
As any early Hadoop user (version .17, circa 2009) will tell you, it was a comparatively painful time compared to now. In the beginning, using Hadoop was time-consuming, excessively hands-on, and perhaps most frustratingly—unstable. I recall several instances of single points of failures, of colleagues losing entire sets of data and having to start from scratch.
Today, an ecosystem of post-Hadoop technologies have emerged, such as Mesos, Presto, Yarn, and Docker, to name a few. It’s an evolution driven by a real need: 90% of the world’s data was created only in the last year. With this expanded set of data at our fingertips, there has been an increased demand for aggregating, manipulating and deriving benefit from data faster and more securely. The post-Hadoop technologies have allowed us to tackle three key problems—containerization, scheduling, and experimentation—that have revolutionized how we work with data.
Every developer wants his or her own set of data tools, which means DevOps has the nightmare task of managing a large cluster. For any particular job, the tool used, along with all it's dependencies, must be distributed to each machine in the cluster. Get enough developers together sharing the same cluster and it doesn't take to long before the requirements of one tool will break another. For instance, Tool A requires version 1.2.3 of a specific library but Tool B needs version 1.4.8 or it breaks. This is colloquially known as dependency hell.
Enter Docker, an open platform for distributed applications. Docker enables containerization of a tool, along with its dependencies, so it can be rapidly deployed as a "black box" for every machine in the cluster.Each tool is self-contained along with its dependencies. This makes it possible to have different jobs use different versions of the same tool without any conflict.
So, developers can now use the best tools for the job without the mess: they no longer have to use a specific tool chosen by the DevOps team,or set up an entirely new system to integrate data in the pipeline. By containerizing code, we’ve been able to break open the silos of using different versions of tools—driving growth and scaling without consequence.
II. Continuous scheduling
Data doesn’t sleep. People, machines, systems are constantly producing new data. Enter Chronos, a distributed, fault-tolerant scheduler. Chronos is a key post-Hadoop technology that has allows us to tackle the challenge of running queries on a schedule, enabling continuous data processing. Chronos, along with most of the technologies we’re seeing in the post-Hadoop ecosystem, plays nice with other technologies, including Mesos (resource management) and Docker. The powerful combination has allowed us to process customer data separately, all the while running containers continuously and scaling horizontally without writing new code.
With scheduling and containerization issues tackled, we are entering a new phase of effective data manipulation. In the past, what held developers back from experimenting with their data was a fear of losing that data in the process. Presto and Flink allow engineers to conduct fast analytics on top of an existing cluster, thereby simplifying experimentation, including combining different data sets, writing new machine-learning frameworks—essentially everything that has set the stage for today’s predictive analytics revolution. Indeed, the more our tools allow us to experiment with data, the further and quicker we’re going to go. It’s the same as a science lab; a cancer research center with full access to tools and funding can and will find those life-altering solutions faster.
IV. Using Memory More Effectively
If there's one big trend today emerging in the big data space that's rapidly pushing Hadoop and MapReduce to the bottom of the tool chest, it's solutions that use memory more effectively. You see this technique being utilized by our personal ad hoc querying tool of choice, Prestodb. But another contender we're watching is Apache Spark. Similar to how Hadoop helps developers avoid taxing the network, which is arguably the slowest information bus in your computer, tools like Spark give developers the ability to express jobs that both avoid taxing the network and the hard disk at the same time. The approach often reaps a 10x-100x performance improvement over a traditional MapReduce job. The most intriguing thing about Spark is that it gives you the primitives to express the most complex of computations while remaining memory efficient. Interestingly enough we are seeing the building of more familiar ways of data munging through Spark SQL -- which could end up being the killer combo to end the current reigning big data champion. Time will tell.
In my line of business, predictive analytics, we often talk about a new layer of intelligence that can add real value to inherently “dumb” CRM and marketing automation systems. I see this same concept for data scientists and engineers: Using Hadoop and Hive as a foundation, new data processing technologies are enhancing data security and data manipulation, and opening the doors to new possibilities for big data. And just like I was with Hadoop, I aim to be on the front lines of pushing those possibilities.
Viral Bajaria is a CTO & Co-Founder, 6sense, where he leads the development of 6sense’s innovative analytics and predictive platform. Prior to 6sense, Viral built the big data platform at Hulu that processed over 2.5 billion events per day. He was an early adopter of Hadoop (late 2008), and built and managed a cluster that stored and processed over a PB of data. Viral was instrumental in building the infrastructure that powered reporting, financial and recommendations systems across web and mobile devices. In his spare time Viral enjoys contributing to open-source projects.
- Top Big Data Influencers and Brands
- Don’t Miss Strata Hadoop World, San Jose, Feb 17-20, 2015
- Upcoming Webcasts on Analytics, Big Data, Data Science – Jan 6, 2015 and beyond