Spark with Tungsten Burns Brighter

Apache Spark is one of “the hottest technology” for data science and analytics. A project called Tungsten represents a huge leap forward for Spark, particularly in the area of performance. Understand how it works, and why it improves Spark performance so much.



By Paige Roberts, Syncsort.

Tungsten is Shiny

Project Tungsten is a new thing in the Spark world. As we all know, Spark is taking over the big data landscape. As always happens in the big data space, what Spark could do a year ago is radically different from what Spark can do today. It busted the big data sort benchmark last year, and is just getting better as it goes. A project called Tungsten represents a huge leap forward for Spark, particularly in the area of performance. That much was clear, but if you’re like me, you can’t help but wonder what Tungsten is, how it works, and why it improves Spark performance so much.

Spark has gotten better and better over time at optimizing workloads and staying away from IO bottlenecks. They even moved to a BitTorrent style of protocol to speed up network transmission rates. Now, the problem areas are all in the shuffle, serialization and hashing, all CPU intensive operations. IO bandwidth isn’t generally where Spark jobs slow down anymore. CPU compute efficiency and memory restrictions are the choke points.

Tungsten and Spark, a Bright Idea

Tungsten is a shiny metal that gets used a lot in light bulbs. I’m assuming that’s why the Spark folks had the bright idea to name this new project Tungsten. The project improves the efficiency of memory and CPU usage for Spark applications. I had some crazy hope that this new project might start taking advantage of chip cache, and after some research, I was delighted to find out I was right. (Buffing fingernails on shirt and smugly saying, “I told you so.”)

As I said in my post from a year ago “In-Memory Analytics Databases are So Last Century,” in-chip data processing is the wave of the future. Tungsten is surfing that wave like a champ.

In the gaming software industry, using GPU chip cache for data processing is a necessary fact of life. Gaming systems and video cards use GPU chip cache to do intense video data crunching. Ordinary hardware systems have CPU’s for data processing that have high speed memory cache available. It’s just that no one has been building software to take advantage of it properly. Outside of the gaming industry, Actian and Sisense were the only companies or projects I knew of that were taking advantage of in-chip data processing before now, due to the tricky aspect of vectorizing the data in order for it to fit the cache. The challenge of storing and processing data as a vector is getting tackled by the Tungsten project admirably. The Tungsten version of sort that uses chip cache memory is 3 times as fast as RAM in-memory sort. They haven’t gotten the new functionality into all the Spark base algorithms yet, but it’s coming.

As an aside, thanks to a helpful comment on my old In-Memory vs In-Chip  post, I even discovered there’s a Spark project called UCores to take advantage of chip cache in GPU’s and other less common types of chips. Go Spark. Someone is finally taking advantage of modern hardware strengths that should have been exploited ages ago, but only gaming companies seemed to know existed. (Stepping off of soapbox.)

Back to Tungsten. This project also dynamically optimizes Spark operations to take advantage of the RAM memory cache far more efficiently than the JVM does. Tungsten ditches a lot of JVM required overhead and takes over some of the memory cleanup. I would worry about memory leaks, but with the ridiculous number of people working on Spark, around 1000 now, those will probably get plugged as fast as people can find them.

Mind Blowing Idea Behind Tungsten with Spark

The other thing Tungsten does is create code dynamically at runtime with an optimizer. Most code in the world gets written beforehand and only interpreted at runtime. This works, but can lead to the job getting done in less efficient ways than if the procedure was turned around a little. Dynamic runtime code generation can be a mind-blowing idea, but it really works.

Essentially, the idea is for a developer to define what he or she wants done, then let an optimizing engine generate the ideal code to do that job based on runtime conditions like available memory, CPU cycles, data configuration, etc. That method provides a big performance boost because the code is perfectly fitted to the need and to the available resources that a developer couldn’t know about ahead of time. A good optimizer gives you better, more performant jobs in the same way a good query optimizer gives you faster queries in a database. Optimized code generation at runtime is also one of those things that I’ve been espousing from the rooftops for a few years now.

So, to sum it all up, Tungsten is a project to manage memory and processing for Spark to make it perform even better than it already does. And it uses strategies like chip-cache utilization and optimized code creation at runtime that I’ve been telling you were awesome for ages.

So, that brings me to two clear predictions.  First, my head is going to swell a bit from feeling all smug and brilliant, and second, Spark is going to continue to get more and more capable and performant and dominate the big data market for the next few years.

Bio: Paige Roberts started out in the ETL software business 19 years ago doing technical support and documentation. Since then, she has been an integration consultant, integration software trainer, and spent 5 years as an ETL software engineer. In the last four years, she has focused on the big data integration and Hadoop space as a product marketer, freelance analyst, and technology evangelist. Today, she is Product Manager for Syncsort’s big data integration software.

Original.

Related: