8 Myths about Virtualizing Hadoop on vSphere Explained

This article takes some common misperceptions about virtualizing Hadoop and explains why they are errors in people’s understanding.

Myth #5: Cannot get the latest versions of Hadoop products on a virtual platform

new-versionThis is a timing question. The vendors’ products really depend on the operating system and tools such as Java runtime. Provided the latest Hadoop technology is tested and supported by the Hadoop vendor on the newer versions of these, and VMware supports the particular guest OS, then they will work on vSphere. If the installation automation technology implemented in BDE is a short time behind in its capability to deal with the latest Hadoop version, then that does not mean that the latest version of the Hadoop technology is not capable of being virtualized.

Myth #6: No one else is doing virtualization of Hadoop, so why should I?

The first statement here is not true. VMware is aware of many organizations that are in various stages of testing/deploying Hadoop on vSphere. A number of these customers have given talks on the subject in public and others have had their deployments documented in case studies. Some of these deployments are in the hundreds of physical servers with multiple virtual machines on those servers.

Myth #7: Hadoop and vSphere need have a specialized version to run together

This is incorrect. The vSphere environment and the Hadoop software can be combined and run out-of-the box. VMware has donated a set of features that make the Hadoop topology aware of virtualization and those are now built into the distro vendors’ products (called the Hadoop Virtualization Extensions). These ensure, for example, that all replicas of an HDFS datablock do not live on a group of virtual machines that reside on the same host server. This is a parameter that is expressed at Hadoop cluster creation time and is now a part of the standard Hadoop distribution code. No special versions of vSphere or Hadoop software are required for them to run well together.

Myth #8: I can deploy Big Data using containers as a better option to virtualizing Hadoop

We understand that containers are a big area of interest to many users today, indeed they represent virtualization at a different level in the stack. However, containers are not containersthat suitable for encompassing all facets of Big Data – though they can play a part. It would be a mighty container indeed that could contain 30-40Tb of data. I would not want to fire up such a container or tear it down. Or package it and move it around from one developer to another.

Containers are really for a different purpose than holding the actual HDFS data. They are useful for wrapping the compute-oriented components of Hadoop, such as hosting the Nodemanager and the executable containers/JVMs that run in the same (virtual) machine’s operating system. Ideally these would be stateless components in a container landscape. This sort of container may be run inside a virtual machine. But that bulky datastore that makes up the core of Big Data does not belong in a container or set of containers, quite yet. That big data store may be linked to one or more containers, using any number of mechanisms - but it is not wrapped by one. So think carefully about the usage of containers for big data!

Bio: justin-murrayJustin Murray is a Technical Marketing Manager at VMware and has been at the company for over six years. Justin creates technical material and gives guidance to customers and the VMware field organization to promote the virtualization of big data workloads on VMware's vSphere platform. Justin has worked closely with VMware's partner ISVs (Independent Software Vendors) to ensure their products work well on vSphere and continues to bring best practices to the field as the customer base for big data expands.