8 Myths about Virtualizing Hadoop on vSphere Explained

This article takes some common misperceptions about virtualizing Hadoop and explains why they are errors in people’s understanding.

By Justin Murray, VMware. myth

The short explanations given should serve to clear up the understanding about these important topics.

Myth #1: Virtualization may add significant performance overhead to a Hadoop cluster.

This is a common question from users who are in the early stages of considering virtualizing their Hadoop clusters. performanceEngineers at VMware (and some of its customers) have done several iterations over multiple years of performance testing of Hadoop on vSphere with various hardware configurations. These tests have consistently shown that virtualized Hadoop performance is comparable to, and in some cases better than that of a native equivalent.

In 2015, a lengthy set of tests conducted on vSphere 6 with 32 host servers and 128 virtual machines (four virtual machines per host server) showed that a MapReduce task finished in 12% shorter time on vSphere than the equivalent non-virtualized or native system. Higher numbers of virtual machines per server are also viable and produce corresponding results that also beat the native system – but four was a good starting point for many situations.

As with any platform on which Hadoop runs, the details of the setup matter. The disk storage, virtual machine placement and networking in particular need to be organized in keeping with the known best practices in order to get the highest performance from the system. The same principles apply to the native world. VMware has documented those best practices and built them into the Hadoop cluster-provisioning tool, vSphere Big Data Extensions. You can read more on this here.

Myth #2: Virtualization requires the use of shared storage

This is a misunderstanding of the features of virtualization. VMware vSphere works very well with non-shared direct-attached storage (DAS) and hosts HDFS data in virtual shared-storagemachines that rely on that storage. The Hadoop distribution vendors frequently recommend DAS-type storage for cost and performance reasons. With vSphere, each physical disk/spindle in DAS may be presented as a unique datastore to the hypervisor. Virtual disk files (VMDKs) would then be placed onto those datastores. This is a well-understood and tried-and-trusted mechanism. There are large virtualized Hadoop clusters running today that are entirely DAS-based and have no shared storage present at all.

Myth #3 Hadoop cannot work with shared storage

This is not true and we are in fact now at a point where a number of users of Hadoop are requesting shared storage to back their clusters. Shared storage comes in many forms, such as SANs, virtual SAN or software-defined storage, NFS devices and HDFS-aware NAS storage mechanisms. SANs and NFS have been deployed in many VMware installations before Hadoop became popular, so they have become associated with vSphere generally, but they are not a prerequisite, as seen in Myth #2.

The important factor to bear in mind with your choice of storage is the effective bandwidth that is available in terms of Mbytes/second. One can measure the effective bandwidth by using a loading tool such as IOmeter to mimic the traffic seen in Hadoop (long sequential I/Os of 64Mb, 128 Mb or higher). This is a different measurement to the classic IOPS (I/Os per second) that is used for measuring suitability for an RDBMS or an older style of data storage. Provided the required bandwidth is available to be shared across the number of servers that will be attached to the SAN or NAS, then Hadoop will be deployable there. In general, we see adopters of virtualized Hadoop placing smaller clusters (10 physical servers) for trial purposes on their SAN -based storage, if they intend to place significant performance load on those clusters.

With HDFS-aware NAS type storage, we have seen several deployments already of virtualized Hadoop clusters where the HDFS data is contained solely on the NAS device. The virtual machines in that case contain the compute nodes of Hadoop, such as the Resourcemanager, Nodemanager and Container processes. This has also been shown to scale up to over 100 connected servers running the compute-side virtual machines.

Myth #4: The Hadoop Distribution Vendors do not support virtualized Hadoop

This is not true. The major vendors of Hadoop software have engaged with VMware to test and validate the behavior of vendorstheir products on vSphere. These are documented in solution briefs, reference architectures and validation guides that are available from the vendors. VMware’s policy is to work with the distro vendor should any problem arise to solve the customer’s problem.

Continued on next page ...