The Big ‘Big Data’ Question: Hadoop or Spark?

With a considerable number of similarities, Hadoop and Spark are often wrongly considered as the same. Bernard carefully explains the differences between the two and how to choose the right one (or both) for your business needs.

machine_learningMachine learning – creating algorithms which can “think” for themselves, allowing them to improve and “learn” through a process of statistical modelling and simulation, until an ideal solution to a proposed problem is found, is an area of analytics which is well suited to the Spark platform, thanks to its speed and ability to handle streaming data. This sort of technology lies at the heart of the latest advanced manufacturing systems used in industry which can predict when parts will go wrong and when to order replacements, and will also lie at the heart of the driverless cars and ships of the near future. Spark includes its own machine learning libraries, called MLib, whereas Hadoop systems must be interfaced with a third-party machine learning library, for example Apache Mahout.

The reality is, although the existence of the two Big Data frameworks is often pitched as a battle for dominance, that isn’t really the case. There is some crossover of function, but both are non-commercial products so it isn’t really “competition” as such, and the corporate entities which do make money from providing support and installation of these free-to-use systems will often offer both services, allowing the buyer to pick and choose which functionality they require from each framework.

Many of the big vendors (i.e Cloudera) now offer Spark as well as Hadoop, so will be in a good position to advise companies on which they will find most suitable, on a job-by-job basis. For example, if your Big Data simply consists of a huge amount of very structured data (i.e customer names and addresses) you may have no need for the advanced streaming analytics and machine learning functionality provided by Spark. This means you would be wasting time, and probably money, having it installed as a separate layer over your Hadoop storage. Spark, although developing very quickly, is still in its infancy, and the security and support infrastructure is not as advanced.

The increasing amount of Spark activity taking place (when compared to Hadoop activity) in the open source community is, in my opinion, a further sign that everyday business users are finding increasingly innovative uses for their stored data. The open source principle is a great thing, in many ways, and one of them is how it enables seemingly similar products to exist alongside each other – vendors can sell both (or rather, provide installation and support services for both, based on what their customers actually need in order to extract maximum value from their data.

Original Post

bernard-marrBernard Marr is a globally recognized expert in big data, analytics and enterprise performance. He helps companies improve decision-making and performance using data. He has written a number of seminal books and over 200 high profile reports. Bernard is a regular contributor to the World Economic Forum, is acknowledged by the CEO Journal as one of today's leading business brains and by LinkedIn as one of the World's top 100 business Influencers.