3 Generations of Machine Learning and Data Mining Tools

Three different paradigms available for implementing Machine Learning (ML) algorithms both from the literature and from the open source community.



Impetus Blog, by Dr. Vijay Srinivas Agneeswaran, Feb 05, 2013

Three Generations of Tools for Realizing Machine Learning Algorithms

... I give my view of the three generations of Machine Learning tools available to us today:

SAS1. The traditional ML tools for machine learning and statistical analysis including SAS, IBM SPSS, Weka and the R language - allow deep analysis of smaller data sets ...

Mahout2. Second generation ML tools such as Mahout, Pentaho or RapidMiner - allow what I call as shallow analysis of big-data. ...

3. The third generation tools such as Spark, Twister, HaLoop, Apache Hama and GraphLabGraphLab - facilitate deeper analysis of big-data - but how deep and how reliable are these?

The first generation ML tools can facilitate deep analytics as they have a wide set of ML algorithms. However, not all of them can work on large data sets - tera-petabytes of data, due to scalability limitations (they are limited by the non-distributed nature of the tool). In other words, they are vertically scalable (you can increase the processing power of the node in which the tool runs), but not horizontally (not all of them can run on a cluster). No doubt they are addressing those limits by building Hadoop connectors. I am quite sure the traditionalists are up in arms against me on this...

The second generation tools ... provide the ability to scale to large data sets by implementing the algorithms over Hadoop, the open source Map-Reduce implementation. These tools are maturing fast and are open source. ... Mahout has a set of algorithms for clustering and classification, as well as a very good recommendation algorithm. ... Mahout implements only a smaller subset of ML algorithms over Hadoop - only 25 algorithms are production quality, with only 8-9 usable over Hadoop.

Read more.