Opinion: RapidMiner is not a version of Weka (KDnuggets News 07:24, item 5, Features)

KDnuggets : News : 2007 : n24 : item5

Features

From: Ingo Mierswa
Date: 05 Dec 2007
Subject: Opinion: RapidMiner is not a version of Weka

"RapidMiner is not a version of Weka, but a full-grown data mining suite independent of Weka"

In the last issue of the KDnuggets news, Dr. Alexander Seewald (one of the contributors to the Weka machine learning library) stated that in future polls on Data Mining suites RapidMiner (YALE) should be combined with Weka , similar to the products of Salford Systems.

I do not share his opinion. Although I would actually prefer not to start a discussion about the pros and cons of two great open-source products, I have to comment on the claims of Dr. Seewald's post since most of them are simply not true.

1. "The free data mining software tool Yale (now named RapidMiner) is heavily based on Weka."

No, it is not. You can even remove the file weka.jar (containing the Weka learners as a library) and RapidMiner will work as usual as a complete Data Mining application and engine. The only difference after removing Weka from RapidMiner is that the learners implemented in the Weka toolkit are then no longer available as operators in RapidMiner. The total number of RapidMiner operators is about 500 - only about 100 of them are derived from Weka learners). The other 400 operators contain many data sources, preprocessing methods, validation and visualization techniques which are not available within Weka. Since we also provide all important and widely used learning schemes as own implementations additionally to those of Weka, nothing is really missing if Weka is removed.

You can try that yourself: download the latest release and remove the file "weka.jar" from the lib directory of RapidMiner. Here is some information about the latest release:

http://rapid-i.com/content/view/26/82/

2. "I would classify it as WEKA with a nifty interface"

You are partly right. I indeed really like the user interface of RapidMiner more than that of Weka - although I am of course a bit biased (as you probably are, too). However, RapidMiner of course is not just Weka with another interface. As written above, RapidMiner provides an additional set of about 400 operators for many aspects of Data Mining not covered by Weka. In my opinion, there are also some even more important differences than just the user interface or the number of different techniques:

Usability: the data flow in RapidMiner is always the same in a tree based structure. This is a totally different concept compared to a graph-based layout like the KnowledgeFlow of Weka. We can ensure automatic validations as well as automatic process optimizations for large-scale data mining which is hardly the case for interactive graph layouts. In addition, we provide long parameter names which are better understandable and created a lot of components of the graphical user interface just to ease the definition of steps and parameters. Our tree-based layout together with a stronger modular concept (a nice example is the cross validation operator in RapidMiner which allows to arbitrarily process intermediate results) also allows for breakpoints and the easy definition of re-usable building blocks.

Efficiency: people often notice that RapidMiner can handle larger data sets than Weka. And here is an interesting fact: in our support forum, you will find people who made a comparison for the memory usage in case of learners which are originally part of the Weka toolkit. Even then the memory consumption is lower. The reason for that is a more efficient internal data representation together with a layered data view concept. Many data transformations are performed "on the fly" instead of transforming the data and store the transformed data set.

Scalability: the internal data handling of RapidMiner allows the application of a large amount of data mining and learning methods directly on an external database. This is especially important for linear learning methods like the Perceptron or Na?veBayes which can directly be learned on the database without loading the data into memory. This allows for large-scale data mining with RapidMiner.

Integratibility: the Weka toolkit is definitely easy to integrate into other software products and the developers have done a great job here. However, if you want to integrate different data mining processes into the same product based on Weka, you often have to re-transform the data over and over again (with all the drawbacks mentioned above). At least for the latest versions of RapidMiner, the layered data views allow the integration of different branches of analysis into a single product without copying and re-transforming the data every time anew. This enhances the integratibility even if the necessary code is slightly more complex in the case of RapidMiner.

This is only a small excerpt of differences. If you are interested in a more detailed discussion, I would suggest the paper by Mierswa, et al.: "YALE: Rapid Prototyping for Complex Data Mining Tasks", KDD 2006.

3. "...even the code tree is quite similar at first glance."

Sure. Weka started back in 1993 and it definitely was inspiring for many users and developers. For that reason, we decided to find a package structure which is similar to that of Weka since developers were already familiar with this structure. Weka of course did not invent the concepts behind Machine Learning and Data Mining - the developers of Weka were probably influenced by the same literature, people, thoughts, and ideas which also influenced us. It is therefore no wonder that structures in code (which are represented by the code tree) are quite similar. If programs are considered as a model of the world, they actually should have similar structures for the same domain.

However, this similarity only exists for the structure of the learning operators themself and not for all other operators and functions available in RapidMiner. It does also not exist on a code level. Styles in implementations are quite different between Weka and RapidMiner which can easily be seen. The whole implementation concept is also totally different (especially the Operator concept of RapidMiner).

4. "This does not seem to be widely known."

The main reason probably is that there is not anything to know. It is just your opinion and the facts do not support it.

5. "On a more positive note, this means that WEKA is on first place in usage (103+48 = 151 :-) among free tools in KDnuggets 2007 Poll: Data Mining / Analytic Software Tools"

Since RapidMiner is not just another interface, as it was claimed by you, this conclusion is of course not valid. Even if this would be true, you would still have to live with the fact that other people would not prefer the Weka user interface and would have chosen another one. I am pretty sure that there exist some people who do only use RapidMiner because of its interface but they actually would like to use Weka but do not like the Weka-interface. I am also definitely positive that there are many more who prefer RapidMiner because of other reasons (like the ones stated above). If you are interested, you could have a look into the testimonials and references on the Rapid-I (the company behind RapidMiner) web site:

http://rapid-i.com/content/view/8/56/

6. "This should be accounted for in next years poll."

Again: it should not. I would be the first person who would admit that RapidMiner is "only" a nice user interface for Weka, if this was the case (and I also really should know since I know the code base of both products probably better than anyone else). Actually, creating a new user interface alone would already have been a great result. But this is, as I have tried to explain above, simply not true for RapidMiner. Don't get me wrong: Weka is a great toolkit and especially in the beginning of our development we depended on Weka much more than we do now. Today, we can simply remove Weka and still have a complete and fully functional Data Mining solution - together with a nice user interface for data mining and data analysis. Therefore, there is no reason for combining the both quote different products in the next poll.

7. Legal point of view:

Let me close this discussion by stating the following point. Weka is a product of and owned by University of Waikato, New Zealand. RapidMiner is a product of and owned by Rapid-I, Dortmund, Germany. They are two different products by two different independent providers and owners. Hence, it does not make sense to combine them in the next poll.

Saying "RapidMiner is a version of Weka" is like saying "SPSS is a version of SAS" or "BMW is a version of Ford". It's non-sense and simply not true (even if they look similar and have a similar structure with four wheels).

RapidMiner is a complete data mining suite of its own that optionally can make use of the Weka learners but does not depend on them at all. The first versions of RapidMiner (YALE) did not even integrate Weka. This option was added later as a service for the Weka users and to ease their transition to RapidMiner.

Thanks for your interest,
Ingo Mierswa
Managing Director of Rapid-I

KDnuggets : News : 2007 : n24 : item5

PREVIOUS | NEXT