KDnuggets : News : 2007 : n24 : item6 < PREVIOUS | NEXT >

Features

From: Dr. Alexander K. Seewald
Date: 14 Dec 2007
Subject: Opinion: RapidMiner is heavily based on Weka

First, I would like to thank Ingo Mierswa from Rapid-I to give me such a detailed response to discuss, and to the editor of KDnuggets for the opportunity to reply.

1. "The free data mining software tool Yale (now named RapidMiner) is heavily based on Weka."

Comparing numbers of fuzzily defined concepts such as operators might be misleading. A time-honored and well established way to compare code contributions is based on lines-of-code. I've therefore downloaded the most recent RapidMiner version 4, which has 213,177 lines of java code. WEKA in a reasonably recent version (a bit dated, I guess) has 438,486 lines of java code. So RapidMiner consist of two thirds WEKA.

I would be happy to do a more detailed comparison on the closed-source version if RapidMiner supplies the source code under an appropriate NDA.

It is not clear to me why a high number of operators should be preferred. When the same goals can be achieved with a smaller number of operators, this is IMHO an achievement over needing five times as many. The best programming languages are those with the smallest number of elementary operations, not those with the highest number.

2. "I would classify it as WEKA with a nifty interface"

We can agree that RapidMiner has the niftier interface, which I've already said. But I don't agree with your other conclusions.

Concerning efficiency: It is well known that the WEKA user interface is a memory hog. This initially prompted me to create the command-line guide to WEKA, which is still quite popular. Optimizations such as doing on-the-fly transformations instead of block transformations are trivial and cannot be construed as significant intellectual property. And you are of course free to contribute these changes to WEKA. I can run my large ham/spam dataset with 500,000 samples and 1.3 million attributes through NaiveBayesNominal on the commandline and get results in a few minutes, and would be surprised if RapidMiner had something of their own to offer that was as efficient and as accurate.

Concerning scalability: It is well known that NaiveBayes can be directly learned on a database, and I have personally pointed this out to a lot of people over the years. If you want, I can even write out the SQL statements which would get you the counts needed for NaiveBayes learning. Any iterative learning algorithm can be made to learn sample-by-sample for additional scalability, that is nothing new. You mention an iterative version of NaiveBayes and Perceptron. This is child's play - WEKA offers at least ten incremental learning algorithms, and most of these are much harder to make incremental. With a few dozen lines of code, I could improve WEKA's SMO to an incremental learning algorithm based on simple published ideas that work reasonably well.

Concerning integratibility: I personally think that it is an advantage to have all the intermediate transformation steps stored as static ARFF. This makes the analysis reproducible years later, which a dynamic approach does not allow. In any case with the integration of WEKA into Pentaho there will be a general business intelligence framework including even a proper OLAP server, making this issue moot.

3. "...even the code tree is quite similar at first glance."

I believe that we have an agreement here. This may be just a case of convergent evolution.

4. "This does not seem to be widely known."

See most other points. I think that not all what I said here is generally known.

5. "On a more positive note, this means that WEKA is on first place in usage (103+48 = 151 :-) among free tools in KDnuggets 2007 Poll: Data Mining / Analytic Software Tools"

Based on point 1., I am willing to concede that WEKA gets only 2/3 of YALE's votes (=68.7+48=116.7 - still first place, but just barely). But the original statement was more meant as a jest.

Still I cannot resist offering a compromise: Count the number of people using the commercial version of RapidMiner (which does not include WEKA) separately from those which use the open-source version (which does include WEKA), and adjust votes accordingly. This might give us a good overview of how important WEKA is to RapidMiner's users.

6. "This should be accounted for in next years poll."

See point 5. I think the open source and the closed source version (without WEKA) of RapidMiner should be separate points in next years poll - perhaps even under different trade names, so there is less confusion (e.g. YALE for the OSS and RapidMiner for the closed-source version). You explained that the code base of the closed-source version is completely different, so this would make a lot of sense.

7. Legal point of view:

Your argument comparing RapidMiner/WEKA to SPSS/SAS is misleading. SPSS and SAS clearly do not have such a large overlap in their code. And if they did, it would obviously be a legal issue.

Things are different for open source software - and rightly so - but I think RapidMiner should tread more carefully and point out their association to WEKA more clearly and more prominently.

Best,
Dr. Alexander K. Seewald

Bookmark using any bookmark manager!


KDnuggets : News : 2007 : n24 : item6 < PREVIOUS | NEXT >

Copyright © 2007 KDnuggets.   Subscribe to KDnuggets News!