RapidMiner Response to the post by Dr. Seewald (KDnuggets News 08:01, item 6, Features)

KDnuggets : News : 2008 : n01 : item6

Features

From: Ingo Mierswa
Date: 04 Jan 2008
Subject: RapidMiner Response to the post by Dr. Seewald

First of all, I would like to wish a "Happy New Year" to everybody. I also would like to thank Dr. Seewald for his thoughts since they show us where some people still have problems in seeing the conceptual differences between our data mining solution RapidMiner and other data mining products like Weka.

In issue 23 of the KDnuggets newsletter, Dr. Seewald claimed that "RapidMiner is a version of Weka" and should be combined with Weka in next year's data mining tools poll of KDnuggets. I wrote a detailed answer to Dr. Seewald, in which I tried to explain the conceptual differences between both software tools and, a lot more important, that RapidMiner does not depend on Weka at all. Actually, the currently most commonly used version of RapidMiner, the free version 4.0.1, does not even contain Weka, neither any Weka-specific functionality nor a single line of Weka code.

Basically, this single piece of information should be enough to decide if there is any reason to combine both tools for future polls. I elaborated on these facts extensively in my last answer. Therefore, I only want to give some short comments on some of Dr. Seewald's new claims.

1. "The free data mining software tool Yale (now named RapidMiner) is heavily based on Weka."

Dr. Seewald still did not get the point: RapidMiner is a full-grown complete data mining suite fully independent of Weka. As I have said before, just remove the file "weka.jar", which is no longer part of the stable free version of RapidMiner since August 2007 (version 4.0.1) anyway, and you will see that not very much happens. Still all relevant operators and functionalities are present. The functionality of Weka is just an optional addition which should ease the transition for Weka users to RapidMiner.

I am not sure if Dr. Seewald even tried to use RapidMiner. We deliver hundreds of sample data mining processes together with the software. Only two contain a Weka learner (clearly marked with a preceding "W-") -- they are called "WekaLearner.xml" and "WekaEMClustering.xml" and they are examples of how easily one can use a Weka learner from within RapidMiner. All other of the hundreds of data mining processes do not depend on Weka at all. That is probably all which has to be commented to "heavily based".

"Comparing numbers of fuzzily defined concepts such as operators might be misleading."

We consider the same type and granularity of concept for both systems when comparing Weka vs. RapidMiner in this dimension. We derive about 100 operators from Weka vs. about 400 which are part of RapidMiner without Weka. A good part of these 400 operators fully replace the most important operators of the 100 derived from Weka. This proportion is the reason, why you can simply remove Weka and still get a full data mining suite which is still able to perform more different data mining processes than Weka. In my opinion, for this type of comparison a user-centric view is certainly more helpful to users than a developer point of view as lines of code are.

"A time-honored and well established way to compare code contributions is based on lines-of-code."

As all software engineers can certainly confirm: it is not. Beside that, just combining the lines of code of both products into one single set does not have anything to do with the reality of this concrete software integration. RapidMiner also uses functionality of Microsoft Excel (and in contrast to the Weka parts the Excel parts cannot even be removed!). Excel has several millions lines of code -- maybe Dr. Seewald wants to suggest that the remaining votes (after moved some away to Weka) should be divided with Excel based on his way of calculations.

"The best programming languages are those with the smallest number of elementary operations, not those with the highest number."

This is why Scheme/Lisp is more widely used nowadays than Java? And the best data mining tools are those with the least functionality? Those that can connect to fewer data sources? Provide fewer data preprocessing options? Provide fewer visualization and evaluation schemes? This argument is of course complete non-sense.

Sorry, but I cannot resist: following your own arguments, what do they say about the code quality of Weka? Needing more than the double amount of code for far less functionality? Quantity does not imply quality. As a data mining environment (supporting the complete data mining process, not only a collection of data mining algorithms), RapidMiner without Weka provides 4x the functionality with less than half the code.

2. "I would classify it as WEKA with a nifty interface"

I do not want to comment here since all necessary arguments were already stated in my last post.

Just let me add that Dr. Seewald does not seem to get the point about the conceptual differences at all when he says that he can "...write out the SQL statements which would get you the counts...". I am sure he can. The important thing is: with RapidMiner there is no need for this type of optimization or database specialization. Everything is directly done by the RapidMiner data core. This is one of the most important differences in the development and usage of RapidMiner compared to that of Weka.

5. "Based on point 1., I am willing to concede that WEKA gets only 2/3 of YALE's votes (=68.7+48=116.7 - still first place, but just barely). But the original statement was more meant as a jest."

Even if Dr. Seewalds calculations for votes are meant as a joke, I explicitly do not want to comment on the new calculation. Readers can easily conclude my opinion from my arguments above.

"Count the number of people using the commercial version of RapidMiner (which does not include WEKA) separately from those which use the open-source version (which does include WEKA), and adjust votes accordingly. This might give us a good overview of how important WEKA is to RapidMiner's users."

This would mainly give an insight of how much more people use software when it's free.

6. "This should be accounted for in next years poll."

"See point 5. I think the open source and the closed source version (without WEKA) of RapidMiner should be separate points in next years poll."

I don't get why you are so concerned about dividing our software product into two different products while, for example, for Salford Systems all their products are counted as one. Both versions of RapidMiner share the same code base and basically the same functionality - they mainly differ in the license. The closed-source (OEM version) is a developer license and hence has significantly fewer users than the free end-user license versions of RapidMiner. I have no idea why it would be interesting to see the number of developers using the closed-source version of RapidMiner for their products in a "data mining tool usage" poll.

"Perhaps even under different trade names, so there is less confusion (e.g. YALE for the OSS and RapidMiner for the closed-source version)."

Sorry, but we will of course not change our trade names only because somebody might be afraid that there could be some confusion during a KDnuggets poll. YALE was renamed to RapidMiner. For all license versions of it. And Rapid-I holds the rights to its sources and hence has the right to name RapidMiner as it pleases.

As a final conclusion from this discussion, I would like to suggest that all readers should try (better: use) both tools. Both data mining applications have different strengths and weaknesses and in one situation or another one of the tools might be more appropriate. The great thing about both tools being open-source and freely available is that you are free to test them yourself.

Thanks again for your interest.
Best regards,
Ingo Mierswa
Managing Director of Rapid-I

KDnuggets : News : 2008 : n01 : item6

PREVIOUS | NEXT