Topics: Coronavirus | AI | Data Science | Deep Learning | Machine Learning | Python | R | Statistics

KDnuggets Home » News :: 2013 :: Aug :: News, Software :: RapidMiner and Big Data - In-Memory, In-Database, and In-Hadoop ( 13:n19 )

RapidMiner and Big Data – In-Memory, In-Database, and In-Hadoop


RapidMiner offers flexible approaches to remove any limitations in data set size. This paper compares 3 RapidMiner engines: In-Memory, In-Database, and In-Hadoop.



RapidMiner & Big Data – How Big is Big? by Ingo Mierswa, July 31, 2013

… The RapidMiner platform is an excellent solution for handling unstructured data like text files, web traffic logs, and even images. Given this, the variety aspect of big data does not pose new challenges to the platform. But we will discuss how the volume of big data can be easily handled -without writing a single line of code.

Analytical Engines in RapidMiner
RapidMiner offers flexible approaches to remove any limitations in data set size. The most often used engine of RapidMiner is the In-Memory engine, where data is loaded completely into memory and is analyzed there. This and other engines are outlined below.

In-Memory: The natural storage mechanism of RapidMiner is in-memory data storage, highly optimized for data access usually performed for analytical tasks.

  • In-memory analytics is always the fastest way to build analytical models.
  • Data set size is restricted by hardware (memory): The more memory is available the larger the data sets which can be analyzed.
  • Data set size: On decent hardware, up to ca. 100 million data points.

In-Database: The enterprise edition of RapidMiner offers a set of operators where the data stays in the database and the analysis is performed there. This allows for essentially unlimited data set sizes since the data is not extracted from the database.

  • Not applicable for all analysis tasks.
  • Runtime depends on the power of the database server.
  • Data set size: Unlimited (limit is the external storage capacity).

In-Hadoop: The advantage of Hadoop is that it offers both a distributed storage engine as well as a possibility to use a Hadoop cluster for a distributed analytical engine to distribute certain analytical and preprocessing tasks.

  • Not applicable for all analysis tasks.
  • Runtime depends on the power of the Hadoop cluster.
  • Due to massive overhead introduced by Hadoop, the usage of Hadoop is not recommended for smaller data set sizes.
  • Data set size: Unlimited (limit is the external storage capacity).

Below, you can find a runtime comparison for the creation of a Naive Bayes model with these 3 engines:

RapidMiner Computation Engine Runtimes for building a Naive Bayes model

Read more.


Sign Up

By subscribing you accept KDnuggets Privacy Policy