Too slow or out of memory problems in Machine Learning/Data Mining?

What are some of the problems in machine learning, data mining and related fields that you have difficulties with because they are too slow or need excessively large memory?



[ This question from mvarshney was posted on KDnuggets Data Mining Open Forum and I thought it was interesting enough to post in KDnuggets News. Gregory Piatetsky, Editor. Please comment below]

What are some of the problems in machine learning, data mining and related fields that you have difficulties with because they are too slow or need excessively large memory?

As a hobby research project we built an out-of-core programming model to handle data larger than system memory and it natively supports parallel/distributed execution. It showed good performance on some problems (see below) and we wish to expand this technology (hopefully community-driven) for the real life problems.

Some benchmarks (against Weka, Mahout and R):

a) Apriori Algorithm for frequent itemset mining [CPU-bound but average memory]

Webdocs dataset with 1.7M transactions over 5.2M unique items (1.4GB). The algorithm finds sets of items that appear frequently in transactions. For 10% support level, Weka3 could not complete this job in 3 days. Our version completed in 4hr 24 min (although to be fair, we used Tries instead of hashtables as in Weka). More importantly though, on one 8-core machine it took 39min, on 8 machines -> 6min 30sec (=40x)

b) SlopeOne recommendation engine [High memory usage]

MovieLens dataset with 10M ratings from 70K for 10K movies. SlopeOne recommends new movies based on Collaborative Filtering. Apache Mahout's "Taste" non-distributed recommender would fail for less than 6GB memory. To benchmark the out-of-core performance, we restricted our version to 1/10th of this limit (600MB) and it completed with 11% overhead (due to out-of-core) in execution time.

c) Dimensionality Reduction with Principal Component Analysis (PCA) [Both CPU and Memory bound]

Mutants "p53" protein dataset of 32K samples with 5400 attributes each (1.1GB). PCA is used to reduce the dimension of dataset by dropping variables with very small variances. Although our version could process data larger than system virtual memory, we benchmarked this dataset since the R software can process it. R completed the job in 86 min. Our out-of-core version had no additional overhead; in fact, it completed in 67min on single-core and 14min on 8-core machine.

The excellent software today either work for data in Megabytes range by loading them in memory (R, Weka, numpy) or tera/petabytes range for data centers (Mahout, SPSS, SAS). There seems to be a gap in the Gigabytes range -- large than virtual memory but less than "big data". Although, projects like numpy's Blaze, R bigmemory, scalapack etc are addressing this need.

From your experiences, can you relate examples where such a faster and out-of-core tool can benefit the data mining/machine learning community?

This question was cross-posted at StackOverflow if you prefer to comment there