OpenML: Share, Discover and Do Machine Learning

OpenML is designed to share, organize and reuse data, code and experiments, so that scientists can make discoveries more efficiently. It is an interesting idea to build a network of machine learning.


Recently, an interesting paper introduced OpenML which may provide an alternative way to mine the data. Don’t get confused. There is an OpenML which is a standard programming environment of digital media. What I introduce here is another OpenML, just as its name implies, an open science platform where machine learning researchers can share all their datasets, algorithms and experiments. Its logo consists of four colors, each of which represents an important part of OpenML.

Here are the four key numbers of OpenML. (as of Aug 11, 2014).

230 Data SetsThese are input data for machine learning – mushroom database, spam e-mail, letter image recognition, to name a few. For each data set, it provides a brief description of properties, like default accuracy, number of classes, number of features, etc.

1172 TasksTasks are created if researchers want to play with data. All these tasks are divided into four types, Supervised Classification, Learning Curve, Supervised Data Stream Classification, and Supervised Regression, depending on what kinds of results are expected to be shared. Tasks can be downloaded and solved by all users.

364 FlowsFlows are implementations of algorithms, workflows or scripts that solve OpenML tasks, often through a plugin. Scientists can also upload the actual code or reference it by URL, if the code is hosted on GitHub or other open source platforms. On each flow page, results of all tasks the flow has run on will be compared.

24990 RunsAn attempt to solve a task and obtain the required output is called a run. Take Run 24980 as an example. It performs Flow weka.Bagging_SMO_PolyKernel(1) on task 36, which is a supervised classification on data set segment. It also provides evaluation results, like AUC, confusion matrix, predictive accuracy, etc. People can easily compare results of all runs on the same task.

It is worth mentioning that OpenML can be integrated in other machine learning tools, like Weka, R, so that people can automatically upload data and code. For example, in Weka, we can add a number of tasks and Weka algorithms to run. The plugin will download all data, run every algorithm on every task, and then automatically upload the results to OpenML. Manual run upload is under development currently. People can only upload runs using plugins or API (Java/R).

OpenML or Kaggle?

The benefit of open source platform grows faster as more people use it. When people start to get familiar with OpenML, the above numbers will increase for sure, or may be increasing when you are reading this post. One obvious benefit of OpenML is that researchers can define their own tasks and also build algorithms to solve other tasks. All shared results are stored and organized online for easy access, reuse and discussion.

You may think of Kaggle, where people can also download data sets and evaluate different algorithms. However, OpenML is designed for sharing and comparing research results. It focuses on collaboration, instead of competition. It is a good idea to present machine learning skills by winning a Kaggle competition. But OpenML will be a place to make discoveries as long as enough runs are performed.

Ran Bi is a master student in Data Science program at New York University. She has done several projects in machine learning, deep learning and also big data analytics during her study at NYU. With the background in Financial Engineering for undergrad study, she is also interested in business analytics.