ioModel Machine Learning Research Platform – Open Source
This article introduces ioModel, an open source research platform that ingests data and automatically generates descriptive statistics on that data.
By Matt Hogan, twintechlabs.
In the past, scientific researchers who strove to innovate have had to either learn the discipline of writing code or rely on computer or data scientists for complex model development and the integration of the models that were developed. The ioModel Research Platform challenges this traditional approach by putting the power of machine learning directly into the hands of subject matter experts, unlocking the potential for more rapid innovation at a significantly reduced cost with higher reliability.
The ioModel Research Platform is developed entirely using open source technology and is itself available (without support or warranty) under the GPL License on GitHub. We invite the scientific community to collaborate with us on the roadmap, development, and governance of the Platform. We’re committed to working openly and transparently to drive forward scientific research and innovation.
The software as it exists today (approaching a 1.0 release), supports the ingestion of CSV files into data frames, statistical exploration of the data, transformation of the data, and the training and evaluation of predictor and classifier models – all without writing any code. Our product road-map includes support for features including cluster analysis and a research “notebook” that contains a log of all the processing steps taken during each project for easier paper writing and reproducibility.
ioModel relies on a local installation of the PostgreSQL database (though others like MySQL and SQLite will also work) and Python 2.7 (Python3 is not supported at this time, but future support for it is planned). The project is available on GitHub (https://github.com/twintechlabs/iomodel) and includes instructions on getting off and running quickly.
Once installed, we can take it for a quick test-drive. First, let’s head over to the UCI archive and grab a data set where we can predict CPU performance based on hardware attributes: https://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.names
Download the file from our GitHub repo here:
Once you have the data file, fire up your local copy of ioModel (python manage.py runserver), log in, and create a project to host your new data file (and the associated models you’ll create). Click the New Project button, give you new project a name and description, and then select Create. Once you are back to your project page, click into your newly created project to start working with it.
From your newly created project, select the Import Data button to start the process of importing your data set. The file will need to be in CSV format and the first row should include column header/names. Give your new data set a name, description, and select the CSV file that you downloaded from the GitHub repo. Once imported, you’ll notice that you immediately get a bunch of descriptive statistics generated along with the ability to analyze the data a number of different ways:
It’s often helpful to start by understanding Coefficient of Correlation. This can be done by selecting the Correlation button and selecting the PRP field. In this data set, PRP is actual measured machine performance while ERP is the estimated performance that the paper authors generated using multiple linear regression. Selecting PRP and and Analyze will get you a chart showing the strength of positive and negative relationships between each numeric variable and the feature that you selected.
Now its time to train our first model. Use the breadcrumbs to navigate back to your data set and select the Train Model button. Give your model a name, select Multiple Linear Regression as the model, PRP as the feature target, and select the following fields (using CTRL-click) as features for training:
And then select the Train button. Once training is complete, you’ll have a screen showing you the parameters you used in training your model, RMSE, and max error. These values have automatically been calculated for you using an 80:20 train:test split on your data. You can further evaluate your model by selecting the Validate Model button. Doing so generates a leave-one-out-cross-validation that shows you explained variance, the confidence interval, and a host of visualizations to help you understand variance and how your model performed. You also get access to a cross validation file that includes the predicted values for each row and the residuals. Every model you create is available for future access from your project page. Furthermore, every model you create is automatically accessible as a web service from within your application and you can view usage metrics to understand how it has been performing in the wild.
We’re excited by ioModel and it’s promise as a visual platform for machine learning and we’re really happy to engage with the community in an open, constructive forum. Feel free to contact us with any questions or contributions!
Bio: Matt is a passionate technologist and futurist and believes that we stand to build the greatest products by studying the intersection of people and technology, both when products are being built and in how they are being used. He is dedicated to driving forward the areas of analytics, data science, and the development of advanced cognitive systems to drive quality of life improvements for all and to facilitate the creation of smart, connected cities, devices, and people.
- Top 16 Open Source Deep Learning Libraries and Platforms
- How to Organize Data Labeling for Machine Learning: Approaches and Tools
- How I Unknowingly Contributed To Open Source