Data Workflows for Machine Learning

Paco Nathan compares several open source frameworks for Machine Learning workflows, including KNIME, IPython Notebook and related libraries, Cascading, Cascalog, and Spark/MLbase, and proposes 9 criteria to evaluate the best alternatives.

By Gregory Piatetsky, Apr 20, 2014.

This presentation was made at SF Bay Area Machine Learning Meetup in April 2014.

Data Workflows for Machine Learning Paco Nathan compares/contrasts several open source frameworks which have emerged for Machine Learning workflows, including KNIME, IPython Notebook and related Py libraries, Cascading, Cascalog, Scalding, Summingbird, and Spark/MLbase.

The analysis develops several points for "best of breed" and what features would be great to see across the board for many frameworks... leading up to a "scorecard" to help evaluate different alternatives.

The 9 criteria proposed for evaluating Machine Learning Data Workflows are:
  • includes people, defines oversight for exceptional data
  • separation of concerns, allows for literate programming
  • multiple abstraction layers for metadata, feedback, and optimization
  • testing: model evaluation, TDD, app deployment
  • future-proof system integration, scale-out, ops
  • visualizing allows people to collaborate via code
  • abstract algebra and functional programming containerize business process
  • blend results from different time-scales: batch and low latency
  • optimize learners in context, to make model selection a potentially compiler problem

Paco Nathan also reviews the PMML standard for migrating predictive models, e.g., from SAS to Hadoop.

Watch at

Here is also a Slideshare: Data Workflows for Machine Learning from an earlier meeting in Seattle, Jan 2014.

Paco Nathan, is a "player/coach" who's led innovative Data teams building large-scale apps for 10+ years, and worked as an OSS evangelist for the past 2+ years. Expert in distributed systems, machine learning, cloud computing, functional programming -- with a focus on Enterprise data workflows. Paco received his BS in Math and MS in CS degrees from Stanford, and has 30+ years technology industry experience ranging from Bell Labs to early-stage start-ups.

See also an interview and opinion by Paco Nathan in KDnuggets: