Portable Format for Analytics: moving models to production

There are many ways to compute the best solution to a problem, but not all of them can be put into production. The Portable Format for Analytics (PFA) provides a way of formalizing and moving models.

By Jim Pivarski, DMG.

As a data scientist today, you have a lot of tools to choose from. Thousands of R packages are available on CRAN, libraries like Mahout and MLlib bring machine learning to the Hadoop/Spark ecosystem, and Python has a growing set of analysis tools based on Numpy, Scipy, and Scikit-Learn. However, that blessing is also a curse: while a proliferation of tools is conducive to tinkering, it's a nightmare to productionalize.

Data Science Tools & Environments

Imagine the following scenario: after struggling with conventional data mining algorithms, you find an enormous lift using Professor Hess's Gradient-Boosted Deep Learning Monte Carlo Adaptive Regularization Toolkit (GraBooDeLearnMicArt) and you're ready to put your amazingly accurate model to work. But there's a catch. Your company's servers only run Java, and GraBooDeLearnMicArt is an R package. An R-Java bridge would not only be an installation and maintenance burden, but perhaps it goes against your company's security policy. Perhaps GraBooDeLearnMicArt has hundreds of seemingly unnecessary dependencies. Furthermore, the prospect of reimplementing it in Java is grim.

On the other hand, you don't need to reimplement the whole GraBooDeLearnMicArt package for production, only its "predict" method, which scores new data against a fixed model. The "fit" method may require state-of-the-art algorithms to produce a model from training data, but suppose that "predict" is only a mild variation on matrix multiplication and decision trees. You work with a Java developer to implement the "predict" method, but later realize that you forgot a pre-processing step and have to change the code. A few weeks later, you decide to smooth the outputs. Of course, you'll be refreshing the models every month. The Java developer gets annoyed and stops answering your calls.

There is a better way.

The Portable Format for Analytics (PFA) is a common language for representing statistical scoring engines, the "predict" method of a model. A PFA scoring engine is a JSON file containing model parameters and a scoring procedure. The scoring procedure transforms inputs to outputs by composing functions that range in complexity from addition to neural nets. If your "predict" method can be expressed in terms of common data science primitives (arithmetic, special functions, matrices, list/map manipulations, decision trees, nearest cluster/neighbor, and "lapply"-like functional programming), then it can be written in a few lines of PFA "code" (actually JSON). For example, a random forest is scored like this:

      {"a.map": [{"cell": "forest"},
         {"params": [{"tree": "TreeNode"}],
          "ret": "string",
          "do": {"model.tree.simpleTree":
             ["input", "tree"]}}]}}

Starting from the innermost function call, "model.tree.simpleTree" scores "input" against a "tree", which is part of an inline user-defined function that transforms a "TreeNode" named "tree" into a "string" by scoring it, which is applied to a list of "TreeNodes" from a data cell named "forest", and the most common ("a.mode") score is reported. It could be generated automatically from an R expression like this:

   a.mode(a.map(forest, function(tree) {
      model.tree.simpleTree(input, tree)

PFA has an open specification developed by the not-for-profit Data Mining Group with implementations for Java, Python, and R. In the scenario above, the data scientist would only have to express the model in PFA for the backend engineer to plug it into a PFA implementation running in production. Thanks to a detailed conformance suite, everyone can be confident that a scoring engine that tests well in R or Python will work in Java.

PFA can be compared to the Predictive Model Markup Language (PMML), which also provides a language-neutral way to encode models. However, PFA adds the flexibility of arbitrary function composition, rather than choosing from a set of established models. Want to partition a space with clusters and associate a different SVM to each cluster? Sure. Want to augment decision trees so that a neural network is performed at each node to decide which branch to follow? Why not? The only constraint is that input data types, model parameter types, and function signature types fit together, so that a scoring engine can be compiled down to fast bytecode in high-performance environments.

I'd like to close by comparing PFA to OpenGL. In the early 1990's, there was no shortage of graphics toolkits, but OpenGL was a cross-platform standard that focused on simple primitives, rather than being a complete window-widget toolkit. OpenGL code written in one language on one system could easily move to another. The standard only described the API, so software implementations could be optimized and ultimately replaced by hardware for blazingly fast processing. Someday, the same may be true for PFA.

Author bio: Jim Pivarski is a data analyst and software engineer at Open Data Group.