PMML FAQ: Predictive Model Markup Language
An update on PMML (Predictive Model Markup Language), de facto standard to represent predictive solutions. With PMML 4.1, all the capabilities available for data pre-processing were also made available for post-processing.
From:
Alex Guazzelli kindly updated KDnuggets FAQ: PMML entry and his entry was so good that I want to share it with KDnuggets readers.
Alex Guazzelli (VP, Analytics at Zementis), answers:
PMML stands for "Predictive Model Markup Language". It is the de facto standard to represent predictive solutions. A PMML file may contain a myriad of data transformations (pre- and post-processing) as well as one or more predictive models.
Because it is a standard, PMML allows for different statistical and data mining tools to speak the same language. In this way, a predictive solution can be easily moved among different tools and applications without the need for custom coding. For example, it may be developed in one application and directly deployed on another.
Traditionally, the deployment of a predictive solution could take months, since after building it, the data scientist team had to write a document describing the entire solution. This document was then passed to the IT engineering team, which would then recode it into the production environment to make the solution operational. With PMML, that double effort is no longer required since the predictive solution as a whole (data transformations + predictive model) is simply represented as a PMML file which is then used as is for production deployment. What took months before, now takes hours or minutes with PMML.
PMML is developed by the Data Mining Group (DMG), a consortium of commercial and open-source data mining companies. The latest version of PMML, version 4.1, was released by the DMG in December 2011.
Since PMML is XML-based, it is not rocket science. Its structure follows a set of pre-defined elements and attributes which reflect the inner structure of a predictive workflow: data manipulations followed by one or more predictive models.
What are the benefits of PMML?
PMML makes it extremely easy for any predictive solution to be moved from one data mining system to another. For example, once represented as a PMML file, a predictive solution can be operationally deployed right away, without the need for custom code. In this way, PMML transforms predictive analytic solutions into dynamic assets that can be put to work immediately.
For big companies with many in-house statistical and data mining tools, PMML works as the common denominator, since whenever the solution is built, it is immediately represented as a PMML file. This allows companies to use "best of breed" tools to build the best possible solutions.
Since PMML is a standard, it also fosters transparency and best practices. Transparency comes from the fact that the predictive solution is no longer a black box. Open the box and understanding what is inside, the analytics team can easily recognize past decisions and establish practices that work.
What kind of predictive techniques are supported by PMML?
PMML defines specific elements for several predictive techniques, including neural networks, decision trees, and clustering models, to name just a few. New techniques just recently supported are k-Nearest Neighbors and Scorecards, which include reason codes.
PMML also defines an element for representing multiple models. That is, PMML can be used to represent model segmentation, composition, chaining, cascading, and ensemble, including Random Forest Models.
To review all the elements supported by PMML, take a look at the language specification at the DMG website (see Resources below).
Can PMML represent data pre- and post-processing?
PMML has several built-in functions, such as IF-THEN-ELSE and arithmetic functions, that allow for extensive data manipulation. It also defines specific elements for the most common pre-processing tasks such as normalization, discretization, and value mapping. To review all the pre-processing capabilities PMML has to offer, refer to the PMML pre-processing primer.
With PMML 4.1, all the capabilities available for data pre-processing were also made available for post-processing. In this case, a PMML file can now also contain a set of business rules that define actions or decisions to be taken based on the outcome of the predictive model. A PMML file can represent the entire predictive solution, from raw data and model to business decisions.
Resources
Websites
- DMG website: Complete PMML specification
- Zementis PMML resources page: Links to PMML tools and examples
- PMML Wikipedia page
Book: PMML in Action (2nd Edition) - Available on Amazon.com
Talks/Presentations
- PMML Talk - Presentation on PMML and predictive analytics to the ACM Data Mining Bay Area/SF group.
- When Big Data and Predictive Analytics Collide - Presentation on PMML, Big Data, and Predictive Analytics given at Intellifest 2012.
Articles:
- PMML: An Open Standard for Sharing Models- Paper in The R Journal
- What is PMML? - Article in IBM developerWorks
- Representing predictive solutions in PMML: Move from raw data to predictions - Article in IBM developerWorks
- Predicting the future, Part 4: Put a predictive solution to work - Article published on IBM developerWorks