New Standard Methodology for Analytical Models
Traditional methods for the analytical modelling like CRISP-DM have several shortcomings. Here we describe these friction points in CRISP-DM and introduce a new approach of Standard Methodology for Analytics Models which overcomes them.
Standard Methodology for Analytical Models
In the previous paragraphs, the short comings of the CRISP-DM were discussed, whereby the issues raised are the reversed view of the improvements made in the Standard Methodology for Analytical Models (SMAM). The phases of the SMAM are as follows:
The use-case identification phase describes the brainstorming/discovery process of looking at the different areas where models may be applicable. It involves education of the business parties involved on what analytical modeling is, what realistic expectations are from the various approaches and how models can be leveraged in the business. Discussions on the use-case identification involve topics around data availability, model integration complexity, analytical model complexity and model impact on the business. From a list of identified use-cases in an area, the one with the best ranking on above mentioned criteria should be considered for implementation. Note that, businesses often are not familiar with this very first data/fact based funneling step, and as a result, they will have to let go of their religiously held initial –and much to complex- analytic modeling idea. Parties involved in this phase are (higher) management, to ensure the right goal setting, IT for data availability, the involved department for model relevance checking and the data scientists to infuse the analytical knowledge and creative analytical idea provisioning. The result of this phase is a chosen use-case, and potentially a roadmap with the other considered initiatives on a timeline.
Model requirements gathering
The use-case requirements gathering phase describes the process where for the chosen use-case, the set of conditions are explored that need to hold for the model to be viable in the business process. A not exhaustive list of topics of discussion are conditions for the cases/customers/entities considered for scoring, side-conditions that need to hold, consistency checks that need to hold, handling of unexpected predictions, or unexpected input data, requirements about the availability of the scores, the timing of scores (and the data), the frequency of refresh of the scores; initial ideas around model reporting can be explored and finally, ways that the end-users would like to consume the results of the analytical models. Parties involved in this phase are people from the involved department(s), the end-users and the data scientists. The result of this phase is a requirements document.
In the data preparation phase, the discussions revolve around data access, data location, data understanding, data validation, and creation of the modeling data. It is needed to create an understanding of the operational data required for scoring, both from an availability (cost) and timing perspective. This is a phase where IT/data administrators/DBA’s closely work together with the data scientist to help prepare the data in a format consumable by the data scientist. The process is agile; the data scientist tries out various approaches on smaller sets and then may ask IT to perform the required transformation in large. As with the CRISP-DM, the previous phase, this phase and the next happen in that order, but often jump back and forth. The involved parties are IT/data administrators/DBA/data modelers and data scientists. The end of this phase is not so clearly defined. One could argue that the results of this phase should be the data scientist being convinced that with the data available, a model is viable, as well as the scoring of the model in the operational environment.
In the modeling experiment phase, the core data scientist is at his/her element. This is where they can play with the data; crack the nut; trying to come up with the solution that is both cool, elegant and working. Results are not immediate; progress is obtained by evolution and by patiently collecting insights to put them together in an ever evolving model. At times, the solution at hand may not look viable anymore, and an entire different angle needs to be explored, seemingly starting from scratch. It is important to set the right expectations for this phase. There is no free lunch for the data scientist, although the business always seems to think so. The term data science5 does honor to what is being done here: it is scientific research, with all its struggles, its Eureka’s, and its need for meticulous experimentation. The topic of the research: data, and hence the fitting term: data science. The data scientist may need to connect to end-user to validate initial results, or to have discussion to get ideas which can be translated into testable hypotheses/model features. The result of this phase is an analytical model that is evaluated in the best possible way with the (historic) data available as well as a reporting of these facts.
Dashboards and visualization are critically important for the acceptance of the model by the business. In analytical aspiring companies, analytical models often are reported on by a very technical model report, at the birth of the model in a non-repeatable format. In more mature analytical practice, the modeling data is used for insight creation is a repeatable way. Topics of discussion in this phase are analytic reporting and operational reporting. Analytical reporting refers to any reporting on data where the outcome (of the analytical model) has already been observed. This data can then be used to understand the performance of the model and the evolution of performance over time. Creating structural analytic performance reports also pave the way for structural proper testing using control groups. Operational reporting refers to any reporting on the data where the outcome has not yet been observed. This data can be used to understand what the model predicts for the future in an aggregated sense and is used for monitoring purposes. For both types of reporting, insights are typically created by looking at behavior of subgroups as qualified by the model. By creating a structural reporting facility for the insights, it allows deeper insight in changing patterns that can be used by business users, as a ‘free’ addition to the repeated scoring of the analytical model. The involved parties are the end-users, the involved business department, potentially a reporting department and the data scientists. The result of this phase is a set of visualizations and dashboards that provide a clear view on the model effectiveness and provide business usable insights.
Proof of Value: ROI
Analytical models typically start as an experiment where, at the start of the project, the results cannot be guaranteed. Results depend on the quality of the data and the (unobservable) knowledge that the data contains about the phenomenon to be modelled, as well the quality of the data scientist, the time spent on the creation of the solution and the current state of the art of analytical models. As stated earlier, the business is not educated to think about the quality of analytical models in a technical way, nor should they necessarily get there. However, as the model impacts many business targets, the involved parties in the business need to be sure that they can trust the model (very concrete: their bonuses depend on their business performance, and hence the performance of the analytical model may determine their bonus). An accuracy of 90% seems to be a good target for an analytical model from business perspective, irrespective of the understanding of the measure of accuracy involved. Yet, the criteria influencing the quality of an analytical model are discussed above and cannot be commanded by the business. To jump out of this back-and-forth discussion, a proper experiment needs to be set up: in a limited fashion, the analytical model is applied to new data and the outcomes are measured in such a way that the result can be made financial. If the ROI is positive enough, the business will be convinced that they can trust the model; the model is proven to generalize well once more, and a decision can be made if the model should be deployed or not. Topics of discussion are around the setup of the experiment, control groups, measuring the model effectiveness, computation of the ROI and the success criteria. The people involved are the end-users, potentially the finance department, the IT department in order to provide the new data for the experiment and the data scientists. The result of this phase is a report on the experimental setup, the criteria around the measurements and the outcome.
The operationalization phase is not applicable to all models, although, the models that are most valuable are not one-time executions, but are embedded, repeatable scoring generators that the business can act upon. The operationalization is a phase where the data scientist closely works with the IT department. The model development took place in a relatively unstructured environment that gave the possibility to play with data and experiment with modeling approaches. Embedding an analytical model in the business means it migrates from this loosely defined environment to a location of rigor and structure. The discussions that the data scientist and the IT operator need to have, revolve around a hand-over process of the model. In addition, the IT operator needs to understand the data requirement of the model and needs to prepare the operational environment for this. The hand-over of a model to an operational team needs to come with an audit structure. If integration in end-user systems is required, programmers are involved, guided by the data scientist on the workings of the analytical model. Moreover, for the integration itself, an IT change process such as Agile6 may be defined. The result of the initial part of this phase is a hand-over document where all parties involved agree on the coming process. The final result of this phase is a functional analytical model, that is, repeatable scores of the model are available to the business process in order help makes better decisions.
An analytical model in production will not be fit forever. Depending on how fast the business changes, the model performance degrades over time. The insight creation phase took care of the monitoring of this performance; the model life cycle phase defines what needs to happen. Generally, two types of model changes can happen: refresh and upgrade. In a model refresh, the model is trained with more recent data, leaving the model structurally untouched. The model upgrade is typically initiated by the availability of new data sources and the request from the business to improve model performance by the inclusion of the new sources. The involved parties are the end-users, the operational team that handles the model execution, the IT/data administrators/DBA for the new data and the data scientist. The result of this phase is, during the construction of the phase, a document describing the governance and agreement on the change processes around the model refresh/upgrades. On execution, the result is a model that is once more effective for the duration it lasts.