Date: Thu, 01 Apr 1999 16:12:24 +0200 From: Gregory Piatetsky-Shapiro, gps Subject: Report on CRISP-DM -- proposed global standard for a Data Mining Process CRISP-DM (CRoss-Industry Standard Process for Data Mining) is a project developing an industry neutral and tool neutral Data Mining process model. CRISP-DM is partially funded by the European Commission under the ESPRIT Program, and is sponsored by NCR, ISL (now part of SPSS), DaimlerChrysler, and a dutch insurance company OHRA. There are currently approximately 180 members of SIG CRISP world wide. See www.crisp-dm.org for full information. CRISP-DM had 3 workshops (Amsterdam, November 1997, London, May 1998, New York, September 1, 1998) and recently concluded the fourth workshop in Brussels, on March 18, 1999. All of the workshops received much attention and each of them had more than 30 participants from diverse industry sectors and research institutes representing the whole range from tool vendors to end users. The purpose of the workshops was to inform participants about the ongoing progress in developing the standard process as well as to get feedback and input for improvements of each draft that was made public available before the workshops. In Brussels - where I attended - there was an overall acceptance of the CRISP-DM process and all participants expressed their interest in pushing forward these efforts to define a standard process for data mining. The consortium members have developed a very impressive model and methodology for data mining process. On a high-level, the process model has 6 phases: (see http://www.ncr.dk/CRISP/process2.htm) 1) Business (or Problem) Understanding This phase focuses on understanding the project objectives and requirements from a business perspective, and developing initial technical problem definition and a project plan. 2) Data Understanding This phase starts with an initial data collection and initial data analysis, identifies data quality problems and discovers first insights into the data. 3) Data Preparation This phase builds ready-to-model dataset. Subtasks include table, record, and attribute selection as well as transformation and cleaning of data. 4) Modeling Various modeling techniques are selected, applied, and fine-tuned. Stepping back to the data preparation phase is often needed. 5) Evaluation At this stage there are good models (from a technical point of view). Here we thoroughly evaluate the model, and review the steps executed to construct the model, to check if we did not miss an important business issue and achieves the desired business objectives. At the end of this phase, go/no go decision on deployment is made. 6) Deployment Deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. For each of these phases there is a list of tasks, e.g. Data Understanding consists of 2.1 Collect Initial Data 2.2 Describe Data 2.3 Explore Data 2.4 Verify Data Quality with each task having a specific output (such as a report or another dataset) This process model is both generic and designed to be customizable, e.g. it can naturally be specialized both for specific tasks, such as CRISP-DM for classification and CRISP-DM for clustering, and also for specific business problem, e.g. CRISP-DM for attrition modeling, and even for specific business tasks, e.g. CRISP-DM for attrition modeling in telecommunications. Two of the project sponsors, OHRA (Netherlands) and DaimlerChrysler, successfully used the CRSIP model in development of specialized process models for their practice. While the current model may still need some fine-tuning (e.g. some suggested a separate phase for monitoring after deployment, and a privacy impact statement as part of 1), but in my opinion it meets industry needs. The advantages for having a standard industry model are many. They will make large data mining projects faster, cheaper, more reliable and more manageable. Even small scale data mining investigations will benefit from using CRISP-DM. SPSS is planning to have CRISP-DM to be a part of Clementine, and some consulting companies (OpenMIND, Two Crows) have already started using CRISP-DM model. The funding for CRISP from the European Commission ends by April 30, 1999. Next steps include writing a book about it and promoting its use in the industry. There are also plans under consideration to form an industry consortium to promote CRISP-DM standards worlswide. One specific suggestion I made is to develop standards for meta-data and CRISP-DM reports using XML. This would help to improve interoperability of different tools and stimulate innovation. In the project meeting after the Brussels workshop it was decided to go forward in building a more "global" consortium and both SPSS and NCR confirmed their intention to be part of the consortium (if they are not the only two members) Next CRISP meeting is tentatively planned in San Diego, in conjunction with KDD-99. If you are interested in participating in setting the global standards for the data mining process, please contact Thomas Reinartz, DaimlerChrysler, 89081 Ulm, Germany email: thomas.reinartz@daimlerchrysler.com You can also email to list crisp.sig@dbag.ulm.daimlerbenz.com for discussions.
Copyright © 1999 KDnuggets