Nine Laws of Data Mining, part 1

Tom Khabaza, one of the authors of the Clementine data mining workbench and of CRISP-DM methodology for data mining process, proposes and explains 9 laws of data mining.

4th Law of Data Mining – “NFL-DM”:

 The right model for a given application can only be discovered by experiment or “There is No Free Lunch for the Data Miner”

It is an axiom of machine learning that, if we knew enough about a problem space, we could choose or design an algorithm to find optimal solutions in that problem space with maximal efficiency.  Arguments for the superiority of one algorithm over others in data mining rest on the idea that data mining problem spaces have one particular set of properties, or that these properties can be discovered by analysis and built into the algorithm.  However, these views arise from the erroneous idea that, in data mining, the data miner formulates the problem and the algorithm finds the solution.  In fact, the data miner both formulates the problem and finds the solution – the algorithm is merely a tool which the data miner uses to assist with certain steps in this process.

There are 5 factors which contribute to the necessity for experiment in finding data mining solutions:

  1. If the problem space were well-understood, the data mining process would not be needed – data mining is the process of searching for as yet unknown connections.
  2. For a given application, there is not only one problem space; different models may be used to solve different parts of the problem, and the way in which the problem is decomposed is itself often the result of data mining and not known before the process begins.
  3. The data miner manipulates, or “shapes”, the problem space by data preparation, so that the grounds for evaluating a model are constantly shifting.
  4. There is no technical measure of value for a predictive model (see 8th law).
  5. The business objective itself undergoes revision and development during the data mining process, so that the appropriate data mining goals may change completely.

This last point, the ongoing development of business objectives during data mining, is implied by CRISP-DM but is often missed.  It is widely known that CRISP-DM is not a “waterfall” process in which each phase is completed before the next begins.  In fact, any CRISP-DM phase can continue throughout the project, and this is as true for Business Understanding as it is for any other phase.  The business objective is not simply given at the start, it evolves throughout the process.  This may be why some data miners are willing to start projects without a clear business objective – they know that business objectives are also a result of the process, and not a static given.

Wolpert’s “No Free Lunch” (NFL) theorem, as applied to machine learning, states that no one bias (as embodied in an algorithm) will be better than any other when averaged across all possible problems (datasets).  This is because, if we consider all possible problems, their solutions are evenly distributed, so that an algorithm (or bias) which is advantageous for one subset will be disadvantageous for another.  This is strikingly similar to what all data miners know, that no one algorithm is the right choice for every problem.  Yet the problems or datasets tackled by data mining are anything but random, and most unlikely to be evenly distributed across the space of all possible problems – they represent a very biased sample, so why should the conclusions of NFL apply?  The answer relates to the factors given above: because problem spaces are initially unknown, because multiple problem spaces may relate to each data mining goal, because problem spaces may be manipulated by data preparation, because models cannot be evaluated by technical means, and because the business problem itself may evolve.  For all these reasons, data mining problem spaces are developed by the data mining process, and subject to constant change during the process, so that the conditions under which the algorithms operate mimic a random selection of datasets and Wolpert’s NFL theorem therefore applies. There is no free lunch for the data miner.

This describes the data mining process in general.  However, there may well be cases where the ground is already “well-trodden” – the business goals are stable, the data and its pre-processing are stable, an acceptable algorithm or algorithms and their role(s) in the solution have been discovered and settled upon.  In these situations, some of the properties of the generic data mining process are lessened.  Such stability is temporary, because both the relation of the data to the business (see 2nd law) and our understanding of the problem (see 9th law) will change.  However, as long this stability lasts, the data miner’s lunch may be free, or at least relatively inexpensive.

Many thanks to Chris Thornton of Sussex University for his help in formulating NFL-DM.

Bio:Tom KhabazaTom Khabaza helps organisations improve their marketing and customer processes, to improve their efficiency, risk analysis and fraud detection, and to improve their strategic decision-making, through new knowledge and predictive capabilities extracted from data. Tom has worked in the field of data mining for over 20 years, and is one of the authors of the world-leading Clementine data mining workbench, and of the CRISP-DM industry standard data mining methodology.

Original. Reposted by permission.

Here is a continuation: Nine Laws of Data Mining, part 2.