KDnuggets : News : 2004 : n08 : item18 < PREVIOUS | NEXT >

Publications


Subject: Dorian Pyle's nine rules not to follow

This Way Failure Lies -- By Dorian Pyle -- DB2 Magazine

Nine simple rules you won't want to follow.

Not all mining projects are successful. This pronouncement may come as a surprise (but it probably won't). Some mining projects are successful � sometimes spectacularly so. The fact that this second pronouncement may come as a surprise to many readers is unfortunate.

Although there are many paths to data mining success, the paths to failure are followed all too often. Data miners heading for failure seem to follow rules, or worst practices, just as those seeking success try to follow best practices.

Certain Disaster

Here is a selection of worst practices I've encountered much too often over the past year. These rules are in no particular order. To court failure, simply make a selection from this list.

Rule 1. Jump right in. To guarantee failure from the beginning, simply take the data on hand and start applying whatever tools are available. Don't stop to consider how answers to this sort of business problem could best be discovered. Most particularly, don't consider what the business users need from the process. Instead, focus on the data that's most immediately available and the tools you're most familiar with. That way, you'll produce results of the sort that you are most comfortable with producing. And don't stop to consider how the result has to be applied in the real world. A successful project produces results that the business can actually apply. If you're aiming for failure, you don't need to consider how the results will be applied until your results have already been generated and delivered.

Rule 2. Frame the problem in terms of the data. Don't consider alternative solutions to the business situation. Because the problem has been passed to a data miner, the best way to solve the problem is through data analysis. Don't consider whether other decision-making or data-gathering efforts would be appropriate, or whether some other method of dealing with the business problem would help. Don't consider the other challenges faced by the company or the specifics of the industry the company competes in. Consider only what the data set as it currently exists has to reveal. Whatever the data can be persuaded to reveal, recast the business objective in terms that the data addresses.

Rule 3. Focus only on the most obvious way to frame the problem. Don't waste time trying to explore or reconfigure the data. The best results are to be had by concentrating on the statistical and technical criteria provided by the tool of choice. Those criteria are technically exact measures that can easily justify the quality of the final mined models that are discovered. Concentrate on improving the technical merits of the model until it reaches the highest degree of technical perfection.

Rule 4. Rely on your own judgment. An experienced miner has the best judgment as to what should and should not be included in the model. Because the data contains all the necessary information, mining's job is only to extract and reveal the relationships that its contains. Input from others, particularly from the business managers, is likely to be distracting and should be ignored, discounted, or at least recast into terms that the data actually does address. Remember that the miner probably does know best, and that, if properly and correctly applied, the tools will reveal all.

Rule 5. Find the best algorithms. Mining data is fundamentally about algorithms. For any data set, some particular algorithm will produce the best model. In order to discover the best model, it's very important to use the appropriate algorithm. In fact, fitting an appropriate algorithm (mining) to the data is what data mining is all about.

Rule 6. Rely on memory. Most data mining projects are simple enough that you can hold most important details in your head. There's no need to waste time in documenting the steps you take. By far, the best approach is to keep pressing the investigation forward as fast as possible. Should it be necessary to duplicate the investigation or, in the unlikely event that it's necessary to justify the results at some future time, duplicating the original investigation and recreating the line of reasoning you used will be easy and straightforward.

Rule 7. Intuition is more important than standard practice. All data is different, and all data sets are unique, so each should be approached as an individual and unique entity. Standard practices are really only guides for those who aren't experienced and, therefore, haven't developed the intuition necessary to work with each data set as a unique experience.

Rule 8. Minimize interaction between miners and business managers. A skilled miner can easily explore a data set and discover all the interesting, insightful, and useful relationships without any interaction, guidance, or involvement from the business managers. In fact, it's the primary business task for a data miner to take a data set, analyze it, and return a comprehensive report that is insightful and relevant, relying exclusively on the information contained in the data. Data mining tools are so powerful at discovering relationships, a miner's main job is simply to report the deep insights the tools discover.

Rule 9. Minimize data preparation. The most important and interesting part of data mining is creating models using state-of-the-art mining algorithms. Tools are quite capable of preparing any data set automatically, so that it will reveal all the significant, interesting, and relevant relationships contained therein. Preparing data is not only boring and tedious, it also takes a long time. It slows the mining process to a crawl. It's best by far to do as little preparation as possible and get right to the modeling part.

Here is the rest of the story.


KDnuggets : News : 2004 : n08 : item18 < PREVIOUS | NEXT >

Copyright © 2004 KDnuggets.   Subscribe to KDnuggets News!