KDnuggets : News : 2001 : n26 : item3    (previous | next)

## News

From: Bob Nisbet
Date: Tue, 11 Dec 2001 19:12:42 -0800
Subject: RE: Statisticians vs Data Miners

I would like to throw my two-cents worth in on this subject.

I am a scientist and a data miner. My thesis here is that parametric (and non-parametric) statistical tools are needed IN ADDITION to machine learning data mining tools to find the faint patterns of relationship in large noisy databases. Ninty percent of the data mining job may be analyzed best with statistical tools. However, the final modeling (in my opinion) can be done far better with machine-learning tools. This conclusion is drawn from my 7 years of experience in data mining analyses in business databases, and there are good reasons why this is so. Here is why I think so.

The discipline of parameteric statistical analysis was "invented" largely by Slr R.A. Fisher following his landmark paper in 1921. It is interesting to note that Fisher started out in the paper with the definition of probability as the ratio of the intrinisic probability of an event's occurence and the total probability of all event's occurrence. This is the classical definition that flowed out of probability theory of the 19th Century. However, by the end of his paper, he defines "probability" as just the intrinsic probability of an event's occurrence, period. He referred to this redefined entity as "liklihood", to distinguish it from classical probability. However, in later papers, he and his followers revert to the term "probability". So, there are actually two general theories of probability in the statistical literature. Fisher's statistical methods procede from that point. Unfortunately, in order to make his methods work in a world of noisy non-linearity, and factor interactions, Fisher and his followers had to make a number of assusmptions and add several compensating terms to their analysis (to account somewhat for non-linearity and total factor interaction).

Among the assumptions he made were:
1. The data distribution is known, and most analyses assume a normal distribution
2. The X variables exhibit a linear, additive effect on the total relationship to the predicted variable (Y)
3. All the variables are independent from each other in their effect on the predicted (dependent) variable. This means that there is not significant collinearity among the variables (two variables related to each other).
4. There is no significant heteroscedasticity (the variance is the same throughout the range of a variable)

But, what if the variables are highly non-linear? What if they are multiplicative, rather than additive? They are in forest modeling analyses, for example (but, more about that below).

Fisher designed his statistical tools for use in the medical world to permit different researchers to analyze the same data and get the same results. Previous (Bayesian) statistical methods with their subjective "priors" did not lend themselves well to that end. To make these methods work, scientists had to do controlled experiments, holding all variables constant and varying the treatment of one variable at a time. Results were compared to a "control" group with no treatments. Laboratory conditions of temperature, light, moisture, etc. often had to be held constant, because the physics of variable response might be affected by the environment. These highly-controlled conditions are almost never found outside of a laboratory, but business analysts began to use these methods anyway.

This approach to "truth" (epistemology) in science is a direct application of Aristotelian logic. Aristotle taught that nature could be understood by breaking down natural systems into pieces, analyzing the pieces, then putting them back together again to understand the whole of the system. This approach flowed out of the Enlightenment that ended over 2,000 years of dominance of Platonic philosophy in the Western world. Newton's physics and following Industrial Revolution picked up on Aristotle again and promulgated the view of the "world-as-machine". The paragon of industrial efficiency became defined in terms of
- a well-oiled machine
- firing on all cylinders
- etc.

Scientists also picked up on Aristotle in their study of nature. This view of the world worked alright within the range of Hewton's instruments, and then we learned so much about these systems that our expression of them became extremely complex. Then Newton's world began to fall to pieces. Einstein's Theory of Relativity and Quantum Physics showed us that the "real" world is vastly more complex then we thought. Not only that, but we learmed that it is more complex than we CAN think! In the science of Ecology (my academic discipline), we learned that a tropical rainforest may not regenerate after clearing on the periphery of its range. This is because the very factors necessary for its survival in these areas of relative stress are maintained by the forest itself! These are systems-level processes that can only be seen when the systems is complete and functioning as a whole. Plato would have said, "See, I told you so".

Scientists and business data analysts have learned during the last 20 years, that we must retreat to Plato and begin to view the world also from the top-down perspective, when viewing the tangible facts that occupied so much of Aristotle's attention. For Plato, the "whole" was greater than the sum of its parts. Only by the use of the aproaches of Plato and Aristotle together can we begin to understand how complex systems work. Instead of viewing the world-as-machine, this is a view of the world-as-organism!

This is what is done today in ecosystem modeling (my academic specialty), and this is what is done in data mining for CRM customer behavior analyses, my current specialty at Torrent Systems (now part of Ascential Software). Machine-learning algorithms are idealy suited for the analysis of complex business data sets with highly non-linear interacting variables that are distributed in very non-normal distributions. Statistical tools are very valuble in the data discovery, preliminary analysis, and data preparation phases of data mining projects, but true predictive data mining is done (and even is defined today) in terms of machine-learning methods.

I rest my case...

Bob Nisbet
Analytical Scientist
Torrent Systems (now Ascential Software)

KDnuggets : News : 2001 : n26 : item3    (previous | next)