I have started looking at the ways of automatically discovering such patterns and attended Gio Wiederhold tutorial in 1987 at Int. Conf. on Database Engineering in LA, entitled "Extracting Knowledge From Data". Gio and his student R. Blum [Blum 1981] have developed Rx, the first program that analyzed historical data from about 50,000 Stanford patients, and looked for unexpected side-effects of drugs. The program did discover some side-effects that were unknown to its authors, and the approach looked very promising.
However, I could not quite convince GTE management that discovery in data was a good idea. One senior manager told me that he thought that data mining was a solved problem -- I could apply a decision tree (and building a decision tree was a solved problem, wasn't it?) to the database and presto -- I will have all the results I need.
Later, I attended a AAAI-88 workshop in Minneapolis on "Databases and Expert Systems". The workshop had interesting presentations and was a good way to get researchers to interact. Putting together a workshop seemed relatively easy (little did I know) and I decided to organize a workshop on discovery in data at next year's IJCAI-89. That would be a great way to stimulate more research in the field and to convince my management at GTE Laboratories that discovery in data was a good idea.
What should I call this workshop? The name "data mining" which was already
used in the database community seemed unsexy, and besides statisticians used
"data mining" as a pejorative term to criticize the activity. "Mining" is
unglamorous and there is no indication what are we mining for .
"Knowledge mining" and "knowledge extraction" did not seem much better, and "database miningTM" was trademarked by HNC for their Database Mining WorkstationTM. So, I came up with "Knowledge Discovery in Databases", which emphasized the "discovery" aspect and the focus of discovery on "knowledge".
With encouragement and help from Jaime Carbonell (CMU), Bud Frawley (GTE), Kamran Parsaye (IntelligenceWare), Ross Quinlan (U. of Sydney), Michael Siegel (BU), and Sam Uthurusamy (GM Research), I put together a Knowledge Discovery in Databases (KDD-89) workshop at IJCAI-89 in Detroit.
The term "Knowledge Discovery in Databases" (KDD for short) became popular in the AI and Machine Learning community. However, the database researchers were on better speaking terms with the business folks and the press, and the term "data mining" became much more popular in the press. As of Nov 1999, search on www.altavista.com gives about 100,000 pages for "data mining", compared to 18,000 for "knowledge discovery". Currently, both terms are used essentially as synonyms, as in the name of the main journal for the field -- "Data Mining and Knowledge Discovery" (Kluwer). Sometimes "knowledge discovery process" is used for describing the overall process, including all the data preparation and postprocessing while "data mining" is used to refer to the step of applying the algorithms to the clean data (Fayyad, Piatetsky-Shapiro, and Smyth, 1996).
KDD-89 had 9 papers presented in 3 sessions, on Data-Driven Discovery, Knowledge-Based Approaches, and Systems and Applications and concluded with a summary panel discussion by Larry Kershberg, Ross Quinlan, Pat Langley.
The main topics discussed at the KDD-89 workshop included:
Some important areas turned out to be much harder than we thought in 1989.
Learning from structured data is still very difficult and current best
methods from the Inductive Logic Programming community
(http://www.cs.bris.ac.uk/~ILPnet2/) are still too slow to be used on large databases.
Interestingness of discovered patterns is still a hard problem, and it still requires significant amount of using domain knowledge. CYC (Lenat 1995) which held a lot of promise in 1989, did not produce the expected results. On the other hand, we now have the web which is the largest repository of general knowledge, although still with very imperfect query system.
Privacy, which was a concern 10 years ago, especially proper balancing between companies' desire to use personal information versus individual's desire to protect it, remains a thorny issue. Recent initiatives like CPEX, which is an XML-based standard to enable simultaneous customer view within multiple enterprise applications may solve the technical issues of exchanging personal information. However, they do not solve the fundamental privacy concerns of individuals. In my opinion, the way to solve these concerns is to
Another major advance was a holistic understanding of the entire Knowledge Discovery Process (Brachman & Anand, 1996), which encompasses many steps from data acquisition, cleaning, preprocessing, to discovery step, to postprocessing of the results and their integration into operational systems.
Good progress was also achieved in Ensemble Classifiers (Boosting, Bagging); Association Rules; OLAP; and Data Visualization.
The second generation data mining systems, called suites, were developed by data mining vendors, starting from around 1995. These tools were driven by the realization that the knowledge discovery process requires multiple types of data analysis, and most of the effort is spent in data cleaning and preprocessing. The suites such as SPSS Clementine, SGI Mineset, IBM Intelligent Miner, or SAS Enterprise Miner allowed the user to perform several discovery tasks (usually classification, clustering, and visualization) and also supported data transformation and visualization. An important advance, pioneered by Clementine, was a GUI which allowed users to build their knowledge discovery process visually.
By 1999, there are over 200 tools available for many different tasks (see
However, even the best data mining tools addressed only a part of the overall business problem. Data still had to be extracted from legacy databases, cleaned and preprocessed, and model results had to be delivered to the right channels and, most importantly, integrated with the specific application or business logic. Successful development of such applications in areas like direct marketing, telecom, and fraud detection, led to emergence of data-mining-based "vertical solutions".
Examples of such systems include HNC Falcon for credit card fraud detection, IBM Advanced Scout for basketball game analysis, and NASD KDD Detection system (Kirkland 1999).
In 1989, a really large database was 1 MB. Today, we have multi-terabyte databases that are being mined.
In 1989 there were a handful of companies providing data mining tools. In 1999 there are over a 100 companies.
Other interesting trends could be observed by analyzing the subscribers to KDnuggets newsletter (see www.kdnuggets.com/news/), a popular moderated newsletter on Data Mining and Knowledge Discovery topics that I am publishing. The subscriber base grew from 50 people who received the first issue in 1993 to about 8000 as of October 1999. KDnuggets subscriber list is still growing at 7% a quarter, but much slower than 40% a quarter in 1994 or 20% a quarter in 1997. While researchers made the majority of the subscribers in the first few years, now the majority is from commercial domains. About half of subscribers are from .com and .net domains, but there are significant groups of data miners in Western Europe (especially UK, Germany, and France) and Pacific Rim (especially Australia, Japan and Singapore). Surprisingly, there are pockets of subscribers in over 80 countries, and on all continents except Antarctica.
I expect standards to appear for different parts of the knowledge discovery process, and greatly faciliate industry growth. Already we have proposed standards like CRISP for the data mining process, PMML for predictive model exchange, and Microsoft OLE DB.
Significant applications will appear in E-commerce, especially with real-time personalization. There will be significant use of intelligent agents.
I also expect great progress in pharmaceuticals and new drugs enabled by knowledge discovery and bioinformatics.
I think there will be tighter integration of knowledge discovery modules with a database system, and most database systems will include a set of discovery operations.
I expect also that the data mining industry will overcome the hype stage, and will merge with the database industry.
Brachman, R. and T. Anand. The Process of Knowledge Discovery in Databases:
A Human-Centered Approach. In " Advances in Knowledge Discovery and Data Mining",
ed. U.. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, AAAI/MIT Press 1996.
Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth. From Data Mining to Knowledge Discovery in Databases (a survey), AI Magazine, 17(3): Fall 1996, 37-54
Inselberg, A. (1985). The plane with parallel coordinates. The Visual Computer, 1, 69-91.
Kirkland, J. et al, The NASD Regulation Advanced-Detection System (ADS), AI Magazine 20(1): Spring 1999, 55-67.
Lenat, D. B. "Cyc: A Large-Scale Investment in Knowledge Infrastructure." Communications of the ACM 38, no. 11 (November 1995).
J. Ross Quinlan: Induction of Decision Trees. Machine Learning, Volume 1, 1986.
Copyright © 2002 KDnuggets. Subscribe to KDnuggets News!