KDD Nuggets Vol 1, No. 7 -- November 23, 1993 Contents: * Gregory Piatetsky-Shapiro, Chris Matheus, Padhraic Smyth, Ramasamy Uthurusamy: KDD-93 workshop report (for AI magazine) coming soon -- KDD-94 Call for Papers The KDD Nuggets is an informal list for the dissemination of information relevant to Knowledge Discovery in Databases (KDD), such as announcements of conferences/workshops, tool reviews, publication reviews, application descriptions, and other relevant ideas. E-mail contributions to kdd@gte.com. Requests to be added/deleted to kdd-request@gte.com -- Gregory Piatetsky-Shapiro -------------------------------------------------- \documentstyle {article} \oddsidemargin 0.125in \textwidth 6.5in \evensidemargin 0pt \topmargin -0.5in \headheight 11pt \headsep 16pt \textheight 9.0in \parskip 5pt \begin{document} \title{KDD-93: Progress and Challenges in \\ Knowledge Discovery in Databases} \author{Gregory Piatetsky-Shapiro (GTE Laboratories)\thanks{The authors can be reached at gps@gte.com, matheus@gte.com, pjs@galway.jpl.nasa.gov, and samy@gmr.com, respectively}, Christopher Matheus (GTE Laboratories), \\ Padhraic Smyth (JPL), Ramasamy Uthurusamy (GM Research Laboratories)} \date{November 16, 1993} \maketitle Interest in Knowledge Discovery in Databases (KDD) continues to increase, driven by the rapid growth in the number and size of large databases and the applications-driven demand to make sense of them. The research side of KDD is of growing interest to researchers in machine learning, statistics, intelligent databases, and knowledge acquisition, as evidenced by the number of recent workshops (Piatetsky-Shapiro 1991a, 1991b, Zytkow 1992, Ziarko 1993) and special journal issues (Piatetsky-Shapiro 1992, Zytkow 1993, Cercone 1993) devoted to or closely related to discovery in databases. The application side is of interest to any business or organization with large databases. KDD applications have been reported in many areas of business, government, and science (Piatetsky-Shapiro and Frawley 1991, Inmon and Osterfelt 1991, Parsaye and Chignell 1993). The notion of discovery in databases has been given various names, including knowledge extraction, data mining, database exploration, data pattern processing, data archaeology, information harvesting, siftware, and even (when done poorly) data dredging. Whatever the name, the essence of KDD is the {\em nontrivial extraction of implicit, previously unknown, and potentially useful information from data} (Frawley et al 1992). KDD encompasses a number of different technical approaches, such as clustering, data summarization, learning classification rules, finding dependency networks, analyzing changes, and detecting anomalies (see Matheus et al 1993). Over 60 researchers from 10 countries took part in the third KDD workshop (Piatetsky-Shapiro 1993) held during AAAI-93 in Washington, D.C, in the simmering July heat. A major trend evident at the workshop was the transition to applications in the core KDD area of discovery of relatively simple patterns in relational databases; the most successful applications are appearing in the areas of greatest need, where the databases are so large that the manual analysis is impossible. Progress was also facilitated by the availability of commercial KDD tools, both for generic discovery and for domain-specific applications such as marketing. At the same time, progress is slowed by problems such as lack of statistical rigor, overabundance of patterns, and poor integration. Besides applications, the main themes of this workshop were the Discovery of Dependencies and Models and Integrated and Interactive KDD Systems. \ \\ {\Large Real-World Applications} The applications presented at the workshop fell into three broad application areas: scientific, financial, and manufacturing. Most of the systems performed some form of classification, while two systems dealt with detecting and describing changes. In addition to talks and poster presentations, several demonstrations of research and commercial discovery systems were given at the workshop. \ \\ {\bf Scientific applications}: Two applications were presented in the area of astronomy. Usama Fayyad (JPL), started the workshop with a talk on Sky Image Cataloging and Analysis Tool (SKICAT), an automated system for analyzing large-scale sky surveys. The multi-terabyte size of the database ruled out a manual approach to image classification. Using a number of innovative machine learning methods, Usama and his colleagues were able to recognize objects at least one magnitude fainter in resolution than was previously possible while achieving an accuracy of about 94\%. This work is noteworthy as a real application of machine learning to a difficult problem with results that are being used by scientists on a daily basis. Padhraic Smyth, also from JPL, gave a related talk on the problem of Image Database Exploration, describing collaborative work with Usama Fayyad. Padhraic described challenging issues in image analysis such as how to measure the right attributes, the role of prior knowledge, incremental learning, and the use of multi-sensor data. He also examined how these issues are handled in current JPL tasks such as the analysis of Venus images obtained by the Magellan spacecraft. \ \\ {\bf Financial applications}: Two systems were presented for detecting and describing changes in large business databases. Tej Anand described A.C. Nielsen's recent work on a commercial product called Opportunity Explorer. This system is a redesign and extension of their Spotlight (Anand and Kahn 1992) product for identifying and reporting on trends and exceptional events in the extremely large supermarket sales databases. An innovative feature of Spotlight is the automatic explanation of relationships between key events. Opportunity Explorer is a more general tool for developing interactive, hypertextual reports using knowledge discovery templates which convert a large data space into concise, inter-linked information frames. It is marketed to help sales analysts and product managers of consumer packaged goods companies develop better sales strategies. Christopher Matheus and Gregory Piatetsky-Shapiro of GTE Laboratories presented their Key Findings Reporter (KEFIR), a system for discovering and explaining ``key findings'' in large relational databases. While the system's design is domain independent, the current focus is on trend and normative analysis of health-care information. KEFIR performs an automatic drill-down on the data along multiple dimensions to determine the most interesting deviations of specific quantitative measures relative to norms or previous values. It then identifies explanatory relationships between findings, and generates a report using natural language templates and graphics. A prototype of KEFIR has been implemented in C and tcl with an embedded SQL interface. Three other financial applications used classification methods in the areas of insurance, marketing, and stock market analysis. John Major (Travelers) analyzed the important problem of selecting the most interesting rules among those discovered in data. He presented a rule refinement strategy which defined rule ``interestingness'' via rule accuracy, coverage, simplicity, novelty, and significance. His method gave preference to rules not dominated in these measures by other rules, and removed those that were potentially redundant. In an application of the method to a tropical storm database, the system reduced 161 rules generated by IXL (a product of IntelligenceWare, Inc) to the 10 most interesting ones which were meaningful to a meteorologist. Wojtek Ziarko (U. of Regina, Canada) presented an application of Reduct Systems' Datalogic/R discovery tool to identify strong predictive rules in stock market data. Monthly data collected over a ten year period was analyzed to identify dominant relationships among fluctuations of market indicators and stock prices. Evaluation, by a domain expert, of the results (including both precise and imprecise, strong and weak rules) revealed that the strong rules confirm expert's experiences while weak rules were difficult to interpret. Datalogic/R, a commercially available tool which derives rules using the variable precision rough sets approach, was also demonstrated at the workshop. Pierro Bonissone and Lisa Rau of GE R\&D presented preliminary results of applying decision trees and logistic regression to a database of accounting, customer, and sales information. Initial results that suggest emerging markets and provide feedback on sales performance are encouraging enough to warrant further pursuit of this work. {\bf Manufacturing}: Two applications dealt with semiconductor manufacturing and software engineering. Sharad Saxena of Texas Instruments presented his approach to fault isolation during semiconductor manufacturing using automated discovery from wafer tracking databases. These databases contain the history of the semiconductor wafers as they undergo various processing steps. A generate-and-test approach is taken for using such databases for automated diagnosis. Based on prior manual analysis of such databases, classes of queries to the database as well as patterns in the responses to these queries that are useful for fault isolation are identified. Diagnosis is accomplished by automating the query generation and the detection of potentially useful patterns. A prototype system was implemented and tested on real data, finding both known and previously unknown faults. Inderpal Bhandari of IBM presented Attribute Focusing, a method for exploratory analysis of attribute-valued data intended for use by domain experts who do not have a background in data analysis. The approach uses a model of interestingness based on magnitude of data values, association of data values, and basic knowledge of the limits of human information processing capabilities, as well as a model of interpretation to guide the domain specialist to discover knowledge from attribute-valued data. This approach has been used successfully by software managers, developers, and testers at IBM to make real-time improvements on their products, as well as on their process of production. The Attribute Focusing approach is being used in several IBM laboratories, with reported net savings of hundreds of person days. A PC-based implementation of Attribute Focusing was demonstrated by Bhandari and Michael Herman of IBM. \ \\ {\Large Discovery of Dependencies and Models} The second major theme of the workshop was discovery of dependencies and models. The workshop provided clear evidence of the diversity of technical approaches which are being applied to the general KDD problem. The focus was on the use of particular mathematical and statistical methods for the induction of qualitative relationships directly from data. Jan Zytkow (Wichita State U.) outlined the latest developments in his joint research with Robert Zembowicz on deriving equations from data. He proposed a computationally simple test for the absence of functional dependency which can eliminate the much more expensive search to determine the form of dependency. The test relies on search for discretization of data into optimal intervals. Initial experiments with their 49er system showed that the test significantly reduces the computation time, while losing only a few actual equations, typically those with a particularly poor fit to data. Dependency networks are an important form of discovered knowledge, and recent progress in this field (Pearl 1992, Spirtes et al 1993) is very encouraging for KDD. Probabilistic networks are a powerful knowledge representation medium, providing a bridge between the power of explicit knowledge representation in graphical form and more subtle (but robust) quantitative statistical methods. Greg Cooper (U. of Pittsburgh), presented the latest results in his research on the use of Bayesian statistical methods for the learning of causal probabilistic network models that contain hidden variables. In earlier work, Cooper has demonstrated that networks with hidden variables can be directly inferred from data. In this talk, he showed how to structure the calculations to dramatically speed up the computation. Cooper also summarized recent research progress relevant to the discovery of directed probabilistic networks from data: there is a greater understanding of what relationships can be captured from data by directed acyclic graphs (DAGs) and which DAGs are indistinguishable based only on data; new methods have been developed for the discovery of probabilistic networks with measured and possibly unmeasured (latent) variables; these methods have been applied to real data with promising results. The major improvements needed for applications to real databases are computational (search) efficiency, integration of different methods, especially those dealing with discrete and continuous variables, and estimating the confidence and the stability of the output. Sa\v so D\v zeroski (Jo\v zef Stefan Institute, Ljubljana, Slovenia) gave an invited overview of Inductive Logic Programming (ILP) methods for KDD. ILP is an important paradigm that goes beyond the typical attribute-value relations (which are the limit of what can be learned by most current machine learning methods) to the more general language of first-order relations. The field has developed rapidly in recent years (Muggleton 1992), and now boasts relatively sophisticated algorithms and methods for handling a variety of problems, with great potential for KDD applications (Lavra\v c and D\v zeroski 1993). D\v zeroski outlined the motivation for ILP and proceeded in his talk from early work through more recent extensions and up to successful applications. He described a particularly successful experiment in prediction of protein secondary structure, where not only was the ILP method better in terms of predictive accuracy than alternative published methods, but, perhaps more significantly, yielded new domain knowledge. Still, much work remains to be done in handling noisy probabilistic concepts and especially in dealing with very large databases. The workshop revealed that much work is afoot in the knowledge discovery area which promises to take us beyond the discovery of relatively simple representations such as conjunctive probabilistic rules or linear models. However, as one broadens the search space to allow for more expressive languages of knowledge representation, there comes an inevitable computational penalty in terms of scaling complexity of the algorithms. Each of the three talks showed that while the underlying models may be very different, for each class of models steady progress is being made on whittling down impractical algorithms to practical ones by taking advantage of particular structural characteristics of the methods and the representation being used. We hope to see some of the presented techniques showing up at future workshops as standard workhorses of successful applications. \ \\ {\Large Integrated and Interactive Systems} The third theme of the workshop dealt with Integrated and Interactive Systems. The two are closely related, since multi-method, integrated discovery systems frequently rely on human expertise to select the next discovery method, and interactive systems frequently offer a choice of multiple discovery algorithms. Ron Brachman (AT\&T Bell Laboratories) started the session with a talk about ``Integrated Support for Data Archaeology'', which is a skilled human task of interactive and iterative data segmentation and analysis. He presented a system, called IMACS, that supports a data archaeologist with a natural, object-oriented description of an application domain, a powerful query language, and a friendly user interface that supports interactive exploration. IMACS is built on CLASSIC, a formal knowledge representation system. Willi Kloesgen (GMD, Germany) described rule refinement and optimization strategies in Explora, an interactive system for discovery of interesting patterns in databases. The number of patterns presented to the user is reduced by organizing the search hierarchically, beginning with the strongest, most general, hypotheses. An additional refinement strategy selects the most interesting statements and eliminates the overlapping findings. The efficiency of discovery is improved by inverting the record-oriented data structure and storing all values of the same variable together, which allows efficient computation of aggregate measures. Different data subsets are represented as bit-vectors making computation of logical combinations of conditions very efficient. Explora, a publicly available system,\footnote{anonymous ftp to ftp.gmd.de or 129.26.8.90, cd gmd/explora} which runs on a Mac, was demonstrated at the workshop. Philip Chan (Columbia U.) proposed Meta-learning as a general technique to integrate a number of distinct learning processes. He examined several techniques of learning arbiters that select among independently learned classifiers. Such strategies are especially suitable for massive amounts of data that main-memory-based learning algorithms cannot efficiently handle. Preliminary results are encouraging, showing that parallel learning by meta-learning can achieve comparable prediction accuracy in less time and space than purely serial learning. An important design issue discussed at the workshop was the use of an internal vs. an external database. Both IMACS and Explora use an {\em internal database} approach of pre-loading relevant parts of the data and transforming it into their internal and efficient format. This approach generally speeds up discovery for small or medium-size databases. However, it limits the system ability to work with large external databases. An {\em external database} approach, taken in discovery systems such as SKICAT, Spotlight, and KEFIR, is to build an interface, usually based on SQL, to a DBMS. This approach has its difficulties, such as dealing with communication problems and having to fit the discovery requests into the Procrustean bed of SQL. Retrieval from an external database may take longer, since in addition to a communication delay, the physical database organization may be sub-optimal for discovery system requests. However, this approach allows handling of large external databases that would not fit in memory and avoids duplicating the code for DBMS operations like joins or aggregations. We expect the coming advances in database technology, such as faster hardware, SQL servers, and forthcoming powerful SQL 2 and SQL 3 standards, to make the external database approach more attractive. Other related issues were discussed at the summary session. Larry Kerschberg (George Mason U.) observed that analysts frequently need to track hypotheses in multiple databases and proposed a mediator agent between an analyst and different discovery algorithms on one hand, and multiple data and knowledge source on the other hand. Meta-learning may offer a way to develop such mediator agents. Jan Zytkow proposed an agenda for integration. The first part is integration of different forms of knowledge: contingency tables, rules, decision trees, and equations. Each form has different strengths and a limited conversion is possible from one form into another. The second part is integration of search in different spaces of new terms, equations, and rules. Such integration is required in a machine discovery system to match human flexibility in detecting patterns of different type. The multiple searches should be globally controlled and guided by a combination of data conditions, background knowledge, and user preferences. \ \\ {\Large Advances and Difficulties} The workshop and the following discussion on the KDD Nuggets e-mail list\footnote{to subscribe, send e-mail to kdd@gte.com} highlighted several difficulties in application development. {\bf Insufficient statistical awareness}: Some KDD experiments are performed without sufficient awareness of statistical theory. The classical example of this problem is testing $N$ independent patterns for deviation from the norm, each test having a significance of $\alpha$. Then, $N\alpha$ patterns are likely to pass the test purely due to chance. Eliminating such ``random'' discoveries requires statistical controls, such as Bonferroni adjustments, which in the above example means reducing the significance level for each test to $\alpha/N$, in order to assign the final discovery the significance of $\alpha$. Other ways to eliminate chance discoveries include randomized testing procedures (Jensen 1991). At the summary session John Major estimated that only about half of the work presented at the workshop dealt adequately with this problem. Hopefully, raising this issue will increase proper statistical awareness. {\bf Overabundance of patterns}: As many pioneers of KDD have found, even with proper statistics, it is all too easy to find many statistically significant patterns which are either obvious, redundant, or useless. A common approach to reducing the number of obvious ``discoveries'' (such as only women have pregnancies), is to focus on changes, since ``obvious'' patterns will not change. Redundant discoveries can be eliminated by rule refinement methods such as those presented by Major or Kloesgen, or by using some findings to explain others. The more difficult task of separating the important patterns from the useless requires domain knowledge. A general heuristic here is that rules and patterns are important to the degree they can lead to a useful action. This suggests a decision-theoretic framing of the problem of evaluating the usefulness of discovered patterns. The {\it utility} of a particular pattern should not be measured in isolation, but instead evaluated in the context of set of possible actions. {\bf Integration}: Even if a perfect discovery systems is built, it needs to be integrated with other existing hardware/software systems to be useful. As expert system developers discovered years earlier, usually only a small part of the deployed system is new technology -- the rest is interfacing and system integration, mundane but critical steps in moving from prototype stage to deployment. {\bf Privacy vs Discovery:} Discovery in social or business data may raise a number of legal, ethical, and privacy issues. In 1990, Lotus was planning to introduce a CD-Rom with data on more than 100 million American households. The stormy protest led to the withdrawal of this product (Rosenberg 1992). Recent conferences on Computers, Freedom, and Privacy have also increased the awareness about issues of privacy and data ownership. These difficulties are compensated by a number of important advances in areas relevant to KDD. Here we list only a few. {\bf Multistrategy systems:} Several recent comparisons of different learning and discovery algorithms have showed that different methods are superior for different types of problems (Brodley 1993) -- no single method is best across a range of problems. As a result, there is a movement to multistrategy learning methods, especially for classification, which apply a number of different methods to the {\em same} task and select rules from the best method. This is an area of very active research interest, with recent progress reported in (Michalski \& Tecuci, 1993a, 1993b). {\bf Handling large scale databases}: Since most learning algorithms cannot handle very large datasets, it is usually necessary to reduce the size of data on which learning is performed. One way is to eliminate irrelevant data using the data dependencies. This has been shown (Almuallim and Dietterich 1991) to increase the performance of the classifier methods. Other methods rely on various forms of data sampling. Catlett used an intelligent sampling approach to make a sublinear algorithm for decision tree induction (Catlett 1991). His method has been used to efficiently learn decision trees from databases with hundreds of thousands of records. Overall, the workshop reflected measurable progress in developing and deploying KDD applications. {\bf Acknowledgments}: The authors thank all the workshop participants, and especially Inderpal Bhandari, Saso Dzeroski, Usama Fayyad, Willi Kloesgen, Ryszard Michalski, Evangelos Simoudis, and Jan Zytkow for their comments and contributions to this report. We thank Patel-Schneider for his editorial guidance and rapid processing of this paper. \ \\ {\Large References} \parindent 0pt H. Almuallim and T. Dietterich, 1991. Learning with Many Irrelevant Features. In Proceedings of AAAI-91, 547-552. Menlo Park, Calif: AAAI. T. Anand and G. Kahn 1992. SPOTLIGHT: A Data Explanation System. Proceedings of CAIA-92. Washington, D.C.: IEEE Computer Society. C. Brodley , 1993. Addressing the Selective Superiority Problem: Automatic Algorithm/Model Class Selection, In Proc. of 10th Machine Learning Conference, 17--24. Morgan Kaufmann. J. Catlett, 1991. Megainduction: A Test Flight. In Proc. of 8th Machine Learning Conference, 596--599. Morgan Kaufmann. N. Cercone, 1993. Guest editor, Special Issue on Learning and Discovery in Databases, {\em IEEE Trans. on Knowledge and Data Engineering}, Dec. W. Frawley, G. Piatetsky-Shapiro, and C. Matheus, 1992. Knowledge Discovery in Databases: An Overview. AI Magazine, Fall 1992. Reprint of the introductory chapter of {\em Knowledge Discovery in Databases} collection, AAAI/MIT Press, 1991. W. H. Inmon and S. Osterfelt, 1991. {\em Understanding Data Pattern Processing: the key to Competitive Advantage}. QED Technical Publishing Group, Wellesley, MA. D. Jensen, 1991. Knowledge discovery through induction with randomization testing, in {\it Proceedings of the 1991 AAAI KDD Workshop}, G. Piatetsky-Shapiro (ed.), AAAI, Anaheim, CA, pp. 148--159. N. Lavra\v c and S. D\v zeroski, 1993. {\em Inductive Logic Programming: Techniques and Applications.} Ellis Horwood, Chichester. C. Matheus, P. Chan, G. Piatetsky-Shapiro, 1993. Systems for Knowledge Discovery in Databases, {\em IEEE Trans. on Knowledge and Data Engineering}, Dec. R. S. Michalski and G. Tecuci, 1993a. Editors, Proceedings of the 2nd International Workshop on Multistrategy Learning, George Mason University, Harpers Ferry, 1993. R. S. Michalski and G. Tecuci, 1993b. Editors, {\em Machine Learning: A Multistrategy Approach, Volume IV}. Morgan Kaufmann. S. Muggleton, 1992. {\em Inductive Logic Programming.} Academic Press, London. K. Parsaye and M. Chignell, 1993. {\em Intelligent Database Tools \& Applications}. NY: John Wiley. J. Pearl, 1992. Probabilistic Reasoning in Intelligent Systems: Networks of plausible inference, 2nd ed. San Mateo, Calif.: Morgan Kaufmann. G. Piatetsky-Shapiro, 1991a. Knowledge Discovery in Real Databases: A workshop report on KDD-89, AI Magazine, vol. 11, no. 5, January 1991. G. Piatetsky-Shapiro, 1991b. Report on AAAI-91 workshop on Knowledge Discovery in Databases, IEEE Expert, October. G. Piatetsky-Shapiro and W. Frawley, 1991. Editors, {\em Knowledge Discovery in Databases}, Cambridge, Mass.: AAAI/MIT Press. G. Piatetsky-Shapiro, 1992. Editor, Special issue on Knowledge Discovery in Databases and KnowledgeBases, {\em Int. J. of Intelligent Systems} 7:7, Sep. G. Piatetsky-Shapiro, 1993. Editor, Proceedings of KDD-93: the AAAI-93 workshop on Knowledge Discovery in Databases, AAAI Press report TR-20. M. Rosenberg, 1992. Protecting Privacy, Inside Risks column, {\em Communications of ACM}, 35(4), p. 164. P. Spirtes, C. Glymour, R. Scheines, 1993. {\em Causation, Prediction, and Search}, Lecture Notes in Statistics, Springer-Verlag. W. Ziarko, 1993. Proceedings of the Rough Sets and Knowledge Discovery workshop, Banff, Canada. J. Zytkow, 1992. Editor, Proceedings of the Machine Discovery Workshop, Aberdeen, Scotland, July. J. Zytkow, 1993. Guest editor, Special Issue on Machine Discovery, {\em Machine Learning}, 12(1-3). \end{document}