KDD Nugget 95:8, e-mailed 95-04-01 Contents: Replies to Tim Clark Query on Matching Records: * L. Breiman, Matching Records * S. Stolfo, Matching Records: Merge/Purge Tool Soon Available * A. Ketterlin, Analyzing Foster Childrens' Home Payments Database * S. Rondeau, An algorithm for matching similar records Publications: * D. Heckerman, CACM special issue on Bayesian networks * G. Grandbase du Donnee, Knowledge Burial in Databases: An Underview Question: * A. Pang, Data mining in retail sector ? CFPs: Calls for Papers and Participation: * D. Georgakopoulos, ICDE-96, Int. Conf. on Data Engineering The KDD Nuggets is a moderated mailing list for news and information relevant to Knowledge Discovery in Databases (KDD), also known as Data Mining, Knowledge Extraction, etc. Relevant items include tool announcements and reviews, summaries of publications, information requests, interesting ideas, clever opinions, etc. Please include a descriptive subject line in your submission. Nuggets frequency is approximately bi-weekly. Back issues of Nuggets, a catalog of S*i*ftware (data mining tools), references, FAQ, and other KDD-related information are available at Knowledge Discovery Mine, URL http://info.gte.com/~kdd/ or by anonymous ftp to ftp.gte.com, cd /pub/kdd, get README E-mail add/delete requests to kdd-request@gte.com E-mail contributions to kdd@gte.com -- Gregory Piatetsky-Shapiro (moderator) ********************* Official disclaimer *********************************** * All opinions expressed herein are those of the writers (or the moderator) * * and not necessarily of their respective employers (or GTE Laboratories) * ***************************************************************************** ~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ "How is it possible to find meaning in a finite world, given my waist and shirt size?" Woody Allen ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Note: I felt that the replies to Timothy Clark query on Analyzing Foster Childrens' Foster Home Payments Database and methods for matching similar records were of sufficient general interest to be summarized here. The first two replies were e-mailed to me and the next two were e-mailed to Timothy Clark. Gregory Piatetsky-Shapiro KDD Nuggets Moderator ~~~~~~~~~~~~~~~~~~~~~~~~ From: Leo Breiman Date: Fri, 24 Mar 1995 15:32:26 -0800 Your letter was posted in kdd. The Census Bureau, with whose work I'm more or less familar, has worked on matching records for at least 15 years. They have some computer algorithms which work pretty well. A person who is knowlegable about current state of the art is Howard Hogan, Statistical Reseqarch Division, Bureau of the Census. Also there have been numbers of articles on matching procedures. Journal of the American Statistical Association is a good place to start. Leo Breiman ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Sat, 25 Mar 95 9:37:42 EST From: Sal Stolfo Subject: Merge/Purge Tool Soon Available Timothy Clark's request for information in the prior KDD Nuggets prompted this reply. He seeks information about solutions or systems able to match multiple records that exhibit great variability in describing a single domain entity. Merging multiple databases, or matching "similar" records within one database, is a ubiquitous problem, solved today by a variety of methods, some better than others. (Scan your junk mail over a few months to see how poorly some commercial organizations do the job.) With the coming age of network computing, the problem of merging multiple records from diverse sources will become increasingly difficult and important. Scale becomes the crucial issue. A prototye system implementing several alternative scalable solutions will soon be available for other's to experiment with. In the meantime, please see "The Merge/Purge Problem for Large Databases" to appear in SIGMOD 95, or directly accessible from http://www.cs.columbia.edu/pdis_lab/ . The software system implementing the solutions described in the paper is useable here at Columbia. If you send me your data, we'll process it and return the results of the analysis! However, we'll need your "equational theory" if our prototype has insufficient knowledge to process your matching task. Sal Stolfo Department of Computer Science Columbia University New York, NY 10027 sal@cs.columbia.edu ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >From: alain@dpt-info.u-strasbg.fr >Date: Sat, 25 Mar 1995 08:13:53 +0100 >To: timclark@whidbey.whidbey.com >Subject: Re: Analyzing Foster Childrens' Foster Home Payments Database >X-UIDL: 796456184.006 > > >Dear Mr Clark, > >I've just read your posting in KDD-Nuggets mailing list. I'm not sure I've >well understood the details of your problem, but it seems to me that you >want to cluster records according to how they "globally" match. Maybe it >would be necessary to give more details on the encoding of the records so >as anyone of us could be of some help. > >I'm a PhD student in Strasbourg (France), and am working on incremental >hierarchical multivariate clustering techniques. I've done some work on >extending a machine-learning clustering algorithm (whose originator is >Doug Fisher, from Vanderbilt University in Nashville). Maybe such an >algorithm could help you in search for coherent groupings in your data. >There are also a wide variety of multivariate clustering techniques >developed in the Statistics community. I don't know if you are aware of >common multivariate clustering techniques. If you are, maybe you should >give us some statements about why they (eventually) fail. If you are not, >I (as well as many others researchers) can send you pointers to the >relevant >literature. You may even find ready-to-run programs to make preliminary >experiments. > >Your original post raises the issue of very large databases also. Even >though >this is a primary criterion for KDD techniques, I think it is not >completely >dealt with yet. Thus, it would be better if you could prepare some samples >on which the first experiments could be conducted (a thousand records seems >to me a correct size). > >In any case, if you give me details about the structure of your database >(ie. encoding of values, what you consider to be a "reasonable match" >between records, how the information is stored ...), I could make maybe >more >relevant remarks. >I hope you will find some help. > >Best regards, > >Alain Ketterlin - PhD student >Departement d'Informatique - Universite Louis Pasteur >Strasbourg --- France. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Return-Path: Date: Wed, 29 Mar 1995 18:40:42 +0800 X-Sender: timclark@whidbey.whidbey.com (Unverified) Mime-Version: 1.0 To: kdd%eureka@gte.com From: timclark@whidbey.whidbey.com (Timothy K. Clark) Subject: Response to Foster Children Record Matching Problem Content-Type: text/plain; charset="us-ascii" Content-Length: 8640 >Date: Sun, 26 Mar 1995 17:47:53 -0800 >From: Stephen Rondeau >To: timclark@whidbey.whidbey.com >Subject: Drastic Revision of Previous Note >X-UIDL: 796456184.008 > >Mr. Clark: > >Here is basically a rework of my past note concerning your query for an >algorithm or mechanism to handle noisy data. > >I wrote that algorithm hastily and did not test it. I've spent the last day and >a half testing a variation of it, which handled 98% of my noisy test data. >Since I thought that 2% of 6,000,000 was too big a number, I kept searching >for a better solution. > >It finally dawned on me that I was going about it all wrong. > >The basic problem is that exact matching of the fields will never work, since >there is no guarantee that the error will appear in the same place or with >the same value. I remembered the "exclusive-or" function: it takes two operands >and uses this mechanism to perform its function on each bit: > > operand1 operand2 result > -------- -------- ------ > 0 0 0 > 0 1 1 > 1 0 1 > 1 1 0 > >Now, if you take the same two character strings, such as names or social >security numbers, and exclusive-or them together, the result will be a string >of bytes that are all zero ('00'X or the ASCII NUL character). Of course, we >want to compare two strings which may differ greatly -- a complete mismatch >-- or just a little bit (pardon the pun) -- a "fuzzy" match. > >The first question then, is what should be used for matching? You could use >all four identifiers, but that gets messy quickly -- too many possibilities >for typographical errors. The ideal situation for exclusive-oring strings >is when the strings align, i.e., each nonblank character in string1 matches up >to its counterpart in string2. We want to ignore blanks because extra ones >are common typographical errors. For example: > >John Jones > John Jones > >These would not match using an exclusive-or operation. In addition, imbedded >blanks and capitalization cause problems in names. In short, let's avoid the >name field. Birthdate might be okay, except with what I suspect is about 50,000 >individuals in your database, there must be quite a few with the same >birthdate... it's not unique. But both social security number (ssnum) >and case number are supposed to be unique. If the case number is not just >digits, it can fall prey to errors similar to names. Consequently, I feel the >ssnum is the best field to use. > >The next question is, how many errors do you tolerate? Presuming that >you don't have the luxury of an error-free record for each individual, >what you will be comparing against (the "base" record for the individual) >may contain errors. So, you may be comparing an erroneous field with >another erroneous field. Yet we can do that... why can't a computer? Let's >look at two aligned ssnums, each containing one error (indicated by '?'): > >Base: 009463?88 >Current: 009?63788 > >We would feel confident these are very likely the same numbers. What would >the consequence of an exclusive-or operation on them be? Let a '.' represent >the NUL character ('00'X), and '!' represent the nonzero bytes: > >Base: 009463?88 >Current: 009?63788 > --------- >XOR: ...!..!.. > >Now you can see that the only thing we need to do is count the number of >nonzero bytes -- that will tell us how many errors there are. And you >can choose the threshold you think is acceptable. For my program, I used >two. That is, if there are more than two errors found after exclusive-oring >the base ssnum and the current one, the current ssnum does not belong to >the same individual. > >If one of the ssnums is missing a digit and that's the only thing wrong, >this will still work. If there is an extra digit inserted, it won't (unless it >is the rightmost digit). Perhaps you could clean up the ssnum field so that >all of them are aligned, have the same number of characters and are in the >same format. The database query language can probably help you find those >bad ones -- hopefully, there will be less than 1000 that need to be formatted >by hand. Then you'd only have to worry about the content, which the >algorithm below addresses. > >A base ssnum represents the entire record. It's "class" is automatically >assigned to 1 -- "identifying" the first individual (that is, a class is >an individual). This ssnum will be the one against which all others will be >compared. A new base ssnum is created for each current ssnum that fails to >match all of the previous base ssnums. > >The unique record number corresponding to the record is always remembered >so we can determine which record belongs to which class. In the algorithm, >the record number is added to the list (concatenated to a string, actually) >corresponding to the correct individual. Therefore, when the algorithm is >finished, all you need to do is index through the recs_in_class array -- >the index is the individual or class, and the element is the list of >record numbers corresponding to the individual. > >start--------------------------------------------------------------------- > > >base_recs = 1 >individual = 1 > >/*---- Choose any record in the database to be first base record. ----*/ > >ri = 1 > >rec = NEXT_RECORD() > >base_rec[base_recs] = rec.ssnum >base_rec_class[base_recs] = individual > >recs_in_class[individual] = ri > >error_threshold = 2 >NUL = '00'X > >/*---- For all remaining records in database ----*/ > >Do Forever > rec = NEXT_RECORD() > If ERROR > then Leave > > match_found = false > > /*-------------------- > Compare all base ssnums to the current record's ssnum. > > If none match, create a new class (individual). > If one matches, use the class (individual) associated with the base ssnum > ---------------------*/ > > Do bi=1 to base_recs > > /*---- Exclusive-or ssnums between the records as follows: ----*/ > > xor_bytes = XOR(base_rec[bi], rec.ssnum) > > /*---- Count the number of mismatched bytes ----*/ > > count = 0 > Do i=1 to LENGTH(xor_bytes) > If SUBSTR(xor_bytes,i,1) \= NUL /* If current byte is not equal to NUL */ > then count = count + 1 > End > > If count > error_threshold /* If no match within tolerance */ > then Iterate /* then Match against the next base ssnum */ > > match_found = true > > /*---- Remember which records matched (append new to old list). ----*/ > > recs_in_class[base_rec_class[bi]] = > recs_in_class[base_rec_class[bi]] ||' '|| ri > > Leave > End /* For all base records */ > > If match_found = false /* If no match */ > then do > individual = individual + 1 /* Assume it's another individual */ > > base_recs = base_recs + 1 /* New base record for comparison */ > > base_rec[base_recs] = rec.ssnum /* Establish new base for individual */ > base_rec_class[base_recs] = individual > > recs_in_class[individual] = ri > end >End > >/* The following may help to illustrate the desired results. */ > >/*---- Show the classes and the record numbers associated with them -----*/ >Do i=1 to individual > WRITE_TO_SCREEN(i||": "||recs_in_class[i]) >End > >end------------------------------------------------------------------------- > >The above "algorithm" is reformatted and "genericized" from REXX source code. >I believe everything essential is there. The REXX program was able to >find 100 out of 100 individuals in 540 noisy records. Sometimes the test >data contained error-free ssnums for all records belonging to an individual, >sometimes it didn't. > >I generated the test records using a lot of calls to a random number generator. >The identifier fields which contain the errors and their positions within a >particular field vary randomly, as well as the number of records per individual >and their location in the test data (they are not grouped together there). I >think it is fairly representative of your data, once the ssnum field is >cleaned up. However, the only errors are up to one incorrect digit per ssnum. > >Once again, I hope this works for you. If not, feel free to call >(206-246-6077) or send email (sbr@halcyon.com). Of course, if it does work >I'd also like to hear about it! > >Stephen Rondeau, AugmenTek > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Thu, 23 Mar 95 15:56:34 SST To: kdd@gte.com From: sinnguk@iti.gov.sg (Angeline Pang) Subject: Data mining in retail sector Hello Gregory, Are you aware of any data mining work done in the retail/wholesale sector. Have you got any papers or references (besides Spotlight by AC Nielsen). Any information will be of great help. Thank you. Angeline Angeline S. N. PANG email: sinnguk@iti.gov.sg Information Technology Institute tel :+65 - 7720 465 National Computer Board fax :+65 - 7795 966 71 Science Park WWW: http://www.iti.gov.sg Singapore 0511 (Note: There is a lot of activity in the business sector on database marketing -- see e.g. Business Week Cover article, Sep 5, 1994 also in http://info.gte.com/~kdd/nuggets/94/n7.txt , but I do not a good list of recent publications. If you have relevant information or references, please reply to kdd@gte.com, and I will summarize to the list. GPS) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >From <@watstat.uwaterloo.ca:heckerma@microsoft.com> Fri Mar 3 12:18:29 1995 Subject: CACM special issue on Bayesian networks Content-Length: 3537 This month's Communications of the CACM is a special issue on Real-World Applications of Bayesian networks. The issue was inspired by papers and panel discussions from the UAI-93 conference. The special issue includes three articles (see below) that nicely illustrate the practical uses of Bayesian networks, along with some introductory material. A second special issue, containing seven additional articles on Real-World Applications of Uncertain Reasoning, will soon appear in the International Journal of Human-Computer Systems. David Heckerman Abe Mamdani Michael Wellman Guest Editors ============ Structure and Chance: Melding Logic and Probability for Software Debugging Lisa Burnell, University of Texas at Arlington Eric Horvitz, Microsoft Research Abstract: To date, software engineers charged with debugging complex software packages have had few automated reasoning tools to assist them with identifying the sources of error and with prioritizing their effort. We describe methods, based on a synthesis of logical and probabilistic reasoning, that can be employed to identify the likely source and location of problems in complex software. The methods have been applied to diagnosing run-time errors in the Sabre system, the largest timeshared reservation system in the world. The results from our validation suggest that such methods can be of value in directing the attention of software engineers to program execution paths and program instructions that have the highest likelihood of harboring a programming error. Applying Bayesian Networks to Information Retrieval Robert Fung, Prevision Inc. Brendan Del Favero, Stanford University Abstract: The need for effective information-retrieval systems is increasing with the exponential growth of available information. In this article, we describe an application of Bayesian networks, a relatively new technology for probabilistic representation and inference, to information retrieval. Our research has been directed at developing a probabilistic information-retrieval architecture that assists users that have stable information needs in routing (i.e., sorting through) large amounts of time-sensitive material, gives users an intuitive language with which to specify their information needs, and integrates relevance feedback and training data with the user's judgments to incrementally improve retrieval performance. Towards these goals, we have developed and implemented an information retrieval system based on Bayesian networks. We tested our implementation's performance by participating in the 1993 Text Retrieval Conference (TREC-2), sponsored by the National Institute of Standards and Technology (NIST). Decision-Theoretic Troubleshooting David Heckerman, John S. Breese, and Koos Rommelse Microsoft Research Abstract: We develop a decision-theoretic method for troubleshooting a nonfunctioning device. Rather than passively observe device behavior and attempt to infer what faults are present, we allow for the repair or replacement of device components, and use resulting changes in behavior of the device to infer likely causes of failure. The method uses a Bayesian network to model the problem domain. We have applied our technique successfully to troubleshooting problems with printing over computer networks, photocopier feeders, automobiles, and gas turbines. We report empirical findings demonstrating the high quality of troubleshooting plans produced by this method when compared to plans generated by simple heuristics. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From: gregoire@ful1.com (Gregoire Grandbase du Donnee) Subject: Knowledge Burial in Databases: An Underview This is an abstract of the introductory chapter of the forthcoming book on the hot new topic of Knowledge Burial in Databases. **** Knowledge Burial in Databases: An Underview **** Gregoire Grandbase du Donnee SoCal University Computers have promised us a fountain of data but delivered a flood of wisdom A happy MIS executive Abstract This chapter presents an overview of the state of the art in research on knowledge burial in databases. We analyze {\em Knowledge Burial} and define it as the trivial storage of explicit, previously known, and potentially useless information in data. We then compare and contrast database, machine learning, and other approaches to burial in data. We present a framework for knowledge burial and examine problems in dealing with large, noisy databases, the use of domain knowledge, the role of the user in the burial process, burial methods, and the form and uses of buried knowledge. We also discuss application issues, including the variety of existing applications and propriety of burial in social databases. We present criteria for selecting an application in a corporate environment. In conclusion, we argue that burial in databases is both feasible and practical and outline directions for future research, which include better use of domain knowledge, efficient and incremental algorithms, interactive systems, and integration on multiple levels. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~CFPs: Calls for Papers and Participation:~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From: dimitris@gte.com (Dimitrios Georgakopoulos) To: Multiple recipients of list Subject: ICDE96 - preliminary call for papers \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ The ICDE96 home page http://info.gte.com/ftp/doc/ICDE96/flyer.html will be kept up to date as information becomes available \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ PRELIMINARY CALL FOR PAPERS The 12th International Conference on Data Engineering Feb. 26 - March 1, 1996 New Orleans, Louisiana, USA Sponsored by the IEEE Computer Society SCOPE Data Engineering deals with the use of engineering disciplines in the design, development and evaluation of information systems for different computing platforms and system architectures. The 12th Data Engineering Conference will provide a forum for the sharing of original research results and practical engineering experiences among researchers and practitioners interested in all aspects of data and knowledge management. The purpose of the conference is to share solutions to problems that face our information-oriented society and to identify new challenges and directions for future research. TOPICS OF INTEREST The topics of interest include but are not limited to: o Data Engineering and Information Superhighways o Mobile Computing o Network Databases and Security o Knowledge Mining and Discovery o Interoperability of Heterogeneous Information Systems o Distributed Transaction Management o Query processing and Optimization o Virtual Enterprise Management o Work Management and Simulation o Multimedia Systems and Applications o Open Architecture and Extensible Systems o Visualization and User Interface o Active Database/Knowledge Base Systems o Intelligent and Deductive Systems o Parallel and Distributed Database Systems o Object-oriented Databases and Software Engineering o Real-time, Temporal and Spatial Systems o Industrial Challenges in Data Engineering o Scientific and Statistical Databases o Information Systems for Engineering Applications PAPER SUBMISSION Six copies of original papers not exceeding 6000 words (25 double spaced pages) should be submitted by May 31, 1995 to program chair: Stanley Y. W. Su Database Systems R&D Center 470 CSE Building University of Florida P. O. Box 116125 Gainesville, FL 32611-6125 E-mail: icde96@cis.ufl.edu Tel. (904) 392-2680, FAX: (904) 392-1220 Authors should explicitly specify in the cover page one or two topics of interest most relevant to the submission. Submissions for the industrial program should also be specified in the cover page. INDUSTRIAL, PANEL &TUTORIAL PROGRAMS There will be a series of sessions focused on issues relevant to practitioners of information technology. Proposals for the industrial program, panels and the tutorial program are being sought. Proposers should submit a short write-up to the persons in charge of the Industrial, Panel and Tutorial programs. PUBLICATIONS & AWARDS All accepted papers will appear in the Proceedings published by IEEE Computer Society. The authors of selected papers will be invited to submit extended versions for possible publication in the IEEE Transactions on Data and Knowledge Engineering and in the Journal of Distributed and Parallel Databases. An award will be given to the best paper. A separate award honoring K.S. Fu will be given to the best student paper (authored solely by students). IMPORTANT DATES o Paper, Panel, and Tutorial submissions: May 31, 1995 o Notification of acceptance: October 2, 1995 o Tutorials: February 26-27, 1996 o Conference: February 28 - March 1, 1996 ORGANIZING COMMITTEE General Chair: Marek Rusinkiewicz, University of Houston Program Chair: Stanley Y. W. Su, University of Florida Industrial Program Chair: Umesh Dayal, Hewlett-Packard laboratories Panel Program Chair: Arie Segev, University of California-Berkeley Tutorial Program Chair: Witold Litwin, University of Paris PROGRAM VICE CHAIRS Data Engineering and Information Super Highways Gunter Schlageter, Univ. of Hagen Interoperability of Heterogeneous Information Systems Dennis McLeod, Univ. of Southern California Scientific and Statistical Databases and Applications Arie Shoshani, LBL Multimedia Systems and Applications William Grosky, Wayne State University Extensible and Active Database/Knowledge Base Systems Sharma Chakravarthy, Univ. of Florida Parallel and Distributed Database Systems Chaitanya K. Baru, IBM Toronto Labs Object-oriented Databases and Software Engineering Luqi, Naval Postgraduate School Real-time, Temporal and/or Spatial Systems Alex Buchmann, Technical Univ. Darmstadt EUROPEAN COORDINATOR o Keith G. Jeffery, Rutherford Appleton Lab FAR EAST COORDINATOR o Ron Sacks-Davis, CITRI EXHIBITS PROGRAM o to be named PUBLICITY CHAIR o Dimitrios Georgakopoulos , GTE Laboratories FINANCIAL CHAIR o Stephen Huang, University of Houston REGISTRATION o to be named LOCAL ARRANGEMENTS o William Buckles, Tulane University