KDD Nuggets #5 -- October 28, 1993 Contents: * Interestingness: Robert Demolombe, Darrell Conklin, GPS * Douglas H. Fisher: new AI/Stats list created * Michael Brodie: Economist: AI is revolutionizing the Credit Business Requests: * Richard Forsyth: machine learning in geographic/spatial databases? * Douglas H. Fisher: hierarchical clustering ? The KDD Nuggets is an informal list for the dissemination of information relevant to Knowledge Discovery in Databases (KDD), such as announcements of conferences/workshops, tool reviews, application success/failure stories, interesting ideas, outrageous opinions, etc. If you have such a contribution, please email it to kdd%eureka@gte.com Mail requests to be added/deleted also to kdd%eureka@gte.com. -- Gregory Piatetsky-Shapiro -------------------------------------------------- From demolombe@tls-cs.cert.fr Wed Oct 20 13:19:46 1993 Subject: Interestingness There are many works in logic on the concept of "relevance", or "topic" or "subject matter" or "aboutness" or "relatedness". We are working on the definition of a logic for reasoning about links between a sentence and topics. The initial motivation was to help users to retrieve information. In that context "interesting" topics are defined as those topics that are related to the query, and "interesting" additional answer is defined as additional information related to these interesting topics. Is this related to "interestingness"? Are there possible aplication of this work to KDD, in the sense that interesting topics may be used to focus the search on a restricted set of rules or patterns? I am definitly not a specialist in KDD, who could give me an answer? If one is interested we have a preliminary version of a paper entitled: "Reasoning about "is about" " which is available on request. Robert Demolombe --------------- From: conklin@qucis.queensu.ca (Darrell Conklin) Date: Wed, 27 Oct 93 10:18:51 EDT Subject: On "interestingness" Techniques for knowledge discovery include conceptual clustering, and its incremental counterpart, concept formation. These techniques typically describe objects by sets of attribute/value pairs (features) and group similar objects together into a hierarchy of concepts. Concepts are represented by intensional definitions, which represent recurrent patterns of features. There are many (sometimes conflicting) evaluation methods for the "interestingness" of a concept, including "rediscovery", predictiveness of features given concept membership, ability to compress data, and so on. Here is another idea. Consider a concept C: we can ask, is there some subset S of the features in C that are highly predictive of the others (the features in C-S)? That is, the concept C could be useful for inference if P(C-S | S) = P((C-S)&S)/P(S) = P(C)/P(S) is sufficiently high, and the regularity or concept C has been observed a sufficient number of times. (I leave the definition of "sufficient" to a KDD theorist). The expression above can easily be evaluated --- without a scan through the whole database --- if both C and S are discovered concepts with an attached frequency of occurrence field. A similar technique has been used by Rooman and Wodak (Nature 335, 1988) and in my own research to evaluate discovered associations between protein sequence and structure. Concept formation systems offer a reasonable technique for uncovering regularities in the data, while constraining the search space over possible regularities. --------------- From: gps@gte.com (Gregory Piatetsky-Shapiro) Subject: Interestingness is subjective In practical applications, interestingness of a piece of knowledge is usually related to whether this knowledge can lead to some useful action. Thus, objective (syntactic, statistical, information-theoretic, logical, etc) measures of interestingness, are an important but no the onlu component of overall interestingness. Another, subjective, component is necessary. In our current system which looks for Key changes in health care data, such subjective component is the degree of *discretion* over a particular finding. Thus, an increase in costs due to normal pregnancies is not very interesting, since there is no action item. An increase in costs due to premature babies is very interesting, since there are well-known prevention techniques. We get from experts the values of discretion for the basic elements in the knowledge base. Then, interest of any combination of these elements can be computed as a simple, but domain-dependent function of interest of basic elements. -------------------------------------------------- From: Michael Brodie Subject: AI is revolutionizing the Credit Business The Economist, Sept. 25, 1993, article describes how AI is used to identify cardholder preferences, interests, and qualifications. They claim a 19,000 item rulebook and will eventually be used to process all transactions. -------------------------------------------------- From: dfisher@vuse.vanderbilt.edu (Douglas H. Fisher) Subject: AI/Stats list A new mailing list for those interested in AI and Statistics has been created. Requests to be added to this mailing list should be directed to ai-stats-request@watstat.uwaterloo.ca Organization of the 1995 International AI and Statistics Workshop has begun. Look for announcements in this and other newsgroups. Doug Fisher, General Chair ------------------------------------------------------------ ----------------- Requests --------------------------------- From: RS_FORSYTH@cv.uwe.ac.uk Subject: machine learning in geographic/spatial databases hello out there i have a student looking into applications of learning/induction/discovery algorithms to remote-sesning & geographic databases. we have only tracked down 3 or 4 refs so far (all but one in the KDD-93 proceedings). is anyone out there well up on the state of that particular art? if so i'd greatly appreciate info on where to look, whom to contact & so forth. thanks, richard forsyth. (UWE Bristol, UK) -------------------------------------------------- From: dfisher@vuse.vanderbilt.edu (Douglas H. Fisher) Subject: query The forms of iterative optimization in clustering that I am familiar with begin with some initial clustering, and then iteratively move single objects around in search of a better clustering according to some objective measure. I have built a system that forms an initial hierarchical clustering, and then moves top-down through the hierarchy, at each level `reclassifying' entire clusters (subtrees) in search of a better partition. This top-down pass terminates at leaves, where single objects are reclassified in the global hierarchical structure. In general, several top-down passes may be necessary before the hierarchical clustering `stabilizes'. If you know of published work along similar lines, either similar systems, or work related to the more general issue of reclassifying object sets (versus single objects), then please send me citations at dfisher@vuse.vanderbilt.edu P.S. I already know of one piece of related work by Nevins at Georgia State. Thank you, Doug Fisher