KDnuggets News 02:09, item 16, Software

KDnuggets : News : 2002 : n09 : item16 (previous | next)

Software

From: Jim Morgan
Date: May 3, 2002
Subject: Draft Announcement on Availability of the Enhanced Search Program

In 1963, John Sonquist and I published an article in the Journal of the American Statistical Association proposing a new approach to the analysis of rich bodies of data. When one has more than 1000 cases, it is no longer necessary to make restrictive assumptions of additivity or linearity of effects, as in regression. Taking advantage of the fact that a few divisions on any one explanatory classification would largely exhaust its explanatory power, one could use sequential binary segmentation to find the best set of subgroups to account for some criterion variable. The emphasis was on selecting among many competing explanations to reduce error, rather than in testing a single model.

With funds from the National Science Foundation, such a program, called the Automatic Interaction Detector was developed, put into the Institute for Social Research's OSIRIS software, and documented in a monograph Searching for Structure, by Sonquists, Baker and Morgan in 1971. The program has been used widely, but was written for the mainframe.

A version of most of the OSIRIS software was developed for the PC by Neal VanEck, and called MICROSIRIS. We have recently created an enhanced version of SEARCH which can handle four kinds of criterion variables: means, simple covariances, classifications, or rankings, and has other improvements, particularly a hierarchical summary table easily converted into publishable form. This new version is incorporated into MICROSIRIS and is available from Neal at vaneckn@erols.com for a small charge. (Or see http://vaneckn.ws) A stand-alone version of SEARCH is available free from the ISR at www.isr.umich.edu/src/search, along with a fuller explanation and justification and the documentation. It is easily usable with any data management software, though the instructions assume SAS is available. A major new feature is a hierarchical summary table that, with a little editing, can be made into a publishable table. Tree diagrams look nice, but are difficult to use particularly where the criterion is a set of classes or a simple regression. It is easy to generate expected values for further analysis or assignment of missing information.

In the meantime many similar programs have appeared, ranging from free academic versions to very expensive commercial versions. A whole industry calling itself data mining and often quite vague about just what it is doing is offering alternatives. We released our source code to SAS in the hopes they would put a version into their research soafware. Instead, they only offer an extremely expensive version in their commercial software. Since the original development was funded by the National Science Foundation, it is our view that a version of the program should be freely available, with charges only for consultation or for programs with substantial improvements that required research and development. Actually there do not seem to be such improvements, except in our enhanced SEARCH program which we funded ourselves. Comparisons show that many programs provide similar results.

jnmorgan@umich.edu

KDnuggets : News : 2002 : n09 : item16 (previous | next)