KDD Nugget 94:13, e-mailed 94-07-14 Contents: * G. Piatetsky-Shapiro, new release of KD Mine at http://info.gte.com/~kdd/ * D. Lim, Query: Fractal Database product by Cross/Z ? * D. Lim, article on DATA MINING: TAPPING INTO THE MOTHER LODE The KDD Nuggets is a moderated list for the exchange of information relevant to Knowledge Discovery in Databases (KDD, also known as Data Mining), e.g. application descriptions, conference announcements, tool reviews, information requests, interesting ideas, clever opinions, etc. It has been coming out about every two-three weeks, depending on the quantity and urgency of submissions.. Back issues, FAQ, and other KDD-related information are now available via Mosaic, URL http://info.gte.com/~kdd/ or by anonymous ftp to ftp.gte.com, cd /pub/kdd, get README E-mail contributions to kdd@gte.com Add/delete requests to kdd-request@gte.com -- Gregory Piatetsky-Shapiro (moderator) ********************* Official disclaimer *********************************** * All opinions expressed herein are those of the writers (or the moderator) * * and not necessarily of their respective employers (or GTE Laboratories) * ***************************************************************************** ~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The old saying: statistics are like a bikini--what they reveal is tantalyzing, what they cover up is vital. from usenet ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ----------------------------- From: G. Piatetsky-Shapiro (gps@gte.com) Subject: New Release of Knowledge Discovery Mine Date: Thu, 14 July 1994 Knowledge Discovery Mine at http://info.gte.com/~kdd/ has been redesigned and extended. The new release, besides better graphics (many thanks to Chris Matheus), also has a hypertext catalog of many commercial and public-domain KDD-related tools, a larger list of publications and references, and pointers to other information servers and KDD-related homepages. Comments and contributions to Knowledge Discovery Mine are welcome -- please e-mail them to kdd@gte.com. ----------------------------- From: "Desmond Lim" Date: Thu Jun 23 19:00:32 1994 Subject: Fractal Database product by Cross/Z ? In the February 1994 issue of the Database Programming & Design magazine there was an article on Data Mining and in it featured a Fractal Database by a company called Cross/Z International Inc. of Great Neck, New York. I was wondering if anyone knew how this was being performed? Desmond Lim Senior Engineer - Technical Support Seagate Technology International ----------------------------- Date: Thu Jun 23 19:00:32 1994 From: "Desmond Lim" Subject: Data Mining: Tapping into the Mother Lode Here is the article from the Database Programming & Design, February 1994 issue by Lisa Lewinson: Data: It's growing, stretching our ability to store it. A new wave of "mining" tools just might lead us to data's real value - as information DATA MINING: TAPPING INTO THE MOTHER LODE For over three decades, companies have been collecting massive amounts of data on a myriad of topics, from customers to inventory to invoices. This glut of data often lands in the nether reaches of the data world - on tape, in vaults of memory on mainframes, and in specialy commisioned storage shops. The data typically sits there, unused, until an auditor or government regulation deems it safe to throw out. This scenario is changing radically as corporations around the world take a second look at the power hidden in their databases. Instead of files figuratively gathering dust, historical data is now being viewed as an invaluable, proprietary resource that can uncover patterns and hidden meanings that can actually predict the future. Business place a high value on forecasting. These organizations want to use their data to make accurate predictions about several critical issues: inventory amounts, response to mailings or bonus offers, fraudulent credit card usage, the cost of insurance claims, which loans will go bad, or product demand, which customers will "churn" (that is, leave one vendor for another). Armed with accurate forecasts, business can save millions of dollars. Analyzing historical data to find patterns that shed light on the present is loosely described as "data mining." Data mining not only answers predictive business questions; it can also reveal the most important attributes influencing the predicted answer. In many cases, being able to reach this level of understanding proves at least as important as the prediction itself. This fact is why, says Casey Klimasauskas, president of NeuralWare Inc., "Data mining is a powerful levee to help hold back the Mississippi food of information. THE POWER TO THINK The increasing power of PCs has already begun to usher in new ways of presenting and accessing data, text, multimedia, and other types of information. Powerful PCs have also made data mining commercially viable. Calculations that would take days on mainframes (assuming time was available to run the calculations) or would be impossible on less-powerful PCs or workstations can often be done within several hours on a typical 80486 PC. While some companies have already employed statisticians to build "models" that project trends in the business, traditionally, only two to four models are completed in a year. This figure hardly keeps up with the demand to understand business problems. Data mining technology can roll out predictive models in a day or a week. One of the best-known ways to mine data is with a neural network. Neural network software is a complex mathematical computer model of the way a collection of brain cells, called neurons, operate - tht is, learn from experience, develop rules, and recognize patterns. Neural "nets" are designed for pattern recognition among complex data elements. Using various types of algorithms, neural nets are typically applied to between 300 and 6,000 rows of data (although considerably more or less data is possible) and use selected attributes (typically from two to 200) as "inputs." The neural network user must experiment to find the best set of inputs to "train" the net. To operate efficiently, a neural network requires clean data and considerable data preparation. Neural nets also work only on numeric data. If, for example, symbolic data such as "type" columns are included in the input attributes, each type must be converted into numeric format. Both new and experienced users find that building an effective neural network model usually requires a number of tries, with the attributes used as inputs being continually honed and refined. Neural networks are not learned overnight, but those who have worked through the process report that the effort is well worth it. NEURAL NETWORKS APPLIED A leading vendor of neural net software is NeuralWare of Pittsburgh, Pennsylvania. Founded in 1987 by the husband and wife team of Jane and Casey Klimasouskas, the company has designed its business to make knowledge about neural nets as accessible as possible. "In many companies," points out Casey Klimasauskas, "neural networks are a natural for data analysts, business users, and statisticians who work with the data to understand how it relates to the company's business. One issue that arises among those who have read about neural networks and still don't understand them is this: If I give you a book on COBOL programming and then ask you, 'how do you balance a checkbook with it,' the answer is not obvious. Similarly, MIS people may read a book on neural networks, look at the problem they're trying to solve, and not see a connection." Klimasauskas related that in developing NeuralWare's training course, they knew they would have to bring potential users through a systematic, start-to-finish methodology, showing bu ilding blocks for reaching a solution. Klimasauskas points out that one of thekys to success in using neural networks is having acces to corporate data. "The data organization can facilitate this process," he states, "or they can massively hinder it. The MIS department should realize that with neural networks, they are using a mathematical technique for clustering things together. That means that a lot of fields in databases that are begrudgingly kept in customer databases, for example, really become improtant. MIS has a big role in validating and enhancing the quality of data. Data mining technology puts more emphasis on the improtance of clean data." In other words, to use sophisticated tools of prognostication, end-user analysts are dependent on MIS making all of a corporation's data accessible, usable, and as clean as possible. NeuralWare starts its training with a general, four-day course titled "Applying Neural Computing in Business, Industry, and Government." Most of the time is spent understanding the types of problems to use in predicting and building a methodology for using aneural net. The trainees are then told to spend a month using the software, then to return for training in their specific application. "A neural network is math, not magic," adds Jane Klimasaukas. Yet, she claims that a degree in statistics and modelling is not required to master neural network software. However, the user should have a good knowledge of the domain, a willingness to learn the new technology, and the time to spend experimenting. The financial industry has been a fertile ground for neural nets. Traders and asset managers have been using neural nets for trend analysis and pattern recognition. Susan Garavaglia, adirector in the analytical services department at Dun and Bradstreet Information Services, N.A., has been using NeuralWare products more than a year for credit evaluation and marketing. "Our customers asked us about neural nets," says Garavaglia. "We use the software to deliver a model to them. It has give us the resources and capability to do case studies and work more closely with our customers." CUSTOMIZE SOLUTIONS Another leading vendor of neural network technology is HNC of San Diego, California. HNC produces the DataBase Mining Workstation (DMW). HNC gears its efforts toward providing customized applications for customers, and primarily works through value added resellers (VARs) who tailor the DMW to specific applications. Randy Richardson, president of Customer Insight Co. Inc. of Inglewood, Colorado is a VAR for HNC. Richardson was already in the business of selling sophisticated, stand-alone customer databases for large corporations when he encountered HNC in 1992. Richardsone dedicated his best programmer for six months to writing an interface between his proprietary database and the DMW in the belief that the DMW would provide a valuable service to his customers. Richardson is actively selling the combined package-at prices in the six- to seven-figure range. He explains to chief financial officers that while a neural network may not produce better results than an in-house statistician, a high "opportunity cost" is accrued to the business fromm not having necessary models built. At one large cellular phone company, Richardson convinced the CFO that the DMW could save $450 million in one year simply by accurately predicting customers who would "churn." Richardson sells the customized package, the DMW, and 20 days of consulting for $80,000. "If I only sell the workstation," Richardson says, "I haven't solved their problem." This fact is because, adds Allen Jost, vice president of HNC's Decision Systems division, "In many cases, customers spend far more time organizing their data than modeling it. Getting the data organized is where [Richardson's] Customer Insight Co. really makes a difference." Carol Klenke, micro marketing manager for First Commerce Corp., a large banking concern in Lousiana, is another DMW customer. Introduced to the DMW through Richardson's Customer Insight Co., she attended a three- and a five-day training session with HNC; afterward, she felt empowered to predict business problems. One of Klenke's first challenges was to determine the best customers for a marketing campaign involving auto loans. The attempt was so successful that the news travelled throughout the bank's 90 branches. Now, an associate of Klenke's works full-time building predictive models. "We are exploring how we can predict customer retention-who will stay with the bank and who will not. We're using the DMW for tracking direct mail. We experiment with different offers we create. When the results come in, we analyze them on the DMW to see if we had targeted the right group and if we should use that group again." Klenke says that she uses data samples of 1,000 records to come up with the results. "At First Commerece, we believe in investing in technology and we definitely plan to illustrate the payback we've achieved with DMW," Klenke says. "We're happy about the buy-in throughout the bank. The bank card area, mortgage area, and the branches are all enthused. All the branches would like the retention analysis." SWORDS TO PLOWSHARES Another technology that performs predictive data modelling is fractal geometry. Fractal geometry is based on work originally applied to compression of terrain images for cruise missile projects. It is a mathematical means of compressing data. The compression occurs with no data loss, so an entire set of records, rather than a sample, can be analyzed. Since this technology can work on many gigabytes of data at once, it offers intriguing possibilities for companies that, for instance may want to locate the three customers out of 30 million who responded in a specified manner. This type of query could take days to process, even on hardware designed for gigabytes of data. One vendor of fractal technology is Cross/Z International Inc. of Great Neck, New York. Cross/Z takes a selected portion of a client's entire database and, using it's own IBM MVS-based mainframe, transforms the database into a fractal database that can be used as a PC-based file view of the data. Clients can then access the file via a DOS-based front-end tool called Private Eye, which designs and examines views of the fractalized data. For example, using Private Eye, you can determine that the largest response to a massive mailing came from a certain ZIP code and age bracket. A market analyst could then determine how customer response related to occupation for this subroup. Analysts would have the security of knowing that they are dealing with full counts, rather than samples. Do neural nets and fractal geometry products work on the same genre of problems? According to William Gillet, Cross/Z's vice president of business development, "Neural networks typically work on a subset of the data. We work on mission-critical problems with millions of rows of data. A client who wants to optimize a mailing may send us five million records, including who they mailed to and who responded. We build a fractal database on the entire file, splitting out a validation group, if required. A fractal model is then built, which is integrated with a front-end tool to display the results: a ranked file on the most likely persons to respond to the next mailing." Jane Blume, a senior marketing manager at American Express, confirms that Cross/Z was used to help identify customers who would upgrade their Executive Corporate card from green to gold. American Express's original mailing had received a four-percent response. This group, plus two million additional customer records, were turned over to Cross/Z, to determine who should receive the next mailing. According to American Express's Blume, "Cross/Z broke them out into 10 categories, with a number one as the most likely to respone. We sent out another mailing using the Cross/Z model, and it beat our plan. We had hoped for a 4.84 percent response; [the Cross/Z engineered plan] came in at 5.3 percent, exceeding our plan by 11 percent. Now the model will be updated again, looking at the 5.3 percent who did respond and those who didn't. Our goal is to upgrade constantly." Stuart Spencer, marketing manager at American Express, adds, "The people at Cross/Z recognize that what they put together is complex. They bring it down to user friendly levels. Without them, we would still be waffling in the vagaries of direct mail." Cross/Z's Gillet says the company charges $13.500 to build a single model and offers a volume-based discount to build a series of models. Gillet says that the company builds more than 200 models per year. In addition to American Express, current customers include Allstate and Federal Express. In 1994, Cross/Z plans to introduce a software product that will enable companies to build their own fractal models onsite. CASE-BASED REASONING Cognitive Systems of Boston, Massachusetts takes yet another approach to data mining. The firm produces ReMind, a case-based reasoning tool. Case-based reasoning uses past experiences (as reflected in textual data) to solve curent problems. In case-based reasoning, past cases are represented, indexed, and stored in a computer so they can be retrieved in the best possible manner. Case-based reasoning software builds up its own set of historic examples; each case added to its list of examples helps the computer learn. Using past examples, a case-based reasoning system is able to justfiy and explain how it arrived at a result and lets a user look at real instances of past occurrences. Steve Mott, President of Cognitive Systems of Stamford, Connecticut, points out that companies have spent hundreds of billions of dollars over the past 10 years to create relational databases in the hopes of capturing a range of data. "Only 10 percent," he asserts, "are really delivering benefits of the original investment. Tools for current relational databases do not lend themselves to mining data. SQL is not accessible to the average person unless you know the structure and format of the data, and SQL packages don't do complex queries. When people want to know trends and patterns, you can't get that from SQL." In addition, says Mott, "Data as it is collected today has a strong textual component. But neural nets convert text into numbers. Textual codes are given numeric equivalents. This translation can cause problems because the text's subtle distinctions and context are often important. Categories might have inherited relationships. A neural net has no prayer, for example, of tyring to process input text in a help field, including all the problem reports and call that go into the database." In order to resolve these problems, Cognitive Systems gave its Remind tool a natural-language component. A number of ReMind's customer's use the software for powerful help desk applications. "One customer has a 50,000 query case library," says Mott, "meaning that new queries can be matched against 50,000 samples in the library. It has a 95- to 96-percent level of accuracy." Cognitive Systems is now building case-based reasoning templates for the banking industry in the areas of investment selection, bankruptcy prediction, and credit risk. At a major food manufacturing company, Peter Ducksbury Smith, a principal senior scientist, is using Remind in a process control application, where a new, high-tech system installed in a manufacturing plant is logging 100 analogs every 30 seconds for seven machines. Before Smith began work on the project, the company was throwing the data away every three days because it was impossible to interpret or store it. No more. Before the data is scrapped, Smith uses neural nets and statistical induction to come up with a model of what is actually taking place in the massive data-gathering mechanisms. He then uses ReMind to interpret the findiings of the neural nets, thus overcoming the "black box" (limited explanation capabilities) of neural networks. "Remind makes the data more useful and finds things out that people didn't know about," says Smith. "The results are very promising. I'm not seen as a crazy scientist sitting off in the corner. I'm now seen as making the data useful to the manufacturing plant managers." MODERN GOLD RUSH Neural nets, fractal geometry, and case-based reasoning are by no means the only technologies available for data mining. Three other examples of software vendors now donning a data miner's helmet include: Abtech Corp. of Charlottesville, Virginia, which is using an abductive network modeling approach; Teranet of Nanaimo, B.C., Canada, which is marketing ModelWare, the Universal Process Modeling Algorithm; and Reduct Systems Inc. of Regina, Saskatchewan, Canada, which has released Datalogic/R, based on rough sets. Many companies are finding that a suite of data mining tools is most helpful in generating predictive models for business problems. The vendors, although excited by the wide application of data mining technologies, realize that some problems are better solved by other technologies, and are trying to advise their clients wisely. Whatever approach is taken to data mining, however, more business users and CFOs are waking to the real, bottom-line gold in "them thar" databases. If the rush to test and use data mining techniques, DBAs will find the spotlight on them. The years of effort already spent in building a richer set of attributes in relational databases, enforcing ranges and type codes, and analyzing and codifying data's meaning and content will pay off more handsomely than ever before. Lisa Lewinson is president of Northstar Consulting Inc., a Chicago-based firm that develops data ming applications and knowledge-based systems. She can be reached at (708)-786-3922. TARGETS FOR DATA MINING The potential applications for data mining technology are many. Here are a few currently being addressed by data mining: Marketing: predicting which customers will respond to a mialing or buy a particular product; classifying customer demographics. Banking: forecasting levels of bad loans and fraudulent credit card usage, credit card spending by new customers, and which kinds of customers will respond to (and qualify for) new loan offers. Manufacturing, sales, and retail: prediciting sales; determining correct inventory levels and distribution schedules among outlets. Manufacturing and production: predictiong when to expect machinery failures; finding key factors that control optimization of manufacturing capacity; predicting excessive vibrations in a steel mill when rolling; determining value for circuit trim resistors. Brokerage and securities trading: predicting when bond prices will change; forecasting the range of stock fluctuation for particular issues and the overall market; determining when to trade stocks. Insurance: forecasting amout of claims and cost of medical coverage; classifying most important elements that affect medical coverage; predicting which customers will buy new policies. Computer hardware and software: predicting disk-drive failure; forecasting how long it will take to create new chips; prediciting potential security violations. Government and defense: forecasting the cost of moving military equipment; testing strategies for potential military engagements; predicting consumption of resources. Medicine: predicting a drug's mechanism of action; classifying anti-cancer agents tested in a drug screening program; allocating testing resources for emergency rooms. -----------------------------