KDnuggets : News : 2000 : n09 : item4

News

Previous | item4 | Next


Date: Mon, 01 May 2000 13:59:59 -0700
From: Boris Kovalerchuk borisk@tahoma.cwu.edu
Subject: Comments on proposed MS OLE DB standard

Comments on the Microsoft draft standard (specification) for Data Mining

Microsoft with support from Data Mining companies (ANGOSS Software, Appsource, Comshare, DB Miner Technology, Knosys, Magnify, Megaputer Intelligence, Maximal Innovative Intelligence, NCR, PolyVista and SPSS) developed a draft standard for Data Mining (OLE DB for Data Mining, DRAFT Specification):

http://www.microsoft.com/presspass/press/2000/Mar00/DataMiningPR.asp http://www.microsoft.com/data/oledb/

This draft (Version 0.9) is open for a public discussion until May 15, 2000.

From our viewpoint the main goals of these specifications are 1) to unify terminology,

2) to unify, simplify and speed up communications between databases, data mining tools (called data mining services), mined knowledge (in the form of data mining models) and data mining final output (in the form of forecasts, ranking, distributions, associations, correlations and so on for a particular data set),

and 3) to help to select (automatically) the most appropriate DM services/algorithms for a specific data set.

To solve these tasks Microsoft specified metadata These metadata describe each data column (target column, column used for forecasting the target, numeric data formats, contents of the column, type of the possible DM model and so on).

Similarly metadata are specified for DM services, characterizing an algorithm's capabilities.

Some flexibility is permitted. DM services can add provider-specific metadata.

Potentially these two sets of metadata (database metadata and DM service metadata) can be MATCHED AUTOMATICALLY for selecting an appropriate Data Mining service. This productive idea of matching probably was most clearly illuminated by Dhar and Stein in the concept of problem ID [Intelligent Decision Support Methods, Prentice Hall, 1997].

It is critical that in order for this matching to occur that specifications catch the REALLY IMPORTANT FEATURES OF BOTH DATA AND ALGORITHMS and are flexible enough to incorporate future algorithm developments and improvements. From our viewpoint, the most sensitive component for matching is the TYPE OF CONTENTS OF COLUMNS. Microsoft suggests the following flags for types of data contents: key, discrete, continuous, cyclical, ordering, probability distribution and so on. For instance, the flag probability permits matching services working with probability distributions with databases, which contain probability data. However, the matching for discrete, ordered, continuous and some other content data types is not so obvious.

There are two DIFFICULTIES:

1) Terminology (equal terms should have the same meanings for DM consumers and providers)

2) OLE DB for DM Grammar (Microsoft draft, p.80) should permit adequate matching.

The analysis of these difficulties and some suggestions are presented in the full document, http://www.cwu.edu/~borisk/data_mining/ms-draft-comments.html.

SUMMARY OF SUGGESTIONS:

1. Extend the Grammar with more content data types and their combinations such as NOMINAL and some others and permit REFERENCES for data types presented as APIs. Make terms describing content consistent with terms already used in MEASUREMENT THEORY for many years [D. Krantz, R.Luce, P. Suppes and A.Tversky, Foundation of measurement, Academic Press, v.1-3, 1971, 1989,1900].

2. Develop APIs, which will describe contents data types as C++/Java classes (OOP approach). For more about this OOP approach see [Kovalerchuk B., Vityaev E., Data Mining in Finance: Advanced in Relational and Hybrid Methods, Kluwer, 2000, pp. 164, 169-186].

Discussion of Microsoft specification on Data Mining in KDNuggets can be an important input for further development of DM applications.

(Full text of this document is at http://www.cwu.edu/~borisk/data_mining/msdraftcom.pdf)

Boris Kovalerchuk, Ph.D. Dept. of Computer Science, Central Washington University Ellensburg, WA 98926-7520 ph. (509) 963-1438, fax (509) 963-1449 borisk@tahoma.cwu.edu http://www.cwu.edu/~borisk/finance

Previous | item4 | Next


KDnuggets : News : 2000 : n09 : item4

Copyright © 2000 KDnuggets. Subscribe to KDnuggets News!