KDnuggets News 01:11, item 5, News

KDnuggets : News : 2001 : n11 : item5 (previous | next)

News

Subject: Interview: Usama Fayyad on NASA and Microsoft

3) At JPL/NASA, you developed SKICAT project that made out-of-this-world discoveries by analyzing star data. What was the most innovative part of your SKICAT work?

I believe the most innovative part was that we used data sets that were collected for a different purpose to solve a fundamental problem in the field. The sky survey is done in a certain image resolution and is intended to cover the entire visible sky with a few thousand plates. However, very high resolution images of a tiny part of the sky covered by each photographic plate are taken by a separate telescope to help calibrate images and pixel intensities across multiple parts of the sky and multiple times of the year. The big insight was realizing that an astronomer can readily recognize a sky object in the high-resolution image but see nothing but a few unintelligible pixels in the regular survey plate image. The challenge then was to find out if the few low-resolution pixels contained latent information sufficient to allow a machine to match the human recognition in the corresponding high-resolution image. The success of the mining algorithm at extracting this accurate classification capabilities from measurements done over the few low-resolution pixels allowed us to be able to outperform highly trained human astronomers on images where no high resolution views were available. This looked like magic to astronomers. We wound up solving problems that astronomers struggled with for decades. All thanks to algorithms that could take advantage of a large number of measured variables per sky object. It turned that the classification was inherently high-dimensional, and hence a human-based analysis approach, using traditional statistical analysis techniques, was not going to yield the required accuracies.

That is an excellent example of a classic data mining application: lots of data, a difficult analysis task, and users who are very motivated to work with you to solve the problem. This kind of settings lets data mining algorithms shine on aspects they excel at: the search for useful dimensions out of a large set of possibilities.

4) After JPL, you joined Microsoft Research, where you led a data mining group that developed a number of data mining components for Microsoft Products. What are some of the difficulties and successes of translating research into product development ?

I started out at Microsoft Research and grew the Data Mining & Exploration group. The charter was to continue to do basic science. At one point, we also decided to form a parallel product group to ship the data mining components and a new API as part of SQL Server 2000. This actually was a crucial and important transition: to form the product group and focus it on development. It required on my part the ability to abandon a lot of the freedoms that were the perks of the research life and take on the responsibilities associated with shipping a product. It meant living with the product team and relocating to their building, etc.

The major difficulties in getting the fruits of research into a real industrial product all revolve around how to change the perspective and the definition of the technology so it becomes something a development group can live with and work with. For a company like Microsoft, it was key to understand that for a platforms company, pushing data mining algorithms and techniques was the WRONG thing to advocate. I had to come up with a new view of data mining as a reasonable and natural extension to the platform. Hence, we could not build tools that are intended for specialists and Ph.D.'s. We had to figure out how to package the technology so that a developer building on the database platform could use the technology in a natural manner. This also meant integration of the requirements in the platform and letting the system manage the scary details.

For example, in data mining you create a model, you train a model, you update a model, you apply a model, and you delete a model. The magic contribution of working with a product group was to get them to figure out the normal analogies. CREATE statement in SQL does exactly what you need: define the structure of a model, and then make it a first-class citizen of the DBMS, just like a table. Training was akin to INSERT INTO where you insert data into a table. Applying the model, say for prediction, was akin to the JOIN operator. This was a whole new way of thinking about data mining operations. But in the world of SQL these were familiar and understood notions. That is one of the examples of notions that you never consider when you are in the advanced research world. But they are crucial if the technology is to be adopted by product groups.

next questions

KDnuggets : News : 2001 : n11 : item5 (previous | next)