KDnuggets : News : 2005 : n20 : item4 < PREVIOUS | NEXT >

Features

From: Gregory Piatetsky-Shapiro
Date: 24 Oct 2005
Subject: Usama Fayyad Interview, 2: on data mining challenges

Q2. GPS: What are some of the biggest data mining challenges you face now at Yahoo?

Usama Fayyad: Yahoo!'s users, through their use of our network of products, generate over 10 terabytes of data per day. This is the equivalent of the entire text contents of the library of Congress. This is data that describes product usage, and does not include content, email, or images, etc.

The first and largest challenge is the ability to capture all of this data reliably, process it, reduce it, and use it to feed the many, many reports, applications, and data warehouses, data marts, dashboards, and scorecards across the company and its businesses. This is a game of reliability and scalability. You cannot fall behind, because you can never catch up if you do. Because this data stream is always growing (Yahoo now serves over 410 million unique users a month!) you cannot just plan for the existing data load, but always be building ahead of the game. There simply is not enough time in the day to play catch up or reprocess the data. Also, this data comes from thousands of servers that are around the world, with new servers being added and old ones replaced all the time...

A second challenge is defining metrics that are central to the business and understandable by the business units. This is a very tricky area. Yahoo! is in a wide range of businesses and verticals. Figuring out how to process the data and present the results in ways that are actionable by the businesses is not easy, especially in the Internet space where things change fast and on an ongoing basis. This also includes keeping up with new pages and new products being launched almost on a daily basis -- this is an environment that is very far from static! In addition, it is a poorly understood area: no one knows how to measure the health of an Interactive Business in a robust way, so we actually have to build many of these advanced metrics that transcend the very primitive state of metrics in the industry today. This is research and innovation in a live business environment!

The other challenges can be summarized by scale: both on data mining algorithm and on the management of data mining models. When you are responsible for generating thousands of predictive and classification data mining models, updating them daily, and then using them to produce predictions in real time, a huge challenge is to make sure that all these data mining models are updated, reading the correct information, and their outputs checked against what it takes to conduct data mining models and their notorious sensitivities to changes in the data or outliers. Many companies find it challenging to run a handful of models; we have to run and maintain thousands. This is a scale that is unfamiliar to most practitioners in our field and it requires systematic and product-like thinking -- not just analysis-oriented thinking.


KDnuggets : News : 2005 : n20 : item4 < PREVIOUS | NEXT >

Copyright © 2005 KDnuggets.   Subscribe to KDnuggets News!