KDnuggets Home » News » 2010 » Apr » Publications » Google large scale ML lessons  ( < Prev | 10:n08 | Next > )

Google large scale machine learning system


 
  
If data is abundant then often a more fruitful approach is to design a highly scalable learning system and use several orders of magnitude more training data


Lessons learned developing a practical large scale machine learning system

Google Research Blog, April 06, 2010, Simon Tong

GoogleWhen faced with a hard prediction problem, one possible approach is to attempt to perform statistical miracles on a small training set. If data is abundant then often a more fruitful approach is to design a highly scalable learning system and use several orders of magnitude more training data.

This general notion recurs in many other fields as well. For example, processing large quantities of data helps immensely for information retrieval and machine translation.

Several years ago we began developing a large scale machine learning system, and have been refining it over time. We gave it the codename "Seti" because it searches for signals in a large space. It scales to massive data sets and has become one of the most broadly used classification systems at Google.

After building a few initial prototypes, we quickly settled on a system with the following properties:

  • Binary classification (produces a probability estimate of the class label)
  • Parallelized
  • Scales to process hundreds of billions of instances and beyond
  • Scales to billions of features and beyond
  • Automatically identifies useful combinations of features
  • Accuracy is competitive with state-of-the-art classifiers
  • Reacts to new data within minutes
Seti's accuracy appears to be pretty decent. For example, tests on standard smaller datasets indicate that it is comparable with modern classifiers.

Seti has the flexibility to be used on a broad range of training set sizes and feature sets. These sizes are substantially larger than those typically used in academia (e.g., the largest UCI dataset has 4 million instances). A sample of the data sets used with Seti gives the following statistics:

A good machine learning system is all about accuracy, right?

In the process of designing Seti we made plenty of mistakes. However, we made some good key decisions as well. Here are a few of the practical lessons that we learned. Some are obvious in hindsight, but we did not necessarily realize their importance at the time.

Lesson: Keep it simple (even at the expense of a little accuracy).

Having good accuracy across a variety of domains is very important, and we were tempted to focus exclusively on this aspect of the algorithm. However, in a practical system there are several other aspects of an algorithm that are equally critical:

Lesson: Know when to say "no".

Read more.


KDnuggets Home » News » 2010 » Apr » Publications » Google large scale ML lessons  ( < Prev | 10:n08 | Next > )