Fundamental methods of Data Science: Classification, Regression And Similarity Matching

Data classification, regression, and similarity matching underpin many of the fundamental algorithms in data science to solve business problems like consumer response prediction and product recommendation.



By Manu Jeevan, Jan 2015.

In this post I will be discussing the 3 fundamental methods in data science. These methods are basis for extracting useful knowledge from data, and also serve as a foundation for many well known algorithms in data science. I won’t be getting into the mathematical details of these methods; rather I am going to focus on how these methods are used to solve data centric business problems.

So let’s get started,

Classification 1. Classification and class probability estimation

Classification and class probability estimation attempts to predict, for each individual in a population, to which class does this individual belongs to. Generally the classes are independent of each other. An example for a classification problem would be:

"Among all the customers of Dish, which are likely to respond to a new offer?"

In this example the two classes can be called "will respond" and "will not respond". Your goal for classification task is given a new individual; determine which class that individual belongs to. A closely related concept is scoring or class probability estimation.

A Scoring model when applied to an individual produces a score representing the probability that the individual belongs to each class. In our customer response example, a scoring model can evaluate each individual customer and produce a score of how likely each customer is to respond to the offer.

2. Regression

Regression is the most commonly used method in forecasting. Regression tries to predict a real valued output (numerical value) of some variable for that individual. An example regression problem would be: “What will be the cost of a given house?” The variable to be predicted here is housing price, and a model could be produced by looking at other, similar houses in the population and their historical prices. A regression procedure produces a model that, given a house, estimates the price of the house.

Regression is related to classification, but the two are different. In simple terms, classification forecasts whether something will happen, while regression forecasts how much something will happen.

By heart this concept: “Scoring is a classification problem not a regression problem because the underlying target (value you are attempting to predict) is categorical”

Similarity Matching 3. Similarity matching

Similarity matching tries to recognize similar individuals based on the information known about them. If two entities (products, services, companies) are similar in some way they share other characteristics as well.

For example, Accenture will be interested in finding customers who are similar to their existing profitable customers, so that they can launch a well targeted marketing campaign. Accenture use similarity matching based on the characteristics that define their existing profitable customers (such as company turnover, industry, location.. etc) .

Similarity is the underlying principle for making product recommendations (identifying people who are alike in terms of the products they have purchased or have liked). Online retailers such as Amazon and Flipkart use similarity to provide recommendations of similar products to you. Whenever you see expressions like “People who like A also like B” or “people with your browsing history have also looked at …..” The concept of similarity is being applied.

Conclusion

I talked about classification, regression and similarity matching in this post. I strongly believe the application of these fundamental methods to business problems is far more important than their algorithmic details. Important things to keep in mind are:

  • Scoring is a classification technique not a regression technique.
  • The difference between classification and regression.
  • How similarity matching is used to find similar customers.

Manu Jeevan is a self-taught data scientist and blogger at BigDataExaminer, where he writes about Data Science, Statistics, Python and Machine Learning, to help others learn data science.

Related: