What questions can data science answer?
There are only five questions machine learning can answer: Is this A or B? Is this weird? How much/how many? How is it organized? What should I do next? We examine these questions in detail and what it implies for data science.
As amazing as it is, there are only five questions machine learning can answer.
Is this A or B?
Is this weird?
How much/how many?
How is it organized?
What should I do next?
Machine learning (ML) is the motor that drives data science. Each ML method (also called an algorithm) takes in data, turns it over, and spits out an answer. ML algorithms do the part of data science that is the trickiest to explain and the most fun to work with. That’s where the mathematical magic happens.
ML algorithms can be grouped into families based on the type of question they answer. These can help guide your thinking as you are formulating your razor sharp question.
Is this A or B?
This family is formally known as two-class classification. It’s useful for any question that has just two possible answers: yes or no, on or off, smoking or non-smoking, purchased or not. Lots of data science questions sound like this or can be re-phrased to fit this form. It’s the simplest and most commonly asked data science question. Here are few typical examples.
- Will this customer renew their subscription?
- Is this an image of a cat or a dog?
- Will this customer click on the top link?
- Will this tire fail in the next thousand miles?
- Does the $5 coupon or the 25% off coupon result in more return customers?
Is this A or B or C or D?
This algorithm family is called multi-class classification. Like its name implies, it answers a question that has several (or even many) possible answers: which flavor, which person, which part, which company, which candidate. Most multi-class classification algorithms are just extensions of two-class classification algorithms. Here are a few typical examples.
- Which animal is in this image?
- Which aircraft is causing this radar signature?
- What is the topic of this news article?
- What is the mood of this tweet?
- Who is the speaker in this recording?
Is this Weird?
This family of algorithms performs anomaly detection. They identify data points that are not normal. If you are paying close attention, you noticed that this looks like a binary classification question. It can be answered yes or no. The difference is that binary classification assumes you have a collection of examples of both yes and no cases. Anomaly detection doesn’t. This is particularly useful when what you are looking for occurs so rarely that you haven’t had a chance to collect many examples of it, like equipment failures. It’s also very helpful when there is a lot of variety in what constitutes “not normal,” as there is in credit card fraud detection. Here are some typical anomaly detection questions.
- Is this pressure reading unusual?
- Is this internet message typical?
- Is this combination of purchases very different from what this customer has made in the past?
- Are these voltages normal for this season and time of day?