What Types of Questions Can Data Science Answer
Data science has enabled us to solve complex and diverse problems by using machine learning and statistic algorithms. Here we have enumerated the common applications of supervised, unsupervised and reinforcement learning techniques
Machine learning (ML) is the motor that drives data science. Each ML method (also called an algorithm) takes in data, turns it over, and spits out an answer. ML algorithms do the part of data science that is the trickiest to explain and the most fun to work with. That’s where the mathematical magic happens.
ML algorithms can be grouped into families based on the type of question they answer. These can help guide your thinking as you are formulating your razor sharp question.
Is this A or B?
This family is formally known as two-class classification. It’s useful for any question that has just two possible answers: yes or no, on or off, smoking or non-smoking, purchased or not. Lots of data science questions sound like this or can be re-phrased to fit this form. It’s the simplest and most commonly asked data science question. Here are few typical examples.
- Will this customer renew their subscription?
- Is this an image of a cat or a dog?
- Will this customer click on the top link?
- Will this tire fail in the next thousand miles?
- Does the $5 coupon or the 25% off coupon result in more return customers?
Is this A or B or C or D?
This algorithm family is called multi-class classification. Like its name implies, it answers a question that has several (or even many) possible answers: which flavor, which person, which part, which company, which candidate. Most multi-class classification algorithms are just extensions of two-class classification algorithms. Here are a few typical examples.
- Which animal is in this image?
- Which aircraft is causing this radar signature?
- What is the topic of this news article?
- What is the mood of this tweet?
- Who is the speaker in this recording?
Is this Weird?
This family of algorithms performs anomaly detection. They identify data points that are not normal. If you are paying close attention, you noticed that this looks like a binary classification question. It can be answered yes or no. The difference is that binary classification assumes you have a collection of examples of both yes and no cases. Anomaly detection doesn’t. This is particularly useful when what you are looking for occurs so rarely that you haven’t had a chance to collect many examples of it, like equipment failures. It’s also very helpful when there is a lot of variety in what constitutes “not normal,” as there is in credit card fraud detection. Here are some typical anomaly detection questions.
- Is this pressure reading unusual?
- Is this internet message typical?
- Is this combination of purchases very different from what this customer has made in the past?
- Are these voltages normal for this season and time of day?
How Much / How Many?
When you are looking for a number instead of a class or category, the algorithm family to use is regression.
- What will the temperature be next Tuesday?
- What will my fourth quarter sales in Portugal be?
- How many kilowatts will be demanded from my wind farm 30 minutes from now?
- How many new followers will I get next week?
- Out of a thousand units, how many of this model of bearings will survive 10,000 hours of use?
Usually, regression algorithms give a real-valued answer; the answers can have lots of decimal places or even be negative. For some questions, especially questions beginning “How many…”, negative answers may have to be re-interpreted as zero and fractional values re-interpreted as the nearest whole number.