21 Must-Know Data Science Interview Questions and Answers, part 2

Second part of the answers to 20 Questions to Detect Fake Data Scientists, including controlling overfitting, experimental design, tall and wide data, understanding the validity of statistics in the media, and more.

Q17. How would you use either the extreme value theory, Monte Carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?

Answer by Matthew Mayo.

Extreme value theory (EVT) focuses on rare events and extremes, as opposed to classical approaches to statistics which concentrate on average behaviors. EVT states that there are 3 types of distributions needed to model the the extreme data points of a collection of random observations from some distribution: the Gumble, Frechet, and Weibull distributions, also known as the Extreme Value Distributions (EVD) 1, 2, and 3, respectively.

The EVT states that, if you were to generate N data sets from a given distribution, and then create a new dataset containing only the maximum values of these N data sets, this new dataset would only be accurately described by one of the EVD distributions: Gumbel, Frechet, or Weibull. The Generalized Extreme Value Distribution (GEV) is, then, a model combining the 3 EVT models as well as the EVD model.

Knowing the models to use for modeling our data, we can then use the models to fit our data, and then evaluate. Once the best fitting model is found, analysis can be performed, including calculating possibilities.

18. What is a recommendation engine? How does it work?

Answer by Gregory Piatetsky:

We are all familiar now with recommendations from Netflix - "Other Movies you might enjoy" or from Amazon - Customers who bought X also bought Y.,

Other Movies you might enjoy

Such systems are called recommendation engines or more broadly recommender systems.

They typically produce recommendations in one of two ways: using collaborative or content-based filtering.

Collaborative filtering methods build a model based on users past behavior (items previously purchased, movies viewed and rated, etc) and use decisions made by current and other users. This model is then used to predict items (or ratings for items) that the user may be interested in.

Content-based filtering methods use features of an item to recommend additional items with similar properties. These approaches are often combined in Hybrid Recommender Systems.

Here is a comparison of these 2 approaches used in two popular music recommender systems - Last.fm and Pandora Radio. (example from Recommender System entry)
  • Last.fm creates a "station" of recommended songs by observing what bands and individual tracks the user has listened to on a regular basis and comparing those against the listening behavior of other users. Last.fm will play tracks that do not appear in the user's library, but are often played by other users with similar interests. As this approach leverages the behavior of users, it is an example of a collaborative filtering technique.
  • Pandora uses the properties of a song or artist (a subset of the 400 attributes provided by the Music Genome Project) in order to seed a "station" that plays music with similar properties. User feedback is used to refine the station's results, deemphasizing certain attributes when a user "dislikes" a particular song and emphasizing other attributes when a user "likes" a song. This is an example of a content-based approach.

Here is a good Introduction to Recommendation Engines by Dataconomy and an overview of building a Collaborative Filtering Recommendation Engine by Toptal. For latest research on recommender systems, check ACM RecSys conference.

19. Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?

Answer by Gregory Piatetsky:

In binary classification (or medical testing), False positive is when an algorithm (or test) indicates presence of a condition, when in reality it is absent. A false negative is when an algorithm (or test) indicates absence of a condition, when in reality it is present.

In statistical hypothesis testing false positive is also called type I error and false negative - type II error.

It is obviously very important to distinguish and treat false positives and false negatives differently because the costs of such errors can be hugely different.

For example, if a test for serious disease is false positive (test says disease, but person is healthy), then an extra test will be made that will determine the correct diagnosis. However, if a test is false negative (test says healthy, but person has disease), then treatment will be done and person may die as a result.

20. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?

Answer by Gregory Piatetsky:

There are many good tools for Data Visualization. R, Python, Tableau and Excel are among most commonly used by Data Scientists.

Here are useful KDnuggets resources:
There are many ways to representing more than 2 dimensions in a chart. 3rd dimension can be shown with a 3D scatter plot which can be rotate. You can use color, shading, shape, size. Animation can be used effectively to show time dimension (change over time).

Here is a good example.

5-dimensional scatter plot of iris data
Fig 20a: 5-dimensional scatter plot of Iris data
, with size: sepal length; color: sepal width; shape: class; x-column: petal length; y-column: petal width, from here.

For more than 5 dimensions, one approach is Parallel Coordinates, pioneered by Alfred Inselberg.

Fig 20b: Iris data in parallel coordinates

See also
Of course, when you have a lot of dimensions, it is best to reduce the number of dimensions or features first.