2016 Gold Blog21 Must-Know Data Science Interview Questions and Answers

KDnuggets Editors bring you the answers to 20 Questions to Detect Fake Data Scientists, including what is regularization, Data Scientists we admire, model validation, and more.

Q4. Explain what precision and recall are. How do they relate to the ROC curve?

Answer by Gregory Piatetsky:

Here is the answer from KDnuggets FAQ: Precision and Recall:

Calculating precision and recall is actually quite easy. Imagine there are 100 positive cases among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a better chance of catching many of the 100 positive cases.  You record the IDs of your predictions, and when you get the actual results you sum up how many times you were right or wrong. There are four ways of being right or wrong:

  1. TN / True Negative: case was negative and predicted negative
  2. TP / True Positive: case was positive and predicted positive
  3. FN / False Negative: case was positive but predicted negative
  4. FP / False Positive: case was negative but predicted positive

Makes sense so far? Now you count how many of the 10,000 cases fall in each bucket, say:

Predicted Negative

Predicted Positive

Negative Cases

TN: 9,760

FP: 140

Positive Cases

FN: 40

TP: 60

Now, your boss asks you three questions:

  1. What percent of your predictions were correct?
    You answer: the "accuracy" was (9,760+60) out of 10,000 = 98.2%
  2. What percent of the positive cases did you catch?
    You answer: the "recall" was 60 out of 100 = 60%
  3. What percent of positive predictions were correct?
    You answer: the "precision" was 60 out of 200 = 30%

See also a very good explanation of Precision and recall in Wikipedia.

Precision Recall Relevant Selected
Fig 4: Precision and Recall.

ROC curve represents a relation between sensitivity (RECALL) and specificity(NOT PRECISION) and is commonly used to measure the performance of binary classifiers. However, when dealing with highly skewed datasets, Precision-Recall (PR) curves give a more representative picture of performance. See also this Quora answer: What is the difference between a ROC curve and a precision-recall curve?.

Q5. How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything?

Answer by Anmol Rajpurohit.

Often it is observed that in the pursuit of rapid innovation (aka "quick fame"), the principles of scientific methodology are violated leading to misleading innovations, i.e. appealing insights that are confirmed without rigorous validation. One such scenario is the case that given the task of improving an algorithm to yield better results, you might come with several ideas with potential for improvement.

An obvious human urge is to announce these ideas ASAP and ask for their implementation. When asked for supporting data, often limited results are shared, which are very likely to be impacted by selection bias (known or unknown) or a misleading global minima (due to lack of appropriate variety in test data).

Data scientists do not let their human emotions overrun their logical reasoning. While the exact approach to prove that one improvement you've brought to an algorithm is really an improvement over not doing anything would depend on the actual case at hand, there are a few common guidelines:
  • Ensure that there is no selection bias in test data used for performance comparison
  • Ensure that the test data has sufficient variety in order to be symbolic of real-life data (helps avoid overfitting)
  • Ensure that "controlled experiment" principles are followed i.e. while comparing performance, the test environment (hardware, etc.) must be exactly the same while running original algorithm and new algorithm
  • Ensure that the results are repeatable with near similar results
  • Examine whether the results reflect local maxima/minima or global maxima/minima

One common way to achieve the above guidelines is through A/B testing, where both the versions of algorithm are kept running on similar environment for a considerably long time and real-life input data is randomly split between the two. This approach is particularly common in Web Analytics.

Q6. What is root cause analysis?

Answer by Gregory Piatetsky:

According to Wikipedia,
Root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. A factor is considered a root cause if removal thereof from the problem-fault-sequence prevents the final undesirable event from recurring; whereas a causal factor is one that affects an event's outcome, but is not a root cause.

Root cause analysis was initially developed to analyze industrial accidents, but is now widely used in other areas, such as healthcare, project management, or software testing.

Here is a useful Root Cause Analysis Toolkit from the state of Minnesota.

Essentially, you can find the root cause of a problem and show the relationship of causes by repeatedly asking the question, "Why?", until you find the root of the problem. This technique is commonly called "5 Whys", although is can be involve more or less than 5 questions.

5 Whys
Fig. 5 Whys Analysis Example, from The Art of Root Cause Analysis .

Q7. Are you familiar with price optimization, price elasticity, inventory management, competitive intelligence? Give examples.

Answer by Gregory Piatetsky:

Those are economics terms that are not frequently asked of Data Scientists but they are useful to know.

Price optimization is the use of mathematical tools to determine how customers will respond to different prices for its products and services through different channels.

Big Data and data mining enables use of personalization for price optimization. Now companies like Amazon can even take optimization further and show different prices to different visitors, based on their history, although there is a strong debate about whether this is fair.

Price elasticity in common usage typically refers to
  • Price elasticity of demand, a measure of price sensitivity. It is computed as:
    Price Elasticity of Demand = % Change in Quantity Demanded / % Change in Price.

Similarly, Price elasticity of supply is an economics measure that shows how the quantity supplied of a good or service responds to a change in its price.

Inventory management is the overseeing and controlling of the ordering, storage and use of components that a company will use in the production of the items it will sell as well as the overseeing and controlling of quantities of finished products for sale.

Wikipedia defines
Competitive intelligence: the action of defining, gathering, analyzing, and distributing intelligence about products, customers, competitors, and any aspect of the environment needed to support executives and managers making strategic decisions for an organization.

Tools like Google Trends, Alexa, Compete, can be used to determine general trends and analyze your competitors on the web.

Here are useful resources: