Data Scientist Interviews Demystified

We look at typical questions in a data science interview, examine the rationale for such questions, and hope to demystify the interview process for recent graduates and aspiring data scientists.

By Colleen M. Farrelly, Quantopo, LLC on August 2, 2018 in Data Science Skills, Hiring, Interview Questions, P-value, random forests algorithm, XGBoost

comments

Interviews can be nerve-wracking, even for those seasoned in tech job searches. I am often asked what data scientist interviews are like and how one prepares for them. With such a wide range of position types, it’s difficult to answer those questions exactly. Tech companies and production teams may focus more on software skills, while larger businesses outside of tech may focus more on the statistical testing and data mining skills.

However, a few skills and pieces of knowledge show up in many types of interviews for many types of positions, and they are a good starting point for interview preparation. Brushing up on some basic statistical concepts and the coding language(s) posted on the job details are always a good idea, as is reviewing any of your peer-reviewed papers (even if they are from several years ago!). Relevant industry experience or courses from college/graduate school are usually discussed, even if recent positions have been in other industries. Most of these questions will directly relate to the position to which one applies and the needed tools/expertise for that specific role.

However, there are some common tasks and questions beyond those directly related to the position that seem to be universal to data science interviews, and some of these questions/tasks assess more than what appears on the surface. Here are a few common ones, along with the questions beyond the obvious.

1) Explain a p-value. On the surface, it’s a pretty easy question for those who have studied statistics; you’re comparing a control group (business as usual) with some change to see if the change improves or detracts from current results. A model or statistical test yields a value called a p-value that indicates how often one would expect that value/result if the null hypothesis were true. The smaller the p-value, the less likely the test is incorrectly telling an analyst that the sample doesn’t match the null value.

However, lurking behind this number is sample size and power. Given a large enough sample size, any small deviance from the null value will come up as very unlikely under the null. But, the CEO is not going to care if there’s a 0.01% difference between condition A and condition B in the testing.

2) What is MapReduce (or parallel computing), and why is it important? Especially in production environments, efficient code is important, and big data usually involves computing quantities in parallel at some point, whether through Hadoop or Spark or something in-house. Even in R&D teams focused on designing the prototypes, a prototype that is easier to productionalize can be worth more than a prototype that will require a lot of time and resources to implement, even if the latter is a bit more accurate.

That means that the 1% gain which helps win a Kaggle competition but requires a relatively large amount of computational power is not practical in a business. Parallel computing and implementation considerations play a large role in operational efficiency.

3) How do random forests and boosted regression models differ? Simple enough question about bagging vs. boosting and possibly some production considerations. However, assessing the machine learning knowledge of a candidate reveals the depth to which they understand the algorithm—and can potentially learn other algorithms or create their own algorithms as needed. A bootcamp answer suggests someone at least knows the basics. An in-depth discussion of the mathematics or mention of more advanced algorithms in these frameworks suggests expertise in machine learning. This question also assesses current knowledge in the field, such as the rise of boosting as a computationally-viable production algorithm with the introduction of XGBoost and other hardware-interactive boosting algorithms.

4) Something related to conditional probability and Bayes' theorem. It’s a favorite topic, and it’s a way of assessing not only statistical knowledge but also reasoning through problems. Much of business reasoning involves hypothetical scenarios along the lines of “if we do X to this population, what can we expect to happen to revenue…?” and understanding how to subset data to specific instances that match a condition is a foundation of many business analysis requests.

5) Give the candidate a hypothetical dataset and ask for potential ways to analyze it (or give them the data and time to analyze). Seems to be a simple application question to make sure a candidate can code and apply some statistical or machine learning model. However, many times, this problem is actually one that the company is facing or has faced recently. And might be something you are hired to fix or solve upon hire. This question also assesses problem-solving style (and if it matches well with the team’s style), communication skills related to technical material (which will be needed when presenting analysis results to the team’s boss or a suite of executives), and quality of work (correct solutions, quickness of coming up with a good answer…).

Data science interviews don’t have to be intimidating or mysterious to aspiring applicants, and understanding some of the common questions can make a difference between netting the first job and needing to apply somewhere else. Practice. Prepare. And trust all you have learned along the way!

Related:

Data Scientist Interviews Demystified

More On This Topic

Latest Posts

Top Posts

<img width="94" height="95" src="/images/tkb-1808-s.png" width=94 alt="Silver Blog" align="right">Data Scientist Interviews Demystified

More On This Topic

Latest Posts

Top Posts

Data Scientist Interviews Demystified