2 Questions for a Junior Data Scientist

Academic credentials and experience with previous machine learning projects are important for kicking off a data science career. However, landing your first job out of school will require you to extend your thinking about projects and problems. Learn how one interviewer honed in on desired skills by considering these two questions.

By Sowmya Vajjala, Blogging on NLP Research and teaching, R, Python, etc.

Hiring a data scientist is, in general, a difficult process, in my opinion. There are a lot of people coming from widely different backgrounds, levels of academic degrees, and experience. The profile requirements for a “data scientist” differs a lot from company to company. In addition to these, we see more and more people adding a tagline “data scientist” to their LinkedIn profiles every day. It is perhaps relatively easier to assess an experienced data scientist profile’s suitability for a job, but how do we evaluate a junior/entry-level data scientist profile?

I spent some time thinking about this issue after interviewing a couple of fresh graduates/certificate program pass-outs/intern candidates over the past months. I felt there are two important questions to ask during the interview process. This post is about those two questions and my rationale for asking them.

Let me start by giving them headings.

  1. Different ways of solving the same problem
  2. Generalized understanding or specific knowledge


Different ways of solving the same problem

Let me take an anonymized real interview experience. I had to interview a candidate with very good academic credentials and an impressive array of projects already as a student. One of the projects involved using reinforcement learning to solve a problem X in the Natural Language Processing domain. This was some Kaggle competition (as far as I remember) and the candidate was among the top XXX participants. I felt they did a reasonably good job of explaining what they did.

At that point, I asked — “what will you do if you are told to not use reinforcement learning?” The candidate seemed surprised at the question. After some contemplation, they said: “deep learning.” I asked — “what in deep learning?”. Them: “Maybe RNNs.” Me: “Okay, let us say it was decided that we cannot use deep learning. Can you think of any other solution?” Them: (clueless look).

In this case, a popular algorithm for the problem we were discussing (this is found in textbooks too) uses regular expressions and simple heuristics! Now, one may ask, what is the point of asking this question in this era? Three reasons:

  • In industry projects, it often makes sense to build a quick MVP or a simple solution, get some feedback, and iterate over it. So, it is useful to think about a couple of different options to solve a problem and evaluate which one can be built quickly.
  • Reinforcement learning, or Deep learning, or Regular expressions are just “methods” to solve problems. They should not be the solutions themselves irrespective of the problem.
  • This is also going to tell me whether the candidate just followed the description of a mandatory project in their course, or actually understood and thought about solving the problem

So, my point is: when we are starting out, it is useful to not stop at one solution, and think about “what are the other ways of solving this?”. This is what we normally do in a real-world job too. We look for optimal solutions under given constraints. While it is difficult for someone just starting out to know everything — part of this is also common sense. When someone suddenly asks you about an alternative solution, you should have an answer in an interview. This is going to be very useful as one grows as a data scientist.


Generalized understanding or specific knowledge

Let me again take an anonymized interview experience I had. A fresh out of college candidate on their resume had a bunch of typical projects — spam classification, MNIST dataset digit recognition, sentiment analysis, etc. On one of these, the candidate also claimed to be in the Top-10 performers on the Kaggle leader board. While that is impressive, these are also so far away from real-world project scenarios. So, what should I do?

Instead of asking questions on the specifics of these well-known datasets and projects, I modified my “problem descriptions” slightly. I asked the candidate problem-solving questions such as: “Let us say I run an online business, and I frequently get customer emails complaining about something. I only have three customer support departments: orders and billing, returns and refunds, others. I want a machine learning solution which routes customer emails to one of these three departments.” — if someone understood the projects they did above, they would have mapped this to a classification problem, potentially similar to spam or sentiment classification. Even at an entry-level, not seeing that connection is a red flag for me.

My point for this question is two-fold:

  • It is completely all right if all the data science projects you could show are standard datasets and Kaggle competitions (a few months ago, I wondered if this is useful, but I changed my opinion now). But, one needs to know how to generalize the knowledge from these to new problems.For example, if you previously worked on a text classification problem, you need to be able to identify another text classification problem and walk through some steps in solving it.
  • Here too, my second point is similar to Question 1. This tells me whether the candidate really understood what they did, or just followed instructions, or followed online tutorials.

To summarize, when applying for data science jobs, entry-level candidates need to think about their projects a bit beyond the exact things they did — looking for other possible solutions, and for examples of similar problems in real-world applications.

Of course, all this is my personal opinion, and not necessarily found in each and every data science interview under the sun!

Original. Reposted with permission.

Bio: Sowmya Vajjala is an NLP researcher based in Toronto, Canada who works on problems related to processing and understanding text  by using machine learning methods. Previously, Sowmya taught at Iowa State University and as a data scientist in Toronto.