Gold BlogAdvice For New and Junior Data Scientists

This article is for people who are already in the field but are just starting out. My goal is to not only use this post as a reminder to myself about the important things that I have learned, but also to inspire others as they embark onto their DS careers!

Partnering With Experienced Data Scientists

One of the key ingredients of deliberate practice is to receive timely and actionable feedback. No great athletes, musicians, or mathematicians are able to achieve greatness without coaching or targeted feedback.

One common trait I have observed from people who have a strong growth mindset is that they are generally not ashamed of acknowledging what they don’t know and they constantly ask for feedback.

Looking back at my own academic and professional career so far, many times in the past I self-censored my questions because I did not want to appear incapable. However, over time I realized that this attitude was rather detrimental — in the long run, most instances of self-censorship are missed opportunities for learning rather than shame.

Image source: edutopia — It’s important to have a growth mindset!

Before this project, I had very little experience putting machine learning models into production. Of the many decisions that I made for the project, one of the best decisions was to declare early and shamelessly to my collaborators that I know very little about ML infrastructure, but that I wanted to learn. I promised them, however, as I got more knowledgeable, I would make myself useful for the team.

This turned out to be a pretty good strategy, because people generally love to share their knowledge, especially when they know their mentorship will benefit themselves eventually. Below are a few examples that I would not have learned so quickly without the guidance of my partners:

  • Scikit-Learn PipelinesMy collaborator suggested to me that I can make my code more modular by adopting Sklearn’s pipeline construct. Essentially, pipelines define a series of data transformation that are consistent across training and scoring. This tool made my code cleaner, more reusable, and more easily compatible with production models.
  • Model Diagnostics: Given that our prediction problem involves time, my collaborator taught me that typical cross validation will not work, as we could run into the risk of predicting the past using future data. Instead, a better method would be to use time series cross validation. I also learned different diagnostic techniques such as lift chart and various other evaluation metrics such as SMAPE.
  • Machine Learning Infrastructure: With the help from ML infra engineers, I learned about managing package dependency via virtualenvs, how to serialize models using pickling, and how to make the model available at scoring time using Python UDFs. All these are data engineering skills that I didn’t know before.

As I learned more new concepts, not only was I able to apply them for my own project, I was able to drive engaging discussions with the machine learning infrastructure team so they can build better ML tools for data scientists. This creates a virtuous cycle because the knowledge that was shared with me made me a better partner and collaborator.

Principle I learnedIn the long run, most instances of self-censorship are missed opportunities for learning rather than shame. Declare early and shamelessly your desire to learn, and make yourself useful as you become better.


Teaching And Evangelizing

As I got closer to putting my model into production, I noticed that a lot of the skills that I picked up could be very valuable for other data scientists on our team. Having been a graduate student instructor for years, I always knew I had a passion for teaching, and I always learned more about the subject when I became the teacher. Richard Feynman, the late Nobel Laureate in Physics and a phenomenal teacher, spoke about his view on teaching:

Richard Feynman was once asked by a Caltech faculty member to explain why spin one-half particles obey Fermi Dirac statistics. Rising to the challenge, he said, “I’ll prepare a freshman lecture on it.” But a few days later he told the faculty member, “You know, I couldn’t do it. I couldn’t reduce it to the freshman level. That means we really don’t understand it.”

This was really inspiring — if you can’t reduce the subject to its core and make it accessible for others, that means you don’t really understand it. Knowing that teaching these skills can improve my understanding, I seek opportunities to carefully document my model implementations, give learning lunches, and encourage others to try out the tools. This was a win-win because evangelization raises awareness, which in tern helps to drive tool adoption across the team.

As of late September, I have started collaborating with our internal Data University team to prepare a series of classes on our internal ML tools. I am not exactly sure where this will go, but I am very excited about driving more ML education at Airbnb.

Finally, I would end this section with a tweet from Hadley Wickham:

Principle I learned: Teaching is the best way to test your understanding of the subject and the best way to improve your skills. When you learn something valuable, share it with others. You don’t always have to create new software, explaining how existing tools work can also be super valuable.


At Step K, Think About Your Step K+1

From focusing on my own deliverables, to partnering with the ML infrastructure team, to finally teaching and enabling other data scientists to learn more about ML tools, I am really happy that the scope of my original project was much larger than it was a few months ago. Yet, admittedly, I never anticipated this in the first place.

As I reflected on the evolution of this project, one thing that was different from my previous projects was that I always had a slight dissatisfaction with the current state of things, and I always wanted to make it a little bit better. The most eloquent way to characterize this is from Claude Shannon’s essay:

Image source: Book cover from “A Mind at Play: How Claude Shannon Invented the Information Page” by Jimmy Soni, Rob Goodman

“There’s the idea of dissatisfaction. By this I don’t mean a pessimistic dissatisfaction of the world — we don’t like the way things are — I mean a constructive dissatisfaction. The idea could be expressed in the words, This is OK, but I think things could be done better. I think there is a neater way to do this. I think things could be improved a little. In other words, there is continually a slight irritation when things don’t look quite right; and I think that dissatisfaction in present days is a key driving force in good scientists.”

By no means I am a qualified scientist (even though that is somehow in my job title), but I do think the characterization of slight dissatisfaction is quite telling for whether you will be able to extend the impact of your project. Throughout my project, whenever I am at step K, I naturally would start thinking about what to do for step K+1 and beyond:

  • From “I don’t know how to build a production model, let me figure out how”to “I think the tools can be improved, here are my pain points, suggestion and feedback for how to make the tools better”, I reframed myself from a customer to a partner with ML infrastructure team.
  • From “let me learn the tools so I can be good at it” to “let’s make these tools more accessible for all the other Data Scientists interested in ML”, I reframed myself from a partner to an evangelizer.

I think this mindset is extremely helpful — use your good taste and slight dissatisfaction to fuel your progress with persistence. That said, I do think that this dissatisfaction cannot be manufactured, and can only come from working on a problem you care about, which brings to my last point.

Principle I learned: Pay attention to your inner dissatisfaction when working on a project. These are clues to how you can improve and scale your project to the next level.


Parting Thoughts: You And Your Work

Recently, I came across a lecture from Richard Hamming, who is an American Mathematician well known for many of his scientific contributions, including Hamming code and Hamming distance. The lecture was titled You And Your Research, where Dr. Hamming said it can very well be renamed as “You And Your Career”.

As he shared his stories, a few important points stood out for me.

If what you are doing is not important, not likely to be important, why are you doing it? You must work on important problems. I spent Friday afternoon for years thinking about the important problems in my field [that’s 10% of my working time].

Let me warn you about important problems, importance is not the consequence, some problems are not important because you haven’t gotten an attack. The importance of problem, to a great extent, depends on if you got a way of attacking the problem.

This whole course, I am trying to teach you something about style and taste, so you’ll be able to have some hunch on when the problem is right, what problem is right, how to go about it. The right problem at the right time at the right way counts, and nothing else counts. Nothing.

When Dr. Hamming speaks about importance, he means problems that are important to you. For him, it was scientific problems, and for many of us, it might be something different. He also talked about the importance of having a plan of attack. If you don’t have a plan, the problem does not matter, however big the consequences. Lastly, he mentioned doing it with your own unique style and taste.

His bar for doing great work is extremely high, but it’s one worth pursuing. When you find your important problem, you will naturally try to make it better and make it more impactful; you will find ways to teach other about its significance; you will spend time to learn from other great people and build your craft.

What’s a problem that is important to you that is on your critical path?

Bio: Robert Chang is a data scientist at Airbnb working on Machine Learning, Machine Learning Infrastructure, and Host Growth. Prior to Airbnb, he was a data scientist at Twitter and have a degree in Statistics from Stanford University and a degree of Operations Research from UC Berkeley.

Original. Reposted with permission.