Advice For New and Junior Data Scientists
This article is for people who are already in the field but are just starting out. My goal is to not only use this post as a reminder to myself about the important things that I have learned, but also to inspire others as they embark onto their DS careers!
By Robert Chang, Airbnb.
Image credit: Alice Truong
Two years ago, I shared my experience on doing data science in the industry. The writing was originally meant to be a private reflection for myself to celebrate my two year twitterversary at Twitter, but I instead published it on Medium because I believe it could be very useful for many aspiring data scientists.
Fast forward to 2017, I have been working at Airbnb for a little bit less than two years and have recently become a senior data scientist — an industry title used to signal that one has acquired a certain level of technical expertise. As I reflect on my journey so far and imagine what’s next to come, I once again wrote down a few lessons that I wish I had known in the earlier days of my career.
If the intended audience of my previous post was for aspiring data scientists and people who are completely new to the field, then this article is for people who are already in the field but are just starting out. My goal is to not only use this post as a reminder to myself about the important things that I have learned, but also to inspire others as they embark onto their DS careers!
Whose Critical Path Are You On?
Philip Guo, an outstanding academic and prolific blogger, reflected on his experience interacting with various mentors throughout his years as a student, intern, and researcher. In his blot post “Whose Critical Path Are You On?”, he made the following observation:
If I was on my mentor’s critical path [for career advancement or fulfillment], then they would fight hard to make sure I got the help that I needed to succeed. Conversely, if I wasn’t on my mentor’s critical path, then I was usually left to fend for myself. […] If you get on someone’s critical path, then you force them to tie your success to theirs, which will motivate them to lift you up as hard as they can.
Image credit: The Icefields Parkway // Daniel Han
This work dynamic is pretty intuitive, and I wish I had internalized it earlier in my career when choosing projects, selecting teams, or even evaluating which mentors or companies to work for.
As an example, while at Twitter, I had always wanted to learn more about machine learning, but my team, despite being very data driven, largely needed data scientists to focus on experiment design and product analytics. Despite my best efforts, I often found it difficult to marry this intellectual desire with the critical projects of my team.
As a result, when I arrived at Airbnb, I made a conscious decision to focus on joining a project/team where ML is critical to its success. I worked with my manager to identify a few promising opportunities, one of which is to model the lifetime value (LTV) of listings on Airbnb.
This project was not only critical to the success of our business, but also to the development of my career. I learned so much about the workflow of building machine learning model at scale, and there was no better way to learn other than learning in the context of solving a concrete business problem.
Undoubtedly, I was very lucky to find a project that aligned with my aspirations and where I wanted to build my skills. I believe the framework of picking projects on our mentors’ critical paths can make us increasingly “lucky” over time on matching our aspirations with the right projects at work.
Principle I learned: We all have skills that we would like to develop and intellectual interests that we would love to pursue. It’s important to evaluate how well our aspirations align with the critical path of the environment we are in. Find projects, teams, and companies whose critical path best aligned with yours.
Picking the Right Tools For The Problem
Before Airbnb, I had been coding in R and dplyr for most of my professional life. After starting on the LTV project, I soon realized the deliverable was not a piece of analysis code, but rather a production machine learning pipeline. Given that it is much easier to build complex pipelines in Airflow using Python, I was faced with a dilemma — should I switch from R to Python?
Image source: quickmeme.com (besides R or Python, Excel is also a serious contender 👊)
This turns out to be a very common question among data scientists, since many struggled to decide which language to choose. For me, there is clearly a switching cost once committed to one or the other. I went through the pros and cons to understand the tradeoffs, but the more I thought about it, the more I fell into the trap of decision paralysis. (Here is an entertaining talk that demonstrates this concept). Eventually, I escaped from this paralysis after reading this response on Reddit:
Instead of thinking about which programming language to learn, think about which language offers you the right set of Domain Specific Languages (DSL) that fit your problems.
The appropriateness of a tool is always context dependent and problem specific. It’s not about whether I should learn Python, it’s whether Python is the right tool for the job. To elaborate more on this point, here are a few examples:
- If your goal is to apply the most current, cutting-edge statistical methods, R is likely to be the better choice. Why? Because R is built by statisticians and for statisticians. Nowadays, academics publish their research not only in papers but also in R packages. Each week, there are many interesting new R packages made available on CRAN, like this one.
- On the other hand, Python is great for building production data pipelines, since it is a general-purpose programming language. For example, one can easily wrap a scikit-learn model using Python UDF to do distributed scoring in Hive, orchestrate Airflow DAGs with complex logic, or write a Flask web app to showcase the output of the model in a browser.
For my particular project, I needed to build a production machine learning pipeline, and my life would be a lot easier if I did it in Python. Eventually, I rolled up my sleeves and embraced this new challenge!
Principle I learned: Instead of fixating on a single technique or programming language, ask yourself, what is the best set of tools or techniques that will help you to solve your problem? Focus on problem solving, and the tools will come naturally.
Building A Learning Project
Even though I have not used Python to do Data Science work before, I did play with the language in a different capacity. However, I never really learned Python fundamentals properly. As a result, I got scared when code was organized into classes, and I always wondered what __init__.py was used for.
To really learn the fundamentals properly this time, I took inspiration from Anders Ericsson’s research on Deliberate Practice:
Deliberate Practice is activities designed, typically by a teacher, for the sole purpose of effectively improving specific aspects of an individual’s performance.
Given that I was my own teacher, insights from Dr. Ericsson were very helpful. For example, I kicked off my “learning project” by curating a set of materials that were most relevant for doing ML in Python. This process took me a few weeks until I settled on a personalized curriculum. I stress tested this curriculum by asking experienced Pythonistas to review my plan. All of this pre-work was meant to ensure I would be on the right learning path.
Here is a glimpse of my personalized curriculum
Once I had a clearly defined curriculum, I used the following strategies to deliberately practice on the job:
- Practice Repeatedly: I forced myself to carry out mundane, non mission-critical analyses in Python instead of in R. This dragged down my productivity initially, but it forced me to get familiar with the basic API of pandas, without the burden of needing to meet an urgent deadline.
- Create Feedback Loop: I found opportunities to review other people’s code and fix small bugs when appropriate. For example, I tried to understand how our internal Python libraries were designed before using them. When writing my own code, I also tried to refactor it several times and make it more readable for everyone.
- Learn By Chunking and Recalling: By the end of each week, I wrote down my weekly progress, which included the important resources I studied in that week, concepts I learned, and any major takeaways during that week. By recalling the materials I learned, I was able to internalize the concepts better.
Slowly and gradually, I got better each week. It certainly wasn’t easy though: there were times when I had to look up basic syntax in both R and Python because I was switching back and forth between the two languages. That said, I kept in mind that this is a long term investment, and dividends will be paid as I dived into the ML project.
Principle I learned: As supported by many field experiments, before diving into a project, planning ahead helps you to practice more deliberately. Repeating, chunking, recalling, and getting feedback are among the most useful activities to reinforce learning.