Data Science is Boring (Part 1)

Read about how one data scientist copes with his boring days of deploying machine learning.



By Ian Xiao, Engagement Lead at Dessa

Figure

 

TLDR: Many people cherrypick the exciting parts of doing Data Science (or ML, Machine Learning) to motivate themselves and others. But we must face a reality: the real work is often “boring” — boring as comparing to what people romanticize. Feeling bored creates tension; it, ultimately, leads to a high turnover rate in data science. I want to share what I actually do and how I cope with the “boringness in data science”. I hope to help you, the aspiring Data Scientists, to set the right expectations. So, once you decide to pursue a Data Science career, you are in it for the long game. Enjoy.

Like What You Read? Follow me on Medium or LinkedIn. Also, subscribe to B.A.B, my education channel that aims to help aspiring data scientists and professionals to develop critical business and technical skills and create top-tier portfolios to impress recruiters, bosses, peers, and customers.

A Personal Invitation. My teams at work have been working hard to release Altas, a tool we use internally to build better ML applications faster. We will launch a free Community Edition on September 20, 2019. We are looking for 1000 early users who can share their feedback and help us improve. See more details and sign up here. Love to have you being part of our journey!

 

1. Story Time

 
My young and handsome cousin, Shawn, came to Canada recently. He’s here to pursue a master degree in Computer Science. Like many students, Shawn is very passionate about Machine Learning. He wants to become a Data Scientist (or anything that has to do with ML) when he graduates in 2 years.

As an older brother with a genuine interest in Shawn’s success, I decided to share the most guarded learning from my data science career — it is not “the Sexiest Job of the 21st Century” like HBR portrayed; it is boring; it is draining; it is frustrating. Just like any other careers.

It is my obligation to tell Shawn the truth, even it is disappointing. It will help him to make an informed decision about his career choice (more importantly, I will avoid the 3-AM phone calls from my mother and uncle, who will definitely give me lessons on family, responsibility, mentorship, and honesty).

via GIPHY


 

Being a smart, driven, and inquisitive young man, Shawn asked me to elaborate on how “boring” looks like. This is what this post is about.

In addition, we also touch on key trends in ML and how to be relevant and stand out. I will share this in a few follow up posts. Please follow me on Medium if you are interested.

 

2. Setting Some Context

 
It is important to recognize that how I got here (my LinkedIn) so you can put things into perspective. I am offering my observations and opinions as a Data Science manager who leads teams to deploy ML systems at Fortune-100 enterprises, manages client relationships, and does some technical work.

A few more important definitions. An ML system is a solution that solves a business domain problem, has an ML component, and has all other non-ML system things required to work with humans or machines.

Deploying means getting the solution to drive real business operations. For example, setting up experimentations to train and validate an ML model is not a deployment; setting up a recommendation engine that sends out monthly product offers via email is a deployment. Deploying ML systems faces very different problems than just building a good ML model. Read more here if you are interested.

That said, I do not represent people who joined, say Google or other high-tech firms, as junior developers and became technical managers. These firms do really good work, but I argue that they represent only “the top 1%”. Other Fortune-100 type enterprises often lag in technological sophistication, speed of adoption, and investment in tools and engineering talents.

 

3. Let’s Get into It

 
In short, when I say data science is boring, I mean the deflating feeling when one realizes the gap between the romanticized expectation and reality.

via GIPHY


 

Most young Data Scientists expect to spend most of the time tinkering with and building fancy ML models or presenting ground-breaking business insights with colorful visualizations. Sure, these are still part of the job.

But, as enterprises got more educated, they focus more on real operation values. This means enterprises want to deploy more ML systems; they care less about how many new models or fancy dashboards they have. As a result, Data Scientists are asked to do the non-ML works. This drives boringness.

Let’s further qualify what “boring” looks like in data science, it will be very boring if I show you my typical-day-from-Monday-to-Friday. So, I am going to categorize my work in major work buckets, highlight the expected vs. reality, and share my coping mechanisms.
I will use the narrative “we” because the examples are drawn from a collection of experiences and teams. The examples may not be exhaustive, but I think they will make the points.

3.1 Designing (5–10% of the time)

This is when we collectively get “high” intellectually to problem-solve and propose brilliant ideas. These ideas can include new model architecture, data features, and system design, etc. Very soon, we would hit a low because we need to go with the simplest (and often the most boring) solution due to time constraints and other priorities.

Expected: We implement ideas that can be featured in famous ML journals, such as NIPS, Google’s AI Research blog, etc. Maybe even win the next Nobel Prize.

Reality: We implement things that do the job well. We take photos of some nice whiteboard drawings that are worth framing for.

via GIPHY


 

Coping Mechanism: 1) Keep talking about the crazy ideas over drinks with friends outside of my domain; they can be brutally honest (and rude) in shutting down the crazy, but stupid ideas, 2) do the crazy and smart ideas as side projects, 3) it turns out, most of the crazy ideas don’t really work or is just marginally better than the simple ideas. So validating and reinforcing the KISS principle (Keep-It-Simple-Stupid) always give me comfort and closure.

3.2 Coding (20–70% of the time depends on the role)

Nothing much to say here. This is when we put on headphones, sip some coffee, stretch our fingers, lock on to the screens, type out beautiful lines of codes, and let the magic happens.

via GIPHY


 

Our codes generally fit into five categories (% of total lines of code): data pipeline (50-70%), system and integration things (10–20%), ML model (5–10%), analysis to support debugging and presentations (5–10%). It’s roughly in line with other people’s observations. Here is a bigger picture.

Figure

The proportion of the model code (illustrative) by Sergey Karayev in his Full Stack Deep Learning Course (link)

 

As you can see, we spend most of the time working on the boring non-ML stuff. Although the ML component is very critical, modern frameworks and coding languages (e.g. Keras, XGBoost, Python’s sklearn, etc) have abstracted lots of the complexity away. This means that achieving the results we need does not require a heavy codebase; the workflow is already well-standardized and optimized (doing low-level optimization is different, but it’s probably 1% of cases).

Expected: You spend most of the time developing and refining the ML component; someone else will take care of the rest.

Reality: No one wants 1) to do things you do not want to do, 2) you to keep all the goodies to yourself, and 3) you to spend a disproportional amount of time on an already well-optimized workflow.

Coping Mechanism: We all take leads to make design decisions based on our domain expertise and be the prime developer for our part while playing a support role on others (e.g. contribute ideas, doing some hands-on development, or QA). Doing so allows us to play to our strengths while learning from others. More importantly, it helps to avoid tension from fighting for the “sexy work”.

3.3 QA, Debug, and Fixing Sh*t (at least 65% of the time)

In my opinion, this is the most boring and painful part of any technical development work. Developing ML systems is not an exception.

In the ML context, there are two types of “bugs”: bad results and traditional software issues. Bad results refer to low model scores (e.g. accuracy or precision) or insensible predictions (e.g. probabilities are very skewed based on business experience). Nothing is wrong with the code, it is just the results do not make sense or not good enough. Traditional software issues include things like broken code or system configuration issues.

Expected: We just need to deal with the bad results and think of smarter ways to build better models. This is still somewhat intellectually engaging; it is also rewarding to see the performance goes up due to some good ideas.

Reality: Out of the time we spend on QA /debug/apply fixes, ~70–90% are on traditional software issues. Usually, we can achieve a good enough result fairly quickly after we build the end-to-end model training and validation pipeline. Then we often de-prioritize modeling to focus on system issues.

Coping MechanismI gamify and keep a “Trophy Board” using Github’s Issues feature. I get an instant dopamine rush when I close issue tickets. I feel proud to see the issues we “conquered”, more the prouder. Of course, I am prouder if everything just works magically when I hit “go” — it only happened once in a programming assignment back in university. I will remember that feeling for the rest of my life. If it happens again in real life, something is probably wrong.

Figure

A Snapshot of the Gibhut Issue Board

 

3.4 Fire-Fighting (10–50% of the time)

via GIPHY


 

This is a nightmare for any delivery team managers and not specific to data science. It does not matter how thought out the timeline is. Things always come up and throw you off track. To be concrete, surprises can be grouped into three categories: a) external issues like scope change, upstream system dependency, and client complains, b) internal team matters like annoying bug that takes much longer to solve than expected, people getting new jobs and not transition properly, under-staffing, personality conflicts, and c) my own ignorance, which is a miscellaneous bucket for “others”.

Expected: cruising from the beginning to the end. High-fives and hugs from the clients, your boss, and your team.

Reality: unexpected things generally happen in the most inconvenient time. There are general patterns, but no catch-all formula, which makes it frustrating.

Coping Mechanism: 1) multiply the timeline by 2-2.5x to leave enough buffer room if it relates to deep technical things or cross-team activities, 2) be aggressive when setting milestones internally, 3) I swear loudly in my mind, well, some times verbally when appropriate, 4) breathe, smile, and listen, 5) explore all possible options with the team and prioritize by feasibility, effort, and resistance, 6) if none of these work, don’t wait, ask for help! 7) just execute. Many of these are not coping mechanisms per se, but they are good practices and have been working well.

 

4. To Sum Up

 
All of these is to say that real-world data science is difficult. People with aspirations to pursue a career in ML should recognize the fact that there are a lot more to just building models. You will eventually get bored and frustrated, just like you would with any career. It is okay and normal. Most importantly, you should develop a coping mechanism so you can stay in the game for the long run, and enjoy the small reward along the way and the final victory.

via GIPHY


 

This is only one part of my conversation with Shawn. In two follow-up posts, I will share my thoughts on what the field may look like in 2 years and how to stay relevant and stand-out. Stay tuned!

Like What You Read? Follow me on Medium or LinkedIn. Also, subscribe to B.A.B, my education channel that aims to help aspiring data scientists and professionals to develop critical business and technical skills and create top-tier portfolios to impress recruiters, bosses, peers, and customers.

A Personal Invitation. My teams at work have been working hard to release Altas, a tool we use internally to build better ML applications faster. We will launch a free Community Edition on September 20, 2019. We are looking for 1000 early users who can share their feedback and help us improve. See more details and sign up here. Love to have you being part of our journey!

Until Next Time.

Ian Xiao

 
Bio: Ian Xiao is Engagement Lead at Dessa, deploying machine learning at enterprises. He leads business and technical teams to deploy Machine Learning solutions and improve Marketing & Sales for the F100 enterprises.

Original. Reposted with permission.

Related: