How To Become a 10x Data Scientist, part 1
A 10x developer is someone who is 10 times more productive than average. We adapt tips and tricks from the developer community to help you become a more proficient data scientist loved by team members and stakeholders.
By Stephanie Kim, Algorithmia.
Recently I gave a talk at PyData Seattle about how to ramp up your data science skills by borrowing tips and tricks from the developer community. These suggestions will help you become a more proficient data scientist who is loved by your team members and stakeholders.
This post is broken up into five parts including:
- History and controversy of the 10x developer.
- Project design.
- Code design.
- Tools for the job.
- Productionizing model.
Also, if you want to watch the original talk in its entirety check it out here.
A 10x developer is someone who is literally 10 times more productive than the average developer.
A 10x developer is someone who not only produces more code per hour than the average developer, but they debug like a boss, they introduce less bugs because they test their code, they mentor junior developers, write their own documentation, and have a lot of other broad skills that go beyond knowing how to code.
A 1968 experiment by H. Sackman, W. J. Erikson, and E. E. Grant called the “Exploratory experimental studies comparing online and offline programming performance” discovered that there were a wide range of times it took programmers to accomplish coding tasks.
The study focused on programmers who had an average of 7 years experience and discovered that there was a 20:1 difference between programmers.
Although there were flaws found in the experiment such as combining programmers writing in low-level languages and others in high-level languages , later there were more studies done finding similar results.
While there has been extensive debate regarding whether or not 10x developers exist, this article will instead focus on what you can do to be a more productive data scientist by taking tips and tricks from seasoned developers who others consider remarkably faster than their counterparts.
Get to Know the Business
It doesn’t matter if you work for an education, biotech or finance company, you should have at least a high-level understanding of the business you’re solving problems for.
In order to effectively communicate the story behind your data analysis, you should find out what drives the business and understand the business’s goals.
For instance if you were focusing on optimizing for food truck locations you would want to understand foot traffic, competition, events happening in the area, and even weather. You’d want to understand why the business is optimizing for locations. It might be to increase sales for current trucks or maybe they are looking to add trucks.
Even though you might be a data scientist at a job search site today and at a financial firm tomorrow, you should know what makes the business tick in order to make your analysis relevant to stakeholders.
You should also have an understanding of what the business processes are for your project such as knowing who needs to sign off on the end results, who the data model will get passed on to once your part is complete and what the expected timeframe is.
And finally, you should make sure you know who the stakeholders are and introduce realistic expectations to non-technical stakeholders. Expect to be an educator and teach non-technical stakeholders why reaching their goals might take more time or resources than they thought.
When you understand the stakeholder’s goals and make sure you communicate the technology, expertise, and time it would take to build out their solution then you will become an even more valuable asset to your company.
Know the Data
While it’s important to understand the business, it’s more important to understand the data. You’ll need to know how the data was extracted, when it was extracted, who is responsible for quality control, why there might be gaps in the data (for instance a change in vendors or a change in extraction methods), what might be missing and what other data sources could potentially be added to create a more accurate model.
It really comes down to talking to different teams and asking a lot of questions. Don’t be afraid to ask what people are working on and discuss what you’re working on too because you never know when people are doing duplicate work or if they have a cleaner version of data that you need access to. It’ll save you a ton of time being able to query a database versus making multiple API calls to SiteCatalyst for example.
Why does taking time and care during project design make you a 10x data scientist?
- You’ll only do the work that needs to be done (think before you code) so you’re faster at getting a project done because you’ll do less work!
- By finding misunderstandings between what customers/users think they need versus what they really need you’ll position yourself as the expert and a consensus builder.
- You’ll solidify your own understanding of what the ask is so you won’t make costly errors.
While there are many best practices when designing your code, there are a few that stand out which will increase your x-value considerably.
The first time I heard the idea that clarity or clearness beats cleverness was in my writing classes in college. It’s easy to get caught up on your own cleverness and use the latest word of the day to get your ideas across, but like programming you’ll not only likely confuse yourself, you’ll confuse others.
In the above Scala example, the first line shows the
sortBy method using a shorthand syntax. While it’s concise, it’s hard to think through what the underscore stands for. Even though this is a common pattern that many people use as an argument name in an anonymous function, for less advanced developers (or when you haven’t looked at your code for a while), it becomes tedious to figure out what the code does.
In the second example we at least use an argument name, plus it’s showing assignment and we can see that it’s sorting by the next to last element in the sequence x.
When code is less abstracted, it’s easier to debug later so in the third example I’m going to explicitly name my argument so it’s representative of the data.
When your brain has to go through each step and either look up or recall what the shorthand code does, it takes longer to debug and longer to add a new feature so even though using shorthand such as the examples above are concise and faster to type initially, in the long run it will benefit both you and others to avoid being too clever.
While we won’t go over caching, we will cover the importance of naming things. Imagine you’re looking through some old code and you see a sequence being sorted like in the Scala example:
Using a single letter to name a sequence doesn’t provide useful information at all because likely you are pulling your data from somewhere like an API, a database or a data stream in Spark where you’d have to run your code to see what “x” is.
So keeping with the Scala example from before:
You can tell what we are sorting without even running the code.
However, sometimes there are perfectly good reasons to use X as a variable name. For example X is often used in machine learning libraries where X is known to mean the observed data while y is the variable that is trying to be predicted for. In this case it’s preferable to use the conventions of your field where “model”, “fit”, “predicted”, and “x” and “y” mean the same thing to everyone in that field.
Outside of data science you would be expected to follow the programming language’s conventions of the language your are using. For example, I recommend you checking out the docs such as PEP for Python to learn best practices or
By being careful of your naming conventions and by being clear instead of clever with your code, it will make refactoring and debugging both easier and faster. By following these two tenants of code design, you’ll be well on your way to becoming a 10x data scientist.
Original. Reposted with permission.
Bio: Stephanie Kim is Developer Evangelist at Algorithmia.
Top Stories Past 30 Days