How To Become a 10x Data Scientist, part 2
A 10x developer is someone who is 10 times more productive than average. We adapt tips and tricks from the developer community to help you become a more proficient data scientist loved by team members, including code design and selecting right tools for the job.
By Stephanie Kim, Algorithmia.
Continued from How To Become a 10x Data Scientist, part 1
Being consistent with your code style is just as important as following naming conventions. To gain some basic style points you should stick to the same case, don’t mix camel and snake case together in the same script. It quickly becomes hard to read and navigate your code. Another way you should be consistent is to stick with the same method of accomplishing a task. For instance if you want to remove duplicates from a dictionary and you need to do it in a couple of spots in your code, don’t get creative and use a different way to do it just because you saw it on Stack Overflow. Use the clearest and least clever method that is consistent across your code and across scripts. Again, the purpose of consistency is to avoid confusing yourself and others which will allow you to debug faster! (Notice a debugging theme here).
Remember how we just talked about having to remove duplicates in a dictionary in multiple places in your code? Use functions so you don’t have to rewrite that code multiple times. Even if you aren’t reusing the code, it’s a crucial best practice to wrap your code in functions. Your functions should be small and only do one thing so they can be reused.
When you don’t use functions, then you’ll have global variables that result in name collisions, non-testable code, and code that repeats itself often.
By utilizing functions, your code is composable and easy to write unit tests for.
But don’t just stop at writing small functions that only do one thing, make sure to abstract your functions so you can reuse them – this lends itself to the DRY mentality and will speed up your development time which will at least make you a 2xer.
Less common, but important to code design is using code stubs. Code stubs are simply mock classes and functions that show inputs, outputs and comments that provide an outline for your code. Using code stubs before you actually start to write the real meat of your code will force you to think through your code first and will help you avoid monstrous spaghetti code. You’ll notice what areas you are repeating code in before you write it and will think through what data structures are most appropriate.
The above code sample brings us to writing both comments and documentation. To truly become beloved by your coworkers and increase your own efficiency as a data scientist is to write helpful concise comments. Not only should you include comments about what the piece of code does, but its inputs and outputs as well.
Also, probably the coolest thing about docstrings is that they can be turned into documentation via libraries in most languages. For instance Python has a library called Sphinx that allows you to turn your docstrings into full blown documentation.
You might know what your code does now, but down the line when you are trying to debug or add a feature you and others will be glad for the comments.
No matter what language you’re coding in, please use exception handling and leave a helpful error message for yourself, your coworkers, and end users. The code above is showing a stop function passing in the error message from the API that’s being called.
If the data isn’t what the API expects, then it throws a helpful error message. In your own code you could write a message within the stop function that helps the user such as:
stop(paste0(“Make sure all your inputs are strings: ”, e))
This example above is from the Hitchhikers Guide to Python and it uses the Python testing library Pytest.
While writing unit tests are fairly common for developers, they are rarely used in the data science world. Sure you are validating your model using cross validation, a confusion matrix, and other methods. But are you testing the queries that are getting your data for you? How about the various methods you are using to clean and transform the data the way you need it for your model? These areas are crucial in safekeeping against “Garbage In, Garbage out”. When you test your code you are both future proofing it against changes that might introduce bugs, but when you are your own QA, everyone will think you’re a rockstar due to the lack of bugs in your code once it goes to production.
Using version control for your projects is an important step in becoming a 10x data scientist. Obvious benefits are saving different versions of your model, and easily working across teams, but also by using version control with a back up in a repository you safeguard against losing work in case of a stolen laptop or crashed hard drive.
In beta, there is an open source data version control project called Data Version Control which looks promising for data science workflows. It relies on Git and allows projects to be reproducible across teams by building a data dependency graph. Your data is saved separately from your model and it works like other version control allowing you to roll back to previously saved snapshots.
10x developers know to use the right tool for the job, whether it’s using a library to save time, switching languages for performance, or using an API instead of building out the solution themselves.
Say you have Twitter or other social data and need a sentiment analysis. One option is to label that data yourself & train your own model or you could utilize a pre-trained model. It’s ok to not reinvent the wheel by building every data model yourself. Use the tools that are best suited for the job even if that means using ones that you didn’t build.
We’ve all written a Bash script paired with a Cron job to automate some reports right? But after you spend some time trying to debug a report written by someone else that’s automated by a Cron job without even knowing where it was running from, you realize there has to be a better way. Using an automation tool like Puppet, Chef, Ansible, or any of the other popular automation tools you can run your jobs from a centralized location so debugging someone else’s (or your own) job is a lot faster.
Sometimes you’re not going to have a team to hand your pickled model to so you might need to know how to deploy your model yourself.
While there are many differences between these providers, they range from incredibly easy-to-use to requiring much more setup and knowledge. This section could be a talk in itself. If you want more details about model hosting, check out a couple of different talks we covered about intro to deploying your model and deploying and scaling your deep learning model.
Things that could be deal breakers:
- Ease of use
- Cost (including add-ons and hidden costs such as hosting data)
- Vendor lock-in
- Languages available
How does it make you a 10x data scientist:
By knowing how to deploy your model you take yourself from being able to tell a story with your data to easily sharing it with team-members (no matter what language they write) or deploying it to a production environment to share with thousands of users. This will help you become a 10x-er because once you understand this you can create more performant models that will make users happy. And when users are happy, business owners are happy.
To round out this post, here are some favorite tips on how to become a 10x data scientist:
- Pattern Matching. This comes from hard-earned experience of running into a similar problem before and realizing that you could reuse or modify a solution to your current problem.
- Learn how to explain your code – to yourself and others. This means whiteboarding, doing/getting code reviews and even pair programming. Get used to talking about your code and your thought process.
- Learn how/when to quit and start over. Don’t be afraid to start over if you realize there is a better way to solve the problem. It’s better to start over and do it a better way versus sticking out something that isn’t optimal or performant.
- Create your own stock of Gists or organize code snippets through a repository on GitHub or other hosting service.
Lastly all throughout the post, the same theme has cropped up to becoming a 10x data scientist and that one tenant is debugging. Every 10x developer is a master debugger because the rule is however long you code for, you can multiply that by 10 and get the time it will take you to debug it. A few tips to becoming a great debugger is that you use exception handling, you utilize the debugger in your IDE, you talk through your code looking for errors in your logic, and you check the source code of the library involving the error to make sure you are passing in what the code expects.
Even if you only take a few points away from this post, you’ll be on the path to becoming a 10x Data Scientist. Good luck on your journey and feel free to share your tips and tricks to being a 10X Data Scientist with us @Algorithmia.
Original. Reposted with permission.
Bio: Stephanie Kim is Developer Evangelist at Algorithmia.