Data Scientists, You Need to Know How to Code

You need to know how to code — and not just code, but write good code.

By Tyler Folkman, Head of AI at Branded Entertainment Network

I know what you’re thinking — “Of course I know how to code, are you crazy?”

You write tons of code in Jupyter notebooks, hundreds of lines, every day. Clearly, you can code. It's not as if you are training machine learning models by hand or in Excel (though, that is possible).

So what could I possibly mean?

I hate to break this to you, but most of the coding data scientists do I wouldn't consider to really be programming. You are using programming languages as a tool to explore data and build models. But the program you create isn’t something you really think much about as long as it gets the job done.

Your code is usually messy and may not even run sequentially (thanks to notebooks). You likely have never written any unit tests and have little knowledge of how to write good, reusable functions.

But, as data science becomes more and more embedded in real products that type of code isn’t going to cut it. You can’t trust bad code and putting code you can’t trust into products leads to tremendous amounts of technical debt and bad user experiences.

“Okay, okay, but, I am a data scientist, not a software engineer”, you say. I build the models and it is someone else's problem to clean up the code. While that may work at some companies, for now, I deeply believe that a much better pattern is for data scientists to learn how to write better code. You may never become an elite-level software engineer, but data scientists can write code that can be trusted and put into production with some work.

Start With Your Functions

When learning how to level up your code, start with how you write functions. Most code is just a series of functions (or potentially classes) and if you can learn to write pretty good functions, that will go a long way to improving your code quality.

Your functions at a minimum should:

Do only one thing
Have documentation
Use good variable names

While there are entire books written on how to write clean functions, these 3 items are a great place to start.

You should never have a function that feels like it is trying to do more than a single thing. Some signs that your function might be doing too much:

It is longer than a single screen length or roughly 30 lines of code (in my experience).
It is hard to name your function clearly because of how many things it does
It contains a lot of code within if/else blocks that should actually be broken into separate functions

Functions that only do one thing are important because it makes your code easier to understand, manage, and test (more on testing later).

Any function being released to production should have a document string and that string should describe what the function does, give information on the input parameters, and potentially provide some simple examples of how to use the function. You will thank yourself in the future when you have well-documented functions and others will have a significantly easier time understanding your code.

Lastly, please use understandable and helpful variable names. Too many data scientists get in a bad happy of using variable names such as “a”, “a1”, and “a2.” Short, non-helpful variable names are faster to type when experimenting, but when putting code into production make sure your variable names will help others understand your code.

Remove Print Statements

Data scientists often use print statements to display information on what is happening. However, in production, these print statements should either be removed if they are no longer needed or be converted to log statements.

Logging should be how you communicate information and errors from your code. A good Python library to take a look at to make logging even simpler is Loguru. It automatically takes care of most of the annoying parts about logging and it feels much more like just using print statements.

Use a Style Guide

Style guides in programming are used to make it easier for many people to work on the same code, but for that code to mostly look as if it were coded by a single person.

Why does this matter?

When you have a consistent style it makes it much easier to navigate and understand the code. It’s amazing how much easier it can be to spot a bug when using a style guide. Conforming to a standard way of writing code will make it easier for you to navigate that code as well as others. That means you don’t have to spend very much time unpacking how the code was formatted to understand it and can instead focus on what the code does and whether it does it correctly and well.

PEP 8 is probably the most widely used style guide for Python. Though, there are many out there. Another popular source of style guides is Google as they have made public their internal style guides.

What matters is you pick one and try to stick to it. One way to make that easier is to enable your IDE to check for style errors and set up style checks that stop code from being pushed if the style guide isn't followed. You could also commit even further by using an auto-formatter that will automatically format your code for you. These allow you to write code however you want and then when run will auto-format your code to conform to the standard. A popular one for Python is Black.

Write Tests

I’ve found most data scientists fear tests because they don’t really know how to get started with tests.

In fact, many data scientists run what I would call temporary tests already. I find it common for a data scientist to quickly run a few “sanity checks” of a new function in their notebook. You pass through some simple test cases and make sure that the function runs as expected.

Software engineers call that process unit testing.

The only difference, though, is that often data scientists will just delete those temporary tests and move on. Instead, you need to save them and make sure they are run every time before code is pushed to make sure nothing has broken.

To get started using Python, I would go with pytest. Using pytest you can easily create tests and run them all at once to make sure they pass. A simple way to get started is to have a directory called “tests” and within that directory have Python files that start with “test.” For example, you could have “test_addition.py”

# content of test_addition.py
def add(x, y):
    return x + y
def test_add():
    assert add(3, 2) == 5

Typically, you would have your actual function in another Python file and would import it to your test module. You also would never need to test Python addition, but this is just a very simple example.

Within these test modules, you can save all the “sanity checks” of your functions. It is generally good practice to not only test common cases, but also edge cases and potential error cases.

Note: There are many different types of tests. I think unit tests are the best tests for data scientists to get started with testing.

Do Code Reviews

Last, but not least, on our list of the top things to do to write better code is code reviews.

A code review is when another person who is adept at writing code in your domain reviews your code before you commit it to the main branch. This step ensures that best practices are being followed and hopefully catches any bad code or bugs.

The person who reviews your code should preferably be at least as good as you at writing code, but even having someone more junior review your code can still be incredibly beneficial.

It is pretty human to be lazy and it can be easy to let that laziness creep into our code. Knowing that someone will be reviewing your code is a great incentive to take the time to write good code. It also is the best way I’ve found to improve. Having a more experienced colleague review your code and give you tips for ways to improve is priceless.

To make it easier for those reviewing your code, try to keep the amount of new code small. Small, frequent code reviews work well. Infrequent, huge code reviews are terrible. No one wants to be sent 1,000s of lines of code to have to review. These reviews tend to provide worse feedback because the person can’t take the time necessary to really understand that much code at once.

Level up Your Coding

I hope this article has inspired you to take the time to learn how to write better code. It isn’t necessarily hard, but it does to time and effort to improve.

If you follow these 5 suggestions, I am sure you will notice a large improvement in your code quality.

Your future self and your colleagues will thank you.

Check out my free course on how to deploy machine learning models.

Bio: Tyler Folkman is the Head of AI at Branded Entertainment Network. Get your FREE copy of Tyler's 5-Step Process For Creating Amazing Data Science Projects.

Original. Reposted with permission.

Related: