Software engineering fundamentals for Data Scientists

As a data scientist writing code for your models, it's quite possible that your work will make its way into a production environment to be used by the masses. But, writing code that is deployed as software is much different than writing code for exploratory data analysis. Learn about the key approaches for making your code production-ready that will save you time and future headaches.

By Gonzalo Ferreiro Volpi, Fighting fraudsters using Data Science on June 30, 2020 in Advice, Best Practices, Data Science, Programming, Software Engineering

comments

Source: Chris Ried @ unsplash.

As a field, Data Science has caused polemic with other disciplines ever since it started to grow in popularity. Statisticians complain about the lack of fundamental statistics knowledge that’s often observed by practitioners, mathematicians argue against the application of tools without a solid understanding of the principles applied, and software engineers point at data scientists’ ignorance of basic principles while programming. And to be honest, they all have a point. In terms of statistics and maths, it is true that you need to have a solid understanding of concepts such as probability, algebra, and calculus. How deep that knowledge needs to be? Well, that depends a lot on your role, but the basics are not negotiable. A similar thing happens when it comes to programming; if your role implies writing production code, then you need to know at least the fundamentals in software engineering. Why? The reasons are many, but I reckon they could be summarised in a nutshell according to five principles:

Integrity of the code, in terms of how well it is written, resilient to errors, catching exceptions, tested and being reviewed by others.
Explainability of the code, with a proper documentation.
Velocity of the code for it to be run in live environments.
Modularity of your scripts and objects in order to be reusable, avoiding repetition and gaining efficiency across classes of your code.
Generosity with your team, for them to review your code as fast as possible and in the future understand any piece of code written why you

In line with these points, in this story we’ll see some of the fundamentals that I found useful the most, not being a programmer by nature and coming into the field from a completely different background. They have helped to write better production code, saving me time, and making the life of my workmates easier when implementing my scripts.

The importance of writing clean code

Source: Oliver Hale @ unsplash.

In theory, almost everything we’ll cover in this story could be considered tools or tips for writing cleaner code. However, in this specific section, we’ll focus on the strict definition of the word clean. As Robert Martin says in his book Clean Code, even bad code can function, but if code isn’t clean, it can bring a development organization to its knees. How? Well, to be honest the possibilities are many, but just imagine the waste of time it can take to review badly written code or starting out in a new role just to find out that you’ll be dealing with some illegible code that was written ages ago. Or even worse, imagine that something breaks off, causing a product feature to stop working and that the person who wrote the code such a dirty code before you is no longer part of the company.

These are all relatively common situations, but let’s be much less dramatic; who never wrote down some code, left it hanging for a while to work on something more urgent, and then when coming back to it couldn’t remember how it actually worked? I know it happened to me.

These are all valid reasons for making an extra effort to write better code. So let’s start from the basics and go through some tips for writing cleaner scripts:

Be descriptive with the names within your code. I never forgot a concept I learned ages ago when I took some Java lectures at Uni: aim for your code to be mnemonic. Mnemonic refers to a system such as a pattern of letters, ideas, or associations that assists in remembering something. I.e., it means writing self-explanatory names.
Whenever possible, try to imply the type. For example, for a function returning a boolean object, you can prefix it with is_ or has.
Avoid abbreviations and especially single letters.
On the other hand, avoid long names and lines. Writing long names doesn’t mean being more descriptive, and in terms of the lines’ length, the guideline in the PEP 8 style guide for Python Code recommends lines up to 79 characters approx.
Don’t sacrifice clarity for consistency. If, for example, you’ll have objects representing employees and a list containing them all, employee_list and employee_1 is clearer than employees and employee_1
In regards to blank lines and indentation, make your code easier to read separating sections with bank likes and using consistent indentation

The importance of writing modular code

Source: Sharon McCutcheon @ pexels.

This one I reckon is one of the most important points for data scientists and data analysts and a very common source of discussion with software engineerings given that we’re very used to code in tools such as Jupyter Notebooks. These tools that are amazing for Exploratory Data Analysis, but not meant for writing production code. In fact, Python is by nature an object-oriented programming language, it is not within this scope to cover in-depth what does that means. But, in short, unlike procedural programming, where you code a list of instructions for a script to execute, object-oriented programming is about building modules with their own characteristics and actions. Take the following example:

Source: image created by the author.

In practice, these characteristics are known as attributes, and the actions will be methods. In the example above, the objects Computer and Printer would be independent classes. A class is a blueprint containing the attributes and methods of all the objects of that specific type. I.e., all the Computers and Printers that we create would share the same attributes and methods. The concept behind this idea is called encapsulation. Encapsulation means that you can combine functions and data all into a single entity or module. And when you break a program into modules, different modules don’t need to know how something is accomplished if they are not responsible for doing it. And why is this useful? Well, not only for the code to be reusable, avoiding repetition, and gaining efficiency across classes of your code as mentioned before but it also makes it easier to debug if needed.

Again, this might not be relevant if all you’re doing is Exploratory Data Analysis within a Jupyter Notebook, but if you’re writing a script that will be part of a live environment, and especially as applications grow in size, it makes sense to split your code into separate modules. By perfecting each section of the program before putting all of the sections together, you not only make it easier to reuse individual modules in other programs, you make it easier to fix problems by being able to pinpoint the source of the error.

Some further tips for writing modular code:

DRY: Don’t Repeat Yourself
Using functions not only makes it less repetitive but also improves readability with descriptive names to understand what each module does
Minimize the number of entities (functions, classes, modules, etc.)
Single Responsibility Principle: The idea that a class should have one-and-only-one responsibility. Harder than one might expect.
Follow the Open/Closed Principle. I.e., objects should be open for extension but closed for modification. The idea is to write code so that you’ll be able to add new functionality without changing the existing code, preventing situations in which a change to one of your classes also requires you to adapt all depending classes. There are different ways of facing this challenge, though in Python, it’s very common to use inheritance.
Try to use less than three arguments per function. If it has many, maybe split it. A similar criterion applies to the length of the function; ideally, a function should have between 20 and 50 lines. If it has more, then you might want to break it into separate functions
Mind the length of your classes as well. If a class has more than 300 lines, then it should probably be split into smaller classes.

If you’re already using Python, but have none or little knowledge about object-oriented programming, I strongly recommend these two free courses:

The importance of refactoring

Source: RyanMcGuire @ pixabay.

Wikipedia states the definition of refactoring as follows:

In computer programming and software design, code refactoring is the process of restructuring existing computer code without changing its external behaviour. Refactoring is intended to improve the design, structure, and/or implementation of the software while preserving its functionality. Potential advantages of refactoring may include improved code readability and reduced complexity; these can improve the source code’s maintainability and create a simpler, cleaner, or more expressive internal architecture or object model to improve extensibility.

I reckon that the definition speaks by itself, but apart from it, we could add that refactoring gives us a chance to clean and modularize our code after we've got it working. It gives us as well a chance to improve the efficiency of our code. And what I’ve learned so far, is that when a software engineer talks about efficient code, they usually refer to either of these:

Reducing run time
Reducing space in memory

Let’s briefly cover both points…

In my experience, reducing the run time of your code is something that you learn over time as you write more and more production code. When you’re doing some analysis in your Jupyter Notebook, it doesn’t matter if calculating those pairwise distances takes you two, five, or ten minutes. You can leave it running, answer some Slack messages, go to the bathroom, fill up a cup of coffee, and come back to see your code done. However, what happens when there’s a user waiting on the other side? You can’t just leave them hanging while your code compiles, right?

In Python, there are several ways to improve. Let’s quickly cover some of them:

Use vector operations to make your computations faster. For example, when checking if elements of one array are within another, instead of writing for loops, you can use NumPy's intersect1d. You can also use vectors to search for elements according to a condition in order to perform an addition or a similar operation. Let’s see a quick example for when we have to iterate through a list of numbers and perform an operation given a condition:

Instead of using something like this:

# random array with 10 million points
a = np.random.normal(500, 30, 10000000)

# iterating and checking for values < 500
t0 = time.time()
total = 0
for each in a:
    if each < 500:
    total += each
t1 = time.time()
print(t1-t0)

Time: 3.6942789554595947 seconds

# same operation only using numpy
t0 = time.time()
total = a[a<500].sum()
t1 = time.time()
print(t1-t0)

Time: 0.06348109245300293 seconds

More than 58 times faster!

I know that Pandas Dataframes are super easy to use, and we all Pythonistas love them. However, when writing production code, it is better to simply avoid them. Much of the operations we perform with Pandas can be done as well with Numpy. Let’s see some other examples:

Sum matrix across rows according to condition

# random 2d array with 1m rows and 20 columnn
# we’ll use the same in following examples
a = np.random.random(size=(1000000,20))

# sum all values greater than 0.30
(a * (a>0.30)).sum(axis=1)

In the code above, multiplying by a boolean array works because True corresponds to 1 and False to 0.

Add a column according to some condition

# obtain the number of columns in the matrix to be used as index of the new one to be placed at the end
new_index = a.shape[1]

# set the new column using the array created in the previous example
a = np.insert(a, new_index, (a * (a>0.30)).sum(axis=1), axis=1)

# check new shape of a
a.shape

Prints: (1000000, 21) | The new column has been added.

Filter the table according to several conditions

# filter if the new last column is greater than 10 and first column is less than 0.30
b = a[(a[:,0]<0.30)&(a[:,-1]>10)]

b.shape

Prints: (55183, 21) | 55183 rows match the condition.

Replace elements if a condition is met

# change to 100 all values less than 0.30
a[a<0.3] = 100

Apart from the code above, another great option for reducing our run time is parallelization. Parallelization implies writing a script to process data in parallel, using several or all the available processors in the machine. Why can this lead us to a massive improvement in speed? Because most of the time, our scripts compute data serially: they solve one problem, and then the following, and then the following, and so on. When we write code in Python, that’s usually what happens, and if we want to take advantage of parallelization, we have to be explicit about it. I’ll be writing an independent story about this soon, however, if you’re eager for learning more about it, among all the libraries available for parallelization, my favourites so far are:

In regards to reducing space in memory, reducing memory usage in Python is difficult, because Python does not actually release memory back to the operating system. If you delete objects, then the memory is available to new Python objects, but it won’t be free() back to the system. Moreover, as mentioned before, Pandas is a great tool for Exploratory Data Analysis, but apart from being slower for production code, it is also quite expensive in terms of memory. However, there a few things we can do to keep that memory use under control:

First thing first: if possible, use NumPy arrays instead of Pandas Even a dictionary if possible will take much less memory than a Dataframe
Reduce the number of Pandas Dataframe: when we’re modifying dataframes, instead of creating a new object, try to modify the dataframe itself using the parameter inplace=True, so you don’t create copies.
Clear your history: each time you make a change on a dataframe (e.g., df + 2), Python holds copies of that object in memory. You can clear that history by using %reset Out
Be aware of your data types: object and string dtypes are much more expensive in terms of memory in comparison with numbers. That’s why it’s always useful to examine the data types of your dataframe using df.info() and cast them if possible using df[‘column’] = df[‘columns’].astype(type)
Use sparse matrices: if you have a matrix with lots of null values or empty cells, it would be convenient to use instead a sparse matrix, that usually takes much less space in memory.

You can do that using scipy.sparse.csr_matrix(df.values)

Using generators instead of objects: generators allow you to declare a function that behaves like an iterator, but using the word yieldrather than return. Instead of creating a new object (i.e., a list or a NumPy array) with all the calculations, generators generate a single value that’s held in memory and gets updated only when you ask for it. This is known as lazy evaluation. Find more about Generators in this great story of Abhinav Sagar in Towards Data Science.

The importance of testing

Source: Pixabay at @ pexels.

Tests in Data Science are needed. Other software related areas usually complain about the lack of testing in data scientist’s code. While in other kinds of algorithms or scripts, the program might just stop working if there’s a mistake, in Data Science this is even more dangerous cause the program might actually run, but end up with wrong insights and recommendations due to values encoded incorrectly, features being used inappropriately or data breaking assumptions that the models are actually based on.

There are two main concepts worth talking about when we refer to testing:

Unit tests
Test driven development

Let’s start with the former. Unit tests are called that because they cover a small unit of code. The goal is to validate that each individual part of our code performs as designed. In object-oriented programming languages, such as Python, a unit can also be designed for evaluating an entire class but could be an individual method or function as well.

Unit tests can be written from scratch. In fact, let’s do that so we can grasp a better understanding of how unit tests actually work:

Suppose I have the following function:

def my_func(a,b):

   c=(a+b)/2*1.5

   return c

And I want to test that the following inputs return the expected output:

4 and 2 return 4.5
5 and 5 return 5.5
4 and 8 return 9.0

We could perfectly write something like this:

def test_func(function, output):
    out = function
    if output == out:
        print(‘Worked as expected!’)
    else:
        print(‘Error! Expected {} output was {}’.format(output,out))

And then simply test our function:

test_func(my_func(4,2),4.5)

Prints: Worked as expected!

However, this can get trickier with more complex functions, and when we want to test several functions at once, or even a class, for example. A great tool for unit testing without all this faff is pytest library. Pytest requires you to create a python script containing the function or functions to be tested, along with another set of functions asserting the output. The file needs to be saved with the prefix ‘test’, and then it just needs to be run as any other Python script. Pytest was originally developed to use from the command line, but there’s a hacky way of using it from a Jupyter Notebook if you’re still at early stages of a project; you can create and save a .py file using the magic command %%writefile and then simply run the script using command line statements directly from the notebook. Let’s see an example:

import pytest
%%writefile test_function.py

def my_func(a,b):
    c=(a+b)/2*1.5
    return c

def test_func_4_2():
    assert(my_func(4,2)==4.5)

def test_func_5_5():
    assert(my_func(5,5)==7.5)

def test_func_4_8():
    assert(my_func(4,8)==9.0)

The simply run the script:

!pytest test_function.py

And see an output like this if everything ran as expected:

In the future, I’ll write another story to talk about more complex examples of unit tests, and how to test an entire class if that’s what you need. But meanwhile, this should be more than enough to get you started and test some of your functions. Mind that in the examples above, I’m testing the exact number being returned, but you could test as well the shape of a dataframe, the length of a NumPy array, the type of the object returned, etc.

The other point we mentioned before at the beginning of this chapter was Test Driven Development or TDD. This approach or methodology for testing consists of writing the unit tests to be performed for a piece of code, even before starting to develop. Next step, you’ll want to write the simplest and/or quickest piece of code you can as to pass the tests you wrote down initially, this will help you to ensure quality, by focusing on the requirements before writing your code. Also, it will force you to keep the code simple, clean, and testable by breaking it down into small chunks of code, in accordance with the test or tests are written initially. Once you have a piece of code that actually passes those tests, just then, you’ll focus on refactoring for improving the quality or further functionalities of your code.

Source: https://me.me/

A major benefit of TDD is that if in the future a change is needed in the code and you’re no longer working on that project, you moved on to another company or perhaps you are simply on holidays, knowing the tests that were written originally will help anyone taking the code to be sure that it won’t break anything once the change is done.

Some further points worth considering:

Notebooks: ideal for exploration, not ideal for TDD.
Ping pong TDD: one person writes the tests, another the code.
Set both performance and output metrics for your tests.

The importance of code reviewing

Source: Charles Deluvio @ unsplash.

Code reviews benefit everyone in a team to promote best programming practices and prepare code for production. The main goal of code reviews is to catch errors. However, they are also helpful for improving readability and check that the standards are met among a team, so no dirty or slow code is fed into production. Beyond these points, code reviews are also great for sharing knowledge, as members of the team get to read pieces of code from people with different backgrounds and styles.

Nowadays, the tool that is excellent for code reviewing is the platform GitHub with pull requests. A pull request is a solicitude for integrating a change in a piece of code or an entirely new script, into a certain code environment. It is called pull request because their submission implies precisely requesting someone to pull the code you wrote into the repository.

From GitHub’s documentation, we can see their definition of a pull request:

Pull requests let you tell others about changes you’ve pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch.

Pull requests are an art by themselves, and if you’re interested in learning more about them, this story called “The anatomy of a perfect pull request” by Hugo Dias will definitely come in handy. However, here there are a few questions that you can ask yourself when reviewing code:

Is the code clean and modular? Look for duplication, whitespaces, readability, and modularity.
Is the code efficient? Look at loops, objects, functions structure, can it use multiprocessing?
Is documentation effective? Look for in-line comments, docstrings and readme files.
Has the code been tested? Look for unit tests.
Is the logging good enough? Look for clarity and right frequency of logging message.

Original. Reposted with permission.

Related:

Software engineering fundamentals for Data Scientists

The importance of writing clean code

The importance of writing modular code

The importance of refactoring

The importance of testing

The importance of code reviewing

More On This Topic

Latest Posts

Top Posts