Essential data science skills that no one talks about
Old fashioned engineering skills are what you need to boost your data science career.
By Michael Kolomenkin, AI Researcher
Google “the essential skills of a data scientist”. The top results are long lists of technical terms, named hard skills. Python, algebra, statistics, and SQL are some of the most popular ones. Later, there come soft skills — communication, business acumen, team player, etc.
Let’s pretend that you are a super-human possessing all the above abilities. You code from the age of five, you are a Kaggle grandmaster and your conference papers are guaranteed to get a best-paper award. And you know what? There is still a very high chance that your projects struggle to reach maturity and become full-fledged commercial products.
Recent studies estimate that more than 85% of data science projects fail to reach production. The studies provide numerous reasons for the failures. And I have not seen the so-called essential skills mentioned even once as a potential reason.
Am I saying that the above skills are not important? Of course, I’m not. Both hard and soft skills are vital. The point is that they are necessary, but not sufficient. Moreover, they are popular and appear on every google search. So the chance is that you already know if you need to improve your math proficiency or teamwork.
I want to talk about the skills that complement popular hard and soft skills. I call them engineering skills. They are especially useful for building real products with real customers. Regretfully, engineering skills are seldom taught to data scientists. They come with experience. Most junior data scientists lack them.
Engineering skills have nothing to do with the area of data engineering. I use the term engineering skills to distinguish them from purely scientific or research skills. According to the Cambridge dictionary, engineering is the use of scientific principles to design and build machines, structures, and other items. In this paper, engineering is the enabler component that transforms science into products. Without proper engineering, the models will keep performing on predefined datasets. But they will never get to real customers.
The important and often neglected skills are:
- Simplicity. Make sure your code and your models are simple, but not simplistic.
- Robustness. Your assumptions are wrong. Take a breath and continue to code.
- Modularity. Divide and conquer. Dig down to the smallest problem and then find an open-source to solve it.
- Fruit picking. Don’t focus only on low-hanging fruits. But make sure you have always something to pick.
“Entities should not be multiplied without necessity“ — William of Ockham. “Simplicity is the ultimate sophistication” — Leonardo da Vinci. “Everything should be made as simple as possible, but not simpler” — Albert Einstein. “That’s been one of my mantras — focus and simplicity” — Steve Jobs.
I could have filled the whole page with citations devoted to simplicity. Researchers, designers, engineers, philosophers, and authors praised the simplicity and stated that simplicity has a value all of its own. Their reasons changed, but the conclusion was the same. You reach perfection not when there is nothing to add, but when there is nothing to remove.
Software engineers are absolutely aware of the value of simplicity. There are numerous books and articles on how to make software simpler. I remember that KISS principle — Keep It Simple, Stupid — was even taught at one of my undergraduate courses. Simple software is cheaper to maintain, easier to change, and less prone to bugs. There is a wide consensus on it.
In data science, the situation is very different. There are a number of articles, for example, “The virtue of simplicity: on ML models in algorithmic trading” by Kristian Bondo Hansen or “The role of simplicity in data science revolution” by Alfredo Gemma. But they are an exception and not the rule. The mainstream of data scientists does not care at best and prefers complex solutions at worst.
Before going on to the reasons why data scientists usually don’t care, why they should, and what to do with that, let’s see what simplicity means. According to the Cambridge dictionary, it is the quality of being easy to understand or do and the quality of being plain, without unnecessary or extra things or decorations.
I find that the most intuitive way to define simplicity is via negativa, as the opposite of complexity. According to the same dictionary, complexity is consisting of many interconnecting parts or elements; intricate. While we can’t always say that something is simple, we can usually say that something is complex. And we can aim not to be complex and not to create complex solutions.
The reason to seek simplicity in data science is the same reason as in all engineering disciplines. Simpler solutions are much, much cheaper. Real-life products are not Kaggle competitions. Requirements are constantly modified. A complex solution quickly becomes a maintenance nightmare when it needs to be adapted to new conditions.
It is easy to understand why data scientists, especially fresh graduates, prefer complex solutions. They have just arrived from the academy. They have finished the thesis and probably even published a paper. An academic publication is judged by accuracy, mathematical elegance, novelty, methodology, but seldom by practicality and simplicity.
A complicated idea that increases the accuracy by 0.5% is a great success for any student. The same idea is a failure for a data scientist. Even if its theory is sound, it may hide underlying assumptions that will prove as false. In any case, incremental improvement is hardly worth the cost of complexity.
So what to do if you, your boss, your colleagues, or your subordinates are fond of complex and “optimal” solutions? If it is your boss, you are probably doomed and you’d better start looking for a new job. In other cases, keep it simple, stupid.
Russian culture has a concept of avos’. Wikipedia describes it as “blind trust in divine providence and counting on pure luck”. Avos’ was behind the decision of the truck’s driver to overload the truck. And it hides behind any non-robust solution.
What is robustness? Or specifically, what is robustness in data science? The definition that is most relevant to our discussion is ”the robustness of an algorithm is its sensitivity to discrepancies between the assumed model and reality” from Mariano Scain thesis. Incorrect assumptions about reality are the main source of problems for data scientists. They are also the source of problems for the truck driver above.
Careful readers may say that robustness is also the ability of an algorithm to deal with errors during execution. They would be right. But it is less relevant to our discussion. It is a technical topic with well-defined solutions.
The necessity to build robust systems was obvious in the pre-big-data and pre-deep world. Feature and algorithm design were manual. Testing was commonly performed on hundreds, maybe thousands of examples. Even the smartest algorithm creators never assumed that they could think of all possible use cases.
Did the era of big data change the nature of robustness? Why should we care if we can design, train, and test our models using millions of data samples representing all imaginable scenarios?
It figures out that robustness is still an important and unsolved issue. Each year top journals prove it by publishing papers dealing with algorithm robustness, for instance, “Improving the Robustness of Deep Neural Networks” and “Model-Based Robust Deep Learning”. The quantity of data has not been translated into quality. The sheer amount of information used for training does not mean we can cover all use cases.
And if people are involved, the reality will always be unexpected and unimaginable. Most of us have difficulty telling what we will have for lunch, not to talk about tomorrow. Data can hardly help with predicting human behavior.
So what to do in order to make your models more robust? The first option is to read the appropriate papers and implement their ideas. This is fine. But the papers are not always generalizable. Often, you can’t copy an idea from one area to another.
I want to present three general practices. Following the practices does not guarantee robust models, but it significantly decreases the chance of fragile solutions.
Performance safety margin. Safety margins are the basis of any engineering. It is a common practice to take requirements and add 20–30% just to be on the safe side. An elevator that can hold 1000kg will easily hold 1300kg. Moreover, it is tested to hold 1300kg and not 1000kg. Engineers prepare for unexpected conditions.
What is the equivalent of a safety margin in data science? I think it is the KPI or success criteria. Even if something unexpected happens, you will still be above the threshold.
The important consequence of this practice is that you will stop chasing incremental improvements. You cannot be robust if your model increases a KPI by 1%. With all the statistical significance tests, any small change in the environment will kill your effort.
Excessive testing. Forget the single test / train / validation division. You have to cross-validate your model over all possible combinations. Do you have different users? Divide according to the user ID and do it dozens of times. Does your data change over time? Divide according to timestamp and make sure that each day appears once in the validation group. “Spam” your data with random values or swap values of some features between your data points. And then test on dirty data.
I find it very useful to assume that my models have bugs until proven otherwise.
Two interesting sources on data science and ML testing — Alex Gude’s blog and “Machine Learning with Python, A Test-Driven Approach”.
Don’t build castles on the sand. Decrease the dependence on other untested components. And never build your model on top of another high-risk and not validated component. Even if the developers of that component swear that nothing can happen.
Modular design is an underlying principle of all modern science. It is the direct consequence of the analytical approach. The analytical approach is a process where you break down a big problem into smaller pieces. The analytical approach was a cornerstone of the scientific revolution.
The smaller your problem is, the better. And “the better” here is not nice to have. It is a must. It will save a lot of time, effort, and money. When a problem is small, well defined, and not accompanied by tons of assumptions, the solution is accurate and easy to test.
Most data scientists are familiar with modularity in the context of software design. But even the best programmers, whose python code is crystal clear, often fail to apply the modularity to data science itself.
The failure is easy to justify. Modular design requires a method to combine several smaller models into a big one. There exists no such method for machine learning.
But there are practical guidelines that that I find useful:
- Transfer learning. Transfer learning simplifies employing existing solutions. You can think of it as dividing your problem into two parts. The first part creates a low dimensional feature representation. The second part directly optimizes the relevant KPI.
- Open-source. Use out-of-the-box open-source solutions whenever possible. It makes your code modular by definition.
- Forget being optimal. It is tempting to build from scratch a system optimized for your needs instead of adapting an existing solution. But it is justified only when you can prove that your system significantly outperforms the existing one.
- Model ensembles. Don’t be afraid to take several different approaches and throw them into a single pot. This is as most Kaggle competitions are won.
- Divide your data. Don’t try to create “one great model”, while theoretically, it may be possible. For example, if you deal with predicting customer behavior, don’t build the same model for a completely new customer and someone who has been using your service for a year.
Check Compositional Deep Learning for more details about deep learning building blocks. And read Pruned Neural Networks Are Surprisingly Modular for a scientific proof.
There is a constant tension between product managers and data scientists. Product managers want data scientists to focus on low hanging fruits. Their logic is clear. They say that the business cares only about the number of fruits and about where they grow. The more fruits we have, the better we do. They throw in all sorts of buzzwords — Pareto, MVP, the best is the enemy of the good, etc.
On the other hand, data scientists state that the low hanging fruits spoil fast and taste badly. In other words, solving the easy problems has a limited impact and deals with symptoms and not the cause. Often, it’s an excuse to learn new technologies, but often they are right.
Personally, I moved between both viewpoints. After reading P. Thiel’s Zero-To-One I was convinced that the low hanging fruits are a waste of time. After spending almost seven years in start-ups, I was sure that creating a low-hanging MVP is the right first step.
Recently, I developed my own approach that unifies the two extremes. The typical environment of a data scientist is a dynamic and weird world where trees grow in all directions. And the trees switch the directions all the time. They can grow upside down or sideways.
The best fruits are indeed at the top. But if we spend too much time building the ladder, the tree will move. Therefore the best is to aim at the top but to constantly monitor where the top is.
Moving from metaphors to practice, there is always a chance that during a long development things will change. The original problem will become irrelevant, new data sources will appear, the original assumptions will prove false, the KPI will be replaced, etc.
It is great to aim at the top, but remember to do it while rolling out a working product every few months. The product may not bring the best fruit, but you will get a better sense of how the fruits grow.
Bio: Michael Kolomenkin is a father of three, AI researcher, kayak instructor, adventure seeker, reader and writer.
Original. Reposted with permission.
- The unspoken difference between junior and senior data scientists
- Getting A Data Science Job is Harder Than Ever – How to turn that to your advantage
- Advice for Aspiring Data Scientists