Why Implement Machine Learning Algorithms From Scratch?

Even with machine learning libraries covering almost any algorithm implementation you could imagine, there are often still good reasons to write your own. Read on to find out what these reasons are.

There are several different reasons why implementing algorithms from scratch can be useful:

  1. it can help us to understand the inner works of an algorithm
  2. we could try to implement an algorithm more efficiently
  3. we can add new features to an algorithm or experiment with different variations of the core idea
  4. we circumvent licensing issues (e.g., Linux vs. Unix) or platform restrictions
  5. we want to invent new algorithms or implement algorithms no one has implemented/shared yet
  6. we are not satisfied with the API and/or we want to integrate it more "naturally" into an existing software library

Machine learning taxonomy

Let us narrow down the phrase "implementing from scratch" a bit further in context of the 6 points I mentioned above. When we talk about "implementing from scratch," we need to narrow down the scope to make this question really tangible. Let's talk about a particular algorithm, simple logistic regression, to address the different points using concrete examples. I'd claim that logistic regression has been implemented more than thousand times.

One reason why we'd still want to implement logistic regression from scratch could be that we don't have the impression that we fully understand how it works; we read a bunch of papers, and kind of understood the core concept though. Using a programming language for prototyping (e.g., Python, MATLAB, R, and so forth), we could take the ideas from paper and try to express them in code -- step by step. An established library, such as scikit-learn, can help us than double-check the results and to see if our implementation -- our idea of how the algorithm is supposed to work -- is correct. Here, we don't really care about efficiency; although we spend so much time to implement the algorithm, we probably want to use an established library if we want to perform some serious analysis in our research lab and/or company. Established libraries are typically more trustworthy -- they have been battle-tested by many people, people who may have already encountered certain edge cases and made sure that there are no weird surprises. Furthermore, it is also more likely that this code was highly optimized for computational efficiency over time. Here, implementing from scratch simply serves the purpose of self-assessment. Reading about a concept is one thing, but putting it to action is a whole other level of understanding -- and being able to explain it to others is the icing on the cake.

Another reason why we want to re-implement logistic regression from scratch may be that we are not satisfied with the "features" of other implementations. Let's us naively assume that other implementations don't have regularization parameters, or it doesn't support multi-class settings (i.e., via One-vs-All, One-vs-One, or softmax). Or if computational (or predictive) efficiency is an issue, maybe we want to implement it with another solver (e.g., Newton vs. Gradient Descent vs. Stochastic Gradient Descent, etc.). But improvements concerning computational efficiency does not necessarily need to be in terms of modifications of the algorithms, but we could use lower-level programming languages, for example, Scala instead of Python, or Fortran instead of Scala, ... this can go all down to assembly or machine code, or designing a chip that is optimized for running such kind of analysis. However, if you are a machine learning (or "data science") practitioner or researcher, this is probably something you should delegate to the software engineering team.

Decision tree pseudocode

To come back to the main question: Different people implement algorithms from scratch for various reasons. Personally, when I implement algorithms from scratch, I do it because of the learning experience.

Bio: Sebastian Raschka is a 'Data Scientist' and Machine Learning enthusiast with a big passion for Python & open source. Author of 'Python Machine Learning'. Michigan State University.

Original. Reposted with permission.