Gold Mine or Blind Alley? Functional Programming for Big Data & Machine Learning

Functional programming is touted as a solution for big data problems. Why is it advantageous? Why might it not be? And who is using it now?


This question strikes at the heart of why machine learning / big data practitioners might consider tossing aside their familiar imperative languages for a purely functional approach. Absent assignment and iteration, parallelism becomes extremely easy to accomplish. Given two function calls f(x), g(x) in sequence, they can be separately executed in any order without any fear that g(x) alters x in some way that would change the result of f(x). This cannot be done in a programming language that supports state. As parallelism is increasingly a driving force behind many advances in machine learning, including deep learning, it is clearly advantageous to achieve parallelism practically for free.

So who is using functional programming? It seems that so far Haskell and OCaml have not penetrated deeply into the machine learning / data science community. A survey of Google hits for the top machine learning tools shows that nearly all widely used toolboxes such as sci-kit learn, R, Numpy, Theano, Caffe, Weka, etc. are written in imperative languages. Functional ML libraries like HLearn for Haskell exist but are relative newcomers.

Several explanations come to mind. First, many data and machine learning scientists come from non-CS backgrounds. The community includes mathematicians, statisticians, bio-informaticians etc. Among our ranks are many great programmers, but also many people who do not identify strongly as programmers. The functional programming community, on the other hand, appears to be comprised mostly of mature programmers; people who've used imperative languages long enough to think that there are fundamental problems with them.

Another explanation is that machine learning tasks fall into two categories. They either transient tasks which must be fast to write and performance is not an issue. Or they are massively resource-bound and require maximally efficient implementations. By transient, I mean the code is temporary, performing some data transformation or processing task a single time before it is cast aside forever. Such programs are arguably easiest written in a language like Python. On the other hand, high performance applications like implementing large scale deep learning systems, requires every shred of possible performance. While most functional languages are much faster than Python, well-written C is still faster than Haskell or Scala or OCaml.

An equally likely explanation is that the slow adoption of functional languages in the machine learning community is simply due to inertia. Most of the tools in the data science toolbox are relatively old. Implementing complicated algorithms and fast linear algebra libraries is difficult and time-consuming. Evolving competing tools in new languages will take time. And while web development is dominated by hackers fresh out of college, the leaders of the machine learning and data science communities tend to be considerably older. As a result, adoption may also require a modicum of patience.

Zachary Chase Lipton Zachary Chase Lipton is a PhD student in the Computer Science Engineering department at the University of California, San Diego. Funded by the Division of Biomedical Informatics, he is interested in both theoretical foundations and applications of machine learning. In addition to his work at UCSD, he has interned at Microsoft Research Labs.