Supervised Learning: Model Popularity from Past to Present
An extensive look at the history of machine learning models, using historical data from the number of publications of each type to attempt to answer the question: what is the most popular model?
The field of machine learning has gone through enormous changes in the last decades. Admittedly, there are some methods that have been around for a long time and are still a staple of the field. For example, the concept of least squares was already proposed in the early 19th century by Legendre and Gauss. Other approaches such as neural networks, whose most basic form was introduced in 1958, were substantially advanced in the last decades, while other methods such as support vector machines (SVMs) are even more recent.
Due to the large number of available approaches for supervised learning, the following question is often posed: What is the best model? As we all know, this question is hard to answer because, as George Box famously stated, [a]ll models are wrong but some are useful. In particular, the usefulness of the model critically depends on the data at hand. Thus, there is no general answer to this question. A question that is easier to answer is the following: What is the most popular model?. This will be the concern of this article.
Measuring the popularity of machine learning models
For the purposes of this article, I will define popularity using a frequentist approach. More precisely, I will use the number of scientific publications mentioning individual supervised learning models as a surrogate for popularity. Of course, there are some limitations to this approach:
- There are probably more accurate notions of popularity than the number of publications. For example, publications criticizing a certain model do not necessarily imply that the model is popular.
- The analysis is influenced by the search terms that are used. To ensure high specificities, I did not consider model abbreviations, which is why I likely did not retrieve all potential hits. Moreover, sensitivity may be low for models that are also referenced by search terms that were not considered in the analysis.
- Literature databases are not perfect: sometimes, publications are stored with incorrect metadata (e.g. incorrect years) or there may be duplicate publications. Thus, some noise in the publication frequencies is to be expected.
For this piece, I performed two analyses. The first analysis is a longitudinal analysis of publication frequencies, while the second analysis compares the overall number of publication associated with machine learning models across different fields.
For the first analysis, I determined the number of publications by scraping data from Google Scholar, which considers the titles and abstracts of scientific publications. To identify the number of publications associated with individual supervised learning approaches, I determined the number of Google Scholar hits between 1950 and 2017. Since data scraping from Google Scholar is notoriously hard, I relied on helpful advice from ScrapeHero to gather the data.
I included the following 13 supervised approaches in the analysis: neural networks, deep learning, SVMs, random forests, decision trees, linear regression, logistic regression, Poisson regression, ridge regression, lasso regression, k-nearest neighbors, linear discriminant analysis, and log-linear models. Note that for lasso regression, the terms lasso regression and lasso model were considered. For nearest neighbors, the terms k-nearest neighbor and k-nearest neighbour were considered. The resulting data set indicates the number of publications associated with each supervised model from 1950 until now.
Use of supervised models from 1950 until now
To analyze the longitudinal data, I will differentiate two periods: the early days of machine learning (1950 to 1980), in which few models were available, and the formative years (1980 until now), in which interest in machine learning surged and many new models were developed. Note that in the following visualizations only the most relevant methods are shown.
Early days: dominance of linear regression
As we can see from Figure 1, linear regression was the dominating method between 1950 and 1980. In comparison, other machine learning models were mentioned extremely rarely in the scientific literature. Starting from the 1960s, however, we can see that the popularity of neural networks and decision trees began to grow. We can also see that logistic regression was not widely available yet, with only slight increases in the number of mentions at the end of the 1970s.
Formative years: diversification and rise of neural networks
Figure 2 demonstrates that the supervised models that were mentioned in scientific publications became considerably more diverse starting from the late 1980s. Importantly, the rate at which machine learning models were mentioned in the scientific literature has steadily increased until 2013. The plot particularly demonstrates the popularity of linear regression, logistic regression, and neural networks. As we have seen before, linear regression has already been popular way before 1980. In 1980, however, the popularity of neural networks and logistic regression started growing rapidly. While the popularity of logistic regression reached its peak in 2010 where the method nearly became as popular as linear regression, the pooled popularity of neural networks and deep learning (curve Neural Network/Deep Learning in Figure 2) even surpassed the popularity of linear regression in 2015.
Neural networks have become immensely popular because they have led to breakthroughs in machine learning applications such as image recognition (ImageNet, 2012), face recognition (DeepFace, 2014), and gaming (AlphaGo, 2016). The data from Google Scholar suggest that the frequency at which neural networks have been mentioned in scientific articles has slightly decreased in the last years (not shown in Figure 2). This is likely since the term deep learning (multilayered neural networks) has supplanted the use of the term neural network to some extent. The same finding can be made using Google Trends.
The remaining, slightly less popular supervised approaches are decision trees and SVMs. In contrast to the top-3 methods, the rates at which these approaches are mentioned are decidedly smaller. On the other hand, there also seems to be less fluctuation in the frequency at which these methods are mentioned in the literature. Notably, neither the popularity of decision trees nor SVMs has yet declined. This is in contrast to other methods such as linear and logistic regression whose number of mentions has decreased considerably in the last years. Between decision trees and SVMs, SVM mentions seem to exhibit the more favorable growth trend as SVMs managed to overtake decision trees merely 15 years after their invention.
The number of mentions of the considered machine learning models peaked in 2013 (589,803 publications) and has slightly decreased since then (462,045 publications in 2017).
Popularity of supervised learning models across different fields
In the second analysis, I wanted to investigate whether different communities rely on different machine learning techniques. For this purpose, I relied on three repositories for scientific publications: Google Scholar for general publications, dblp for computer science publications, and PubMed for publications in the biomedical sciences. In each of the three repositories, I determined the frequency at which hits for the 13 machine learning models were found. The results are shown in Figure 3.
Figure 3 demonstrates that many methods are quite specific to individual fields. In the following, let us analyze the most popular models in each field.
Overall use of supervised learning models
According to Google Scholar, these are the five most frequently used supervised models:
- Linear regression: 3,580,000 (34.3%) papers
- Logistic regression: with 2,330,000 (22.3%) papers
- Neural networks: 1,750,000 (16.8%) papers
- Decision trees: 875,000 (8.4%) papers
- Support vector machines: with 684,000 (6.6%) papers
Overall, linear models are clearly dominating, making up more than 50% of hits for supervised models. Non-linear methods are not far behind though: neural networks make third place with 16.8% of all papers, followed by decision trees (8.4% of papers) and SVMs (6.6% of papers).
Use of models in the biomedical sciences
According to PubMed, the five most popular machine learning models in the biomedical sciences are:
- Logistic regression: 229,956 (54.5%) papers
- Linear regression: 84,850 (20.1%) papers
- Cox regression: 38,801 (9.2%) papers
- Neural networks: 23,883 (5.7%) papers
- Poisson regression: 12,978 (3.1%) papers
In the biomedical sciences, we see an over-representation in the number of mentions relating to linear models: four of of the five most popular methods are linear. This is probably due to two reasons. First, in medical settings, the number of samples is often too small to allow for fitting complex non-linear models. Second, the ability to interpret the results is critical for medical applications. Since non-linear methods are typically harder to interpret, they are less suited for medical applications because high predictive performance alone is typically not sufficient.
The popularity of logistic regression in the PubMed data is likely due to the large number of publications presenting clinical studies. In these studies, the categorical outcome (i.e. treatment success) is often analyzed using logistic regression because it is well-suited for interpreting the impact of the features on the outcome. Note that that Cox regression is so popular in the PubMed data because it is frequently used for analyzing Kaplan-Meier survival data.
Use of models in computer science
The five most popular models in the computer science bibliography retrieved from dblp are:
- Neural networks: 63,695 (68.3%) papers
- Deep learning: 10,157 (10.9%) papers
- Support vector machines: 7,750 (8.1%) papers
- Decision trees: 4,074 (4.4%) papers
- Nearest neighbors: 3,839 (2.1%) papers
The distribution of machine learning models mentioned in computer science publications is quite distinct: the majority of publications seems to deal with recent, non-linear methods (e.g. neural networks, deep learning, and support vector machines). If we include deep learning, more than three quarters of the retrieved computer science publications deal with neural networks.
A cleavage between communities
Figure 4 summarizes the percentage of parametric (including semi-parametric) and non-parametric models that are mentioned in the literature. The bar plot demonstrates that there is a big difference between the models investigated in machine learning research (as evidenced by computer science publications) and the types of models that are applied (as evidenced by biomedical and overall publications). While more than 90% of computer science publications deal with non-parametric models, roughly 90% of biomedical publications deal with parametric models. This showcases that machine learning research is heavily focused on state-of-the-art methods such as deep neural networks, while users of machine learning often rely on more interpretable, parametric models.
This analysis of mentions of individual supervised learning models in the scientific literature showcases the high popularity of artificial neural networks. However, we also see that different types of machine learning models are used in different fields. Particularly researchers in the biomedical sciences still heavily rely on parametric models. It will be interesting to see whether more complex models will find widespread use in the biomedical field or whether these models are simply not appropriate for typical applications in this sector (e.g. due to insufficient interpretability of these models and low generalizability when the sample size is small).
Bio: Matthias Döring graduated as a Master of Science (Bioinformatics) from Saarland University. He is currently a PhD candidate at the Max Planck Institute for Informatics where he develops computational approaches for improving the treatment and prevention of viral infections. Besides supervised learning, he is interested in deploying machine learning models through web services and blogging about data science applications with R at datascienceblog.net.
- On-line and web-based: Analytics, Data Mining, Data Science, Machine Learning education
- Software for Analytics, Data Science, Data Mining, and Machine Learning
- Multi-Class Text Classification with Doc2Vec & Logistic Regression
- 5 Reasons Logistic Regression should be the first thing you learn when becoming a Data Scientist
- A Tour of The Top 10 Algorithms for Machine Learning Newbies