Learning and Teaching Machine Learning: A Personal Journey

Joseph Barr examines history and origins of Machine Learning and Artificial Intelligence and recounts his personal journey from statistics to industry to teaching machine learning and running R on Unix clusters.

By Joseph R. Barr, barr.jr@gmail.com, April 2014.
San Diego State University and True Bearing Analytics

As very a rough, and a brief background, machine learning (ML) grew out of several not-necessarily disjoint mathematical subjects, notable among these are mathematical statistics, computing & algorithm, information theory and mathematical optimization.

The chronology goes something like this. Mathematical statistics was taught since early part of the 20th century by R.A. Fisher and K. Pearson, however, from the mathematical genealogy project, J. Neyman’s thesis (Warsaw University, 1924) entitled (translated from Polish) “Justification of Applications of Calculus of Probability to the Solution of Certain Questions of Agricultural Experimentation” seems to be the first in “pure” statistics.

Information theory was first taught by Robert Fano at MIT in 1950, with first PhD on the subject awarded to David Huffman in 1953 on his thesis entitled “The Synthesis of Sequential Switching Circuits.” (Claude Shannon’s 1940 doctoral thesis “An algebra for theoretical genetics” doesn’t strike me as information theory.) Alan Turing deserves to be named the founder of the field of artificial intelligence (AI) in the late 1930s and throughout the 1940s, but arguably it was Marvin Minsky who was the first PhD in ML, awarded in 1954 on his thesis "Neural Nets and the Brain Model Problem," this even before AI was an acknowledged discipline. Shortly afterwards (1959), Minsky with John McCarthy went to form the MIT AI lab.

In those ancient times machine learning was bundled with AI, so circa 1970, as George Luger (Prof Emeritus, University of New Mexico) told me, he’d first learned of the perceptron learning algorithm when it was taught in a graduate AI class at the University of Pennsylvania. Although opinions vary, from what I hear, machine learning didn’t make an appearance as distinct academic subject that was taught as a dedicated course until sometime the late-1970s or early-1980s, but perhaps someone can correct me and supply details.

However you look at this, ML is not an entirely new thing, but because of recent technological advances and corresponding business appetite for data analysis, interest in the subject in the academe has exploded to the point that everybody seems, or want to be playing the game. I don’t have exact figures, but today, scores of universities around the world offer doctoral degrees in machine learning, mostly in computer science departments, and the number of those offering a regular academic classes in ML is likely in the hundreds. This happened in tandem with demand in the marketplace; Figure 1 demonstrates job trends. Job Trends for Machine Learning
Figure 1: Jobs in ML

From Figure 1, of all jobs posted on the Indeed.com job board, in June 2013 over 0.04 percent were in ML + Python, 0.025 percent were in ML + R and about 0.015 percent were in ML + SAS.

Other terms which describe more-or-less the same thing are data mining (DM), predictive/advanced analytics (PA/AA); the term data mining is a bit out of vogue, partly because of its occasional pejorative use to describe ‘data snooping’. However, it’s safe to regard ML, DM, PA and AA as more-or-less synonymous. Figure 2 is taken from Google Book and illustrates trends of book titles Google Ngrams for AI, Data Mining, and Machine Learning
Figure 2: Trends of book titles. From Google books

My personal journey hasn’t been entirely unique. Although I’ve taught statistics and probability theory for some years early in my academic career, save a handful of classical methods like regression, permutations and rank tests, no ML concepts had made an explicit appearance in my classes. To see how times have changed, back then statisticians preferred to use the term ‘method’ over ‘algorithm, hence the phrase ‘Fisher’s Method of Scoring,’ rather than ‘Fisher’s Scoring Algorithm,’ was (still is) common.

I left academia in 1995 to pursue other, more ‘practical’ interests. In fact, I must confessed that I learned ML much later in life, and this only after the chatter around me had grown to a fortissimo. Admittedly, I’m self-taught; although I had substantial coursework in probability and statistical theories, my doctoral thesis was in combinatorics. But conditioned to a life of autodidact, (mainly because of my shortcomings as a college student,) I wasn’t completely unprepared for the experience of self-teaching. However, having been trained in the pedantic mathematical manners, I wasn’t quite prepared for the pursuit of discipline with a heuristics core. (I suppose classically-trained musicians encounter a similar difficulty when trying their hands in Jazz.)

As strange as it may seem, my main struggle was a dearth of definition-theorem-proof paradigm in the various texts I’ve perused. I simply didn’t know how to untangle the proverbial knot, and lacking a teacher dampened my learning rate. But after several topsy-turvy years things have begun to coalesce. In retrospect, I’d say that subject’s complexity stems from its dimensionality, and although I’m sure some (or most) will disagree with the particulars, I’d venture to put a rough estimate on the dimensionality of ML at 9, or to put it mathematically, dim⁡(ML)= Math + 3, with Math = 6.

What I mean is that most topics in ML lie in the convex hull of (the theories of) probability, combinatorics, convexity & optimization, statistics, information and computing. To this list I would add the three extra dimensions: heuristics, empirics and applications. So it’s not at all surprising that this was hard stuff to learn: I had to simultaneously comprehend the theoretical underpinning, and accept discipline’s philosophy, as well as learn to compute (to program & understand output,) all this while working on practical problems (mostly those assigned to me by a customer or employer.) The lucky ones, i.e., doctoral students in statistics or computer science learn ML during a course of several years (four or five) of coursework along with substantial amount lab work and supervised research, but I wasn’t among those lucky ones.

Fast forward at least a decade, and here I am, an adjunct professor of statistics at San Diego State Uni-versity (San Diego is in California, USA, on the coast, just north of the Mexican town of Tijuana.) I was hired by the university to beef up the statistics program, primarily to teach courses in machine learning, and after nearly two decades of industry work, this has been a most rewarding experience. In fact, I was fortunate to have given the opportunity to create a graduate one-semester machine learning class and this is the second year I’m teaching it.

San Diego State University (SDSU) has a respectable statistics program where we offer master and doctorate degrees in computational and applied statistics. It’s worth mentioning that SDSU’s computational statistics program is one of the few in the country. Since the California State University charter does not allow for the awarding of doctoral degrees, SDSU had partnered with the Claremont graduate school to offer a joint doctoral program.

In support of this program, we offer a graduate-level class in ML, with student-pool from computational sciences, computational linguistics, and computer science. Class prerequisite are kept fairly minimal; those consist of the standard upper division undergraduate coursework in probability, statistics and linear algebra, but we don’t require coursework in more advanced subjects like measure-theoretic probability, optimization & combinatorics, although we do assume that students have programming experience, especially in R.

The challenge is to make the lectures intuitive and compelling, with theoretical details filled in only when necessary to enhance comprehension. The syllabus is fairly standard for a one-semester course and include the philosophical foundations, ML models, algorithms & their computational complexity, bias-variance tradeoff, model’s generalization, as well as the foundational framework.

The latter includes hypotheses spaces, concept learning and their limitations. I demonstrate some of those ideas using perceptron-learning, the Vapnik-Chevronenkis (VC) dimension of linear perceptron, etc. To make a point (I suppose,) we go through the details of calculating the VC-dimension of linear perceptron in Rn.

VCL (Rn) = n + 1 (L = the space of linear hypotheses.) We do mention the probably-approximately-correct (PAC) framework, but because of time constraint, we don’t go into great details, certainly omit proofs. We cover regression (OLS, and logistic), neural networks, support-vector machines with kernels, boosting and various enhancements to regression (PCA, PLS, regularization, etc.,) and unsupervised learning (clustering.)

We currently don’t have a dedicated lab (or lab assistant), so I encourage the students to form study groups (of two or three) to help each other with assignments and comprehension. All assignments are hands-on data analysis with the R system. Data sets are taken from various sources including the UCI data repository, but occasionally students will work on data from another research project.

R is the primary tool, and because R is an ‘in-memory’ system, it is not suitable for analysis of ‘large’ data sets, therefore it doesn’t take long for students to encounter a ‘capacity problem’ with R. So about half way through the class, I assign a problem which involves analyzing a large & high-dimensional data-set, and at that point they have no choice but to use more powerful computer.

This introduces them to running R in batch on SDSU’s Unix/Linux cluster, (in our case a Solaris cluster.) Not surprisingly, the students are comfortable working on their personal machines (a 50-50 Windows/Apple split), but cajoling is necessary to make the leap to get accustomed working on a server. The transitioning from working locally on one’s PC or Mac, to working on the server requires a bit of tutoring and so, we briefly discuss things like command line, text editors, directories & I/O management and all that is necessary to working in a server environment.

One of our main objectives is to get students into the habit & discipline of writing professional reports, one containing a data dictionary, data quality summaries, exploratory data analysis (EDA) (including uni-variate & bivariate analysis, graphics, etc.)

I encourage them to add as much detail as necessary including how data is partitioned into training and test sets, (details about) cross validation, diagnostics, lift-charts, goodness-of-fit, etc. I emphasize that we must spare no effort to ascertain that a model generalizes: I find myself repeating the remark (a cliché, of course):  “Just because we’re happy with the results of training, it doesn’t mean that the model is predictive.”

At the end of the course, students will possess a portfolio of five or six reports on which a grade is based. My goal is for students to gain solid familiarity with, and good working knowledge of a handful of ML algorithms, that they’ve become unafraid of heuristics, that they carefully test a model because they learned to be skeptical about results of training, that they’re curious to further investigate this vast subject, and that they are well-equipped to tackle related problems.

I’m well aware that top schools offer more by way of better resources (labs, TAs, equipment, and often big-name researchers.) We don’t presume to belong in the prestigious club of top universities, and for obvious reasons we can’t compete with them. But we nevertheless educate data scientists capable to perform at the highest professional levels. We also know that we’re doing reasonably good job in producing curious minds, if not research-minded students.

A case in point, in 2013, partially as a result of taking the class, four of our students have published papers in a respectable refereed journal, and more is certainly to come.

Apropos, there’s a somewhat apocryphal story about George David Birkhoff who once was overheard telling a well-known Syracuse university math professor (something like) “But sir, you’re no Harvard!” True, we’re no Harvard, but The Aztecs made it to the “sweet sixteen” round in the 2014 March madness(*), and yes, we do a decent job preparing our students for careers in data science.

(*) For our non-American readers, “March Madness” is a term coined by the media for the immensely popular college basketball tournament which is held during the month of March, ending with the crowning of the NCAA national champion.

Joseph BarrDr. Joseph Barr is an Adjunct Professor of Statistics at San Diego State University. He’s currently working on unstructured data (NLP) and on various other problems resulting from his consulting business.