All Machine Learning Models Have Flaws
Tags: Bayesian, Decision Trees, Gradient Descent, John Langford, Machine Learning, Statistical Learning
This classic post examines what is right and wrong with different models of machine learning, including Bayesian learning, Graphical Models, Convex Loss Optimization, Statistical Learning, and more.
By John Langford (Microsoft, Hunch.net)
(Gregory Piatetsky: I recently came across another classic and valuable post by John Langford, and he kindly agreed to repost it in KDnuggets).
Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I've created a table (below) outlining the major flaws in some common models of machine learning.
The point here is not simply "woe unto us". There are several implications which seem important.
Methodology: You specify a prior probability distribution over datamakers, P(datamaker) then use Bayes law to find a posterior P(datamakerx). True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly.
What is right: Handles the small data limit. Very flexible. Interpolates to engineering.
What is wrong:
Graphical/generative Models
Methodology: Sometimes Bayesian and sometimes not. Datamakers are typically assumed to be IID samples of fixed or varying length data. Datamakers are represented graphically with conditional independencies encoded in the graph. For some graphs, fast algorithms for making (or approximately making) predictions exist.
What is right: Relative to pure Bayesian systems, this approach is sometimes computationally tractable. More importantly, the graph language is natural, which aids prior elicitation.
What is wrong:
Convex Loss Optimization
Methodology: Specify a loss function related to the worldimposed loss fucntion which is convex on some parametric predictive system. Optimize the parametric predictive system to find the global optima.
What is right: Mathematically clean solutions where computational tractability is partly taken into account. Relatively automatable.
What is wrong:
Gradient Descent
Methodology: Specify an architecture with free parameters and use gradient descent with respect to data to tune the parameters.
What is right: Relatively computationally tractable due to (a) modularity of gradient descent (b) directly optimizing the quantity you want to predict.
What is wrong:
Kernelbased learning
Methodology: You chose a kernel K(x,x') between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning.
What is right: People often find the specification of a similarity function between objects a natural way to incorporate prior information for machine learning problems. Algorithms (like SVMs) for training are reasonably practical  O(n^{2}) for instance.
What is wrong: Specification of the kernel is not easy for some applications (this is another example of prior elicitation). O(n^{2}) is not efficient enough when there is much data.
Boosting
Methodology: You create a learning algorithm that may be imperfect but which has some predictive edge, then apply it repeatedly in various ways to make a final predictor.
What is right: A focus on getting something that works quickly is natural. This approach is relatively automated and (hence) easy to apply for beginners.
What is wrong: The boosting framework tells you nothing about how to build that initial algorithm. The weak learning assumption becomes violated at some point in the iterative process.
Online Learning with Experts
Methodology: You make many base predictors and then a master algorithm automatically switches between the use of these predictors so as to minimize regret.
What is right: This is an effective automated method to extract performance from a pool of predictors.
What is wrong: Computational intractability can be a problem. This approach lives and dies on the effectiveness of the experts, but it provides little or no guidance in their construction.
Learning Reductions
Methodology: You solve complex machine learning problems by reducing them to wellstudied base problems in a robust manner.
What is right: The reductions approach can yield highly automated learning algorithms.
What is wrong: The existence of an algorithm satisfying reduction guarantees is not sufficient to guarantee success. Reductions tell you little or nothing about the design of the base learning algorithm.
PAC Learning
Methodology: You assume that samples are drawn IID from an unknown distribution D. You think of learning as finding a nearbest hypothesis amongst a given set of hypotheses in a computationally tractable manner.
What is right: The focus on computation is pretty rightheaded, because we are ultimately limited by what we can compute.
What is wrong: There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways.
Statistical Learning Theory
Methodology: You assume that samples are drawn IID from an unknown distribution D. You think of learning as figuring out the number of samples required to distinguish a nearbest hypothesis from a set of hypotheses.
What is right: There are substantially more positive results than for PAC Learning, and there are a few examples of practical algorithms directly motivated by this analysis.
What is wrong: The data is not IID. Ignorance of computational difficulties often results in difficulty of application. More importantly, the bounds are often loose (sometimes to the point of vacuousness).
Decision tree learning
Methodology: Learning is a process of cutting up the input space and assigning predictions to pieces of the space.
What is right: Decision tree algorithms are well automated and can be quite fast.
What is wrong: There are learning problems which can not be solved by decision trees, but which are solvable. It’s common to find that other approaches give you a bit more performance. A theoretical grounding for many choices in these algorithms is lacking.
Algorithmic complexity
Methodology: Learning is about finding a program which correctly predicts the outputs given the inputs.
What is right: Any reasonable problem is learnable with a number of samples related to the description length of the program.
What is wrong: The theory literally suggests solving halting problems to solve machine learning.
RL, MDP learning
Methodology: Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process.
What is right: We can learn and act with an amount of summed regret related to O(SA) where S is the number of states and A is the number of actions per state
What is wrong: Has anyone counted the number of states in real world problems? We can’t afford to wait that long. Discretizing the states creates a POMDP (see below). In the real world, we often have to deal with a POMDP anyways.
RL, POMDP learning
Methodology: Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process
What is right: In a sense, we’ve made no assumptions, so algorithms have wide applicability.
What is wrong: All known algorithms scale badly with the number of hidden states.
This set is incomplete of course, but it forms a starting point for understanding what’s out there. (Please fill in the what/pro/con of anything I missed.)
Original: hunch.net/?p=224
Bio: John Langford is a machine learning research scientist, a field which he says "is shifting from an academic discipline to an industrial tool". He is the author of the weblog hunch.net and the principal developer of Vowpal Wabbit. Related:
(Gregory Piatetsky: I recently came across another classic and valuable post by John Langford, and he kindly agreed to repost it in KDnuggets).
Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I've created a table (below) outlining the major flaws in some common models of machine learning.
The point here is not simply "woe unto us". There are several implications which seem important.
 The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there "home framework" through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students.
 Algorithms which conform to multiple approaches can have substantial value. "I don't really understand it yet, because I only understand it one way". Reinterpretation alone is not the goal  we want algorithmic guidance.
 We need to remain constantly open to new mathematical models of machine learning. It's common to forget the flaws of the model that you are most familiar with in evaluating other models while the flaws of new models get exaggerated. The best way to avoid this is simply education.
 The value of theory alone is more limited than many theoreticians may be aware. Theories need to be tested to see if they correctly predict the underlying phenomena.
Here is a summary what is wrong with various frameworks for learning. To avoid being entirely negative, I added a column about what's right as well.
Bayesian LearningMethodology: You specify a prior probability distribution over datamakers, P(datamaker) then use Bayes law to find a posterior P(datamakerx). True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly.
What is right: Handles the small data limit. Very flexible. Interpolates to engineering.
What is wrong:
 Information theoretically problematic. Explicitly specifying a reasonable prior is often hard.
 Computationally difficult problems are commonly encountered.
 Human intensive. Partly due to the difficulties above and partly because "first specify a prior" is built into framework this approach is not very automatable.
Graphical/generative Models
Methodology: Sometimes Bayesian and sometimes not. Datamakers are typically assumed to be IID samples of fixed or varying length data. Datamakers are represented graphically with conditional independencies encoded in the graph. For some graphs, fast algorithms for making (or approximately making) predictions exist.
What is right: Relative to pure Bayesian systems, this approach is sometimes computationally tractable. More importantly, the graph language is natural, which aids prior elicitation.
What is wrong:
 Often (still) fails to fix problems with the Bayesian approach.
 In real world applications, true conditional independence is rare, and results degrade rapidly with systematic misspecification of conditional independence.
Convex Loss Optimization
Methodology: Specify a loss function related to the worldimposed loss fucntion which is convex on some parametric predictive system. Optimize the parametric predictive system to find the global optima.
What is right: Mathematically clean solutions where computational tractability is partly taken into account. Relatively automatable.
What is wrong:
 The temptation to forget that the world imposes nonconvex loss functions is sometimes overwhelming, and the mismatch is always dangerous.
 Limited models. Although switching to a convex loss means that some optimizations become convex, optimization on representations which aren't single layer linear combinations is often difficult.
Gradient Descent
Methodology: Specify an architecture with free parameters and use gradient descent with respect to data to tune the parameters.
What is right: Relatively computationally tractable due to (a) modularity of gradient descent (b) directly optimizing the quantity you want to predict.
What is wrong:
 Finicky. There are issues with paremeter initialization, step size, and representation. It helps a great deal to have accumulated experience using this sort of system and there is little theoretical guidance.
 Overfitting is a significant issue.
Kernelbased learning
Methodology: You chose a kernel K(x,x') between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning.
What is right: People often find the specification of a similarity function between objects a natural way to incorporate prior information for machine learning problems. Algorithms (like SVMs) for training are reasonably practical  O(n^{2}) for instance.
What is wrong: Specification of the kernel is not easy for some applications (this is another example of prior elicitation). O(n^{2}) is not efficient enough when there is much data.
Boosting
Methodology: You create a learning algorithm that may be imperfect but which has some predictive edge, then apply it repeatedly in various ways to make a final predictor.
What is right: A focus on getting something that works quickly is natural. This approach is relatively automated and (hence) easy to apply for beginners.
What is wrong: The boosting framework tells you nothing about how to build that initial algorithm. The weak learning assumption becomes violated at some point in the iterative process.
Online Learning with Experts
Methodology: You make many base predictors and then a master algorithm automatically switches between the use of these predictors so as to minimize regret.
What is right: This is an effective automated method to extract performance from a pool of predictors.
What is wrong: Computational intractability can be a problem. This approach lives and dies on the effectiveness of the experts, but it provides little or no guidance in their construction.
Learning Reductions
Methodology: You solve complex machine learning problems by reducing them to wellstudied base problems in a robust manner.
What is right: The reductions approach can yield highly automated learning algorithms.
What is wrong: The existence of an algorithm satisfying reduction guarantees is not sufficient to guarantee success. Reductions tell you little or nothing about the design of the base learning algorithm.
PAC Learning
Methodology: You assume that samples are drawn IID from an unknown distribution D. You think of learning as finding a nearbest hypothesis amongst a given set of hypotheses in a computationally tractable manner.
What is right: The focus on computation is pretty rightheaded, because we are ultimately limited by what we can compute.
What is wrong: There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways.
Statistical Learning Theory
Methodology: You assume that samples are drawn IID from an unknown distribution D. You think of learning as figuring out the number of samples required to distinguish a nearbest hypothesis from a set of hypotheses.
What is right: There are substantially more positive results than for PAC Learning, and there are a few examples of practical algorithms directly motivated by this analysis.
What is wrong: The data is not IID. Ignorance of computational difficulties often results in difficulty of application. More importantly, the bounds are often loose (sometimes to the point of vacuousness).
Decision tree learning
Methodology: Learning is a process of cutting up the input space and assigning predictions to pieces of the space.
What is right: Decision tree algorithms are well automated and can be quite fast.
What is wrong: There are learning problems which can not be solved by decision trees, but which are solvable. It’s common to find that other approaches give you a bit more performance. A theoretical grounding for many choices in these algorithms is lacking.
Algorithmic complexity
Methodology: Learning is about finding a program which correctly predicts the outputs given the inputs.
What is right: Any reasonable problem is learnable with a number of samples related to the description length of the program.
What is wrong: The theory literally suggests solving halting problems to solve machine learning.
RL, MDP learning
Methodology: Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process.
What is right: We can learn and act with an amount of summed regret related to O(SA) where S is the number of states and A is the number of actions per state
What is wrong: Has anyone counted the number of states in real world problems? We can’t afford to wait that long. Discretizing the states creates a POMDP (see below). In the real world, we often have to deal with a POMDP anyways.
RL, POMDP learning
Methodology: Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process
What is right: In a sense, we’ve made no assumptions, so algorithms have wide applicability.
What is wrong: All known algorithms scale badly with the number of hidden states.
This set is incomplete of course, but it forms a starting point for understanding what’s out there. (Please fill in the what/pro/con of anything I missed.)
Original: hunch.net/?p=224
Bio: John Langford is a machine learning research scientist, a field which he says "is shifting from an academic discipline to an industrial tool". He is the author of the weblog hunch.net and the principal developer of Vowpal Wabbit. Related:
 Machine Learning in 7 Pictures
 11 Clever Methods of Overfitting and how to avoid them
 Vowpal Wabbit: Fast Learning on Big Data
Top Stories Past 30 Days

