Towards a Quantitative Measure of Intelligence: Breaking Down One of the Most Important AI Papers of 2019, Part II

AI scientist Francois Chollet proposes a better framework for measuring the intelligence of AI systems.



Figure

 

This is the second part of an article discussing a recently published paper by Keras creator Francois Chollet that proposes a new method to evaluate the intelligence of artificial intelligence(AI) systems. On the Measure of Intelligence challenges some of the traditional methods that equate intelligence to the ability to perform an atomic task and outline a framework for defining intelligence using quantitative and comparable methods. In the first part of this article, we discuss the philosophical definitions of intelligence pioneered by Charles Darwin and Alan Turing as well as the notion of generalization in deep learning models that is often used as the most visible measure of intelligence. Today, we will focus on Chollet’s proposed framework for evaluating intelligence and its core foundations.

To start, we should go into a field of psychology that is foreign to most AI practitioners.

 

A Psychometrics Perspective of Intelligence

 
The field of psychometrics focus on studying the development of skills and knowledge in humans. A fundamental notion in psychometrics is that intelligence tests evaluate broad cognitive abilities as opposed to task-specific skills. Importantly, an ability is an abstract construct (based on theory and statistical phenomena) as opposed to a directly measurable, objective property of an individual mind, such as a score on a specific test. Broad abilities in AI, which are also constructs, fall into the exact same evaluation problematics as cognitive abilities from psychometrics. Psychometrics approaches the quantification of abilities by using broad batteries of test tasks rather than any single task, and by analyzing test results via probabilistic models.

Some of the concepts in the theory of psychometrics can be used to evaluate the intelligence capabilities of AI systems in the more quantifiable manner. Chollet’s paper outlines a few key ideas:

  • Measuring abilities (representative of broad generalization and skill-acquisition efficiency), not skills. Abilities are distinct from skills in that they induce broad generalization.
  • Evaluate abilities via batteries of tasks rather than any single task, that should be previously unknown to both the test taking system and the system developers.
  • Having explicit standards regarding reliability, validity, standardization, and freedom from bias. In that context, reliability implies that the test results for a given system should be reproducible over time and across research groups. Validity refers to establish clear understanding of the objectives of a given test. Standardization implies adopting shared benchmarks across the subset of the research community. Finally, freedom from bias implies that the test should not be biased against groups of test-takers in ways that run orthogonal to the abilities being assessed.

The idea that solving individual tasks is not an effective measure of intelligence was brilliantly captured by computer science pioneer Allen Newell in the 1970s using an analogy from chess which have become one of the canonical examples of AI:

“we know already from existing work [psychological studies on humans] that the task [chess] involves forms of reasoning and search and complex perceptual and memorial processes. For more general considerations we know that it also involves planning, evaluation, means-ends analysis and redefinition of the situation, as well as several varieties of learning — short-term, post-hoc analysis, preparatory analysis, study from books, etc.”

What this statement is telling us is that chess itself does not involve specific cognitive abilities. However, possessing these general abilities makes it possible to solve chess (and many more problems), by going from the general to the specific, inversely, there is no clear path from the specific to the general. Absolutely brilliant!

 

A Quantifiable Measure of Intelligence

 
Using some of the ideas from psychometrics, Chollet arrives to the following definition of intelligence:

The intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.

This definition of intelligence includes concepts from meta-learning priors, memory, and fluid intelligence. From an AI perspective, if we take two systems that start from a similar set of knowledge priors, and that go through a similar amount of experience (e.g. practice time) with respect to a set of tasks not known in advance, the system with higher intelligence is the one that ends up with greater skills. Another way to think about it is that “higher intelligence” systems “covers more ground” in future situation space using the same information.

The previous definition of intelligence looks amazing from a theoretical standpoint but how can it be included in the architecture of AI systems?

An intelligent system would be an AI program that generates a specific skill to interact with a task. For instance, a neural network generation and training algorithm for games would be an “intelligent system”, and the inference-mode game-specific network it would output at the end of a training run on one game would be a “skill program”. A program synthesis engine capable of looking at a task and outputting a solution program would be an “intelligent system”, and the resulting solution program capable of handling future input grids for this task would be a “skill program”.

Now that we have a canonical definition of intelligence for AI systems, we need a way to measure it ????

 

ARC

 
Abstraction and Reasoning Corpus(ARC) is a dataset proposed by Chollet intended to serve as a benchmark for the kind of intelligence defined in the previous sections. Conceptually, ARC can be seen as a psychometric test for AI systems that tries to evaluate a qualitatively form of generalization rather than the effectiveness on a specific task.

ARC comprises a training set and an evaluation set. The training set features 400 tasks, while the evaluation set features 600 tasks. The evaluation set is further split into a public evaluation set (400 tasks) and a private evaluation set (200 tasks). All tasks are unique, and the set of test tasks and the set of training tasks are disjoint. Given a specific task, the ARC test interface looks like the following figure.

The initial release of ARC is available on GitHub.

I started the previous article by saying that Chollet’s On the Measure of Intelligence could be considered one of the most important papers of this year. Some of the ideas included in the paper or some variations of it can influence the design of AI systems in a way that they can achieve measurable and comparable levels of intelligence. Implementing Chollet’s paradigm is not an easy task but some of the ideas are definitely worth exploring.

 
Original. Reposted with permission.

Related: