Not So Fast: Questioning Deep Learning IQ Results

Did deep learning just leap towards human intelligence? Not so fast.



On May 29th, a group of researchers from the University of Science and Technology of China and Microsoft Research Beijing posted a paper to arXiv.org describing a machine learning system that outperforms some humans on verbal comprehension questions from IQ tests. Understandably, the claim caught media interest, notably in a June 12th article from MIT Technology review. The system consists of two layers. An initial classifier identifies the precise type of question, i.e., whether it is an analogy, classification, synonym or antonym question. IQ (intelligence quotient) questions are highly formulaic, each category of question follows a nearly identical pattern. Given the type of question, a second algorithm chooses the best solution from the candidates.

According to the authors, led by Huazheng Wang, "the results are highly encouraging, indicating that with appropriate uses of the deep learning technologies, we could be a further step closer to the true human intelligence". This is a dramatic claim and invites scrutiny. Surprisingly, Wang et al. make little effort to qualify the relationship between IQ and intelligence. While they offer the caveat that "a high IQ might not be a necessary condition for a successful life", they insist that "people would still tag the individual with high IQ by labels of smart or clever."

lightbulb

In this post I will briefly describe the methods employed in this paper. Next, I'll take a look at their evaluation methodology, which uses crowdsourced question answering via Mechanical Turk as a stand-in for "human intelligence". I'll also compare the capabilities of this system to that of other question answering bots. Next, I'll take a critical look at the implicit assumptions about the nature of the IQ exam and of intelligence. Namely, I'll ask whether a multiple choice synonym/antonym answerer is fundamentally any closer to human intelligence than a calculator. After all, the ability to add large numbers mentally is likely associated with academic achievement!

Problem Formulation and Algorithmic Strategy

First, the authors treat each question as a bag of words. In the bag of words representation, a document is a vector with one component corresponding to each word in the vocabulary. They perform TF-IDF preprocessing. This means that the value of each component scales proportionally with the number of times a word occurs in a document inversely proportional to the document frequency of that word In other words, a single occurrence of a rarer word has greater impact than a single occurrence of a commoner word.

Using this representation, the authors train a support vector machine (SVM) to predict which kind of question is being answered. The task is multi-class classification and the five labels are "Analogy-I, Analogy-II, Classification, Synonym, and Antonym".

An example of an Analogy-I question is given below:

Example 1. Isotherm is to temperature as isobar is to? (i) atmosphere, (ii) wind, (iii) pressure, (iv) latitude, (v) current

Then, once the type of the question is known (with high propability) they use the method of word2vec (by Mikolov et al.) to obtain embeddings (fixed length distributed vector representations) of each word. The authors extend the notion of a single representation, building a model which attempts to capture a separate representation for each sense or meaning of a word. This is where the paper is most technically interesting. For each word, they learn a separate representation for each context in which a word occurs. They then cluster these representations using spherical k-means clustering.

They match each cluster to the appropriate corresponding dictionary meaning. Then, for each of the five types of questions, they introduce a distinct solver to return the best of the candidate answers, using a distance function on word vectors as a basic operation.

To compare the results of their system to human performance, the authors conducted a mechanical turk study. They collected information on the participants age and educational background and asked them several questions. This is a strange choice, and it is surprising that they report these results without hedging. As far as I know there is no way to conclusively verify the age or educational background of a mechanical turk respondent.

Further, it seems extremely unlikely that the set of people with masters degrees performing Mechanical Turk jobs for a few cents per survey are representative of masters degree graduates. The economic incentive for Turk workers is to complete as many questions as possible, not to answer them as accurately, especially if they know that their performance (and thus compensation) is not contingent upon answering correctly. The same goes for doctoral graduates. A cursory look at the stats they report (sample size of 8 for doctoral candidates) reveals peculiar irregularities. Why are doctoral grads far better than masters grads at 4 of the 5 tasks, including Analogy-I, but far worse at Analogy-II?

Further, the accuracy of the humans reported is less than 50% on most of these questions (random guessing would be 20% accuracy for Analogy-I), even for the "Masters graduates". Is this possible? I've taken IQ tests, and it seems absolutely shocking to me that a cadre of randomly selected masters students would get the majority of questions wrong. I confess do not have hard data on what the true results should be, but this number aroused my skepticism.

A Step Closer to Human Intelligence?

If the work truly shows that computers can now pass the written IQ exam with stronger scores than humans, it is definitely interesting. But I don't see how one could reasonably claim that this methodology brings us "a further step closer to the true human intelligence". This system is hand-engineered to identify the specific patterns in formulaic standardized tests. It's hard-wired to know the types of questions that exist. This system may be a powerful demonstration of word2vec style distributed representations for words, but it is hardly a display of true human intelligence. If the format of the question were changed significantly, or were not formulaic, it would seem that this system couldn't cope. As with many standardized tests, the verbal reasoning section tests the breadth of a participant's vocabulary more than anything else, and it would hardly come as a surprise that the computer can maintain a larger vocabulary than a human.

Human intelligence is notable largely for the ability to generalize. In contrast, this system is hyper-specialized. While calculators can multiply far larger numbers to far greater precision than any human, nearly anyone would agree that they are not intelligent. A human can encounter a new task and quickly adapt to it. A human need not be pre-programmed to accomplish any specific task. Question answering systems, such as IBM's Jeopardy-playing Watson have accomplished superhuman feats of question answering for a long time. The case of Watson would seem to be considerably more human-like, because Watson must answer open-ended questions (not multiple choice) and must content with a far more diverse set of clues.

The recent work by Sutskever et al., training recurrent neural networks to translate from English to French is also considerably more human-like as it has no hard wired assumptions about either English or French. It has no hard-programmed notion of a synonym, antonym, bagel, or antelope. It has no pre-programmed notion of grammar, and yet can translate sentences into grammatically well-formed sentences.

Really, this should be a wake-up call to people who've put too much stock in IQ tests that while they may be correlated with attributes of interest, they do not intrinsically represent a direct measure of intelligence any more than the ability to add numbers does. Keep in mind that many other aspects of IQ tests (besides verbal comprehension) have long been trivial for computers to perform. Ultimately the IQ test is a fairly arbitrary set of benchmarks developed in 1912 by a psychologist. This was nearly contemporary with a time when psychologists attributed all female health problems to cases of "hysteria".

Generally, the data science and machine learning communities look for convenient objectives to optimize. This is reasonable. But we ought to be more critical when portraying the results. We ought to be more reserved in making claims about what they represent in the real world.

Zachary Chase Lipton Zachary Chase Lipton is a PhD student in the Computer Science Engineering department at the University of California, San Diego. Funded by the Division of Biomedical Informatics, he is interested in both theoretical foundations and applications of machine learning. In addition to his work at UCSD, he has interned at Microsoft Research Labs.

Related: