Beyond the Turing Test

With more advancements in AI, it might be time to replace the age-old Turing Test with something better to determine if a machine is thinking. Specifically, a more modern approach might include standard questions designed to probe various facets of intelligence, and comparing the computer to a spectrum of human respondents of different ages, sexes, backgrounds, and abilities.



By Charles Simon, Nationally recognized entrepreneur and software developer.

Image by Juan Alberto Sánchez Margallo, CC BY 2.5.

With artificial intelligence (AI) seemingly touching every aspect of our lives, most experts agree that it’s only a matter of time before today’s AI evolves into Artificial General Intelligence (AGI), a point at which computers meet or even exceed human intelligence. The question that remains, though, is how will we know when that happens?

In 1950, Alan Turing introduced his famous test as a method for determining whether or not a machine was actually thinking. While his test has gone through some evolution since his original paper, a common explanation goes like this:

A person, the interrogator (C), can communicate via a computer terminal (these days, we might say by instant-messaging, emailing, or texting). At the other end of the computer link is either a human (B) or a computer (A). After 20 minutes of keyboard communication, the interrogator states whether a person or a computer was at the other end. If the interrogator believes he was conversing with a human, but it's actually a computer, the conclusion is that the computer must be thinking like a human. This experiment is then carried out multiple times, with more than half of interrogators in agreement, for a computer to "pass" the test.

A more recent adaptation to the Turing Test reduces the conversation to five minutes and considers the test passed if the computer fools the subject better than 30% of the time. In 2014, a program called Cleverbot (http://www.cleverbot.com/) was claimed to have passed the Turing Test by fooling 33% of interrogators. While Cleverbot has some sophisticated responses, my interaction with it quickly exposed its limitations. Rather than quibble with Cleverbot’s claims, though, I would rather quibble with Turing’s test. While it represented a great leap at the time of its publication,  I have two primary concerns:

  • The renown of the Turing Testdrives the development of programs such as Cleverbot or Watson, which have astounding language abilities at the expense of resources targeted at real AGI.
  • In order to pass the test, a computer must be programmed to lie. Any personal question such as “How old are you?”, “What color are your eyes?”, or even “Are you a computer?” are giveaways if the computer answers truthfully. To the extent a system is programmed with the equivalent of goals and emotions in order to pass the test, these must be human goals and emotions rather than ones that might be effective for the machine. That is a lot of development effort expended just to play what is essentially a party game!

I also have concerns about the accuracy of the test:

  • The quality of the test result relies on the sophistication/gullibility of the interrogator.
  • The test allows for feigned deficiencies on the part of the computer to cover its limitations (for example, claiming to be a child in order to cover gaps in its understanding).
  • It imposes human-level constraints. If we could build a machine with superhuman intellect, would it fail the test because it seemed too smart?

Suppose we had true AGI systems, and the positions are reversed. Suppose it’s an AGI deciding whether you are a computer or a human. How good a job would you do?

 

Proposed adjustments

 

At the recent AGI-20 conference, one attendee commented that a test for true intelligence would be the ability to design a test for true intelligence. Since we don’t have such a test,  are none of us truly intelligent?

To get around these issues, I propose adjusting the Turing Test. Instead of individual interrogators making up more-or-less random questions, we could create sets of standard types of questions designed to probe various facets of intelligence. Instead of comparing the computer’s responses to an individual human responder, compare the computer to a spectrum of human respondents of different ages, sexes, backgrounds, and abilities.

Now, recast the interrogators as judges who individually score the test results indicating whether or not each answer is a “reasonable” response to the question. The questions and answers should be mixed randomly to prevent spotting and scoring trends. For example, if a respondent gives one low-scoring answer, it should not color the perceived quality of other responses from that respondent.

Sample questions which target specific component areas of intelligence potentially could include the following:

  • Can you describe what you see (or hear) around you right now? (perception)
  • Describe what you see in this picture? (pattern-recognition/knowledge)
  • If I [action, such as sing a song, fall down, drop my pencil, or tell a joke], what will your reaction be? (prediction)
  • If you [action, such as tell a joke, steal my wallet, or pass this test], what will my reaction be? (prediction/comprehension of human behavior).
  • Name three things which are like [an object, such as a tree, a flower, a car, or a computer]. (internal object representation, common-sense relationships)
  • Name your favorite [object, such as food, drink, movie star, book, or scientist]. (goal orientation)
  • Let me explain a code (like Morse code). Using that code, encode this message.
  • What’s wrong with this picture?

“What’s wrong with this picture?” requires not only object recognition within the image but real-world understanding of the use and relationship of objects. From: Koch, Christof and Giulio Tononi. “A Test for Consciousness How will we know when we’ve built a sentient computer? By making it solve a simple puzzle.” (2011).

While these questions could be posed equally to a thinking machine and a human, we would presume that we could get significantly different answers from the two, and it would be easy to distinguish the computer from the person. Instead, the response to each question is graded by several judges as meaningful or not meaningful. Now we can determine that the computer is thinking if it gives a similar number of meaningful answers.

The key issues are that questions need to be open-ended in order to let the respondent demonstrate that they are really understood.  The types of questions given can be varied in order to create a limitless collection. This prevents the computer from being primed with specific answers. The questions would require actual thought. Likewise, any single judge may not be great at determining reasonableness in an individual answer, but with multiple judges rating multiple respondents, we should get a good assessment. How about allowing the AGI to be one of the judges?

Bottom line: It’s time to replace the Turing Test with something better. We have already reached a level of AI development where we can see that continued efforts targeted solely at fooling humans on a Turing Test are not the correct direction for AGI creation.

 

Bio: Charles Simon, BSEE, MSCs is a nationally-recognized entrepreneur and software developer who has many years of computer experience in industry, including pioneering work in AI.  Mr. Simon's technical experience includes the creation of two unique Artificial Intelligence systems along with software for successful neurological test equipment. Combining AI development with biomedical nerve signal testing gives him the singular insight. He is also the author of Will Computers Revolt?: Preparing for the Future of Artificial Intelligence, and the developer of Brain Simulator II, an AGI research software platform that combines a neural network model with the ability to write code for any neuron cluster to easily mix neural and symbolic AI code.

Related: