Can AI Learn Human Values?

OpenAI believes that the path to safe AI requires social sciences.


I recently started a new newsletter focus on AI education. TheSequence is a no-BS( meaning no hype, no news etc) AI-focused newsletter that takes 5 minutes to read. The goal is to keep you up to date with machine learning projects, research papers and concepts. Please give it a try by subscribing below:



Ensuring fairness and safety in artificial intelligence(AI) applications is considered by many the biggest challenge in the space. As AI systems match or surpass human intelligence in many areas, it is essential that we establish a guideline to align this new form of intelligence with human values. The challenge is that, as humans, we understand very little about how our values are represented in the brain or we can’t even formulate specific rules to describe a specific value. While AI operates in a data universe, human values are a byproduct of our evolution as social beings. We don’t describe human values like fairness or justice using neuroscientific terms but using arguments from social sciences like psychology, ethics or sociology. Last year, researchers from OpenAI published a paper describing the importance of social sciences to improve the safety and fairness or AI algorithms in processes that require human intervention.

We often hear that we need to avoid bias in AI algorithms by using fair and balanced training datasets. While that’s true in many scenarios, there are many instances in which fairness can’t be described using simple data rules. A simple question such as “do you prefer A to B” can have many answers depending on the specific context, human rationality or emotion. Imagine the task of inferring a pattern of “happiness”, “responsibility” or “loyalty” given a specific dataset. Can we describe those values simply using data? Extrapolating that lesson to AI systems tells us that in order to align with human values we need help from the disciplines that better understand human behavior.


AI Value Alignment: Learning by Asking the Right Questions

In their research paper, the OpenAI team introduced the notion of AI value alignment as “the task of ensuring that artificial intelligence systems reliably do what humans want”. AI value alignment requires a level of understanding of human values in a given context. However, many times, we can’t simply explain the reasoning for a specific value-judgment in a data-rule. In those scenarios, the OpenAI team believes that the best way to understand human values is by simply asking questions.

Imagine a scenario in which we are trying to train a machine learning classifier in whether the outcome of a specific event is “better” or “worse”. Is an “increase in taxes better or worse?”, maybe is better for government social programs and worse for your economic plans. “Would it be better or worse if it rains today?”, maybe it would be better for the farmers and worse for the folks that were planning a biking trip. Questions about human values can have different subjective answers depending on a specific context. From that perspective, if we can get AI systems to ask specific questions maybe we can learn to imitate human judgement in specific scenarios.

Asking the right question is an effective method for achieving AI value alignment. Unfortunately, this type of learning method is vulnerable to three well-known limitations of human value judgment:

  1. Reflective Equilibrium: In many cases, humans can’t arrive to the right answer to a question related to value judgement. Cognitive or ethical biases, lack of domain knowledge or fuzzy definition of “correctness” are factors that might introduce ambiguity in the answers. However, if we remove many of the contextual limitations of the question, a person might arrive to the “right answer”. In philosophy this is known as the “reflective equilibrium” as is one of the mechanism that any AI algorithm that tries to learn about human values should try to imitate.
  2. Uncertainly: Even if we can achieve a reflective equilibrium for a given question, there might be many circumstances in which uncertainly or disagreement prevent humans for arriving to the right answer. Any activities related to future planning often entail uncertainty.
  3. Deception: Humans have a unique ability to provide plausible answers to a question but that might wrong in some non-obvious way. Intentionally or unintentionally, deceptive or misleading behavior often results in a misalignment between the outcome of a given event and the values of the parties involved. Recognizing deceptive behavior is a non-trivial challenge that needs to be solved to achieve AI value alignment.


Learning Human Values by Debating

So far we have two main arguments to the thesis of AI value alignment:

  1. AI systems can learn human values by asking questions.
  2. Questions are often vulnerable to challenges like uncertainty, deception or the absence of a reflective equilibrium.

Bringing these two ideas together, the OpenAI team decided to induce AI agents to learn human values by relying on one of the purest question-answering dynamics: debates. Conceptually, debating is a form of discussion that breaks down a complex argument into an iterative set of simpler questions in order to formulate a reasoning path towards a specific answer. In applying debate techniques to achieve AI value alignment, the OpenAI team relied on an operating hypothesis:

“Optimal play in the debate game (giving the argument most convincing to a human) results in true, useful answers to questions.”

With that hypothesis as the foundation, OpenAI created a game in which two AI agents engaged in debate, trying to convince a human judge. The debaters are trained only to win the game, and are not motivated by truth separate from the human’s judgments. On the human side, the objective is to understand whether people are strong enough as judges in debate to make this scheme work, or how to modify debate to fix it if it doesn’t. Using AI debaters in the OpenAI debate is an ideal setting but the technology hasn’t really caught up to that point. Most real debates leverage sophisticated natural language patterns that are beyond the capabilities of AI systems today. Certainly, efforts like IBM Project Debater are rapidly closing this gap.

To avoid the limitations of AI debaters, OpenAI uses a scheme with two human debaters and a human judge. The outcome of this debate game are used to train the AI-AI-Human setting.


To test the idea of training AI systems using this debate model, the OpenAI team created a prototype website where two debaters argue over the contents of an image. The games chooses an image of a cat or dog, and show the image to the two debaters but not the judge. One debater is honest and argues for the true contents of the image; the other debater lies. The debaters can talk to the judge and illustrate their points by drawing rectangles on the image, but the judge sees only the rectangles. At the end of the debate, each debater is allowed to reveal a single pixel to the judge, which is the only part of the debate which cannot be a lie. The outputs of the debate are used to train sophisticated image classifiers.


Using debates as the underlying technique, can help to answer important questions about the relationship between humans and AI agents.

The idea of applying social sciences to AI is not a new one but the OpenAI efforts are some of the first pragmatic steps in this area. While social sciences focus on understanding human behavior in the real world, AI sorts of takes the best version of human behavior as a starting point. From that perspective, the intersection of social sciences and AI can lead to a more fairer and safer machine intelligence.

Original. Reposted with permission.