# An Introduction to Monte Carlo Tree Search

A great explanation of the concept behind Monte Carlo Tree Search algorithm and a brief example of how it was used at the European Space Agency for planning interplanetary flights.

By Mateusz Wyszyński, Appsilon Data Science.

Introduction

We recently witnessed one of the biggest game AI events in history – Alpha Go became the first computer program to beat the world champion in a game of Go. The publication can be found here. Different techniques from machine learning and tree search have been combined by developers from DeepMind to achieve this result. One of them is the Monte Carlo Tree Search (MCTS) algorithm. This algorithm is fairly simple to understand and, interestingly, has applications outside of game AI. Below, I will explain the concept behind MCTS algorithm and briefly tell you about how it was used at the European Space Agency for planning interplanetary flights.

Perfect Information Games

Monte Carlo Tree Search is an algorithm used when playing a so-called perfect information game. In short, perfect information games are games in which, at any point in time, each player has perfect information about all event actions that have previously taken place. Examples of such games are Chess, Go or Tic-Tac-Toe. But just because every move is known, doesn’t mean that every possible outcome can be calculated and extrapolated. For example, the number of possible legal game positions in Go is over.

Every perfect information game can be represented in the form of a tree data structure in the following way. At first, you have a root which encapsulates the beginning state of the game. For Chess that would be 16 white figures and 16 black figures placed in the proper places on the chessboard. For Tic-Tac-Toe it would be simply 3x3 empty matrix. The first player has some number n1 of possible choices to make. In the case of Tic-Tac-Toe this would be 9 possible places to draw a circle. Each such move changes the state of the game. These outcome states are the children of the root node. Then, for each of n1 children, the next player has n2 possible moves to consider, each of them generating another state of the game - generating a child node. Note that n2 might differ for each of n1 nodes. For instance in chess you might make a move which forces your enemy to make a move with their king or consider another move which leaves your opponent with many other options.

An outcome of a play is a path from the root node to one of the leaves. Each leaf consist a definite information which player (or players) have won and which have lost the game.

Making a decision based on a tree

There are two main problems we face when making a decision in perfect information game. The first, and main one is the size of the tree.

This doesn’t bother us with very limited games such as Tic-Tac-Toe. We have at most 9 children nodes (at the beginning) and this number gets smaller and smaller as we continue playing. It’s a completely different story with Chess or Go. Here the corresponding tree is so huge that you cannot hope to search the entire tree. The way to approach this is to do a random walk on the tree for some time and get a subtree of the original decision tree.

This, however, creates a second problem. If every time we play we just walk randomly down the tree, we don’t care at all about the efficiency of our move and do not learn from our previous games. Whoever played Chess during his or her life knows that making random moves on a chessboard won’t get him too far. It might be good for a beginner to get an understanding of how the pieces move. But game after game it’s better to learn how to distinguish good moves from bad ones.

So, is there a way to somehow use the facts contained in the previously built decision trees to reason about our next move? As it turns out, there is.

Multi-Armed Bandit Problem

Imagine that you are at a casino and would like to play a slot machine. You can choose one randomly and enjoy your game. Later that night, another gambler sits next to you and wins more in 10 minutes than you have during the last few hours. You shouldn’t compare yourself to the other guy, it’s just luck. But still, it’s normal to ask whether the next time you can do better. Which slot machine should you choose to win the most? Maybe you should play more than one machine at a time?

The problem you are facing is the Multi-Armed Bandit Problem. It was already known during II World War, but the most commonly known version today was formulated by Herbert Robbins in 1952. There are N slot machines, each one with a different expected return value (what you expect to net from a given machine). You don’t know the expected return values for any machine. You are allowed to change machines at any time and play on each machine as many times as you’d like. What is the optimal strategy for playing?

What does “optimal” mean in this scenario? Clearly your best option would be to play only on the machine with highest return value. An optimal strategy is a strategy for which you do as well as possible compared to the best machine.

It was actually proven that you cannot do better than O(ln) on average. So that’s the best you can hope for. Luckily, it was also proven that you can achieve this bound (again - on average). One way to do this is to do the following.

For each machine i we keep track of two things: how many times we have tried this machine (ni) and what the mean return value (xi) was. We also keep track of how many times (n) we have played in general. Then for each i we compute the confidence interval around xi:

All the time we choose to play on the machine with the highest upper bound for xi (so “+” in the formula above).

This is a solution to Multi-Armed Bandit Problem. Now note that we can use it for our perfect information game. Just treat each possible next move (child node) as a slot machine. Each time we choose to play a move we end up winning, losing or drawing. This is our pay-out. For simplicity, I will assume that we are only interested in winning, so pay-out is 1 if we have won and 0 otherwise.

Real world application example

MAB algorithms have multiple practical implementations in the real world, for example, price engine optimization or finding the best online campaign. Let’s focus on the first one and see how we can implement this in R. Imagine you are selling your products online and want to introduce a new one, but are not sure how to price it. You came up with 4 price candidates based on our expert knowledge and experience: 99\$, 100\$, 115\$ and 120\$. Now you want to test how those prices will perform and which to choose eventually. During first day of your experiment 4000 people visited your shop when the first price (99\$) was tested and 368 bought the product, for the rest of the prices we have the following outcome:

• `100\$` 4060 visits and 355 purchases,
• `115\$` 4011 visits and 373 purchases,
• `120\$` 4007 visits and 230 purchases.

Now let’s look at the calculations in R and check which price was performing best during the first day of our experiment.

```library(bandit)

set.seed(123)

visits_day1 <- c(4000, 4060, 4011, 4007)
purchase_day1 <- c(368, 355, 373, 230)
prices <- c(99, 100, 115, 120)

post_distribution = sim_post(purchase_day1, visits_day1, ndraws = 10000)
probability_winning <- prob_winner(post_distribution)
names(probability_winning) <- prices
probability_winning

##     99    100    115    120
## 0.3960 0.0936 0.5104 0.0000
```

We calculated the Bayesian probability that the price performed the best and can see that the price `115\$` has the highest probability (0.5). On the other hand `120\$` seems bit too much for the customers.

The experiment continues for a few more days.

Day 2 results:

```visits_day2 <- c(8030, 8060, 8027, 8037)
purchase_day2 <- c(769, 735, 786, 420)

post_distribution = sim_post(purchase_day2, visits_day2, ndraws = 1000000)
probability_winning <- prob_winner(post_distribution)
names(probability_winning) <- prices

probability_winning

##       99      100      115      120
## 0.308623 0.034632 0.656745 0.000000
```

After the second day price `115\$` still shows the best results, with `99\$`and `100\$` performing very similar.

Using `bandit` package we can also perform significant analysis, which is handy for overall proportion comparison using `prop.test`.

```significance_analysis(purchase_day2, visits_day2)

##   successes totals estimated_proportion        lower      upper
## 1       769   8030           0.09576588 -0.004545319 0.01369494
## 2       735   8060           0.09119107  0.030860453 0.04700507
## 3       786   8027           0.09791952 -0.007119595 0.01142688
## 4       420   8037           0.05225831           NA         NA
##   significance rank best       p_best
## 1 3.322143e-01    2    1 3.086709e-01
## 2 1.437347e-21    3    1 2.340515e-06
## 3 6.637812e-01    1    1 6.564434e-01
## 4           NA    4    0 1.548068e-39
```

At this point we can see that `120\$` is still performing badly, so we drop it from the experiment and continue for the next day. Chances that this alternative is the best according to the `p_best` are very small (p_best has negligible value).

Day 3 results:

```visits_day3 <- c(15684, 15690, 15672, 8037)
purchase_day3 <- c(1433, 1440, 1495, 420)
post_distribution = sim_post(purchase_day3, visits_day3, ndraws = 1000000)

probability_winning <- prob_winner(post_distribution)
names(probability_winning) <- prices

probability_winning
## 99 100 115 120
## 0.087200 0.115522 0.797278 0.000000
```

```value_remaining = value_remaining(purchase_day3, visits_day3)
potential_value = quantile(value_remaining, 0.95)
potential_value
##        95%
## 0.02670002
```

Day 3 results led us to conclude that `115\$` will generate the highest conversion rate and revenue. We are still unsure about the conversion probability for the best price `115\$`, but whatever it is, one of the other prices might beat it by as much as 2.67% (the 95% quantile of value remaining).

The histograms below show what happens to the value-remaining distribution, the distribution of improvement amounts that another price might have over the current best price, as the experiment continues. With the larger sample we are much more confident about conversion rate. Over time other prices have lower chances to beat price `\$115`.

If this example was interesting to you, checkout our another post about dynamic pricing.

We are ready to learn how the Monte Carlo Tree Search algorithm works - next page.