The Surprising Complexity of Randomness

By Brett Romero, Open Data Kosovo.

Previously, in a walkthrough on building a simple application without a database, I touched on randomness. Randomness and generating random numbers is a surprisingly deep and important area of computer science, and also one that few outside of computer science know much about. As such, for my own benefit as much as yours, I thought I would take a deeper look at the surprising complexity of randomness.

Why do we need randomness?

There can be a number of uses for randomness. But firstly, one thing to note is that when it comes to computers and computer science, randomness is typically represented by random numbers – seemingly random sequences of numbers that can then be used for different purposes. These purposes can range from randomly generating words in a flashcard app or shuffling songs in a playlist, to significantly more high-stakes uses, such as generating random keys for secure logins, data encryption, or randomly shuffling a deck of cards in an online game where large amounts of money are at stake.

How are random numbers created at the moment?

Random numbers come in two types, pseudorandom numbers and true random numbers.

Pseudorandom numbers are numbers that are generated to appear random, but are not truly random. Typically, pseudorandom numbers will be generated using a seed value provided by a user or programmer, which is then passed to an algorithm that uses that value to generate a new number. These algorithms often work by taking the remainder of an equation with includes the seed value and several large numbers.

For example, let’s say we use the following very simple equation to generate a series of random numbers:

R = (387 x S + 217) // 954

Where:
R is the random number to be produced
S is the seed value for R
// represents modular division, where the result will be the remainder of the division

Starting with a seed value (S) of 43, the first random number produced by the equation will be:

R = (387 x 43 + 217) // 953

R = 657

To produce the second random number, we then insert 657 as S, back into the equation:

R = (387 x 657 + 217) // 953

R = 25

This process can be repeated as many times as needed, generating an apparently random series of numbers.

While this example is a very simple one, this process of feeding the last random number into the same equation to generate a new random number is common to almost all pseudorandom number generators, and will result in two common attributes, regardless of the complexity.

The first is that if the seed value (S) is the same, the sequence of ‘random’ numbers produced by the algorithm will be exactly the same every time. This means that if you know the equation and the seed value, you can predict the entire sequence of ‘random’ numbers.

The second issue is that, eventually, the pattern will repeat. That is, eventually the formula will generate the same number twice, meaning the whole sequence will start again. And depending on the equation and large values chosen, this could be surprisingly soon.

Creating true random numbers

The reason we have pseudorandom numbers is because generating true random numbers using a computer is difficult. Computers, by design, are excellent at taking a set of instructions and carrying them out in the exact same way, every single time. It is this predictability which makes them so powerful. However, this predictability also makes it complicated to generate true random numbers.

As such, for a computer to create a truly random number, it has to take in some external input from something that is truly random. This external input can be something like key presses and movements of the mouse by a human operator, or network activity on a busy network in an office setting. But it can also be something far more complex such as the effect of atmospheric turbulence on a laser, or measuring the decay of a radioactive isotope.

Generating random numbers using mouse and keyboard inputs

Why does it matter?

This difference between pseudorandom and true random numbers is important, but only in certain settings.

For uses like selecting a random sample when working with data, shuffling a playlist, or triggering events in a video game, it is less important if pseudorandom or true random numbers are used. How true the randomness is, in these cases, will not impact the quality of the outcomes.

In some cases, using pseudorandom numbers may be advantageous. Take for example the process of selecting a random sample for a scientific study. In this case, using pseudorandom numbers allows others to replicate your results by using the same seed value. In video games, being able to trigger the same ‘random’ events is very useful when the game is being tested.

In other cases, using true random numbers is much more important. In applications such as encryption, using true random numbers is particularly important as it helps to ensure that data remains protected. Similarly, for online gambling, gaming companies need to have a very high level of confidence that the way results are being produced in everything from blackjack (how the cards are shuffled), to roulette (where the ball lands) and poker machines (which position the reels stop in) is a truly random process, or they risk someone reverse engineering the algorithm and making a significant profit as a result.

True randomness is not what most people expect

When it comes to true randomness, one of its stranger aspects is that it often behaves differently to people’s expectations. Take the two diagrams below – which one do you think is a random distribution, and which has been deliberately created/adjusted?

Only one of these panels shows a random distribution of dots | Source: Bully for Brontosaurus – Stephen Jay Gould

If you said the right panel, you are in good company, as this is most people’s expectation of what randomness looks like. However, this relatively uniform distribution has been adjusted to ensure the dots are evenly spread. In fact, it is the left panel, with its clumps and voids, that reflects a true random distribution. It is also this tendency for randomness to produce clumps and voids that leads to some unintuitive outcomes.

Take Spotify, the digital music service for example. For years, Spotify listeners have complained about the quality of the playlist shuffle. In fact, the quality of Spotify’s shuffle has been such a topic of discussion, that if you type “Spotify shuffle” into Google, one of the first autocomplete options that will come up is “sucks”. When Spotify looked into these complaints, the most common theme centered on songs from the same artist frequently playing one after the other. In short, people’s expectations of randomness were not matching reality. As Spotify explain in this interesting article, their shuffle was actually random, but they have now adjusted it to better align with what people think of as random – by reducing the randomness and ensuring that songs from a given artist will be spread throughout the playlist.

The gambler’s fallacy

As is also covered in the Spotify article, a great example of this misalignment of people’s expectations with the true nature of randomness is the so-called gambler’s fallacy. What the gambler’s fallacy boils down to is two things:

A belief that independent random events (a flip of a coin, a roll of a dice) have some sort of inherent tendency to revert to the mean. For example, when flipping a coin, a streak of heads makes the likelihood that the next flip will be tails increase so that the eventual distribution will move back towards 50-50.
As a result of belief 1, people tend to underestimate the likelihood of streaks (or clumps) of outcomes. The classic example of this is the person at the roulette table who looks at the list of previous results and sees a run of five black numbers, and believes that the likelihood of the next number being red is now higher as a result. By the way, this is exactly why casinos show the history, to tempt people into betting when they think the odds are in their favor.

To test your own beliefs on the likelihood of streaks, consider a roulette wheel in a casino. Let’s say the casino is open 12 hours a day, and that on average, it gets spun once per minute, giving us 720 spins in a day. Assuming there is a 50% chance of a red number and a 50% chance of a black number (i.e. we are ignoring the green 0 and 00 tiles for simplicity), what do you think the probability is of a streak of 8 or more black or red numbers in a row on a given day?

The answer is over 75%. In other words, on three out of four days, you should expect to see at least one streak of 8 or more black or red numbers during the day. Extending this, there is a 30% chance of a streak of 10 or more and around an 8% chance of a streak of twelve. You can test this and other scenarios using this handy calculator.

What does any of this mean?

In the course of your daily life, not too much. If you are a gambler, you should probably stop, but I am sure I am not the first person to tell you that. If you follow stock pickers, hopefully you will reconsider how much of their ‘skill’ is pure chance, especially when you factor in survivorship bias [1]. Perhaps something here will help you impress your friends at a trivia night.

If none of the above apply however, hopefully this article has introduced you to an interesting and little known area of knowledge with some important and fascinating applications.

[1] Survivorship bias in this context exists because the stock pickers that were not picking the right stocks did not keep writing articles. Over time, this leaves only the people who have been picking the winners (the ‘survivors’) to continue writing, even if their picks were correct purely by chance.

Bio: Brett Romero is a data analyst with experience working in a number of countries and industries, including government, management consulting and finance. Currently based in Pristina, Kosovo, he is working as a data consultant with development agencies such as UNDP and Open Data Kosovo.

Original. Reposted with permission.

Related: