This is part two of a blog post on probabilistic programming. The first part of the blog can be found here.
Markov chains are mathematical constructs with a wide range of applications in physics, mathematical biology, speech recognition, statistics and many others. The simplest way to think about them is considering the above animation. A person (the circle) is trying to find out where their friend lives in a neighbourhood block. Unfortunately all the houses (the squares) look the same and have no numbers. Each time they get to a house they knock on the door, but then immediately forgets where they are. They can then randomly choose to go left or right before trying another house. We could ask how long on average it would take for them to find their friend’s house or what the probability is that they’d find the house after a certain number of steps. This can be easily done as long as the person forgets where they are each time they visit a house. This is the Markov property and is crucial in order to keep the computations of the system reasonable. This is where the person only has knowledge of their current state (house) and have no memory of their previous states. It turns out that there are lots of systems that have this property, however.
One simple system to think about is the coupon collector’s problem. Let’s think about this in terms of toys that are given away in packets of cereal. Imagine a cereal company has a promotion on their cereal and are giving away four toys with their cereal. When you buy one cereal packet you’ll receive one toy, but you don’t know what toy you’ll receive until you buy the packet.
Let’s simulate this to think about the problem, below is an overview of our system where all the toys we have collected (not including duplicates) up until that point are in the top row; the toy we have just received is below along with a counter for how many cereal boxes we’ve bought. Below that is our abstract way of thinking about this problem. If all we care about is how long it takes to complete the collection then instead of keeping track of all the toys we currently have or all the specific toys we have collected without duplicates, we can instead track how many unique toys up until that point we’ve collected. This forms a Markov chain, with the number of unique toys we have collected up until that point is our state. We can see that the probability of transitioning to a new state (receiving a new toy we haven’t already got) is only dependent on the current unique number of toys we’ve collected, so the Markov property holds.
Try simulating cereal purchases a few times. You should be able to estimate on average how much cereal you’d need to buy to complete a collection. Sometimes we’re lucky and can do this in just four purchases. Other times we’re unlucky and have to buy many more. We can work out how much our purchases might vary by calculating the variation in the number of purchases required to complete a collection.
We can actually work out what the expected number of boxes we need to purchase to complete the collection by hand. First let’s think about what the probability is of moving from 0 unique toys to 1 unique toys. This is one as we’ll always gain a toy we have collected before. Moving from 1 to 2, the probability is $3/4$ of gaining a new unique toy. Similarly, transitioning from 2 to 3 is $1/2$ and from 3 to 4 is a $1/4$.
What is the probability of gaining a new unique toy in $x$ purchases if the probability of a new unique toy in one purchase is $p$? This is the probability of not purchasing a new unique toy $x-1$ times followed by purchasing a new unique toy, or $(1-p)^{x-1}p$. This is the geometric distribution and we can quickly calculate the expectation and variance as,
\[\begin{align} \mathbb{E}[X] &= \frac{1}{p}\\ \text{Var}[X] &= \frac{1-p}{p^2} \end{align}\]As all these events are independent of one another (the amount of cereal you have to buy to get the second toy is independent of the amount you have to buy to get the third toy for example), we can calculate the total expected number and variance of the total cereal we have to buy to complete the collection using the sum of expectation and sum of variance rule
\[\begin{align} \mathbb{E}[Y] &= \sum_{i=1}^{4}\mathbb{E}[X_i] \\ \text{Var}[Y] &= \sum_{i=1}^{4}\text{Var}[X_i] \end{align}\]So inputing our values for $p$ ($1$,$3/4$,$1/2$,$1/4$), we find an expected number of purchases of around $8.3$ with variance $14.4$. We can check these values by simulating the process below multiple times and recording the mean and variance.
Markov chains don’t need to have a finite number of states. Consider our random walker in the beginning. Instead of just visiting four houses, imagine that there are an infinite number of houses on the block. Another way to think about this a gambler’s winnings (as long as the gambler can go into an infinite amount of debt). Let’s imagine the gambler plays a simple game where they flip a coin and if it’s heads they gain one dollar and if it’s tails they lose a dollar. Let’s imagine that this is a fair coin so the probability of a win is a half. We could ask how long it might take for the gambler to lose all their money and end up at zero. It turns out this has a surprising answer.
First let’s simulate this process a few times below.
We can actually figure out some properties of this without the need to do multiple simulations. The first trick is to figure out what the expected winnings will be for the gambler after one one game. We know that with probability $p$ it will go up one and with probability $1-p$ it will go down one. So $1\times p + (-1)\times(1-p)$ is $2p -1$. Note that when the probability of gaining a dollar is a half, the expected overall step is zero. This is because half the time they lose a dollar and half the time they gain a dollar, so they both cancel.
Does this mean that after a hundred steps we would expect the winnings to be 0? Clearly, if we experiment by running a few simulations this doesn’t appear to be the case. There is in fact quite a bit of variation between each simulation. What’s the variation in one time-step? Starting at 0 if we’re just as likely to go up one or down one then we could guess the variance is 1. We can work this out, using the formal definition of the variance \(\text{Var}[X] = \mathbb{E}[X^2]-\mathbb{E}[X]^2\). The first expectation is $1^2\times p + -1^2 \times (1-p)$ and the second term is $(1\times p + -1 \times (1-p))^2$. This gives a formula for the variance as $1-(2p-1)^2$.
The great thing about the Markov property is that the probability of moving to a given state depends only on the previous state. For this simple model the probability of moving up or down is actually independent of state. So to calculate the variance after $t$ games we just need to sum up the variance for each step. In other words, the formula turns out to be \((1-(2p-1))^2 t\)
The variance is then at its maximum when $p$ is a half or when there’s equal chance of moving up or down. After one hundred time steps the variance is one hundred. This gives a probability of being at position 0 after one hundred steps of only $8\%$. In fact over time, the probability of returning to the origin diminishes. It turns out that the expected time to return to $0$ is actually infinite, the gambler can take arbitrarily long excursions as there’s nothing bounding their winnings (or losses).
Imagine instead if the coin had a slight amount of bias i.e. $p\neq 1/2$. We can see from our variance formula above that the more bias there is the lower the variance of the gambler’s winnings. We can see the impact of a slight bias by varying the probability of a win in the simulation below. The more we increase the probability the smaller the variance becomes.
Finally let’s look at a real application of a Markov chain with the PageRank algorithm. This is the original algorithm used by Google to rank web pages. A more in depth look can be found here .
We can think about the algorithm in terms of a random walker “walking” through web pages by clicking on links. When it arrives at a web page, it finds all the links and then picks one at random to click on. For simplicity let’s imagine that every page has a link back to the page that linked to it. Each time the walker passes through a page we record this by adding a point to the page. After a long time the pages can be ranked by how many times the walker has visited it. This allows a way of quantifying how “central” a page is on the web. One issue is that the rank number for each page keeps increasing. We can add a dampening factor by making the page’s point smaller if the walker visits later in its walk.
We can simulate the process on a small random network below. The colour denotes how recently the walker had visited that page and the size denotes its PageRank.
In reality there are far quicker ways to calculate the PageRank than to just simulate it. By setting the problem up as a Markov chain though, we’re able to use a lot of mathematical machinery to calculate it efficiently.