This is the first post in a series on Markov Chain Monte Carlo (MCMC), a powerful technique used in performing inference on probabilistic models. We’ll unpack what each of these terms mean: what a Markov Chain is, what Monte Carlo simulation is and then finally how it all fits together to in the framework of MCMC.
The main idea behind the Monte Carlo method is to use simulate randomly from a probability distribution where it is difficult or not possible to have a direct numerical solution of the probability.
Imagine we have a complex deterministic model such as those used in hydrodynamic flow or climate change. We might have many inputs into this model and these inputs are all likely to have some uncertainty around them. A way to understand how this uncertainty impacts the model prediction is to just simulate from it many times using inputs taken from the uncertainty distributions. Notice that this automatically penalizes against rare scenarios. If the probability of an input is low then it is unlikely to be selected and therefore unlikely to contribute to the model prediction.
The method came about from the Manhattan project. Nicholas Metropolis and his team developed the technique and needed some code name for it. They decided to name it after the Monte Carlo casino where the uncle of Stanislaw Ulam (another member of the team) often gambled. The intuition is that if you want to understand say, what the probability of winning at roulette given a certain strategy is, then one solution is to just play it many many times and recorded how often you win and lose. Then after going through your entire savings you divide the number of times you had a win by the total number of times you played and that’s your estimate of your probability of a win. If you don’t want to burn through all your money to understand this then you can create a computer simulation of a roulette wheel and use that to perform your experiment.
This is a bit abstract so let’s look at a simple example of performing a series of coin tosses. Each coin toss can be heads with probability $p$ (normally 0.5 is chosen for a fair coin, although there’s some evidence that isn’t always the case ). In a series of $N$ coin tosses the probability of $k$ heads follows a binomial distribution like this
Let’s imagine we didn’t have a formula for the probability. We could instead repeatedly simulate coin tosses, recording the number of heads in order to build up an empirical distribution that asymptotically converges to the binomial distribution. You can use the tool below to simulate this, changing the probability and speed of the simulation.
Another example of Monte Carlo simulation is in the calculation of $\pi$. One example of this is Buffon’s needle, which can be done by throwing down matchsticks and observing how often they cross a series of parallel lines.
It’s maybe easier to think instead about random points falling onto a square of length one and seeing how often the points fall inside a circle of diameter one centered in the middle of the square. The probability of the point landing in the circle is the same as the area of the circle divided by the area of the square. From basic geometry, the circle area is \(\pi r^2 = \pi (1/2)^2 = \pi/4\) and the area of the square is one. Therefore the probability that a random point lands in the circle is $\pi/4$. Note that in order to check if a point is contained in the circle, you just need to check if the sum of squares of its coordinates are less than one, so this method doesn’t implicitly include $\pi$ anywhere.
You can simulate this process in the tool below.
Finally, let’s look at a less trivial example. Suppose that we conducted a public health intervention and we want to compare the outcome between a group who did not receive the intervention and those who did. We could look at an outcome measure (blood pressure, BMI etc.) and compare the mean in both groups. How do we know if the difference of the means is significant? We might use something like a t-test, but this introduces a few assumptions about the data such as normality.
Let’s assume we don’t know where the data came from, we could instead simulate by dividing the data we have into two groups many times and comparing the difference of means in the two groups. Dividing into two groups is the same as random sampling without replacement, hence why this technique is known as the permutation test.
Below we have a tool that randomly samples a set of data from two normal distributions, with one mean centered at zero and another that you can vary. You can also vary the population size of each group. When the simulation starts a random permutation occurs and then the new difference in the means of both groups is recorded. If it’s greater than the original difference, then this simulation is counted. These counts are then used to estimate the probability that a randomly selected permutation has a difference of means at least as large as the original data. You can see after many iterations this value converges and you can use a pre-determined p-value to see whether this is significant or not.
As always there’s some caveats with this technique. Try playing around with small population sizes when the mean of the two distributions are the same to see if you can still end up with something significant. The moral is to always be suspicious when the size of the data is small.
For the calculation of pi example, I adapted code from here.
All the interactive examples were coded in d3, which has many fantastic examples for data visualization.