Probabilistic programming 4: Markov Chain Monte Carlo

Introduction

At the heart of practical Bayesian inference is Monte Carlo sampling. For a background on this see the previous blog posts: Monte Carlo Method, Markov Chains and Bayesian inference. It can often be mysterious to understand exactly what a sampler is doing, which makes it difficult to diagnose problems. The above tool uses several example two-dimensional posterior distributions to demonstrate how issues such as strong dependency between two parameters or multi-modality can lead to poor performance of certain samplers and how to counteract this.

The walker (black circle) moves around the probability landscape producing random samples that are recorded by the two histograms along the axis. The contours and colours represent the probability surface. At each step, the walker will propose a new location (red circle) either independent of the landscape (Metropolis-Hastings) or with some dependency on it (Slice and Hamiltonian). The walker may then accept the proposed location and the new position is recorded. Further descriptions for each of the samplers can be found below.

Metropolis-Hastings

Metropolis-Hastings is one of the more intuitive sampling algorithms. The idea is to perform a random walk and selectively move to a new position dependent on whether the new position has a higher probability compared to the current position. If you only moved when the probability was higher the random walker would quickly end up in a local maxima and would no longer move. This means in order to sample the region properly you would occasionally want to move to an area of lower probability. At each step the walker draws a potential new position with some distribution around the current position (two-dimensional normal in this case). The probability of the potential new position is then compared to the probability of the current position, if it is higher then the walker moves, otherwise the walker moves with a probability dependent on the ratio of the new position to the old. So if the new position has only a slightly lower probability than the current it is more likely to accept than if the new position has a far lower probability.

There can be many aspects that need fine-tuning for Metropolis-Hastings. One of the key aspects is how far on average the random walker steps at each iteration (step-size). You can adjust the step-size in the above tool. Notice that when the step-size is too small, the walker poorly explores landscape and it would take many iterations to build up sufficient samples. However, if the step-size is too large then the walker's proposed positions are rarely accepted and it can become stuck. The skewed distribution shows an extreme case of this, where there is only one direction for the walker to move in with comparable probability. There exists a sweet spot for the step-size, however this can be problem-dependent and it may not be clear what is optimal.

Slice Sampling

Slice sampling tries to take a more global approach to sampling than with Metropolis-Hastings. There are two parts to generating a new sample, an expansion and contraction phase.

In the expansion phase, units of a given step-size are taken around the current position of the walker and added to an interval (you can control the step-size with the slider above). These step-sizes are added until the interval contains points with a smaller probability than the current position.

In the contraction phase points are uniformly sampled from the constructed interval and the interval is cut at the point if its probability is less than the current position of the walker. Finally a point is randomly sampled from the interval and the walker is updated.

You'll notice that slice sampling is much more efficient when the surface is multimodal. It still struggles with the bimodal surface as parameters are being updated independently. For the skew distribution slice sampling is also fairly inefficient as it has to take many steps to traverse the surface.

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) uses the gradient of the probability surface to provide a more efficient sampling scheme. The idea is to imagine the walker as being a massive object in a potential landscape (imagine a ball rolling down a hill). In the beginning the walker is given a "kick" with random strength and in a random direction (the step-size slider controls how large the kick is). The walker then uses this momentum to travel through the potential landscape for a set number of time-steps. The walker is then updated with the same update rule as for Metropolis-Hastings i.e. if the new probability is higher then accept otherwise accept randomly with a probability dependent on the ratio of the old probability to the new.

You'll observe that HMC is far more efficient than the other methods for sampling the skew distribution and works well with the other unimodal distributions. For the multimodal distributions you still need to provide a sufficient kick in order for the walker to jump between the regions of high probability.

Conclusion

It can be tempting to try and find a single sampler that suits all purposes. Whilst some are able to perform sampling well in most cases (e.g. NUTS) there will still be cases where they may fall down (becoming stuck, or inefficiently sampling the posterior). It is important to note that the examples given are for two-dimensional cases only and typically modern Bayesian inference will use many more dimensions, where other problems can start to creep in. Hopefully the above tool sheds some light on the issues around sampling schemes used in Bayesian inference.

Acknowledgements

All the interactive examples were coded in d3, which has many fantastic examples for data visualization.

This post was broadly inspired by this blog post on numerical optimization and this tool on exploring training of neural nets.