Mike Irvine

Challenges in modelling interventions for COVID-19

2020-03-15T00:00:00+00:00

Photo by Fusion Medical Animation on Unsplash.

Introduction

To begin with a disclaimer, all simulations within this post are for educational purposes only. Although I have tried to use parameters consistent with the current covid-19 outbreak where possible, there are other factors such as incubation period, heterogeneity of the population, or importation of cases that I haven't explicitly included.

Among many of the things the current covid-19 pandemic has shown is the difficulty in predicting whether an outbreak of an infectious disease will grow into an epidemic and what might be the potential impact of subsequent interventions. Using models we can build up a picture of what this uncertainty might be and factor in some elements that we don’t know about the disease such as if individuals can be asymptomatic carriers.

Another recent point of debate that’s been particularly exemplified in the UK is around whether it is best to build up herd immunity or to impose strict controls on movement early in the epidemic as has been done in Singapore, Hong Kong, and Taiwan. The apps below will allow you to explore the impact of both intervention both early and later in the epidemic.

Outbreak control

We have to begin by defining the most important number in infectious disease epidemiology, the $R_0$. It’s full definition is,

The average number of secondary cases for every primary case in a completely susceptible population

This is a little technical, so let’s break down each part of the definition. A completely susceptible population is where all individuals are able to be infected by the virus, where no one has any prior immunity. A secondary case following a primary case is the number of individuals who are infected by one individual. The average number is also important here. For example if an infection has an $R_0$ of 2, on average an infected individual would infect 2 others, but this could potentially be more or less.

At the start of an epidemic, many random infection events can make it incredibly difficult to predict how many cases we would expect even a week later. To explore this, the simulation below shows a series of infection events following one infected individual (the technical name for this type of model is a branching process). Using the slider you can change the reproduction number $R_0$ and simulate three generations ahead (for this purpose we can assume a generation is 7 days, so we are simulating three weeks ahead).

R0: 2.50

Early estimates of the $R_0$ for covid-19 are around 2.5 although there is a large amount of uncertainty around this. Even if we did knew the $R_0$ exactly due to the randomness with how infection events transpire we still see some scenarios where there are a large number of cases after three generations and so where there aren’t any.

Now let’s consider this process happening several times, where we keep simulating what would transpire if one individual is infected. This builds up a probability of an outbreak occurring and what would be the size of the outbreak. Using the app below, you can change the initial $R_0$ and observe what the final number of cases after three generations. As more simulations are run a pattern begins to build up that describes the distribution of all possible infection scenarios for that particular $R_0$.

R0: 2.50

Although I mentioned above that current estimates of the $R_0$ for covid-19 are around 2.5, this is dependent on there being no intervention to its spread. Lots of factors may help to limit the spread and reduce the $R_0$ including social distancing, self-quarantining and contact tracing. Try reducing the $R_0$ above and see how it impacts the probability of an outbreak occurring. You’ll notice that if the $R_0$ is below 1 then there is a zero probability of the epidemic taking off.

Later epidemic case management

Many countries are now observing sustained community-based transmission where the majority of new cases are from individuals becoming infected in their own community and not individuals who had recently travelled abroad. In this situation outbreak control is no longer feasible and so other measures must be used including encouraging social distancing and hand washing. In more extreme cases countries, such as Italy have begun to impose national quarantines for a given period.

Both the timing and duration of these interventions can be incredibly important for controlling the total number of individuals who become infected, but also location and height of the epidemic peak when the most individuals are infected in a given week. The idea is to “flatten the curve” so as to not overwhelm a nation’s healthcare services and give them more time to respond.

The simulator below lets you explore the consequences of an intervention event where the risk of transmission is reduced. The top-left graph shows the curve of the epidemic where there is intervention and the counterfactual scenario where no intervention occurs. The top-right shows the total number of infected individuals at the end of an epidemic and those that had remained susceptible. The bottom graph below shows the effective $R_0$ at a point in time, this is the average number of individuals infected from a case in that moment. Try moving the $R_0$ slider below to see how this impacts the size of the epidemic and its peak.

R0: 2.50

Reduction in transmission: 0%

Start: day 0

Intervention duration: 0 days

The sliders above control how much the intervention reduces the spread of the infection, where the start of the intervention occurs and its total duration. For this simulation as soon as an intervention stops the $R_0$ returns to its initial value, which depending on when the intervention starts can lead to the epidemic being delayed or creating a double-peak epidemic. Also if the intervention begins too late after the peak then there is little impact on the overall epidemic.

Conclusions

As Thomas House from Manchester University has also commented on, initial interventions that lower the effective reproduction may only be delaying those individuals from becoming infected, however does reduce the peak of the epidemic. Many factors impact the overall epidemiology of a virus and how that translates into the total cases infected. This is especially problematic in the current pandemic where estimates of infectivity, incubation period, and recovery time all have large uncertainty. Even more, it is not clear how much these current or future interventions will impact the ability for covid-19 to spread.

Testing binary classifiers

2018-10-24T00:00:00+00:00

Introduction to binary classifiers

Classification is a classic problem in machine learning and statistics, where we have some data and wish to choose a single category for each data point. The simplest form of this is binary classification, where each data point can represent one of two states. For example, is an email spam or not spam? Does a medical test mean a patient does or doesn’t have a particular disease? or does a picture contain a cat or a dog? Consider a series of emails that have been hand-labeled to be either spam or not spam. A simple (and probably pretty poor) binary classifier would be whether the email contained the word “spam” itself. In this way we could automatically classify each email and then compare to the hand-labelled category.

No classifier is perfect. Sometimes spam is missed or a medical test gives a wrong result. In order to determine how well a classifier can perform a number of different performance statistics have been developed. This article briefly goes through some of the main ones with interactive graphs to demonstrate how they’re applied and some of their consequences.

The statistics

For a population of data-points (e.g. emails, patients, or images) and a binary classifier (e.g. checking the email contains the word spam), each data-point can be divided into one of four categories. These are given by whether the data-point is positive or negative (e.g. actually is spam or has a disease) against whether the point tested positive or negative (e.g. the email contains the word “spam”). The four groupings are then

True Positive (TP). A point that tested positive and is actually positive.
True Negative (TN). A point that tested negative and is actually negative.
False Positive (FP). A point that tested positive, but is actually negative.
False Negative (FN). A point that tested negative, but is actually positive.

The diagram below splits the population up into positive (in orange) and negative (in blue), with those that tested positive darker than those that tested negative. Each area represents the total number for each category.

The above diagram shows some of the main classifier statistics used:

Sensitivity (or recall). Out of how many points that are positive actually tested positive.
Specificity. Out of how many points that are negative actually tested negative.
Precision (or positive predictive value). The probability that a point that tested positive is actually positive.

Each button above shows how to calculate these in practice.

Putting it into practice

Thinking about these statistics is important especially when the prevalence of a positive case is low. If only one in a thousand emails is spam and our test is not that sensitive, then most points that test positive actually won’t be spam.

The interactive diagram below simulates the consequences of varying prevalence, sensitivity, and specificity for a population of points (shown as circles). Use the sliders first to determine how prevalent a positive case is along with the properties of the test. Then simulate the positives using the set positives button. Next apply the test with the specified sensitivity and specificity. Finally sort the data points to calculate the statistics.

The data points are sorted into a confusion matrix, with testing negative and positive arranged in columns and the points actually being negative and positive arranged into rows. Once sorted the true positive, false positive, true negative and false negatives can be calculated as entries in each part of the matrix. A number of statistics can then be calculated including sensitivity, specificity (coming from the sample as opposed to the underlying test), and the precision. Others include the negative predictive value, the accuracy (proportion of points correctly classified) and the F1 score.

The receiver operator characteristic

For some purposes a higher number of false positives or false negatives can be tolerated. Often a binary classifier will report a continuous value instead of true positive or negative (such as a probability or a concentration of a species in a blood test). In order to classify each point as being negative or positive, we are free to set a threshold on these values wherever we like. For example, if there is a high cost with missing a positive point then we might set this threshold low, so anything above it is classified as positive. In general, we want to be able to determine how well a classifier is performing over a range of these thresholds. This can be done using the Receiver Operating Characteristic (ROC) curve.

The diagram below demonstrates the ROC for varying classifiers. We can imagine that a classifier assigns a value to each point. If the point is positive this value is determined by a distribution (we’re using a normal distribution here, but the exact shape of the distribution doesn’t matter) and if the point is negative, the classifier value is determined by a different distribution. The sliders below can change the mean and variance of the positive distribution.

The diagram on the left shows how the classifier translates each point into their corresponding values depending on whether they’re positive or negative. The diagram on the right maps out for each possible threshold, the corresponding false positive rate (1 - specificity) and true positive rate (sensitivity). By experimenting, you can see that a good classifier can maximize the true positive rate, whilst minimizing the false positive rate. The area underneath the whole curve (AUC) then summarizes how well the classifier performs over the whole range of possible thresholds. It turns out that, if you pick at random a positive point and a negative point, then the probability the value of the positive point is higher than the negative point is equal to the AUC (see below for more of an explanation).

You can also use the diagram above to explore how the shape of the ROC relates to different parts of where the classifier may be failing. For example, if the ROC curve dips below the diagonal at any point this is an indication that there are more negative cases greater than positive cases at that threshold.

Wrapping up

There’s a huge plethora of statistics for binary classifiers. These statistics can all have their subtleties and can be used in conjunction to understand issues and assess how well the classifier is performing.

Understanding the math behind the AUC

I mentioned briefly above that the probability that the value of a positive case being greater than a negative case is equal to the AUC. In order to understand this, we can simply integrate over the ROC and use the definition of the true positive rate $tpr$ and the false positive rate $fpr$ in terms of the probability density for a negative $f_0$ and positive $f_1$ point in terms of the value $x$,

$\begin{align} AUC &= \int_0^1 fpr(x)d(tpr), \\ &= \int_{-\infty}^\infty fpr(x)tpr'(x)dx, \\ &= \int_{-\infty}^\infty tpr(x) f_0(x) dx, \\ &= \int_{-\infty}^\infty \int_{x}^\infty f_1(y) dy f_0(x) dx, \\ &= \int_{-\infty}^\infty \int_{-\infty}^\infty I(y > x) f_1(y) f_0(x) dx dy, \\ &= P(Y > X). \end{align}$

Acknowledgements

All the interactive examples were coded in d3, which has many fantastic examples for data visualization.

Probabilistic programming 4: Markov Chain Monte Carlo

2018-01-11T00:00:00+00:00

Introduction

At the heart of practical Bayesian inference is Monte Carlo sampling. For a background on this see the previous blog posts: Monte Carlo Method, Markov Chains and Bayesian inference. It can often be mysterious to understand exactly what a sampler is doing, which makes it difficult to diagnose problems. The above tool uses several example two-dimensional posterior distributions to demonstrate how issues such as strong dependency between two parameters or multi-modality can lead to poor performance of certain samplers and how to counteract this.

The walker (black circle) moves around the probability landscape producing random samples that are recorded by the two histograms along the axis. The contours and colours represent the probability surface. At each step, the walker will propose a new location (red circle) either independent of the landscape (Metropolis-Hastings) or with some dependency on it (Slice and Hamiltonian). The walker may then accept the proposed location and the new position is recorded. Further descriptions for each of the samplers can be found below.

Metropolis-Hastings

Metropolis-Hastings is one of the more intuitive sampling algorithms. The idea is to perform a random walk and selectively move to a new position dependent on whether the new position has a higher probability compared to the current position. If you only moved when the probability was higher the random walker would quickly end up in a local maxima and would no longer move. This means in order to sample the region properly you would occasionally want to move to an area of lower probability. At each step the walker draws a potential new position with some distribution around the current position (two-dimensional normal in this case). The probability of the potential new position is then compared to the probability of the current position, if it is higher then the walker moves, otherwise the walker moves with a probability dependent on the ratio of the new position to the old. So if the new position has only a slightly lower probability than the current it is more likely to accept than if the new position has a far lower probability.

There can be many aspects that need fine-tuning for Metropolis-Hastings. One of the key aspects is how far on average the random walker steps at each iteration (step-size). You can adjust the step-size in the above tool. Notice that when the step-size is too small, the walker poorly explores landscape and it would take many iterations to build up sufficient samples. However, if the step-size is too large then the walker's proposed positions are rarely accepted and it can become stuck. The skewed distribution shows an extreme case of this, where there is only one direction for the walker to move in with comparable probability. There exists a sweet spot for the step-size, however this can be problem-dependent and it may not be clear what is optimal.

Slice Sampling

Slice sampling tries to take a more global approach to sampling than with Metropolis-Hastings. There are two parts to generating a new sample, an expansion and contraction phase.

In the expansion phase, units of a given step-size are taken around the current position of the walker and added to an interval (you can control the step-size with the slider above). These step-sizes are added until the interval contains points with a smaller probability than the current position.

In the contraction phase points are uniformly sampled from the constructed interval and the interval is cut at the point if its probability is less than the current position of the walker. Finally a point is randomly sampled from the interval and the walker is updated.

You'll notice that slice sampling is much more efficient when the surface is multimodal. It still struggles with the bimodal surface as parameters are being updated independently. For the skew distribution slice sampling is also fairly inefficient as it has to take many steps to traverse the surface.

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) uses the gradient of the probability surface to provide a more efficient sampling scheme. The idea is to imagine the walker as being a massive object in a potential landscape (imagine a ball rolling down a hill). In the beginning the walker is given a "kick" with random strength and in a random direction (the step-size slider controls how large the kick is). The walker then uses this momentum to travel through the potential landscape for a set number of time-steps. The walker is then updated with the same update rule as for Metropolis-Hastings i.e. if the new probability is higher then accept otherwise accept randomly with a probability dependent on the ratio of the old probability to the new.

You'll observe that HMC is far more efficient than the other methods for sampling the skew distribution and works well with the other unimodal distributions. For the multimodal distributions you still need to provide a sufficient kick in order for the walker to jump between the regions of high probability.

Conclusion

It can be tempting to try and find a single sampler that suits all purposes. Whilst some are able to perform sampling well in most cases (e.g. NUTS) there will still be cases where they may fall down (becoming stuck, or inefficiently sampling the posterior). It is important to note that the examples given are for two-dimensional cases only and typically modern Bayesian inference will use many more dimensions, where other problems can start to creep in. Hopefully the above tool sheds some light on the issues around sampling schemes used in Bayesian inference.

Acknowledgements

All the interactive examples were coded in d3, which has many fantastic examples for data visualization.

This post was broadly inspired by this blog post on numerical optimization and this tool on exploring training of neural nets.

When zombies attack: The infectious disease modelling app

2017-11-02T00:00:00+00:00

I recently participated in a science outreach event where I developed a web app to introduce infectious disease modelling to the general public. The main idea is that there has been a recent zombie outbreak and, given some data and models, the participant was required to investigate various hypotheses on its transmission, estimate the transmissibility and finally simulate what types of interventions would be required for its control.

Here is the link to the app or you can navigate to a section of the app from the images below.

Introduction

Click here

Model fitting

Click here

Simulation

Click here

Measuring hidden populations

2017-09-05T00:00:00+00:00

Cat hiding.

Often the data that comes to us is only partially observed: website users out of all potential customers, patients displaying symptoms out of a total population of infected individuals, the number of animals observed in a study etc. One of the challenges with this type of data is to be able to determine the size and characteristics of the total population based on our (hopefully unbiased) sample.

How we can determine what the size is based on the type of data we observe? We’ll look at two examples here: the zero-truncated count data and mark-recapture data. Zero-truncated data comes about when we observe the same individuals multiple times (but obviously don’t measure the individuals we don’t see). For example from a health registry, police arrest records or number of non-clonal tomato plants.

Mark-recapture data is slightly different. Here there are two or more distinct phases or surveys in the observations and the number detected in each survey plus the overlap is used to estimate the whole population. Traditionally these techniques are used in ecology, where an animal is marked or tagged and released and then recaptured at a later date (e.g. measuring a population of tigers). These techniques are also used where individuals may be captured by different data sources, such as in estimating prevalence of injection drug use.

Let’s explore these concepts with a simulation. Below is our “field” with individuals of a species (circles) moving around randomly. In the centre is a larger circle denoting our field of view. If an individual wanders into our field of view then we measure it (denoted by a change of colour). However, we can’t directly observe the individuals outside the field of view who haven’t been measured already. You can see that it would take a long time for all individuals to wander into the field of view making it impractical to view all individuals this way.

Zero-truncated poisson

First we can imagine that we record the number of times each individual wanders into our field of view. We can use these statistics to build up an estimate of the total population by making some assumptions around how these counts are distributed. Assuming there’s a small, but constant probability of any individual wandering into the field of view at any time leads to a Poisson distribution of counts i.e. the probability of a randomly sampled (from the entire population) individual’s count being $x$ is

$P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}.$

However, we don’t observe the individuals with zero counts so our data is truncated. The probability of not being observed ($p_0$) is $e^{-\lambda}$. Therefore the probability of a random variate $X$ being observed $x$ times is

$P(X = x | X > 0) = \frac{P(X=x)}{P(X>0)} = \frac{\lambda^x}{x!} \frac{e^{-\lambda}}{1-e^{-\lambda}}$

For our data ${x_0,\ldots,x_{n-1} }$, the associated log-likelihood is

$-n\lambda -n\log(1-e^{-\lambda}) + \lambda \sum_{i=0}^{n-1}\log(x_i) -\sum_{i=0}^{n-1}\log(x_i!)$

Using the zero-truncated Poisson model, we can estimate the odds ratio of being observed against not being observed, $p_0/(1-p_0)$. We can calculate this empirically if we knew the number of individuals who weren’t observed ($f_0$) divided by the number that were observed ($N_{obs}$). Combining these together we get the following,

$\hat{f_0} = \frac{p_0}{1-p_0}N_{obs}.$

So now all we need is an estimate of $\lambda$. We can do this by maximizing the likelihood. The tool below shows this in action, notice that for this estimator to work you need to have viewed at least one individual twice or more. Notice that the estimator isn’t dependent on the size of the capture circle as this is incorporated into the rate $\lambda$.

Mark recapture

We now look at a slightly different type of study where there a two distinct phases of measuring a population. In an ecological study this could be where an initial survey is conducted to measure the population size of a particular species and each individual is tagged. At a later stage the population is measured again and the number of individuals as well as the number of tagged individuals are recorded.

Let’s set-up some notation for the problem: $N$ is the unknown total population size, $n$ is the number of individuals marked in the first survey, $K$ is the number of individuals observed in the second survey and $k$ is the number of individuals observed in the second survey who were tagged in the first survey.

A simple estimator of the population size known as the Lincoln estimator (although why it’s called this is a slight mystery and is probably an example of Stigler’s law of eponymy). This estimator assumes that the probability of being observed in the second survey is the same as in the first survey. The empirical probability of being observed in the first survey is $n/N$ and the probability of being observed again is $k/K$. If these two probabilities are independent we can equate them and get the following estimator

$\hat{N} = \frac{nK}{k}.$

It turns out this only performs well for large sample sizes. For smaller sample sizes we can use the Chapman estimator:

$\hat{N} = \frac{(n+1)(K+1)}{k+1} - 1.$

Use the tool below to explore these estimators with our simulation.

You can see under what conditions the Chapman estimator outperforms the Lincoln estimator. Although with both we have no estimate of the uncertainty around them. If we wanted to actually put this into practice then we might consider something like the R package multimark or in Python using PyMC (e.g. this example).

Acknowledgements

All the interactive examples were coded in d3, which has many fantastic examples for data visualization.

Probabilistic programming 3: Bayesian probability

2017-07-04T00:00:00+00:00

Introduction

This is part three in a series on probabilistic programming. Part one introduces Monte Carlo simulation and part two introduces the concept of the Markov chain. In this post I’ll introduce the concept of Bayes rule, which is the main machinery at the heart of Bayesian inference.

The diagram above represents a probability of two events: A and B. A could be testing positive for an infection and B could be actually have the infection. These events are clearly not independent and so we need some way of establishing their relationship as the probability of observing one is dependent on observing the other. How can we quantify this? One way is to look at the intersection of both events or in set notation $A \cap B$. We call the corresponding probability of both events happening $P(A,B)$. What if we already know that an event has occurred? Say we know event $A$ already, what would the probability of $B$ be given we know $A$ is true? Looking at the above diagram we see this would be the area of the intersection divided by the total area of $A$. We can represent this formula as $P(B | A) = P(A,B)/P(A)$. Which can be read as “The probability of B given A is the probability of A and B divided by the probability of A”.

What if we now wanted to know what the probability of A is given B? Well, we can just swap the symbols around in the previous formula to get $P(A | B) = P(A,B)/P(B)$. You’ll notice that we can now define the probability of A and B based on conditional probabilities, $P(A|B)P(B)$ or $P(B|A)P(A)$. If we equate these two and re-arrange we get the following:

$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

This is known as Bayes’ rule and is the entire basis of Bayesian statistics. Its power comes in the ability to take the probability of A on B and invert the relationship to give the probability of B given A.

Sensitivity and specificity

One application of this rule is in DNA or disease testing. We can discover the probability of testing positive given that an individual actually has the disease through repeated measurements of a given test. However, what we’d really want to know is whether someone actually has a disease given that they tested positive. Test accuracy can be described in terms of their sensitivity and specificity. Sensitivity is the probability of testing positive given a positive case and specificity is the probability of testing negative given a negative case. We can invert this relationship using Bayes rule:

$P(+ve | +ve test) = \frac{P(\text{+ve test} | \text{+ve}) P(\text{+ve})}{P(\text{+ve test})}.$

There’s a trick here where we can define the probability of a positive test by considering all the associated conditional probabilities (known as the law of total probability),

$P(\text{+ve test}) = P(\text{+ve test} | \text{+ve})P(\text{+ve}) + P(\text{+ve test} | \text{-ve})P(\text{-ve}).$

So it turns out in order to find the probability of being positive given a positive test we need to know what the underlying probability of actually being positive is (known as the base rate). We can see what the consequences of this are in the interactive diagram below where we imagine that 1000 people have been tested and we also know their disease status. Try playing with the base rate, sensitivity and specificity.

sensitivity

specificity

base rate

If the base rate is low (1%), then even with a 99% sensitivity and specificity the probability of actually being positive if you tested positive is approximately 50%. This is an example of the prosecutor’s fallacy and shows how important it is to consider the base rate is of what you’re testing for.

This type of reasoning can be applied to a diverse set of problems. If you know the probability that someone has a certain gene given they have developed cancer, what is the probability that someone will develop cancer given they have that gene. In a spam filter you can calculate the probability that an email is spam if it contains a set of keywords in terms of the probability of an email containing certain keywords given it’s spam.

Causal belief

Umbrella.

We don’t have to stop with one event dependent on another. We could also consider the impact of multiple events on one another in terms of their probabilities. This generalization is called Bayesian network theory.

Let’s consider the example where we’re deciding whether to take an umbrella with us or not based on the weather forecast. Causally, this looks like the following diagram:

Causal Bayesian diagram. The probability of rain is dependent on it being forecast, and the probability of using an umbrella is dependent on it raining.

Here we’re modelling the conditional dependencies of three events: whether rain is forecast, whether it’s actually raining and whether you end up using an umbrella. Bayes rule gives us a way of inverting this dependency, by for example considering the probability of using an umbrella given rain was forecast. If A = rain forecast, B = raining and C = using an umbrella then the joint probability (the probability of all three events occurring) can be written as

$P(A,B,C) = P(A)P(B|A)P(C|B)$

You can see how changes in these probabilities impacts the conditional dependencies in the tool below.

Forecasted probability of rain

Probability no rain given not forecasted

Probability use umbrella given it's raining

Probability of rain given rain forecasted

This may seem like a slightly trivial example, but this type of model has a huge amount of application. See for example, Latent Dirichlet Allocation in natural language processing, Bayesian Hierarchical Models or Restricted Boltzmann Machines.

Inference

One of the biggest applications of Bayes rule is in Bayesian inference. This is where we have some data $D$ (ex. No. of heads in 100 coin flips. or heights of individuals in a population) and a model parameterised by a set of parameters $\theta$ (e.g. probability of heads for a certain coin or mean and variance of population height) and we wish to calculate the probability $P(\theta|D)$. That is the probability of a certain set of parameters given the data we have observed (e.g. probability coin produces heads is 50% given we’ve observed 100 coin flips that produced 75 heads). Applying Bayes rule, we see that

$P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)}$.

We can therefore write down our desired probability in terms of the probability of observing some data given some model parameters (known as the likelihood) and the probability of a given set of parameters (known as the prior). We also have a tricky Probability of observing the data $D$ to deal with- let’s ignore this for now.

What if we observed some new data $D_2$? Say we decided to flip the coin again 100 times or we get more data on a population. We can apply Bayes’ rule again in a similar fashion as before to produce,

$P(\theta | D_2,D) = \frac{P(D_2 | \theta ) P(D | \theta) P(\theta)}{P(D) P(D_2)}$.

We can then update our posterior each time we observe new data.

This is all a bit abstract, let’s look at an actual example. Imagine we have discovered a new cure for a disease and we want to estimate its efficacy, but we only had a limited amount of data to date. We can assume each patient is independent of one another (is this reasonable? Could how a trial has been set-up change this?) and say that someone is cured with the treatment with a probability $p$. Each individual then has a probability of being cured $p$ and not being cured $1-p$. Now what if our trial had $N$ patients, what is the probability of seeing $x$ people cured? We can calculate this using the binomial distribution (this is just a distribution that counts up all the ways $x$ people can be cured out of $N$ people and sums up the probability for each event). We now have a likelihood, but we also need a prior in order to perform inference. There’s a trick where we can take a prior that has a special shape, so that we can write down the posterior analytically. This trick is called conjugate priors. For this example we can interpret the prior as having observed a previous trial where $y$ people were cured out of $M$ individuals.

The tool below allows us to explore the consequences of this. We can change how many patients were in the prior dataset and also in the current dataset as well as change how many individuals were cured or not (by clicking on them). Below that is a plot of the likelihood for the new dataset as well as the posterior representing the probability of the cure rate incorporating both datasets.

Prior

Data

Some interesting things spring out of this. First, in the limit of a small amount of data, even if all patients are cured we don’t spring to the conclusion that the cure rate is 100%. Similarly, if we have little new information then this doesn’t change our posterior beliefs all that much.

This works nicely for such a simple example, but what if patients came from different populations where we know the cure rate is different? This leads on to Bayesian Hierarchical Modelling, but makes the inference calculation a lot more involved.

The power of Bayesian probability comes from its ability to deal with combining new information with other information or beliefs and its ability to deal with small or missing data.

Acknowledgements

The code for the Venn diagram can be found here.

All the interactive examples were coded in d3, which has many fantastic examples for data visualization.

Probabilistic programming 2: Markov Chains

2017-06-04T00:00:00+00:00

Introduction

This is part two of a blog post on probabilistic programming. The first part of the blog can be found here.

Markov chains are mathematical constructs with a wide range of applications in physics, mathematical biology, speech recognition, statistics and many others. The simplest way to think about them is considering the above animation. A person (the circle) is trying to find out where their friend lives in a neighbourhood block. Unfortunately all the houses (the squares) look the same and have no numbers. Each time they get to a house they knock on the door, but then immediately forgets where they are. They can then randomly choose to go left or right before trying another house. We could ask how long on average it would take for them to find their friend’s house or what the probability is that they’d find the house after a certain number of steps. This can be easily done as long as the person forgets where they are each time they visit a house. This is the Markov property and is crucial in order to keep the computations of the system reasonable. This is where the person only has knowledge of their current state (house) and have no memory of their previous states. It turns out that there are lots of systems that have this property, however.

Cereal toy collector

Bowl of cereal.

One simple system to think about is the coupon collector’s problem. Let’s think about this in terms of toys that are given away in packets of cereal. Imagine a cereal company has a promotion on their cereal and are giving away four toys with their cereal. When you buy one cereal packet you’ll receive one toy, but you don’t know what toy you’ll receive until you buy the packet.

Let’s simulate this to think about the problem, below is an overview of our system where all the toys we have collected (not including duplicates) up until that point are in the top row; the toy we have just received is below along with a counter for how many cereal boxes we’ve bought. Below that is our abstract way of thinking about this problem. If all we care about is how long it takes to complete the collection then instead of keeping track of all the toys we currently have or all the specific toys we have collected without duplicates, we can instead track how many unique toys up until that point we’ve collected. This forms a Markov chain, with the number of unique toys we have collected up until that point is our state. We can see that the probability of transitioning to a new state (receiving a new toy we haven’t already got) is only dependent on the current unique number of toys we’ve collected, so the Markov property holds.

Try simulating cereal purchases a few times. You should be able to estimate on average how much cereal you’d need to buy to complete a collection. Sometimes we’re lucky and can do this in just four purchases. Other times we’re unlucky and have to buy many more. We can work out how much our purchases might vary by calculating the variation in the number of purchases required to complete a collection.

Analysing the coupon collector problem

We can actually work out what the expected number of boxes we need to purchase to complete the collection by hand. First let’s think about what the probability is of moving from 0 unique toys to 1 unique toys. This is one as we’ll always gain a toy we have collected before. Moving from 1 to 2, the probability is $3/4$ of gaining a new unique toy. Similarly, transitioning from 2 to 3 is $1/2$ and from 3 to 4 is a $1/4$.

What is the probability of gaining a new unique toy in $x$ purchases if the probability of a new unique toy in one purchase is $p$? This is the probability of not purchasing a new unique toy $x-1$ times followed by purchasing a new unique toy, or $(1-p)^{x-1}p$. This is the geometric distribution and we can quickly calculate the expectation and variance as,

\[\begin{align} \mathbb{E}[X] &= \frac{1}{p}\\ \text{Var}[X] &= \frac{1-p}{p^2} \end{align}\]

As all these events are independent of one another (the amount of cereal you have to buy to get the second toy is independent of the amount you have to buy to get the third toy for example), we can calculate the total expected number and variance of the total cereal we have to buy to complete the collection using the sum of expectation and sum of variance rule

\[\begin{align} \mathbb{E}[Y] &= \sum_{i=1}^{4}\mathbb{E}[X_i] \\ \text{Var}[Y] &= \sum_{i=1}^{4}\text{Var}[X_i] \end{align}\]

So inputing our values for $p$ ($1$,$3/4$,$1/2$,$1/4$), we find an expected number of purchases of around $8.3$ with variance $14.4$. We can check these values by simulating the process below multiple times and recording the mean and variance.

Walker on a graph

Markov chains don’t need to have a finite number of states. Consider our random walker in the beginning. Instead of just visiting four houses, imagine that there are an infinite number of houses on the block. Another way to think about this a gambler’s winnings (as long as the gambler can go into an infinite amount of debt). Let’s imagine the gambler plays a simple game where they flip a coin and if it’s heads they gain one dollar and if it’s tails they lose a dollar. Let’s imagine that this is a fair coin so the probability of a win is a half. We could ask how long it might take for the gambler to lose all their money and end up at zero. It turns out this has a surprising answer.

First let’s simulate this process a few times below.

We can actually figure out some properties of this without the need to do multiple simulations. The first trick is to figure out what the expected winnings will be for the gambler after one one game. We know that with probability $p$ it will go up one and with probability $1-p$ it will go down one. So $1\times p + (-1)\times(1-p)$ is $2p -1$. Note that when the probability of gaining a dollar is a half, the expected overall step is zero. This is because half the time they lose a dollar and half the time they gain a dollar, so they both cancel.

Does this mean that after a hundred steps we would expect the winnings to be 0? Clearly, if we experiment by running a few simulations this doesn’t appear to be the case. There is in fact quite a bit of variation between each simulation. What’s the variation in one time-step? Starting at 0 if we’re just as likely to go up one or down one then we could guess the variance is 1. We can work this out, using the formal definition of the variance $\text{Var}[X] = \mathbb{E}[X^2]-\mathbb{E}[X]^2$. The first expectation is $1^2\times p + -1^2 \times (1-p)$ and the second term is $(1\times p + -1 \times (1-p))^2$. This gives a formula for the variance as $1-(2p-1)^2$.

The great thing about the Markov property is that the probability of moving to a given state depends only on the previous state. For this simple model the probability of moving up or down is actually independent of state. So to calculate the variance after $t$ games we just need to sum up the variance for each step. In other words, the formula turns out to be $(1-(2p-1))^2 t$

The variance is then at its maximum when $p$ is a half or when there’s equal chance of moving up or down. After one hundred time steps the variance is one hundred. This gives a probability of being at position 0 after one hundred steps of only $8\%$. In fact over time, the probability of returning to the origin diminishes. It turns out that the expected time to return to $0$ is actually infinite, the gambler can take arbitrarily long excursions as there’s nothing bounding their winnings (or losses).

Walker with drift

Imagine instead if the coin had a slight amount of bias i.e. $p\neq 1/2$. We can see from our variance formula above that the more bias there is the lower the variance of the gambler’s winnings. We can see the impact of a slight bias by varying the probability of a win in the simulation below. The more we increase the probability the smaller the variance becomes.

Probability

Application: PageRank

Finally let’s look at a real application of a Markov chain with the PageRank algorithm. This is the original algorithm used by Google to rank web pages. A more in depth look can be found here .

We can think about the algorithm in terms of a random walker “walking” through web pages by clicking on links. When it arrives at a web page, it finds all the links and then picks one at random to click on. For simplicity let’s imagine that every page has a link back to the page that linked to it. Each time the walker passes through a page we record this by adding a point to the page. After a long time the pages can be ranked by how many times the walker has visited it. This allows a way of quantifying how “central” a page is on the web. One issue is that the rank number for each page keeps increasing. We can add a dampening factor by making the page’s point smaller if the walker visits later in its walk.

We can simulate the process on a small random network below. The colour denotes how recently the walker had visited that page and the size denotes its PageRank.

In reality there are far quicker ways to calculate the PageRank than to just simulate it. By setting the problem up as a Markov chain though, we’re able to use a lot of mathematical machinery to calculate it efficiently.

Probabilistic programming 1: Monte Carlo Method

2017-04-20T00:00:00+00:00

Introduction

This is the first post in a series on Markov Chain Monte Carlo (MCMC), a powerful technique used in performing inference on probabilistic models. We’ll unpack what each of these terms mean: what a Markov Chain is, what Monte Carlo simulation is and then finally how it all fits together to in the framework of MCMC.

Background: Monte Carlo method

The main idea behind the Monte Carlo method is to use simulate randomly from a probability distribution where it is difficult or not possible to have a direct numerical solution of the probability.

Imagine we have a complex deterministic model such as those used in hydrodynamic flow or climate change. We might have many inputs into this model and these inputs are all likely to have some uncertainty around them. A way to understand how this uncertainty impacts the model prediction is to just simulate from it many times using inputs taken from the uncertainty distributions. Notice that this automatically penalizes against rare scenarios. If the probability of an input is low then it is unlikely to be selected and therefore unlikely to contribute to the model prediction.

The method came about from the Manhattan project. Nicholas Metropolis and his team developed the technique and needed some code name for it. They decided to name it after the Monte Carlo casino where the uncle of Stanislaw Ulam (another member of the team) often gambled. The intuition is that if you want to understand say, what the probability of winning at roulette given a certain strategy is, then one solution is to just play it many many times and recorded how often you win and lose. Then after going through your entire savings you divide the number of times you had a win by the total number of times you played and that’s your estimate of your probability of a win. If you don’t want to burn through all your money to understand this then you can create a computer simulation of a roulette wheel and use that to perform your experiment.

Simple example: The binomial process

This is a bit abstract so let’s look at a simple example of performing a series of coin tosses. Each coin toss can be heads with probability $p$ (normally 0.5 is chosen for a fair coin, although there’s some evidence that isn’t always the case ). In a series of $N$ coin tosses the probability of $k$ heads follows a binomial distribution like this

$P(k|N,p) = \binom{N}{k} p^k(1-p)^{N-k}.$

Let’s imagine we didn’t have a formula for the probability. We could instead repeatedly simulate coin tosses, recording the number of heads in order to build up an empirical distribution that asymptotically converges to the binomial distribution. You can use the tool below to simulate this, changing the probability and speed of the simulation.

Speed

Probability

Calculating pi

Another example of Monte Carlo simulation is in the calculation of $\pi$. One example of this is Buffon’s needle, which can be done by throwing down matchsticks and observing how often they cross a series of parallel lines.

It’s maybe easier to think instead about random points falling onto a square of length one and seeing how often the points fall inside a circle of diameter one centered in the middle of the square. The probability of the point landing in the circle is the same as the area of the circle divided by the area of the square. From basic geometry, the circle area is $\pi r^2 = \pi (1/2)^2 = \pi/4$ and the area of the square is one. Therefore the probability that a random point lands in the circle is $\pi/4$. Note that in order to check if a point is contained in the circle, you just need to check if the sum of squares of its coordinates are less than one, so this method doesn’t implicitly include $\pi$ anywhere.

You can simulate this process in the tool below.

The permutation method

Finally, let’s look at a less trivial example. Suppose that we conducted a public health intervention and we want to compare the outcome between a group who did not receive the intervention and those who did. We could look at an outcome measure (blood pressure, BMI etc.) and compare the mean in both groups. How do we know if the difference of the means is significant? We might use something like a t-test, but this introduces a few assumptions about the data such as normality.

Let’s assume we don’t know where the data came from, we could instead simulate by dividing the data we have into two groups many times and comparing the difference of means in the two groups. Dividing into two groups is the same as random sampling without replacement, hence why this technique is known as the permutation test.

Below we have a tool that randomly samples a set of data from two normal distributions, with one mean centered at zero and another that you can vary. You can also vary the population size of each group. When the simulation starts a random permutation occurs and then the new difference in the means of both groups is recorded. If it’s greater than the original difference, then this simulation is counted. These counts are then used to estimate the probability that a randomly selected permutation has a difference of means at least as large as the original data. You can see after many iterations this value converges and you can use a pre-determined p-value to see whether this is significant or not.

Group Mean

Sample Number

As always there’s some caveats with this technique. Try playing around with small population sizes when the mean of the two distributions are the same to see if you can still end up with something significant. The moral is to always be suspicious when the size of the data is small.

Acknowledgements

For the calculation of pi example, I adapted code from here.

All the interactive examples were coded in d3, which has many fantastic examples for data visualization.

Presenting & Communicating models: Creating online web applications

2017-04-12T00:00:00+00:00

This blog came about from a recently published article I had in PLoS NTD that I also recently gave as a talk. The main idea is that it’s becoming increasingly easy to create front-ends or dashboards for an epidemic model or models in general and so to lay out some of the tools that we could potentially utilise. Creating a dashboard for a model comes with its own set of challenges and questions: how much can a user play around with the model? what should be outputted and in what format? how can you easily show what’s going into the model?

These are difficult questions to answer and would certainly be answered differently on a case by case basis. We tried laying out in the article and here some of the things that need to be considered on a conceptual level and some of the advantages and disadvantages to this approach. Finally I go through a few different technologies for generating a modelling tool and then talk through some examples.

Challenges in communicating models

Developing user friendly interfaces we probably want to consider the following:

Access—for users with limited modelling expertise.
Speed—analyses produced quickly without expensive computer resources.
Characterisation of uncertainty—usually through repeated runs of the model, resulting in a higher processing burden.
Ease of use—requires design choices, including instructive inputs.
Clarity of presentation—limiting misunderstanding of the model and its outputs.
Responsiveness to needs—flexibility to iteratively update the interface through a consultation with intended end-users.
Range of users—different users have different needs, and it is challenging to survey and understand all of these needs.

Advantages & disadvantages for a web interface

The main advantage is that it can give access to the model for non-expert users. They are also able to generate results from the model in real-time using their own processing power. Interactive input and output can be tailored directly to disease-specific goals, such as seeing what impact increasing the coverage of a vaccine has.

There are some things to consider however. There could be potential misinterpretation or misuse of results due to lack of expert guidance; for example, the dynamics of breaking transmission are likely to be highly locally specific and the modeling results should considered in this context. Part of this means that limited parameters can be changed in the model and end-users don’t have full access to model through interface. This also means it is limited either tailoring to local settings or can only deal with very generic and perhaps unrealistic scenarios.

Can a browser really run complex model simulations?

The short answer is yes! The Second Browser War lead to development of Google’s V8 engine in 2011. Other browsers followed in suit and now most browsers can run standard ECMAscript. There is also an increasing trend in single page applications shifting computation from server to client, meaning more libraries and tools are available with this in mind.

Some examples include

Neural networks: Tensorflow, ConvNetjs
Particle simulator: LiquidFun
PDE, ODE solvers: numericjs
3d simulations
Many more…

Potential web interface technologies

Now we’ve established what we can do in a browser we can look at some of the technologies currently available.

Shiny

Shiny is a way of writing apps in the statistical language R. It’s becoming an increasingly popular way of creating a modelling dashboard and is under active development so there’ll be lots of new features being added. If the model’s already written in R then it should be easy to implement something quickly.

Python jupyter notebook

jupyter notebooks are another viable option for developing a simple dashboard. Again, this doesn’t require an explicit knowledge of web technologies although they become helpful for more complex tasks. They’re extremely easy to set up and host on somewhere like Github and are great for quickly displaying concepts and creating tutorials/blog posts. As with Shiny, there comes a limit where you won’t be able to accomplish everything that you’d want to do however when trying to create tools for more complex models.

Native JavaScript

As with all other libraries or APIs that provide a way of generating a website or tool, at some point you’re going to want to do something that has been explicitly accounted for by the framework. It’s worth considering creating something in more native javascript therefore, although this is going to require some degree of knowledge of web development (CSS,HTML etc). What you get from this is access to very powerful libraries for visualisation such as d3.js and Plotly.js. I’d recommend browsing the d3 gallery to see the range of interactive graphs and figures it can achieve. Another advantage of going down the javascript route is providing an easier way of linking to external databases and other website APIs. As mentioned before browser js engines are powerful and there are plenty of libraries out there for solving ODEs and PDEs as well as plenty of scope for writing your own solver libraries.

Although there’s a bit more overhead with something like Shiny you can generate dashboards using a HTML/CSS library such as Bootstrap. There also exist lots of example dashboard templates that are often free to use.

Some examples

Let’s look at some specific examples written in js, plotly and d3. For these examples we’ll use the SIR model. This is one of the simplest epidemic models, where individuals progress through three disease stages: susceptible ($S$), infected ($I$) and recovered ($R$) as in the diagram below

We’ll assume a fixed population where everyone is susceptible except for a couple of infections at the start.

The first example is an individual simulation, with individuals represented as coloured circles with blue denoting susceptible, red denoting infected and gray denoting recovered. We can interact with the simulation as it progresses by clicking on individuals to “vaccinate” them. All other parameters such as the rate of infectivity and the infectious period can’t be adjusted.

Going further into the concept of the SIR, we might want to explore what impact the different parameters have on the dynamics of the infection. Below we simulate a deterministic (ODE or in a very large population) epidemic and explore what impact the basic reproduction number $R_0$ and the infectious period has. $R_0$ can be a slightly tricky concept to understand, it’s defined as the average number of secondary cases from one primary case in a completely susceptible population. Notice that if it’s less than one then there’s no chance of an epidemic taking off.

We can also explore this concept using a stochastic as opposed to a deterministic model. Here we simulate the epidemic starting with one infected individual and repeat the simulation multiple times to create a distribution of the epidemic curves.

For a more sophisticated example, although one that uses the same basic principles, go to ntdmodelling.org/transfil for a dashboard to model intervention strategies for lymphatic filariasis (a neglected tropical disease) that we developed.

Conclusion

There are many potential technologies out there to build single-page applications and dashboards and it’s becoming easier to produce some really powerful, user-friendly tools. One of the big advantages for having interactive plots is when it comes to geographic data. This is again where a library like d3 really shines and there are some fantastic examples out there.

All of these examples can run in browser, which offers a lot of advantages. Users won’t need to install any software, they can store data locally, interact with other website APIs to pull in other data sources and if the application is updated then these a pushed to the user immediately.

Even if the model can’t be coded up in js, having a few interactive plots goes a long way to explaining complex results and can provide a more succinct way of conveying geographic data.

The original article that inspired this blog is open-access and can be found below.

Irvine, Michael A., and T. Deirdre Hollingsworth. “Making transmission models accessible to end-users: the example of TRANSFIL.” PLoS Neglected Tropical Diseases 11.2 (2017)

Introduction to convolutional neural networks

2017-04-06T00:00:00+00:00

Preamble

Python notebook for this post can be found at https://github.com/sempwn/keras-intro

Before starting we’ll need to make sure tensorflow and keras are installed. Open a terminal and type the following commands:

pip install --user tensorflow
pip install --user keras --upgrade

The back-end of keras can either use theano or tensorflow. Verify that keras will use tensorflow by using the following command:

sed -i 's/theano/tensorflow/g' $HOME/.keras/keras.json

Note that this post was written in keras 2.0, where there have been a number of changes from version 1.0. We begin by loading in the libraries we’ll be using in the notebook

%pylab inline
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.utils import np_utils
from keras import backend as K

Populating the interactive namespace from numpy and matplotlib

Convolutional neural networks : A very brief introduction

To quote wikipedia:

Convolutional neural networks are biologically inspired variants of multilayer perceptrons, designed to emulate the behaviour of a visual cortex. These models mitigate the challenges posed by the MLP architecture by exploiting the strong spatially local correlation present in natural images.

One principle in machine learning is to create a feature map for data and then use your favourite classifier on those features. For image data this might be presence of straight lines, curved lines, placement of holes etc. This strategy can be very problem dependent. Instead of having to feature engineer for each specific problem, it would be better to automatically generate the features and combine with the classifier. CNNs are a way to achieve this.

Automatic feature engineering

Filters or convolution kernels can be treated like automatic feature detectors. A number of filters can be set before hand. For each filter, a convolution with this and part of the input is done for each part of the image. Weights for each filter are shared to reduce location dependency and reduce the number of parameters. The end result is a multi-dimensional matrix of copies of the original data with each filter applied to it.

For a classification task, after one or more convolutional layers a number of fully connected layers can be added. The final layer has the same output as the number of classes.

Pooling

Once convolutions have been performed across the whole image, we need someway of down-sampling. The easiest and most common way is to perform max pooling. For a certain pool size return the maximum from the filtered image of that subset is given as the output. A diagram of this is shown below

MNIST data set

We’ll begin by loading in the MNIST data set, which is a standard set of 28x28 grayscale images of handwritten numerical digits. Keras comes with it built in and automatically splits the data into a training and validation set.

# the data, shuffled and split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Convolutions on image

Let’s get some insight into what a random filter applied to a test image does. We’ll compare this to the trained filters at the end.

Each filtered pixel in the image is defined by $C_i = \sum_j{I_{i+j-k} W_j}$, where $W$ is the filter (sometimes known as a kernel), $j$ is the 2D spatial index over $W$, $I$ is the input and $k$ is the coordinate of the center of $W$, specified by origin in the input parameters.

from scipy import signal
i = np.random.randint(x_train.shape[0])



c = x_train[i,:,:]
plt.imshow(c,cmap='gray'); plt.axis('off');
plt.title('original image');
plt.figure(figsize=(18,8))
for i in range(10):
    k = -1.0 + 1.0*np.random.rand(3,3)
    c_digit = signal.convolve2d(c, k, boundary='symm', mode='same');
    plt.subplot(2,5,i+1);
    plt.imshow(c_digit,cmap='gray'); plt.axis('off');

As you can see the random filters aren’t capable of differentiating different parts or features of the image. We do know however that non-random filters are very good at things like edge detection. Let’s compare these random filters above to a standard edge-detection filter. One such filter is used below

#define edge-detection filter
k = [
    [0, 1, 0],
    [1,-4, 1],
    [0, 1, 0]
    ]

plt.figure();
plt.subplot(1,2,1);
plt.imshow(c,cmap='gray'); plt.axis('off');
plt.title('original image');

plt.subplot(1,2,2);
c_digit = signal.convolve2d(c, k, boundary='symm', mode='same');
plt.imshow(c_digit,cmap='gray'); plt.axis('off');
plt.title('edge-detection image');   

Keras introduction

We’re using keras to construct and fit the convolutional neural network. Quoting their website

Keras is a high-level neural networks API, written in Python and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research.

We can rapidly develop a convolutional neural network in order to experiment with our image classification task. The first step will be to pre-process the data into a form that can be fed into a keras model

batch_size = 128
nb_classes = 10
nb_epoch = 6

# input image dimensions
img_rows, img_cols = 28, 28
# number of convolutional filters to use
nb_filters = 32
# size of pooling area for max pooling
pool_size = (2, 2)
# convolution kernel size
kernel_size = (3, 3)

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

#sub-sample of test data to improve training speed. Comment out
#if you want to train on full dataset.
x_train = x_train[:20000,:,:,:]
y_train = y_train[:20000]

#normalise the images and double check the shape and size of the image data
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_test_inds = y_test.copy()
y_train_inds = y_train.copy()
y_train = keras.utils.to_categorical(y_train, nb_classes)
y_test = keras.utils.to_categorical(y_test, nb_classes)

('x_train shape:', (20000, 28, 28, 1))
(20000, 'train samples')
(10000, 'test samples')

Tricks to avoid overfitting

20000 data-points isn’t a huge amount for the size of the models we’re considering.

One trick to avoid overfitting is to use drop-out. This is where a weight is randomly assigned zero with a given probability to avoid the model becoming too dependent on a small number of weights.
We can also consider ridge or LASSO regularisation as a way of trimming down the dependency and effective number of parameters.
Early stopping and Batch Normalisation are other strategies to help control over-fitting.

Constructing the model

The model we’ll be using for classification will be a simple one convolutional layer plus one fully connected layer convolutional neural network. This is probably the simplest convolutional neural network that could be constructed so it’ll be interesting to see how it performs. We also introduce dropout between the two layers as our preferred method of avoiding overfitting.

#Create sequential convolutional multi-layer perceptron with max pooling and dropout
#uncomment layers below to produce a more accurate score (in the interest of time we use a shallower model)
model = Sequential()
model.add(Conv2D(nb_filters, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))  
#model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
#model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adam(),
              metrics=['accuracy'])

let’s see what we’ve constructed layer by layer. This is useful for checking the shapes of each layer are what you expect. Note that in the first layer the images are now 26x26. This is due to the convolution avoiding going over the edge of the image and so chopping off the outer border of the image.

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 13, 13, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 13, 13, 32)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 5408)              0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 5408)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 10)                54090     
=================================================================
Total params: 54,410.0
Trainable params: 54,410.0
Non-trainable params: 0.0
_________________________________________________________________

Model fitting

We can now fit the model to the data. We provide the batch size, number of epochs as well as the validation data. We also want the output to be verbose so we’re able to see how the log-loss and accuracy in both the test and validation set changes at the end of each epoch.

model.fit(x_train, y_train, batch_size=batch_size, epochs=nb_epoch,
          verbose=1, validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Train on 20000 samples, validate on 10000 samples
Epoch 1/6
20000/20000 [==============================] - 16s - loss: 0.7113 - acc: 0.8046 - val_loss: 0.2958 - val_acc: 0.9171
Epoch 2/6
20000/20000 [==============================] - 15s - loss: 0.3009 - acc: 0.9114 - val_loss: 0.2093 - val_acc: 0.9425
Epoch 3/6
20000/20000 [==============================] - 15s - loss: 0.2325 - acc: 0.9317 - val_loss: 0.1689 - val_acc: 0.9548
Epoch 4/6
20000/20000 [==============================] - 16s - loss: 0.1853 - acc: 0.9460 - val_loss: 0.1385 - val_acc: 0.9620
Epoch 5/6
20000/20000 [==============================] - 15s - loss: 0.1610 - acc: 0.9524 - val_loss: 0.1216 - val_acc: 0.9660
Epoch 6/6
20000/20000 [==============================] - 15s - loss: 0.1451 - acc: 0.9571 - val_loss: 0.1103 - val_acc: 0.9685
('Test loss:', 0.11029711169451475)
('Test accuracy:', 0.96850000000000003)

Results

Let’s take a random digit example to find out how confident the model is at classifying the correct category

#choose a random data from test set and show probabilities for each class.
i = np.random.randint(0,len(x_test))
digit = x_test[i].reshape(28,28)

plt.figure();
plt.subplot(1,2,1);
plt.title('Example of digit: {}'.format(y_test_inds[i]));
plt.imshow(digit,cmap='gray'); plt.axis('off');
probs = model.predict_proba(digit.reshape(1,28,28,1),batch_size=1)
plt.subplot(1,2,2);
plt.title('Probabilities for each digit class');
plt.bar(np.arange(10),probs.reshape(10),align='center'); plt.xticks(np.arange(10),np.arange(10).astype(str));

1/1 [==============================] - 0s

Wrong predictions

Let’s look more closely at the predictions on the test data that weren’t correct

predictions = model.predict_classes(x_test, batch_size=32, verbose=1)

 9792/10000 [============================>.] - ETA: 0s

inds = np.arange(len(predictions))
wrong_results = inds[y_test_inds!=predictions]

Example of an incorrectly labelled digit

We’ll choose randomly from the test set a digit that was incorrectly labelled and then plot the probabilities predicted for each class. We find that for an incorrectly labelled digit, the probabilities are in general lower and more spread between classes than for a correctly labelled digit.

#choose a random wrong result from the test set
i = np.random.randint(0,len(wrong_results))
i = wrong_results[i]
digit = x_test[i].reshape(28,28)

plt.figure();

plt.subplot(1,2,1);
plt.title('Digit {}'.format(y_test_inds[i]));
plt.imshow(digit,cmap='gray'); plt.axis('off');
probs = model.predict_proba(digit.reshape(1,28,28,1),batch_size=1)
plt.subplot(1,2,2);
plt.title('Digit classification probability');
plt.bar(np.arange(10),probs.reshape(10),align='center'); plt.xticks(np.arange(10),np.arange(10).astype(str));

1/1 [==============================] - 0s

Comparison between incorrectly labelled digits and all digits

It seems like for the example digit the prediction is a lot less confident when it’s wrong. Is this always the case? Let’s look at this by examining the maximum probability in any category for all digits that are incorrectly labelled.

prediction_probs = model.predict_proba(x_test, batch_size=32, verbose=1)

 9856/10000 [============================>.] - ETA: 0s

wrong_probs = np.array([prediction_probs[ind][digit] for ind,digit in zip(wrong_results,predictions[wrong_results])])
all_probs =  np.array([prediction_probs[ind][digit] for ind,digit in zip(np.arange(len(predictions)),predictions)])
#plot as histogram
plt.hist(wrong_probs,alpha=0.5,normed=True,label='wrongly-labeled');
plt.hist(all_probs,alpha=0.5,normed=True,label='all labels');
plt.legend();
plt.title('Comparison between wrong and correctly classified labels');
plt.xlabel('highest probability');

It appears in general that when a digit is wrongly labelled, the model provides it with a lower probability than when it’s correctly labelled. We would expect these two groups to become more separate as the model accuracy increases.

What’s been fitted ?

Let’s look at the convolutional layer of the model and the kernels that have been learnt. First we’ll check the dimensions of the first layer to see what we need to extract.

print (model.layers[0].get_weights()[0].shape)

(3, 3, 1, 32)

Now let’s visualise the learnt filters. Remember each of these filters are convolved with the image in order to produce a set of filtered images that can be used for classification.

weights = model.layers[0].get_weights()[0]
for i in range(nb_filters):
    plt.subplot(6,6,i+1)
    plt.imshow(weights[:,:,0,i],cmap='gray',interpolation='none'); plt.axis('off');

Visualising intermediate layers in the CNN

In order to visualise the activations half-way through the CNN and have some sense of what these convolutional kernels do to the input we need to create a new model with the same structure as before, but with the final layers missing. We then give it the weights it had previously and then predict on a given input. We now have a model that provides as output the convolved input passed through the activation for each of the learnt filters (32 all together).

#Create new sequential model, same as before but just keep the convolutional layer.
model_new = Sequential()
model_new.add(Conv2D(nb_filters, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))

#set weights for new model from weights trained on MNIST.
for i in range(1):
    model_new.layers[i].set_weights(model.layers[i].get_weights())

#pick a random digit and "predict" on this digit (output will be first layer of CNN)
i = np.random.randint(0,len(x_test))
digit = x_test[i].reshape(1,28,28,1)
pred = model_new.predict(digit)

#check shape of prediction
print pred.shape

(1, 26, 26, 32)

#For all the filters, plot the output of the input
plt.figure(figsize=(18,18))
filts = pred[0]
for i in range(nb_filters):
    filter_digit = filts[:,:,i]
    plt.subplot(6,6,i+1)
    plt.imshow(filter_digit,cmap='gray'); plt.axis('off');

The filters pick out a lot of details from the image including horizontal and vertical lines as well as edges and potentially the terminal points of lines. We’ve only created one convolutional layer in our model. The real power comes when these convolutional layers are stacked together, creating a mechanism by which more general filters can be learnt.

Mike Irvine

Challenges in modelling interventions for COVID-19

Introduction

Outbreak control

Later epidemic case management

Conclusions

Testing binary classifiers

Introduction to binary classifiers

The statistics

Putting it into practice

The receiver operator characteristic

Wrapping up

Understanding the math behind the AUC

Acknowledgements

Probabilistic programming 4: Markov Chain Monte Carlo

Introduction

Metropolis-Hastings

Slice Sampling

Hamiltonian Monte Carlo

Conclusion

Acknowledgements

When zombies attack: The infectious disease modelling app

Introduction

Model fitting

Simulation

Measuring hidden populations

Zero-truncated poisson

\(P(X = x) = \frac{e^{-\lambda}\lambda^x}{x!}.\)

\(P(X = x | X > 0) = \frac{P(X=x)}{P(X>0)} = \frac{\lambda^x}{x!} \frac{e^{-\lambda}}{1-e^{-\lambda}}\)

\(-n\lambda -n\log(1-e^{-\lambda}) + \lambda \sum_{i=0}^{n-1}\log(x_i) -\sum_{i=0}^{n-1}\log(x_i!)\)

\(\hat{f_0} = \frac{p_0}{1-p_0}N_{obs}.\)

Mark recapture

\(\hat{N} = \frac{nK}{k}.\)

\(\hat{N} = \frac{(n+1)(K+1)}{k+1} - 1.\)

Acknowledgements

Probabilistic programming 3: Bayesian probability

Introduction

\(P(A|B) = \frac{P(B|A)P(A)}{P(B)}\)

Sensitivity and specificity

\(P(+ve | +ve test) = \frac{P(\text{+ve test} | \text{+ve}) P(\text{+ve})}{P(\text{+ve test})}.\)

\(P(\text{+ve test}) = P(\text{+ve test} | \text{+ve})P(\text{+ve}) + P(\text{+ve test} | \text{-ve})P(\text{-ve}).\)

Causal belief

\(P(A,B,C) = P(A)P(B|A)P(C|B)\)

Inference

\(P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)}\).

\(P(\theta | D_2,D) = \frac{P(D_2 | \theta ) P(D | \theta) P(\theta)}{P(D) P(D_2)}\).

Prior

Data

Acknowledgements

Probabilistic programming 2: Markov Chains

Introduction

Cereal toy collector

Analysing the coupon collector problem

Walker on a graph

Walker with drift

Application: PageRank

Probabilistic programming 1: Monte Carlo Method

Introduction

Background: Monte Carlo method

Simple example: The binomial process

\(P(k|N,p) = \binom{N}{k} p^k(1-p)^{N-k}.\)

Calculating pi

The permutation method

Acknowledgements

Presenting & Communicating models: Creating online web applications

Challenges in communicating models

Advantages & disadvantages for a web interface

Can a browser really run complex model simulations?

Potential web interface technologies

Shiny

Python jupyter notebook

Native JavaScript

Some examples

Conclusion

Introduction to convolutional neural networks

Preamble

Convolutional neural networks : A very brief introduction

Automatic feature engineering

Pooling

MNIST data set