Often the data that comes to us is only partially observed: website users out of all potential customers, patients displaying symptoms out of a total population of infected individuals, the number of animals observed in a study etc. One of the challenges with this type of data is to be able to determine the size and characteristics of the total population based on our (hopefully unbiased) sample.
How we can determine what the size is based on the type of data we observe? We’ll look at two examples here: the zero-truncated count data and mark-recapture data. Zero-truncated data comes about when we observe the same individuals multiple times (but obviously don’t measure the individuals we don’t see). For example from a health registry, police arrest records or number of non-clonal tomato plants.
Mark-recapture data is slightly different. Here there are two or more distinct phases or surveys in the observations and the number detected in each survey plus the overlap is used to estimate the whole population. Traditionally these techniques are used in ecology, where an animal is marked or tagged and released and then recaptured at a later date (e.g. measuring a population of tigers). These techniques are also used where individuals may be captured by different data sources, such as in estimating prevalence of injection drug use.
Let’s explore these concepts with a simulation. Below is our “field” with individuals of a species (circles) moving around randomly. In the centre is a larger circle denoting our field of view. If an individual wanders into our field of view then we measure it (denoted by a change of colour). However, we can’t directly observe the individuals outside the field of view who haven’t been measured already. You can see that it would take a long time for all individuals to wander into the field of view making it impractical to view all individuals this way.
First we can imagine that we record the number of times each individual wanders into our field of view. We can use these statistics to build up an estimate of the total population by making some assumptions around how these counts are distributed. Assuming there’s a small, but constant probability of any individual wandering into the field of view at any time leads to a Poisson distribution of counts i.e. the probability of a randomly sampled (from the entire population) individual’s count being $x$ is
However, we don’t observe the individuals with zero counts so our data is truncated. The probability of not being observed ($p_0$) is $e^{-\lambda}$. Therefore the probability of a random variate $X$ being observed $x$ times is
For our data ${x_0,\ldots,x_{n-1} }$, the associated log-likelihood is
Using the zero-truncated Poisson model, we can estimate the odds ratio of being observed against not being observed, $p_0/(1-p_0)$. We can calculate this empirically if we knew the number of individuals who weren’t observed ($f_0$) divided by the number that were observed ($N_{obs}$). Combining these together we get the following,
So now all we need is an estimate of $\lambda$. We can do this by maximizing the likelihood. The tool below shows this in action, notice that for this estimator to work you need to have viewed at least one individual twice or more. Notice that the estimator isn’t dependent on the size of the capture circle as this is incorporated into the rate $\lambda$.
We now look at a slightly different type of study where there a two distinct phases of measuring a population. In an ecological study this could be where an initial survey is conducted to measure the population size of a particular species and each individual is tagged. At a later stage the population is measured again and the number of individuals as well as the number of tagged individuals are recorded.
Let’s set-up some notation for the problem: $N$ is the unknown total population size, $n$ is the number of individuals marked in the first survey, $K$ is the number of individuals observed in the second survey and $k$ is the number of individuals observed in the second survey who were tagged in the first survey.
A simple estimator of the population size known as the Lincoln estimator (although why it’s called this is a slight mystery and is probably an example of Stigler’s law of eponymy). This estimator assumes that the probability of being observed in the second survey is the same as in the first survey. The empirical probability of being observed in the first survey is $n/N$ and the probability of being observed again is $k/K$. If these two probabilities are independent we can equate them and get the following estimator
It turns out this only performs well for large sample sizes. For smaller sample sizes we can use the Chapman estimator:
Use the tool below to explore these estimators with our simulation.
You can see under what conditions the Chapman estimator outperforms the Lincoln estimator. Although with both we have no estimate of the uncertainty around them. If we wanted to actually put this into practice then we might consider something like the R package multimark or in Python using PyMC (e.g. this example).
All the interactive examples were coded in d3, which has many fantastic examples for data visualization.