I recently gave a short introduction to non-parametric models in machine learning. Searching around for a good data set to use I ended up settling on polling data for the 2016 US presidential election. Luckily, fivethirtyeight has the data available to the public and nicely format so that made it a lot easier to hit the greound running.
Non-parametric models are used when you don’t want to make strong assumptions about the underlying data. Imagine that we’re in a forest and we’re collecting the locations of a certain type of tree. We could maybe use a known distribution to describe this, but the relationship might be so complex that none really fits the bill. Non-parametric approaches such as kernel density estimation can then be used.
Similarly, we could have data such as traffic to a particular website over time. Again, we could use something like a polynomial regression model to capture this relationship, however this again might not be complex enough to describe the relationship. Here, we could use the non-parametric technique known as a Gaussian process, to capture this relationship.
Here we’re going to use kernel density estimation to estimate the distribution of polls in time and then use Gaussian processes to capture the polling trends in time and estimate the expected popular vote share and uncertainty at the time of the election.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv("http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv")
data.head()
cycle | branch | type | matchup | forecastdate | state | startdate | enddate | pollster | grade | ... | adjpoll_clinton | adjpoll_trump | adjpoll_johnson | adjpoll_mcmullin | multiversions | url | poll_id | question_id | createddate | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016 | President | polls-plus | Clinton vs. Trump vs. Johnson | 11/6/16 | U.S. | 11/1/2016 | 11/4/2016 | ABC News/Washington Post | A+ | ... | 46.10899 | 41.63457 | 4.555536 | NaN | NaN | http://abcnews.go.com/Politics/qualifications-... | 48472 | 75960 | 11/6/16 | 23:31:10 6 Nov 2016 |
1 | 2016 | President | polls-plus | Clinton vs. Trump vs. Johnson | 11/6/16 | U.S. | 10/31/2016 | 11/4/2016 | Ipsos | A- | ... | 42.71292 | 38.80288 | 6.624687 | NaN | NaN | http://www.realclearpolitics.com/docs/2016/201... | 48484 | 75984 | 11/6/16 | 23:31:10 6 Nov 2016 |
2 | 2016 | President | polls-plus | Clinton vs. Trump vs. Johnson | 11/6/16 | U.S. | 11/3/2016 | 11/5/2016 | NBC News/Wall Street Journal | A- | ... | 43.54423 | 40.91608 | 5.148813 | NaN | NaN | http://www.nbcnews.com/storyline/2016-election... | 48480 | 75974 | 11/6/16 | 23:31:10 6 Nov 2016 |
3 | 2016 | President | polls-plus | Clinton vs. Trump vs. Johnson | 11/6/16 | U.S. | 11/1/2016 | 11/3/2016 | Fox News/Anderson Robbins Research/Shaw & Comp... | A | ... | 46.33723 | 44.03186 | 4.789080 | NaN | * | http://www.foxnews.com/politics/interactive/20... | 48361 | 75811 | 11/4/16 | 23:31:10 6 Nov 2016 |
4 | 2016 | President | polls-plus | Clinton vs. Trump vs. Johnson | 11/6/16 | Wisconsin | 10/26/2016 | 10/31/2016 | Marquette University | A | ... | 45.46722 | 40.99278 | 2.582904 | NaN | NaN | https://twitter.com/MULawPoll | 48095 | 75264 | 11/2/16 | 23:31:10 6 Nov 2016 |
5 rows × 27 columns
Table looks like it has a lot of columns so let’s look at them.
data.columns.values
array(['cycle', 'branch', 'type', 'matchup', 'forecastdate', 'state',
'startdate', 'enddate', 'pollster', 'grade', 'samplesize',
'population', 'poll_wt', 'rawpoll_clinton', 'rawpoll_trump',
'rawpoll_johnson', 'rawpoll_mcmullin', 'adjpoll_clinton',
'adjpoll_trump', 'adjpoll_johnson', 'adjpoll_mcmullin',
'multiversions', 'url', 'poll_id', 'question_id', 'createddate',
'timestamp'], dtype=object)
Let’s first look at the pollster column and see how many polls each pollster has performed.
data.pollster.value_counts()
Ipsos 2670
Google Consumer Surveys 2073
SurveyMonkey 1671
USC Dornsife/LA Times 357
Rasmussen Reports/Pulse Opinion Research 354
CVOTER International 342
The Times-Picayune/Lucid 312
Public Policy Polling 264
YouGov 222
Emerson College 204
Quinnipiac University 174
Morning Consult 171
Gravis Marketing 171
Marist College 141
SurveyUSA 117
Monmouth University 114
Remington 99
CNN/Opinion Research Corp. 84
Fox News/Anderson Robbins Research/Shaw & Company Research 81
ABC News/Washington Post 78
IBD/TIPP 75
Suffolk University 66
Greenberg Quinlan Rosner (Democracy Corps) 57
University of New Hampshire 48
Dan Jones & Associates 45
Maine People's Resource Center 45
NBC News/Wall Street Journal 45
Siena College 42
Mason-Dixon Polling & Research, Inc. 39
Mitchell Research & Communications 39
...
Political Marketing International, Inc./Red Racing Horses 3
Basswood Research 3
University of Akron 3
Feldman Group 3
Cygnal Political 3
Middle Tennessee State University 3
Braun Research 3
Cole Hargrave Snodgrass & Associates 3
Dartmouth College 3
Target Insyght 3
Expedition Strategies 3
Centre College 3
Hoffman Research Group 3
National Journal 3
Utah Valley University 3
R.L. Repass & Partners 3
Behavior Research Center (Rocky Mountain) 3
OnMessage Inc. 3
Riley Research Associates 3
Southern Illinois University 3
University of Houston 3
Florida Southern College 3
Baruch College 3
First Tuesday Strategies 3
University of Wyoming 3
Meredith College 3
BK Strategies 3
University of Mary Washington 3
Merrill Poll 3
University of Arkansas 3
Name: pollster, dtype: int64
So there’s quite a spread. A few pollsters have performed polls many times and many have only done a few polls. Let’s plot a histogram of the distribution of polls
plt.hist(data.samplesize.dropna().values,bins=40);
plt.yscale('log');
plt.ylabel('frequency'); plt.xlabel('poll sample size');
This next section we’ll clean the data to get it into a format suitable for doing some analysis
data.forecastdate = pd.to_datetime(data.forecastdate)
data.createddate = pd.to_datetime(data.createddate)
data['time2election'] = pd.to_datetime(data.createddate) - pd.datetime(2016,11,8)
data['time2election'] = data['time2election']/ np.timedelta64(1, 'D')
We’re only considering national polls so remove state-wide ones
national_data = data[data.state=='U.S.']
Remove NAs from poll date or sample size
national_data = national_data[np.isfinite(national_data.samplesize)]
First, let’s try and understand the distribution in time for the polls. We’ll do this by fitting a kernel density estimation to the time to election data.
from sklearn.neighbors import KernelDensity
from scipy.stats import norm
# Plot a 1D density example
N = 100
np.random.seed(1)
X = np.atleast_2d(data.time2election).T
X_plot = np.atleast_2d(np.linspace(X.min(), 0, 1000)).T
plt.figure(figsize=(18,8))
kde = KernelDensity(kernel='gaussian', bandwidth=5.0).fit(X)
log_dens = kde.score_samples(X_plot)
plt.plot(X_plot[:, 0], np.exp(log_dens), '-',
label="kde estimate")
plt.legend(loc='upper left')
plt.plot(X[:, 0], -0.005 - 0.01 * np.random.random(X.shape[0]), '+k')
plt.xlim(X.min(), 0)
plt.ylim(-0.015, 0.02)
plt.grid()
plt.xlabel('time to election (days)')
plt.ylabel('density')
plt.show()
I’ve plotted the data below spread out along the y-axis so it’s easier to see. KDE does a good job at capturing the density of the polls. As expected, the number of polls are increasing towards the election, with a few interesting peaks that would be worth exploring.
Now, we’ll only look at one pollster to reduce data size as Gaussian processes can take a while to fit for many hundreds of points. The process can be repeated for all the pollsters to produce a prediction with confidence intervals for each of them. I decided on YouGov as it seemed fairly large.
national_data = national_data[national_data.pollster == 'YouGov']
Now we get into the main machine learning procedure. Scikit-learn has recently updated their Gaussian process library, so make sure you’re using version 0.18.1 or higher.
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
With the library loaded, let’s manipulate the data into the standard scikit-learn format. I’ve also standardised the polling by zeroing the mean, which is important for fitting as we will not be explicitly modelling the mean over time.
mpoll = np.mean(national_data.rawpoll_clinton)
sigpoll = 100.
p_clinton = np.array(national_data.rawpoll_clinton/sigpoll)
X = (np.atleast_2d(national_data.time2election).T).astype(float)
y = p_clinton - (mpoll)/sigpoll
#dy = 0.1*np.ones(len(y))/sigpoll#one point error
dy = p_clinton*(1-p_clinton)/national_data.samplesize.values
Scikit-learn gives a lot of flexibility in how we define the kernels. I settled on one that has a constant multiplied by a radial basis function, with values ranging from \(10^{-2}\) to 100 days.
#kernel = C(1.0, (1e-3, 1e3)) * RBF(7., (1e-2, 30.))
#kernel = 1.0 * RBF(length_scale=7.0, length_scale_bounds=(1e-2, 30.0))
kernel = C(1.0, (1e-3, 1e3)) * RBF(10, (1e-2, 1e2))
Now define the regressor. We’ve also produced a nugget or error based on the errors coming from a binomial distribution.
gp = GaussianProcessRegressor(kernel=kernel, alpha=(dy / y) ** 2,
n_restarts_optimizer=10)
gp.fit(X, y)
GaussianProcessRegressor(alpha=array([ 2.91595e-05, 3.03287e-05, ..., 3.15727e-05, 4.14705e-04]),
copy_X_train=True, kernel=1**2 * RBF(length_scale=10),
n_restarts_optimizer=10, normalize_y=False,
optimizer='fmin_l_bfgs_b', random_state=None)
# Make the prediction on the meshed x-axis (ask for MSE as well)
x = np.atleast_2d(np.linspace(-250., 0, 1000)).T
y_pred, sigma = gp.predict(x, return_std=True)
# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
fig = plt.figure()
py = sigpoll*y+mpoll
pdy = sigpoll*dy
py_pred = sigpoll*y_pred+mpoll
psigma = sigpoll*sigma
plt.errorbar(X.ravel(), py, pdy, fmt='r.', markersize=10, label=u'Observations')
plt.plot(x, py_pred, 'b-', label=u'Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
np.concatenate([py_pred - 1.9600 * psigma,
(py_pred + 1.9600 * psigma)[::-1]]),
alpha=.5, fc='b', ec='None', label='95% confidence interval')
plt.xlabel('days until election')
plt.ylabel('Clinton (%)')
plt.ylim(30., 60.)
plt.xlim(-200.,0.)
plt.legend(loc='upper left')
plt.show()
With one nominee predicted, we’ll now predict the other in order to understand the uncertainty in the outcome of the election.
mpoll = np.mean(national_data.rawpoll_trump)
sigpoll = 100.
p_trump = np.array(national_data.rawpoll_trump/sigpoll)
Ty = p_trump - (mpoll)/sigpoll
#dy = 0.1*np.ones(len(y))/sigpoll#one point error
Tdy = (p_trump*(1-p_trump)/national_data.samplesize.values)
Tgp = GaussianProcessRegressor(kernel=kernel, alpha=(Tdy / Ty) ** 2,
n_restarts_optimizer=10)
Tgp.fit(X, Ty)
GaussianProcessRegressor(alpha=array([ 2.16801e-05, 7.77431e-05, ..., 4.36763e-05, 2.04680e-05]),
copy_X_train=True, kernel=1**2 * RBF(length_scale=10),
n_restarts_optimizer=10, normalize_y=False,
optimizer='fmin_l_bfgs_b', random_state=None)
#Predict for Trump
Ty_pred, Tsigma = Tgp.predict(x, return_std=True)
# Plot the function, the prediction and the 95% confidence interval based on
# the MSE
fig = plt.figure(figsize=(18,8))
Tpy = sigpoll*Ty+mpoll
Tpdy = sigpoll*Tdy
Tpy_pred = sigpoll*Ty_pred+mpoll
Tpsigma = sigpoll*Tsigma
plt.errorbar(X.ravel(), py, pdy, fmt='r.', markersize=10, label=u'Clinton Observations')
plt.plot(x, py_pred, 'b-', label=u'Clinton Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
np.concatenate([py_pred - 1.9600 * psigma,
(py_pred + 1.9600 * psigma)[::-1]]),
alpha=.5, fc='b', ec='None', label=None)
plt.errorbar(X.ravel(), Tpy, Tpdy, fmt='k.', markersize=10, label=u'Trump Observations')
plt.plot(x, Tpy_pred, 'r-', label=u'Trump Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
np.concatenate([Tpy_pred - 1.9600 * Tpsigma,
(Tpy_pred + 1.9600 * Tpsigma)[::-1]]),
alpha=.5, fc='r', ec='None', label=None)
plt.xlabel('days until election')
plt.ylabel('Popular Vote (%)')
plt.ylim(30., 50.)
plt.xlim(-200.,0.)
plt.legend(loc='upper left');
#plt.savefig('pres_pred.png',bbox_inches='tight')
Although it’s predicting a Clinton victory in the popular vote, it seems remarkably close and the 95% confidence intervals have some overlap at the time of the election. For any given time-point we have a normal distribution for each prediction and we can then use this to calculate the probability of Clinton winning over Trump or vice versa. Of course in reality many other sources of data contribute towards a prediction. Gaussian processes, however provide a lot of flexibility for these types of data and importantly, can quantify the uncertainty of the prediction.