James P Houghton

James Houghton - Penny Jar Inference 1 - Prediction

Penny Jar Inference 1 - Prediction

15 Aug 2013

In the last post, we used data from the Philadelphia Mint and from pennies in my penny jar to parameterize a model of pennies in circulation. In this post, we'll see what we can learn from that model.

Before we try to predict the distribution of Denver pennies (as we mentioned in the last post) lets try to get an estimate for the number of Philadelphia pennies our model predicts to be in circulation at any given time. We do this by summing the predicted values for each mint year cohort at each simulation time step. To get a distribution of possible values, we run a standard Monte Carlo simulation, using the traces from our Markov Chain Monte Carlo runs (samples from the distribution of parameters) as the randomized inputs.

First, some small changes to our model to sum the coins in circulation each year. Instead of returning the probability of the data, for use in Bayesian inference, we return the number of pennies in circulation each year:

   def incirculationestimates(coin_counts, production, entry_rate, loss_rate):
      post_production = pd.Series(data = 0, index=coin_counts.Year)
      in_circulation = pd.Series(data = 0, index=coin_counts.Year)
      collection_distribution = pd.Series(data = .000000000001, index=coin_counts.Year)
      total_estimates = pd.Series(data = 0, index=range(1930,2014)) 
      for year in range(1930,2014):
            in_circulation += post_production*entry_rate - in_circulation*loss_rate 
            post_production -= post_production*entry_rate
            post_production[year] = production.Philadelphia[production.Year==year].values
         total_estimates[year] = in_circulation.sum()
      return total_estimates

Now we cycle though all the values in the MCMC output, using them as Monte Carlo input to our model, and accumulate the result in bins that we can plot. We use a logarithmic binning scale because the number of coins in circulation grows so much over the timeframe in question.

  bins = np.logspace(7.6, 11.4, num=100)
  results = pd.DataFrame(data=0, columns=range(1930,2014), index=range(0,100))
  for entry, loss in zip(mcmc.trace('entry_rate')[:],mcmc.trace('loss_rate')[:]): 
    answer = incirculationestimates(coin_counts=coin_counts, production=production, entry_rate=entry, loss_rate=loss)
    for year, bucket in zip(answer.index, np.digitize(answer, bins)):
      results.loc[bucket, year] +=1

What we see is that even though there is uncertainty in the number of coins in circulation, the uncertainty is small compared to the magnitudes themselves:

We predict about 250 billion Philadelphia pennies in circulation. It's hard to find estimates of the actual value, but at least someone on the internet agrees with our order of magnitude.

Now lets try to predict how many pennies from a given year will be in our collection of Denver pennies. As before we'll make some small changes to the model - in this case to simulate potential ways that the coins could come up in that collection, taking advantage of numpy's multinomial distribution.

   def DenverPrediction(production, entry_rate, loss_rate, numcoins):
      post_production = pd.Series(data = 0, index=range(1930,2014))
      in_circulation = pd.Series(data = 0, index=range(1930,2014))
      collection_distribution = pd.Series(data = .000000000001, index=coin_counts.Year)
      for year in range(1930,2014):
            in_circulation += post_production*entry_rate - in_circulation*loss_rate 
            post_production -= post_production*entry_rate
            post_production[year] = production.Denver[production.Year==year].values
            collection_distribution += in_circulation
    #normalize the counts to yield a probability of drawing each type of coin
    collection_distribution /= collection_distribution.sum()
    #draw numcoins from the distribution
    return np.random.multinomial(n=numcoins, pvals=collection_distribution)
Now we'll perform a Monte Carlo simulation using the distributions of input parameters derived in the Philadelphia case, and store each result as a line in our 'predictions' dataframe.

   numcoins = coin_counts.Denver.sum()
   bins = np.arange(0, .07, .001)
   predictions = pd.DataFrame(data=0., columns=range(1930,2014), index=range(0,len(mcmc.trace('entry_rate')[:])))
   for index, [entry, loss] in enumerate(zip(mcmc.trace('entry_rate')[:],mcmc.trace('loss_rate')[:])): 
      predictions.ix[index] = DenverPrediction(production=production, entry_rate=entry, loss_rate=loss, numcoins=numcoins)

We can now make a histogram for each of the columns, and plot these as a heat map, for the 209 penny testing data set:

bins = range(0,20)
hist = pd.DataFrame([pd.Series(np.histogram(predictions[x], bins)[0], index=bins[:-1], name=x) for x in range(1930,2014)])

From these distributions we can make predictions based upon central 'credible intervals' - by finding the central interval that contains 50% of simulation runs, we should have at least 50% confidence that the actual measured value (if our models are correct) will fall into this range. We see this in the image below for Denver Mint Year 1982 pennies. 50% of the simulation runs fall between the two red lines, and the interval itself is defined as between 3 and 6 pennies.  

We can do this with an arbitrary size credible intervals, to see how well our predictions map to reality  - ie, a 50% interval should contain the true value at least half the time, a 25% interval should contain the true value at least one quarter of the time, etc. If we make a credible interval for each mint year, we can then get a score of our accuracy over a number of predictions (one per year) and compare that with our confidence.

central_credible = pd.Series(data=0., index=range(10, 100, 10))

for credible in central_credible.index:
   interval = np.percentile(predictions.sort(axis=1, ascending=False), q=[(100.-credible)/2., (100+credible)/2.], axis=0)
   central_credible[credible] = 100*np.mean([coin_counts.Denver>=interval[0]] and [coin_counts.Denver<=interval[1]])

As our results are discrete and in the single-digit range, our n% confidence interval will necessarily include more than n% of the simulation runs, as we saw in the figure above - the interval 3 to 6 (inclusive) includes a range wider than the 50% confidence interval strictly suggests. This means that we can expect predictions made with the 50% interval will likely be correct more than 50% of the time. This would be preferable to shrinking the credible interval by 1 on either side, which would make us less than 50% accurate.

This is what we indeed see. If instead of working on a discrete data set, we worked on a continuous one (or even a discrete one with more granularity) we would expect to see the dots closer to the red line for a well calibrated prediction.

© 2016 James P. Houghton