Penny Jar Inference 0 - Model Parameterization

Everybody has a penny jar* - something you toss loose change into when you come home and never do anything with.

Well, this weekend, a few friends and I did something with ours, namely make a histogram:

Thanks Zach, Tim, and Adrianne!

Here's the final product:

On the x-axis, you see years from 1937 to 2012, and on the y-axis (clearly) the number of pennies for that year that happened to be in my jar. There are three behaviors which I find interesting. First, its pretty reasonable to expect that earlier years would have fewer coins, as these have fallen out of circulation. Second (surprising to me at the time, but makes sense when you think about it) there is some time between when a coin is minted and when it passes to the banks, and then businesses, and then to my wallet, then to the coin jar; and this can be seen as a lag in the chart, with the most likely coins occurring in the late 90's to early 2000's. Lastly, there are spikes and dips from year to year which include both noise, and also variance in the number of coins produced in each year.

Being the sort of person who builds histograms out of their penny jar, I thought it would be interesting to model these effects, and see if they held any predictive capacity. We should start by splitting the pennies into a training and a testing data set - luckily enough, the US Mint did this for us by minting them in different states, and stamping one with a 'D' (for Denver, the others are from Philadelphia). Here's what we get:

The Philadelphia set dominates, as these are the coins that are introduced in Boston, where I live. Denver coins have to make their way east in the pocket of some traveler. This being the case, it may be that our datasets are not truly independent, as older coins have had more time to move east, and might make up an outsized portion of the population. Save that for another post - for now, we'll just go with it. We'll train our models on the Philadelphia data, and then use the Denver set to see how well we did.

We'll start by building a System Dynamics model of the process by which coins enter a post-production phase (at the Mint, in the Banks, etc.) and then general circulation, and how coins accumulate in the penny jar, proportional to their relative presence in the general population the year they were collected:

We simplify things by modeling the collection process as 'With Replacement' as the number of coins in circulation vastly dwarfs the number of coins going into the penny jar, and we model the distribution of pennies which are drawn as the average of all years in the collection. (This is equivalent to saying we collect approximately the same number of pennies each year of the collection.) I've put the equations for the model together in python, to make our data integration and analysis task simpler. Here is what it looks like:

for year in range(1930,2014):
if(year>1930):
in_circulation += post_production*entry_rate - in_circulation*loss_rate
post_production -= post_production*entry_rate

if(year<2013): #this must come after the decays are computed
post_production[year] = production.Philadelphia[production.Year==year].values

if(year>=collection_start_year):
collection_distribution += in_circulation

Its a very simple model, and only has a few equations, which makes it easy to translate. The first two if-statements are there to deal with edge cases when we run out of data in one direction or the other, and the last tells the model when to start collecting coins.

From the US Mint we have information about how many coins were produced in each mint in a given year:

With this information, we can simulate the number of coins in each mint year cohort in circulation over time. This simulation uses average values of entry rate and loss rate, which we'll calculate in a little bit. In the animation below, we can see how individual mint year cohorts grow as coins enter circulation, and decay with time as they get lost.

Another way to view this would be to take our animated histogram and map it as a 2d histogram/heat map, with time on the vertical axis, mint year cohort on the horizontal axis, and the number of coins as the color shading.

In this plot, we can see that the biggest single year in circulation would most likely have been 1982 mint year coins - the year in which pennies switched their metal content, and a large volume of pennies were produced - in the 1984-1990 timeframe.

To see how the simulation responds to changes in the entry rate and loss rate parameters, we can look at the coins in circulation in a given year (we'll use 2012) when the parameters are varied. In the plot below, we vary the Entry Rate horizontally and Loss Rate vertically. Clearly if entry rate is high and loss rate low, then the plot will look a lot like the production figures themselves. If entry rate is low, we expect to see the right side of each plot have a more gradual slope, and if loss rate is high, we expect to see fewer pennies in circulation, and the ones that remain to be weighted to most recent mint years.

Now we have a sense of the model and it's behavior, lets use the Python Monte Carlo package to infer distributions on the input parameters.

import pandas as pd
import pymc as mc

We'll start by setting 'priors' - our advance estimates for the parameters. Both entry rate and loss rate must physically be between zero and one, otherwise we'd lose a negative number of pennies, or have more coins in circulation than we minted. We're also pretty sure that less than half of the pennies in circulation get lost each year. (Otherwise the Mint would be doing some explaining to Congress.)

entry_rate = mc.Uniform('entry_rate', lower=0, upper=.99, value=.08)
loss_rate = mc.Uniform('loss_rate', lower=0, upper=.5, value=.025)

We'll also set the collection start year to 2011.

collection_start_year = 2011

We now have to define a stochastic function which contains our model, and assesses how likely the data is given the input parameters. (Remember Bayes Law?) As these probabilities are teechy small (that's the technical term), and all Markov Chain Monte Carlo cares about is the relative probabilities, we'll use the log probability, which is easier for computers to handle.

@mc.stochastic(trace=True, observed=True) #Don't forget the 'observed' flag!
def circulation(coin_counts=coin_counts, production=production, entry_rate=entry_rate, loss_rate=loss_rate, value=coin_list):
post_production = pd.Series(data = 0, index=coin_counts.Year)
in_circulation = pd.Series(data = 0, index=coin_counts.Year)
collection_distribution = pd.Series(data = .000000000001, index=coin_counts.Year)

for year in range(1930,2014):
if(year>1930):
in_circulation += post_production*entry_rate - in_circulation*loss_rate
post_production -= post_production*entry_rate

if(year<2013): #this must come after the decays are computed
post_production[year] = production.Philadelphia[production.Year==year].values

if(year>=collection_start_year):
collection_distribution += in_circulation

#normalize the counts to yield a probability of drawing each type of coin
collection_distribution /= collection_distribution.sum()

#Transform to log probability?
log_dist = np.log(collection_distribution)

#find the probability of the data from the distribution, and sum to combine
log_prob = log_dist[value].sum()

return log_prob

That's essentially the whole model, we just have a bit of data munging code to make a list of all observed coin years (2012, 2011, 2011, 2011, 2010, 2010, etc). We can now run a MCMC simulation:

model = mc.Model([entry_rate, loss_rate, circulation])
mcmc = mc.MCMC(model)
mcmc.sample(20000, 5000, 3)

And plot our results. Our penny jar data gives us good confidence that the loss rate is between 1.5% and 3% per year, with an average of 2.29%. To me, this is a strikingly precise result for being based on information we literally had laying around.

There is more ambiguity as to the true value of the Entry Rate, with mean at 48.7%. We could improve this with more data, and more specifically, data from coin collections begun in the last year or so. A narrower collection window means that the average distribution used to collect coins more closely represents the final distribution, giving us more resolution on the last few years. If anyone has $10 worth of pennies they collected in the last year, let me know!

Lastly, we'll look at the correlation between the two variables. We expect to see a little bit, as a high entry rate with low loss could look qualitatively similar to a high loss, with low entry rate (although very different magnitudes, as we saw in the 9x9 small multiples plot above).

When we plot valid values for Entry Rate and Loss Rate as calculated by our MCMC, we see that there is some correlation, which suggests to us that when we use our model for prediction, we should base our Monte Carlo result on the traces generated in the MCMC inference, instead of the isolated distributions themselves, to make sure we capture this behavior.

*not everybody has a penny jar. Not many of those that do make histograms.
References:
US Mint: Mint Marks
Coin counts and production spreadsheet

James Houghton - Penny Jar Inference 0 - Model Parameterization

Penny Jar Inference 0 - Model Parameterization