# Puttin’a Prior on it

This is a post based on Julia Silge’s post on estimating a beta distribution using Billboard’s Year-End Hot 100 song lyrics from 1958 to 2014. As you see from my lame title, my post will not be nearly as witty or well-designed as hers, but you may find this extension of her work interesting. I conducted this analysis of the Bayesian Blues methods/data for a few reasons:  1) I had picked up a few ideas on using Stan for density estimation and wanted to try them out, 3) I am always on the lookout to practice with dplyr/tidverse; 3) I really enjoy the beta distribution and its intuition (see @drob’s SE post); and 4) Silge’s analysis provided a really cool method and interesting data, of which she was kind to share both (Silge on GH). The data, code, and a commented full markdown of my analysis are posted on GitHub.

## “Singing the Bayesian Beginner Blues“

As noted, the analysis goals and data in this post are directly based on Julia Silge’s post entitled Singing the Bayesian Beginner Blues.  That post used the Empirical Bayes method for beta parameter point estimation and establishing the uncertainty around measurements of U.S. state names mentioned within a large database of song lyrics (courtesy of Kaylin Walker).  Silge’s post first approximates the parameters of a beta distribution from her data (the Empirical part) and then use those parameters and Bayes Theorem  to approximate a rate for each state if it were derived from that beta distribution. Finally, Silge derived a posterior distribution for each state’s rate; from which that uncertainty around the rate was visualized as the 95% Credible Interval.  This is a really cool approached and is demonstrated with great visualizations.  While reading it, I was inspired to figure this out in Stan and started messing with some code.

The primary differences in this analysis include:

1. Is zero inflated to include states not mentioned in lyrics
2. Incorporates mentions of cities with >= 100k population aggregated to their state and compares to analysis without city counts
3. Utilizes Hamiltonian Monte Carlo via rstan to estimate parameters, propagated uncertainty, and predict state mention values given priors, data, and likelihood.

## Estimating Beta Parameters with Stan

The approach in this post uses Stan, a probabilistic modeling language, to achieve the beta parameter estimation, propagate uncertainty, and predict a posterior distribution for each state, as well as the entire population of song-lyric-mention-rates.  Stan is a modernized version of BUGS and JAGS that is cross-platform and endlessly flexible for estimating a huge range of models.  The Stan website is very informative and there are a number of videos of presentations that helped me get my head around the concepts.  The very basics of Stan are an No U-Turns (NUTS) Adaptive Hamiltonian Monte Carlo (HMC) sampling engine and a language with which to declare your model.  The code for the model gets compiled into C++, the model is sampled with HMC, and the posterior is computed with samples returned to the user as a convenient R object.

That above image is from utilized in one of Benjamin Goodrich’s presentations (and also used Bob Carpenter in one of these videos) [Edit: Bob Carpenter created this image and it has been used by the Stan team in various presentations] to illustrate the efficiency in the HMC NUTS algorithm. Compared to Metropolis and Gibbs, HMC NUTS explores the true posterior (right-most panel) way more effectively and efficiently.  There is no contest there.

On a conceptual level, two primary differences between the Stan approach and the Empirical Bayes (EB) approach is that Stan (and other MCMC style samplers) integrates over distributions where the EB uses point estimates. With distributions, we can propagate uncertainty through all of the steps of the modeling sequence. Secondly, the full Bayesian approach uses priors to regularize the parameter estimates given the data and model.  The end result is control over the amount of regularization based on choice of priors and the ability to understand uncertainty because every parameter has a distribution of its plausible values.

In my previous post on estimating Pokemon CP, I did a very similar task by estimating a beta distribution density with JAGS.  The approach here with Stan is more sound than the JAGS approach as HMC is a better sampler than the Gibbs sampler of JAGS, and that I incorporate the prediction of new values within the model. In the previous example, I used a range of beta parameters extracted from the posterior to re-model the predictions.  Doing this as a generated quantities in Stan more accurately propagates the uncertainty.

## The Model: State Mentions in Song Lyrics as a Beta

The beta distribution is thought of as a succession of binomial trials of successes and failures. As such, it covers the domain of a distribution of probabilities of success. To think of it as a rate of success is to conceptualize the problem here as: How many song mentions per unit of population?

• successes = how many songs successfully mention a state by name
• attempts = a state’s population in units of 100k people (or 1,000,000 in Silge)

There is a bit of a dis-junction in intuition here as the success and attempts are on two completely different scales and it is not limited to the domain of [0,1]; there is no upper bound on the number of songs that mention a state relative to population.  An example of a more intuitively sound beta is Robinsons’s batting average whereas counter to the song mentions example, you cannot have more hits than at bats so it remains between [0,1].  However despite this disjunction, there is still ground for the state based interpretation; it is a very useful and convenient distribution, and Silge showed that it fits the observed data pretty darn well. So we go with it! [note: I modeled these data a few other potential distributions, and beta ‘won’ in terms of log likelihood.]

The models follows as such:

$x\sim beta(\alpha,\beta)\\ \alpha \sim \text{half-normal}(1)\\ \beta \sim \text{half-normal}(10)$

predictions for states are drawn from:

$\tilde{x} \sim beta((successes + \alpha), (attempts - n + \beta))$

# Pokémon GO post-evolution CP: A model

### TL:DR

The punchline: It appears that firstly, the metric of Pokémon CP is correlated directly to HP with few, if any, other related metrics.  Secondly and perhaps more interestingly, that the post_evolution CP increase is a random draw from a parameterized beta distribution centered on a CP percent increase specific to each Pokémon species. Let me explain…

Ok, this is a bit odd.  Before last month, all I knew of Pokémon was a yellow thing called Pikachu and something about a card game.  Then Pokémon Go came out a few weeks ago, and I decided to check it out.  Now I am an avid player and find the intricacies quite interesting; in addition to the psychology of the game play.  I don’t spend too much time reading the internet about the game, but it is fun to merge the game with an analysis.  The following post is akin to an Exploratory Data Analysis (EDA) of the data; not a full inference or comprehensive study.  The full code for this analysis can be found at this Gist.

### Why bother?

The ever enjoyable Data Machina email newsletter included a link to a Pokémon dataset on OpenIntro in the last edition and I followed it down the rabbit hole. This data set contains the observations of a number of measurements of a specific Pokémon species before and after a series of evolutions. For non-Poké people, one goal of the game is to collect Pokémon and then evolve them into more powerful forms.  The metric for “more powerful” is the Combat Power (CP).  Like many things in this game, the mechanism behind how the CP is calculated and effected by evolution is unknown.  However, I am sure a quick Google search will turn up plenty of ideas.  The purpose of the data set on OpenIntro is to look at these variables and try to discover how the CP is modified by an evolution.  The end game to that quandary is to more efficiently select the best Pokémon to spend your limited resources on to evolve.  The final chapter is this narrative is that a more efficiently evolved Pokémon will allow you to be more competitive in gym battles and therefore represent your team (one of Blue, Yellow, or Red), gain Pokémon coins, and attract more rare Pokémons.  If a little insight on how the underlying mechanism of CP works gets you there faster, all the better.

The intent of the posting of this data set on OpenIntro is to see if the provided variables contribute meaningfully to the data generating mechanism of the CP relative to evolutions.  As stated on the site:

A key part of Pokémon Go is using evolutions to get stronger Pokémon, and a deeper understanding of evolutions is key to being the greatest Pokémon Go player of all time. This data set covers 75 Pokémon evolutions spread across four species. A wide set of variables are provided, allowing a deeper dive into what characteristics are important in predicting a Pokémon’s final combat power (CP).

Example research questions: (1) What characteristics correspond to an evolved Pokémon with a high combat power? (2) How predictable is CP from an evolution?

Good questions.  The analysis that follows takes a quick look at Q1, concluding that most of the action is in the HP and few other variables.  Question 2 is the more direct focus of this post, but I formalized the question a bit: What accounts for the variation in CP gain per evolution within a given species?

### The Data

The data from OpenIntro is pretty limited; no Big Poké Data here.  There is a total of 75 observations from pre and post-evolutions of Pokés from four different species.  The species are an important distinction.  This analysis will recognize the variation between species, but try to find a universal model that fits all species.  Note the difference in sample size between species.  The uncertainty of small sample size has an important effect here.

> colnames(pokedat)
[1] "name" "species" "cp"
[4] "hp" "weight" "height"
[7] "power_up_stardust" "power_up_candy" "attack_weak"
[10] "attack_weak_type" "attack_weak_value" "attack_strong"
[13] "attack_strong_type" "attack_strong_value" "cp_new"
[16] "hp_new" "weight_new" "height_new"
[19] "power_up_stardust_new" "power_up_candy_new" "attack_weak_new"
[22] "attack_weak_type_new" "attack_weak_value_new" "attack_strong_new"
[25] "attack_strong_type_new" "attack_strong_value_new" "notes"
>

Species Count
Caterpie 10
Eevee 6
Pidgey 39
Weedle 20

Thinking about the data and the two questions above, my interest turned towards univariate correlation between the variables offered figuring out that the real metric is. Diagnosing correlation would be a visual approach that offers an understating of what is moving in-step with what.  Of course “correlation is not… ” and all that, but it sure as heck doesn’t hurt when you are trying to develop a data generating model.  The metric of interest is not only the pre and post-evolution CP (cp and cp_new respectively), but some permutation of that incorporated the fact that species seems to have a big effect on the output.

pokedat <- as_tibble(read.csv(&quot;pokemon.csv&quot;))

dat1 <- mutate(pokedat, delta_cp = cp_new - cp) %>% # net change in cp
# the % change in cp from old to new
mutate(pcnt_d_cp = delta_cp / cp) %>%
# group by species to calculate additonal varaibles
group_by(species) %>%
# species grouped mean percent change
mutate(spec_mean_pdcp = mean(pcnt_d_cp)) %>%
# species grouped std dev of % changes from old to new cp
mutate(spec_mean_pdcp_sd = sd(pcnt_d_cp)) %>%
# difference in % delta cp (pdcp) change from species group mean
mutate(cent_spec_mean_pdcp = pcnt_d_cp - spec_mean_pdcp) %>%
# z score for pdcp change within species group
mutate(z_spec_mean_pdcp = cent_spec_mean_pdcp / spec_mean_pdcp_sd) %>%
data.frame()


the modification of the data in the code above spells out the development of the desired metric.  Since with are talking about an evolutionary process, the desired metric has to do with chance in states.  In this case the change from pre to post-evolution CP; referred to at the delta CP.  Further, since we have multiple species (and there are many more than four in the Pokémon universe), we consider delta CP relative to the mean of each species.  Finally, as each species has a different range of CP as some species are more powerful than other, it is good to look at the percent change per species, as opposed to an absolute CP increase.  Based on these observations, the desired metric is based on the Percent Delta CP (PDCP) per species.  The metric is the PDCP and it is considered relative to each species; or spec_pdcp.  This is the metric that we seek to identify a data generating mechanism for.

# ggplot2 step by step

## Building a ggplot2 Step by Step

I have included this viz on my blog before; as an afterthought to a more complex viz of the same data.  However, I was splitting out the steps to the plot for another purpose and though it would be worth while to post this as a step-by-step how to.  The post below will step through the making of this plot using R and ggplot2 (the dev version from Github).  Each code chunk and accompanying image adds a little bit to the plot on its way to the final plot; depicted here.  Hopefully this can help or inspire someone to take their plot beyond the basics.

Step 1 is to load the proper libraries which include ggplot2 (of course), ggalt for the handy ggsave function, and extrafont for the use of a range of fonts in your plot.  I did this on a mac book pro and using the extra fonts was pretty easy.  I have done the same on a windows machine and it was more of a pain.  Google around and it should work out.

library("ggplot2") # dev version
library("ggalt") # dev version, for ggsave()
library("extrafont")


Create the data… I included R code at the end of the post to recreate the data.frame that is in the plots below.  You will have to run that code block to make an object called dat so that the rest of this code will work.

# run the code block for making the 'dat' data.frame.
# code is located at bottom of this post


So that whole point to why I was making this plot was to show a trend in population change over time and over geography.  The hypothesis put forth by the author of this study (Zubrow, 1974) is that over time population decreased in the east and migrated to the west.  The other blog post I did covers all of this in more details.

If one were starting out in R, they may go straight to the base plot function to plot population over time as shown below.  [to be fair, there are many expert R users that will go straight to the base plot function and make fantastic plots.  I am just not one of them.  That is a battle I will leave to the pros.]

# base R line plot of population over year
plot(dat$Population ~ dat$Year, type = "l")


That result is underwhelming and not very informative.  A jumble of tightly spaced lines don’t tell us much.  Further, this is plotting data across all of the 19 sites we have data on. Instead of pursuing this further in base plot and move to ggplot2.

ggplot2 is built on the Grammar of Graphics model, check out a quick intro here.  The basic idea is to plot by layers using data and geometries intuitively. To me, plotting in ggplot2 is much like building a GIS map.  Data, layers, and geometries; it all sounds pretty cartographic to me.

# One Tree to Rule Them All! InTrees and Rule Based Learing

The basis of this post is the inTrees (interpretable trees) package in R [CRAN].  The inTrees package summarizes the typically complex result of tree based ensemble learners/algorithms into rule-based learners referred to as Simplified Tree Ensemble Learner (STEL). This approach solves the following problem: Tree based ensemble learners, such as RandomForest (RF) and Gradient Boosting Machines (GBM), often deliver superior predictive results compare to more simple decision tree or other low complexity models, but are difficult to interpret or offer insight. I will use an archaeological example here (because my blog title mandates it), but this package/method can be applied to any domain.

### Oh Archaeology… We need to talk:

Ok, let’s get real here.  Archaeology has an illogical and unhealthy repugnance to predictive modeling. IMHO, illogical because the vast majority of archaeological thought process are based on empirically derived patterns based on from previous observation.  All we do is predict based on what data we know (a.k.a. predict from models, a.k.a. predictive modeling). Unhealthy, because it stifles creative thought, advancement of knowledge, and relevance of the field.  Why is this condition so?  Again, IMHO, because archaeologists are stuck in the past (no pun intended), feel the need to cling to images of self-sufficient site-based discovery, have little impetus to change , and would rather stick to our area of research then coalesce to help solve real-world problems. But, what the hell do I know.

What the heck does this have to do with tree based statistical learning?  Nothing and everything.  In no direct sense do these two things overlap, but indirectly, the methods I am discussing here have (once again) IMHO, the potential to bridge the gap between complex multidimensional modeling and practical archaeological decision making.  Further, not only do I think this is true for archaeology, but likely for other fields that have significant stalwart factions.  That is a bold claim for sure, but having navigated this boundary for 20 years, I am really excited about the potential for this method; more so than most any other.

Bottom lines is that the method discussed below, interpretable trees from rule-based learners in the inTrees package for R, allows for the derivation of simple-to-interpret rule based decision making based on empirically derived complex and difficult-to-interpret tree based statistical learning methods.  Or more generally, use fancy-pants models to make simple rules that even the most skeptical archaeologist will have to reason with.  Please follow me…

# Gaussian Process Hyperparameter Estimation

Quick Way longer then expected post and some code for looking into the estimation of kernel hyperparameters using STAN HMC/MCMC and R.  I wanted to drop this work here for safe keeping. Partially for the exercise of thinking it through and writing it down, but also because it my be useful to someone.  I wrote a little about GP in a previous post, but my understanding is rather pedestrian, so these explorations help.  In general GPs are non-linear regression machines that utilize a kernel to reproject your data into a larger dimensional space in order to represent and better approximate the function we are targeting.  Then using a covariance matrix calculated from that kernel, a multivariate Gaussian posterior is derived.  The posterior can then be used for all of the great things that Bayesian analysis can do with a posterior.

Read lots more about GP here…. Big thanks to James Keirstead for blogging a bunch of the code that I used under the hood here and thanks to Bob Carpenter (github code) and the Stan team for great software with top-notch documentation.

#### code:

The R code for all analysis and plots can be found in a gist here, as well as the three Stan model codes, here gp-sim_SE.stangp-predict_SE.stan,and GP_estimate_eta_rho_SE.stan

The hyperparameters of topic here are parameters of the kernel within the GP algorithm.  As with other algorithms that use kernels, a number of functions can be used based on the type of generative function you are approximating.  The most commonly used kernel function for GP (and seemingly Support Vector Machines) is the Squared Exponential (SE), also known as the Radial Basis Function (RBF), Gaussian, or Exponentiated Quadratic function.

#### The Squared Exponential Kernel

The SE kernel is a negative length scale factor rho ($\rho^2$) times the square distance between data points ($(x -x^\prime)^2$) all multiplied by a scale factor eta ($\eta$).  Rho is a shorthand for the length scale which is often written as a denominator as shown below. Eta is a scale factor that determines how far the function varies from the mean. Finally sigma squared ($sigma^2_{noise}$) at the end is the value for the diagonal elements of the matrix where ($i = j$). This last term is not necessarily part of the kernel, but is instead a jitter term to set zero to near zero for numeric reasons.  The matrix created by this function is positive semi-definite and composed of the distance between observations scaled by rho and eta. Many other kernels (linear, periodic, linear * periodic, etc…) can be used here; see the kernel cookbook for examples.

$k_{x,x^\prime}=\eta^2 \exp(-\rho^2 (x -x^\prime)^2)+\delta_{ij}\sigma^2_{noise} \\ \rho^2 = 1/(2\ell^2) \\ \delta_{ij}\sigma^2_{noise}=0.0001$

#### To Fix or to Estimate?

In this post, models are created where  $sigma^2_{noise}$, $\rho^2$ , and $\eta^2$ are all fixed, as well as a model where  $sigma^2_{noise}$ is fixed and $\rho^2$ and $\eta^2$ are free.  In the MCMC probabilistic framework, we can fix $\rho^2$ and $\eta^2$ or any parameter for the most part, or estimate them. To this point, there was a very informative and interesting discussion on stan-users mailing list about why you might want to estimate the SE kernel hyperparameters.  The discussion generally broke across the lines of A) you don’t need to estimate these, just use relatively informative priors based on your domain knowledge, and B) of course you want to estimate these because you may be missing a large chunk of function space and uncertainty if you do not. The conclusion to the thread is a hedge to try it both ways, but there are great bits of info it there regardless.

So while the greatest minds in Hamiltonian Monte Carlo chat about it, I am going to just naively work on the Stan code to do these estimations and see where it takes me.  Even if fixed with informative priors is the way to go, I at least want to know how to write/execute the model that estimates them. So here we go.

# Gaussian Process in Feature Space

[In the last post, I talked about showing various learning algorithms from the perspective of predictions within the “feature space”. Feature space being a perspective of the model that looks at the predictions relative to the variables used in the model, as opposed geographic distribution or individual performance metrics.

I am adding to that post the Gaussian Process (GP) model and a fun animation Or two.

#### So what’s a GP?

The explanation will differ depending on the domain of the person you ask but essentially a GP is a mathematical construction that uses a multivariate Gaussian distribution to represent an infinite number of functions the describe your data, priors, and covariance. Each data point gets Gaussian distribution and these are all jointly represented as the multivariate Gaussian. The choice of the Gaussian is key to this as it makes the problem tractable based on some very convenient properties of the multivariate Gaussian. When referring to a GP one may be talking about regression, classification, a fully Bayesian method, some form or arbitrary functions, or some other cool stuff I am too dense to understand. That is a broad and overly-simplistic explanation, but I am overly-simplistic and trying to figure it out myself.  There is a lot of great material on the web and videos on youtube to learn about GP.

#### So why use it for archaeology?

GP’s allow for the the estimation of stationary nonlinear continuous functions across an arbitrary dimensional space; so it has that going for it… which is nice. But more to the point, it allows for the selection of a specific likelihood, priors, and a covariance structure (kernel) to create a highly tunable probabilisitic classifier.  The examples below use the kernlab::gausspr() function to compute the GP using Laplace approximation with a Gaussian noise kernel.  The $\sigma$ hyperparameter of this Gaussian kernel is what is being optimized in the models below.  There are many kernels that can be used in modeling a GP.  One of the ideal aspects of a GP for archaeological data is that the continuous domain of the GP can over geographic space taking into consideration natural correlation in out samples. This has strong connections to Kriging as a form of spatial regression in the geostatistical realm, as well as relation to the SVM kernel methods of the previous post.

# To Feature Space and Beyond!

There are many of ways to look at predictive models.  Assumptions, data, predictions, features, etc… This post is about looking at predictions from the feature space perspective.  To me, at least for lower dimensional models, this is very instructive and illuminates aspects of the model (aka learning algorithm) that are not apparent in the fit or error metrics. Code for this post is in this gist.

Problem statement: I have a mental model, data, and a learning algorithm from which I derive predictions.  How well do the predictions fit my mental model? I can use error metrics from training and test sets to asses fit. I can use regularization and information criteria to be more confident in my fit, but these approaches only get me so far. [A Bayesian perspective offers a different perceptive on these issues, but I am not covering that here.] What I really want is a way to view how my model and predictions respond across the feature space to assess whether it fits my intuition. How do I visualize this?

### Space: Geographic Vs. Feature

When I say feature space, I mean the way our data points align when measured by the variables/features/covariates that are used to define the target/response/dependent variable we are shooting for.  The easy counter to this is to think about geographic space.  Simply, geographic space is our observations measured by X and Y.  The {X,Y} pair is the horizontal and vertical coordinate pair (latitude, longitude) that put a site one the map.

So simply, geographic space is our site sample locations on a map.  The coordinates of the 199 sites (n=13,700 total measurements) are manipulated to protect site locations.  These points are colored by site.

Feature space?  Feature space is how the sites map out based not on {X,Y}, but based on the measure of some variables at the {X,Y} locations. In this case, I measure two variables at each coordinate, $x_1$ = cd_conf, anf $x_2$ = tpi_250c.  cd_conf is a cost distance to the nearest stream confluence with respect to slope and tpi_250c is the topographic position index over a 250 cell neighborhood.   If we map ${x_1, x_2}$ as our {X,Y} coordinates, we end up with…

FEATURE SPACE!!!!!!!!! If you are familiar with this stuff, you may look at this graphic and say “Holy co-corellation Batman!”.  You would be correct in thinking this.  As each site is uniquely colored, it is apparent that measurements on most sites have a positively correlated stretch to them.  This is because the environment writ large has a correlation between these two variables; sites get caught up in the mix.  This bias is not fully addressed here, but is a big concern that should be addressed in real modeling scenarios.

Either way, feature space is just remapping the same {X,Y} points into the dimension of cd_conf and tpi_250c. Whereas geographic space shows a relatively dispersed site distribution, the feature space shows us that sites are quite clusters around low cd_conf and low tpi_250c.  Most sites are more proximal to a stream confluence and at lower topographic positions that the general environmental background.  Sure, that makes some sense.  So what… Continue reading “To Feature Space and Beyond!”