Archaeological Predictive Modeling Points of Discussion

sin4_header

In the past few weeks I was involved in two discussions with archaeological colleagues on topics specific to the process and issues associated with Archaeological Predictive Modeling (APM).  One conversation was via twitter with Dr. Colin Wren of the University of Colorado and the other with PhD in training Claire Boardman. The two conversations covered different aspects of the APM topic, but each seemed complimentary to the other.  I am posting excerpts from these discussions here for 1) posterity and 2) because I think they are really interesting questions which rarely get much attention outside of peer-to-peer conversations.  Also, truth be told, locational modeling of archaeology sites is still at the toddler stage (at best) in archaeology and simple questions like these are very valid discussion points. Anyway, perhaps you will find this interesting if you don’t know much about APM, you may find it useful if looking for a basis to instruct from, or useful is you want to frame your own thoughts and build from there.

The discussion with Wren was more specific to sort of grand challenges to APM; the why and how of a variety of approaches.  Colin had some great probing questions about theoretical issues, but I think I only answered from the methodical perspective.  I am often stuck there because I think we need applicable methods to explore the consequences of our data and theories before we know what is a useful approach and what is not.  I present the brief summary of what Colin and I discussed.  The conversation with Boardman was more about how APM is used in an applied setting; specifically Cultural Resources Management (CRM).  My answers to the questions she posed are pretty generalized and could be further sharpened with a more in-depth discussion.

Model Output in ASCII

Wren Discussion Summary

Big issues: representing uncertainty, overcoming poor samples, adjusting for sample and variable correlations, establishing model thresholds based on model purpose.

Solutions? developing a framework of model techniques/approaches that serve specific purposes/conditions (e.g. what works, what doesn’t and why), common data sets to serve as benchmarks so new methods are comparable, convince other archaeologists this is a worthwhile thing to teach their students!

Some background on the “Issues” from the discussion:

  • Uncertainty – documenting, propagating, visualizing uncertainty all around our models.  starts with acknowledging and accepting uncertainty, then to modeling it (bootstrapping, Credible Intervals, Bayesian approaches, modeling different scenarios, etc…). Visualizing via multiple maps, varying thresholds, bivariate color legends…
  • Poor Samples & correlation sample location bias, non-systematic sampling, imbalanced samples are all pretty serious issues. How to fix??? Let me know if you find out. Some components may include random sampling, explicit spatial variance/covariance, rebalancing, over/under sampling, SMOTE, hierarchical models, and what I am currently exploring, distribution regression [good explanation (PDF)].
  • Thresholds – I think this is the single most overlooked part of current modeling practices.  Anytime we classify into present/absent or high/medium/low we need to draw lines in the continuous distribution of probabilities/potential/sensitivity.  THis should be done very cautiously and with respect to the weight/risk assigned to false-positives vs. false-negatives.

sin2_header

Boardman Q&A

Development projects:

What is the main reason for creating predictive models?

Assuming we are considering a “predictive model” to be the forecasting of archaeological site locations based on correlated co-variates (environmental or otherwise), the main reason are to scope the field effort and help in budgeting.  They allow us to estimate the number of holes we will have to do and where we will likely target. They also give us a visual vehicle to share with the State Historic Preservation Office (SHPO) that we have considered sensitivity and what we think about it across the project area.  On a larger scale, they can be used to estimate potential impacts of various project alternatives.  In this sense, the % impact of site-likely area (or whatever the loss metric you chose) can be compared relatively between alternatives.  Models can also be made for research purposes.  As a model that seeks to understand the inference of model parameters will also generate predictions and is typically validated by them.  There are plenty of other reasons that one may want to generate predictions of a models that deal with site locations, but these are the most common.

What project types/locations would you normally create a predictive model for?

For almost any project, but they are more effective on larger and linear projects like pipelines and highways.  The Pennsylvania model I referenced above is now serving as a base model for any work done in that state.  Firms are free to make their own for a project, but need to reference it to the state model and compare findings.

At what stage in the development planning approval process?

Sometimes as early as the proposal stage; to get a sense of cost.  Otherwise the “Phase IA” stage, which is the initial assessment of sensitivity and potential project impacts prior to doing field work.

When you use predictive modelling:

Do you use a standard model design or create a bespoke model for each project?

A little of both.  There are a standard set of parameters we use for the very basic weighted overlay model, but they can be adjusted based on physiographic setting or prior knowledge.  If there are a number of known sites available to fit better parameters to, we do that.  Overall, the design of the model is based on the purpose.  If the purpose is to generate a model that can be easily discussed with the SHPO or Client, has variables that all archaeologists will be familiar with, and has easily discussed assumptions, the simple weighted overlay is the most common.  If in addition to these goals, I want to get a better understanding of site distribution in the study area and the marginal effect of variables on site locations, I may make an additional model via logistic regression, random forests, or random effects model depending on my goals.  If a Principal Investigator has a certain research question or prior information on the area, we can make a model that is based on these ideas as well.

Do you create your models using a GIS application and if so, which one?

A lot of it is done in ArcGIS, but also a lot is done in the statistical programming language R. R is essentially a GIS, as well as a full-fledged programing language. We use QGIS at times as well. In ArcGIS the raster calculator and other raster tools within the Model Builder framework are helpful, but it can be buggy, slow, and is not very open or reproducible.  R has a steeper learning curve, but offers nearly limitless ability to develop, visualize, and understand your models. QGIS is a great alternative (and sometimes go-to) to ArcGIS; it is fully capable.

On average how many models do you create in a year?

Speaking strictly of sensitivity model for archaeological locational data (aka “predictive model”), perhaps a dozen final models, but dozens of sub-models and candidate models.  The year and a half of building the PA model was thousands or tens of thousands of models.  However, we also create many archaeological models that do not serve the basic “predictive model” goals.

In your predictive models:

What datasets do you commonly use in your models?

1/3rd arc second (~10-meter) DEM, National Hydrologic Dataset (NHD), National Wetland Inventory (NWI), surficial and bedrock geology, many derivations of these base variables (e.g. variation in slope over a 3km neighborhood, cost-distance, or elevation below spring head), and known archaeological site locations. Viewshed, cost-paths, proximity to cultural locations are also possible choices.  Things that are often avoided are elevation, density of known sites, and proximity to known sites unless they are explicitly required for your purpose.

What method(s) do you use to validate your model?

Entirely depends on the model. Simple weighted overlay models with no known sites, validation is based on the archaeologists intuition.  For weighted models with sites available, the presence/absence of sites is the validation, For statistical models (regression, random forest, GLMM, etc…) the models have metrics such as AIC, BIC, R^2, sensitivity/specificity, AUC, and any number of other things.  A test set can be held-out to test on “independent” data that was not used in the model building, or better yet, K-Folds Cross-Validation is used to create many hold out samples.  The point of validation is not simply to put a number to accuracy, but to balance accuracy vs. precision and bias vs. variance.

What level of accuracy/precision do you expect/accept?

Completely depends on the purpose of the model as any model is capable of 100% accuracy if the entire study area is considered site-likely or 0% accuracy if it is entirely site-unlikely, the answer is somewhere in between. [The basis of every model should be the purpose that it is intended, then all modeling methods grow from there.  However, the most common approach in archaeology is to build a model, then find a purpose for it.]  If it were a model designed to find the most highly sensitive locations because you wanted to find a specific type of site for testing, the overall accuracy may be low, but with high precision. As you are targeting a part of the distribution, but don’t care if you missed some outliers.  If your model is to broadly protect all potential site likely areas regardless of survey time or funding, you may choose high accuracy and low precision; as you don’t mind a model that considers vast areas as site-likely as long as 100% of sites are accounted for. (I write about this topic a bunch in the 7 volumes of the PA model report).

Far to often, people just make a general model, then apply ad-hoc threshold that achieve an accuracy/precision that “looks good”, and then present it as capable of fitting every purpose.  In actuality it is generalized out of fitting any purpose.  The key is to balance accuracy/precision on top of bias and variance.  A model with great looking accuracy/precision balance (even considering model purpose) can be resulting from an over-fit model with high variance.  So the model looks great on the known data, but does not generalize well to new data.  On the other hand, a model with pretty bad apparent accuracy/precision balance can be built on a model that is balanced bias and variance, so it generalizes to new data much better than the over-fit model that had much better looking accuracy/precision.  In the end, you hope to maximize the accuracy that is possible given your dataset, variables, signal to noise ratio, and model (assumptions), while minimizing your risk of overfitting relative to the desired purpose of the model.

It’s turtles all the way down… For the PA model sensitivity class boundary thresholds for individual models are optimized for  no less than 85% of known site-present cells (10×10-meter) where correctly included within the combined high and moderate sensitivity area that covered that covered no less than 33% of the study area.  This equates to a target specificity of 0.85 and sensitivity of 0.67.  The fact that is it “no less than 33%” is because we need to leave plenty of room for false-positives as they are the locations where we hope to look for sites (they are false positives until a site is found then they are true-positives).  Maximizing specificity (site’s correctly identified) is great, but maximizing sensitivity (the area the model considers site-unlikely) is dangerous because it will lead to picking an over-fit model; bad news.  It is good to pick a target sensitivity based on the projects goals.

*I used the term accuracy here as a stand-in for any measure of model correctness, but in archaeological modeling where the site-present and background samples are frequently severely imbalanced, the use of accuracy (the percent of observation correctly classified) will be very misleading.  This is because the larger number of correctly identified site-unlikely background observations will swamp the fewer incorrect site-likely observations and suggest a very high accuracy when the class we care about (site-likely) may have a very low accuracy.  A metric that weights for class is preferable.

Do you ever use predictive modelling for non-commercial / research projects?

All the time.  It is my main area of study for the past 20 years.  My blog has some examples

fin-2

Archaeological Predictive Modeling Points of Discussion

data.table v. dplyr execution time

Execution Time of data.table vs dplyr

This code shows the difference in execution time and rate of execution time increase for an analysis function coded in two different approaches, 1) data.table, and 2) dplyr. This analysis is the result of working with this function and using profvis to optimize. The end result is that the data.table version is much faster across all simulations and scales much better than the dplyr version. Please read through and check out the profvis results linked in the list below.

Notes:

  • The full code of this comparison can be found here
  • I use profvis to profile the execution time of each function, the full results are here for dplyr and here for data.table
  • I love dplyr and use it for everything. This is not intended as a critique, just some information on execution speed
  • My use of dplyr may not be optimum and may be contributing to the speed issues
  • the overall approach could also be optimized, but it represents a typical use-case balancing code clarity and function
  • If you see a flaw or speed trap in any of the code, please let me know. Thanks.

Finding

Data.table approach is 0.124 seconds faster at 10 sites/events, is 0.922 seconds faster, at 25,000 sites/events, and is 1.735 seconds faster at 50,000 sites/events. For every 1000 sites/events added the data.table approach execution time increases by 0.0277 seconds; dplyr increases at a rate of 0.3765 seconds.

Functions

Below are the two versions of the same function.

The function being tested here computes the aoristic weight for a series of simulated data. What an aorsitc weight is really doesn’t have much to do with this post, but if you are interested, more information can be found here. This somewhat obscure function is useful here because it is an example of the sort of function that an analyst would have to code themselves and therefore run into the decision of using dplyr or data.table. Obviously, you could write this is C++ or do something else to optimize it if that was your priority. However, this is a more realistic comparison (IMHO), as much of an analyst’s time is spent balancing optimization versus getting-stuff-done.

note: there is a little weirdness with negative dates and start/end dates. This is because I started out with BCE dates and then switch to years BP dates to match another analysis. So, it is little bit of a hodge-podge in dealing with BCE and BP.s

data.table version

my_aorist_DT <- function(events, start.date = 0, end.date = 2000, bin.width = 100) {
  require(data.table)
  setDT(events)
  time_steps <- data.table(
    bin.no  = seq(1:(abs(end.date)/bin.width)),
    Start = seq(start.date,(end.date-bin.width), by = bin.width),
    End   = seq((start.date+bin.width),end.date, by = bin.width))
  setkey(time_steps, Start, End)
  overlap_names <- data.table::foverlaps(events, time_steps,
                                         type = "any", which = FALSE)
  overlap_names <- overlap_names[i.Start != End & i.End != Start]
  overlap_names[, duration := (i.End - i.Start)]
  overlap_names[, W := (bin.width / duration)]
  ov_sum <- overlap_names[, .(aorist = sum(W),
                              median_step_W = median(W),
                              site_count = .N), keyby=.(bin.no)]
  setkey(ov_sum, bin.no)
  setkey(time_steps, bin.no)
  ov_sum2 <- ov_sum[time_steps, nomatch = NA]
  ov_sum2[is.na(ov_sum2)] <- 0
  ov_sum2[, bin := paste0(Start, "-", End)]
  return(ov_sum2)
}

dplyr version

my_aorist_dplyr <- function(events, weight = 1, start.date = 0, end.date = 2000,
                      bin.width = 100, round_int = 4) {
  require(dplyr)
  require(data.table)
  time_steps <- data.frame(
    bin.no  = seq(1:(abs(end.date)/bin.width)),
    Start = seq(start.date,(end.date-bin.width), by = bin.width),
    End   = seq((start.date+bin.width),end.date, by = bin.width))
  setDT(time_steps)
  setDT(events)
  setkey(time_steps, Start, End)
  overlap_names <- data.table::foverlaps(events, time_steps,                                           type = "any", which = FALSE) %>%
    filter(i.Start != End & i.End != Start)
  aorist <- overlap_names %>%
    data.frame() %>%
    group_by(name) %>%
    mutate(site_count = n(),
           W = (bin.width / (i.End - i.Start)),
           W = round(W,round_int)) %>%
    arrange(name, bin.no) %>%
    group_by(bin.no) %>%
    summarise(aorist = sum(W),
              median_step_W = median(W),
              site_count = n()) %>%
    right_join(., time_steps, by = "bin.no") %>%
    replace_na(list(aorist = 0, site_count = 0, median_step_W = 0)) %>%
    mutate(bin = paste0(Start, "-", End)) %>%
    dplyr::select(bin, bin.no, aorist)
  return(aorist)
}

Simulation Loop

The simulation loop simply loops over the 1001 site/event counts in sim_site_count_seq. For each loop, both functions are executed and the time it takes to run each is recorded into the table below.

To see the full code for this list, check it out here.

measure
site_count_min 10.000
site_count_max 50000.000
dplyr mean (seconds) 0.995
DT mean (seconds) 0.095
max difference (seconds) 1.981
min difference (seconds) -0.002
mean difference (seconds) 0.900
mean RMSE 0.006

Plot results

DT_dplyr_compare.png

data.table v. dplyr execution time

Estimating a Beta Distribution with Stan HMC

zelda_oot-storms.png

Puttin’a Prior on it

This is a post based on Julia Silge’s post on estimating a beta distribution using Billboard’s Year-End Hot 100 song lyrics from 1958 to 2014. As you see from my lame title, my post will not be nearly as witty or well-designed as hers, but you may find this extension of her work interesting. I conducted this analysis of the Bayesian Blues methods/data for a few reasons:  1) I had picked up a few ideas on using Stan for density estimation and wanted to try them out, 3) I am always on the lookout to practice with dplyr/tidverse; 3) I really enjoy the beta distribution and its intuition (see @drob’s SE post); and 4) Silge’s analysis provided a really cool method and interesting data, of which she was kind to share both (Silge on GH). The data, code, and a commented full markdown of my analysis are posted on GitHub.

Singing the Bayesian Beginner Blues

As noted, the analysis goals and data in this post are directly based on Julia Silge’s post entitled Singing the Bayesian Beginner Blues.  That post used the Empirical Bayes method for beta parameter point estimation and establishing the uncertainty around measurements of U.S. state names mentioned within a large database of song lyrics (courtesy of Kaylin Walker).  Silge’s post first approximates the parameters of a beta distribution from her data (the Empirical part) and then use those parameters and Bayes Theorem  to approximate a rate for each state if it were derived from that beta distribution. Finally, Silge derived a posterior distribution for each state’s rate; from which that uncertainty around the rate was visualized as the 95% Credible Interval.  This is a really cool approached and is demonstrated with great visualizations.  While reading it, I was inspired to figure this out in Stan and started messing with some code.

The primary differences in this analysis include:

  1. Is zero inflated to include states not mentioned in lyrics
  2. Incorporates mentions of cities with >= 100k population aggregated to their state and compares to analysis without city counts
  3. Utilizes Hamiltonian Monte Carlo via rstan to estimate parameters, propagated uncertainty, and predict state mention values given priors, data, and likelihood.

Estimating Beta Parameters with Stan

The approach in this post uses Stan, a probabilistic modeling language, to achieve the beta parameter estimation, propagate uncertainty, and predict a posterior distribution for each state, as well as the entire population of song-lyric-mention-rates.  Stan is a modernized version of BUGS and JAGS that is cross-platform and endlessly flexible for estimating a huge range of models.  The Stan website is very informative and there are a number of videos of presentations that helped me get my head around the concepts.  The very basics of Stan are an No U-Turns (NUTS) Adaptive Hamiltonian Monte Carlo (HMC) sampling engine and a language with which to declare your model.  The code for the model gets compiled into C++, the model is sampled with HMC, and the posterior is computed with samples returned to the user as a convenient R object.

Stan_posterior_example_Fotor_Fotor.png

That above image is from one of Benjamin Goodrich’s presentations (and also used by Bob Carpenter in one of these videos) to illustrate the efficiency in the HMC NUTS algorithm. Compared to Metropolis and Gibbs, HMC NUTS explores the true posterior (right-most panel) way more effectively and efficiently.  There is no contest there.

stanlogo-main-300x300.png

On a conceptual level, two primary differences between the Stan approach and the Empirical Bayes (EB) approach is that Stan (and other MCMC style samplers) integrates over distributions where the EB uses point estimates. With distributions, we can propagate uncertainty through all of the steps of the modeling sequence. Secondly, the full Bayesian approach uses priors to regularize the parameter estimates given the data and model.  The end result is control over the amount of regularization based on choice of priors and the ability to understand uncertainty because every parameter has a distribution of its plausible values.

In my previous post on estimating Pokemon CP, I did a very similar task by estimating a beta distribution density with JAGS.  The approach here with Stan is more sound than the JAGS approach as HMC is a better sampler than the Gibbs sampler of JAGS, and that I incorporate the prediction of new values within the model. In the previous example, I used a range of beta parameters extracted from the posterior to re-model the predictions.  Doing this as a generated quantities in Stan more accurately propagates the uncertainty.

The Model: State Mentions in Song Lyrics as a Beta

The beta distribution is thought of as a succession of binomial trials of successes and failures. As such, it covers the domain of a distribution of probabilities of success. To think of it as a rate of success is to conceptualize the problem here as: How many song mentions per unit of population?

  • successes = how many songs successfully mention a state by name
  • attempts = a state’s population in units of 100k people (or 1,000,000 in Silge)

There is a bit of a dis-junction in intuition here as the success and attempts are on two completely different scales and it is not limited to the domain of [0,1]; there is no upper bound on the number of songs that mention a state relative to population.  An example of a more intuitively sound beta is Robinsons’s batting average whereas counter to the song mentions example, you cannot have more hits than at bats so it remains between [0,1].  However despite this disjunction, there is still ground for the state based interpretation; it is a very useful and convenient distribution, and Silge showed that it fits the observed data pretty darn well. So we go with it! [note: I modeled these data a few other potential distributions, and beta ‘won’ in terms of log likelihood.]

The models follows as such:

x\sim beta(\alpha,\beta)\\ \alpha \sim \text{half-normal}(1)\\ \beta \sim \text{half-normal}(10)

predictions for states are drawn from:

\tilde{x} \sim beta((successes + \alpha), (attempts - n + \beta))

Continue reading “Estimating a Beta Distribution with Stan HMC”

Estimating a Beta Distribution with Stan HMC

Pokémon GO post-evolution CP: A model

TL:DR

The punchline: It appears that firstly, the metric of Pokémon CP is correlated directly to HP with few, if any, other related metrics.  Secondly and perhaps more interestingly, that the post_evolution CP increase is a random draw from a parameterized beta distribution centered on a CP percent increase specific to each Pokémon species. Let me explain…

Black and White Pokeball Outline

Ok, this is a bit odd.  Before last month, all I knew of Pokémon was a yellow thing called Pikachu and something about a card game.  Then Pokémon Go came out a few weeks ago, and I decided to check it out.  Now I am an avid player and find the intricacies quite interesting; in addition to the psychology of the game play.  I don’t spend too much time reading the internet about the game, but it is fun to merge the game with an analysis.  The following post is akin to an Exploratory Data Analysis (EDA) of the data; not a full inference or comprehensive study.  The full code for this analysis can be found at this Gist.

Why bother?

The ever enjoyable Data Machina email newsletter included a link to a Pokémon dataset on OpenIntro in the last edition and I followed it down the rabbit hole. This data set contains the observations of a number of measurements of a specific Pokémon species before and after a series of evolutions. For non-Poké people, one goal of the game is to collect Pokémon and then evolve them into more powerful forms.  The metric for “more powerful” is the Combat Power (CP).  Like many things in this game, the mechanism behind how the CP is calculated and effected by evolution is unknown.  However, I am sure a quick Google search will turn up plenty of ideas.  The purpose of the data set on OpenIntro is to look at these variables and try to discover how the CP is modified by an evolution.  The end game to that quandary is to more efficiently select the best Pokémon to spend your limited resources on to evolve.  The final chapter is this narrative is that a more efficiently evolved Pokémon will allow you to be more competitive in gym battles and therefore represent your team (one of Blue, Yellow, or Red), gain Pokémon coins, and attract more rare Pokémons.  If a little insight on how the underlying mechanism of CP works gets you there faster, all the better.

95d5cded00f3a3e8a98fb1eed568aa9f

The intent of the posting of this data set on OpenIntro is to see if the provided variables contribute meaningfully to the data generating mechanism of the CP relative to evolutions.  As stated on the site:

A key part of Pokémon Go is using evolutions to get stronger Pokémon, and a deeper understanding of evolutions is key to being the greatest Pokémon Go player of all time. This data set covers 75 Pokémon evolutions spread across four species. A wide set of variables are provided, allowing a deeper dive into what characteristics are important in predicting a Pokémon’s final combat power (CP).

Example research questions: (1) What characteristics correspond to an evolved Pokémon with a high combat power? (2) How predictable is CP from an evolution?

Good questions.  The analysis that follows takes a quick look at Q1, concluding that most of the action is in the HP and few other variables.  Question 2 is the more direct focus of this post, but I formalized the question a bit: What accounts for the variation in CP gain per evolution within a given species?

The Data

The data from OpenIntro is pretty limited; no Big Poké Data here.  There is a total of 75 observations from pre and post-evolutions of Pokés from four different species.  The species are an important distinction.  This analysis will recognize the variation between species, but try to find a universal model that fits all species.  Note the difference in sample size between species.  The uncertainty of small sample size has an important effect here.

> colnames(pokedat)
[1] "name" "species" "cp"
[4] "hp" "weight" "height"
[7] "power_up_stardust" "power_up_candy" "attack_weak"
[10] "attack_weak_type" "attack_weak_value" "attack_strong"
[13] "attack_strong_type" "attack_strong_value" "cp_new"
[16] "hp_new" "weight_new" "height_new"
[19] "power_up_stardust_new" "power_up_candy_new" "attack_weak_new"
[22] "attack_weak_type_new" "attack_weak_value_new" "attack_strong_new"
[25] "attack_strong_type_new" "attack_strong_value_new" "notes"
>
 Species Count
Caterpie 10
Eevee 6
Pidgey 39
Weedle 20

Thinking about the data and the two questions above, my interest turned towards univariate correlation between the variables offered figuring out that the real metric is. Diagnosing correlation would be a visual approach that offers an understating of what is moving in-step with what.  Of course “correlation is not… ” and all that, but it sure as heck doesn’t hurt when you are trying to develop a data generating model.  The metric of interest is not only the pre and post-evolution CP (cp and cp_new respectively), but some permutation of that incorporated the fact that species seems to have a big effect on the output.

pokedat <- as_tibble(read.csv(&quot;pokemon.csv&quot;))

dat1 <- mutate(pokedat, delta_cp = cp_new - cp) %>% # net change in cp
# the % change in cp from old to new
mutate(pcnt_d_cp = delta_cp / cp) %>%
# group by species to calculate additonal varaibles
group_by(species) %>%
# species grouped mean percent change
mutate(spec_mean_pdcp = mean(pcnt_d_cp)) %>%
# species grouped std dev of % changes from old to new cp
mutate(spec_mean_pdcp_sd = sd(pcnt_d_cp)) %>%
# difference in % delta cp (pdcp) change from species group mean
mutate(cent_spec_mean_pdcp = pcnt_d_cp - spec_mean_pdcp) %>%
# z score for pdcp change within species group
mutate(z_spec_mean_pdcp = cent_spec_mean_pdcp / spec_mean_pdcp_sd) %>%
data.frame()

Pokemon-Go---Team-Mystic-_Converted_large

the modification of the data in the code above spells out the development of the desired metric.  Since with are talking about an evolutionary process, the desired metric has to do with chance in states.  In this case the change from pre to post-evolution CP; referred to at the delta CP.  Further, since we have multiple species (and there are many more than four in the Pokémon universe), we consider delta CP relative to the mean of each species.  Finally, as each species has a different range of CP as some species are more powerful than other, it is good to look at the percent change per species, as opposed to an absolute CP increase.  Based on these observations, the desired metric is based on the Percent Delta CP (PDCP) per species.  The metric is the PDCP and it is considered relative to each species; or spec_pdcp.  This is the metric that we seek to identify a data generating mechanism for.

Correlation = Where’s the party?

Rplot04.png

Continue reading “Pokémon GO post-evolution CP: A model”

Pokémon GO post-evolution CP: A model

ggplot2 step by step

Building a ggplot2 Step by Step

I have included this viz on my blog before; as an afterthought to a more complex viz of the same data.  However, I was splitting out the steps to the plot for another purpose and though it would be worth while to post this as a step-by-step how to.  The post below will step through the making of this plot using R and ggplot2 (the dev version from Github).  Each code chunk and accompanying image adds a little bit to the plot on its way to the final plot; depicted here.  Hopefully this can help or inspire someone to take their plot beyond the basics.

13

 

Step 1 is to load the proper libraries which include ggplot2 (of course), ggalt for the handy ggsave function, and extrafont for the use of a range of fonts in your plot.  I did this on a mac book pro and using the extra fonts was pretty easy.  I have done the same on a windows machine and it was more of a pain.  Google around and it should work out.

library("ggplot2") # dev version
library("ggalt") # dev version, for ggsave()
library("extrafont")

Create the data… I included R code at the end of the post to recreate the data.frame that is in the plots below.  You will have to run that code block to make an object called dat so that the rest of this code will work.

# run the code block for making the 'dat' data.frame.
# code is located at bottom of this post

So that whole point to why I was making this plot was to show a trend in population change over time and over geography.  The hypothesis put forth by the author of this study (Zubrow, 1974) is that over time population decreased in the east and migrated to the west.  The other blog post I did covers all of this in more details.

If one were starting out in R, they may go straight to the base plot function to plot population over time as shown below.  [to be fair, there are many expert R users that will go straight to the base plot function and make fantastic plots.  I am just not one of them.  That is a battle I will leave to the pros.]

# base R line plot of population over year
plot(dat$Population ~ dat$Year, type = "l")

1

That result is underwhelming and not very informative.  A jumble of tightly spaced lines don’t tell us much.  Further, this is plotting data across all of the 19 sites we have data on. Instead of pursuing this further in base plot and move to ggplot2.

ggplot2 is built on the Grammar of Graphics model, check out a quick intro here.  The basic idea is to plot by layers using data and geometries intuitively. To me, plotting in ggplot2 is much like building a GIS map.  Data, layers, and geometries; it all sounds pretty cartographic to me.

Continue reading “ggplot2 step by step”

ggplot2 step by step

One Tree to Rule Them All! InTrees and Rule Based Learing

The basis of this post is the inTrees (interpretable trees) package in R [CRAN].  The inTrees package summarizes the typically complex result of tree based ensemble learners/algorithms into rule-based learners referred to as Simplified Tree Ensemble Learner (STEL). This approach solves the following problem: Tree based ensemble learners, such as RandomForest (RF) and Gradient Boosting Machines (GBM), often deliver superior predictive results compare to more simple decision tree or other low complexity models, but are difficult to interpret or offer insight. I will use an archaeological example here (because my blog title mandates it), but this package/method can be applied to any domain.

Oh Archaeology… We need to talk:

Ok, let’s get real here.  Archaeology has an illogical and unhealthy repugnance to predictive modeling. IMHO, illogical because the vast majority of archaeological thought process are based on empirically derived patterns based on from previous observation.  All we do is predict based on what data we know (a.k.a. predict from models, a.k.a. predictive modeling). Unhealthy, because it stifles creative thought, advancement of knowledge, and relevance of the field.  Why is this condition so?  Again, IMHO, because archaeologists are stuck in the past (no pun intended), feel the need to cling to images of self-sufficient site-based discovery, have little impetus to change , and would rather stick to our area of research then coalesce to help solve real-world problems. But, what the hell do I know.

What the heck does this have to do with tree based statistical learning?  Nothing and everything.  In no direct sense do these two things overlap, but indirectly, the methods I am discussing here have (once again) IMHO, the potential to bridge the gap between complex multidimensional modeling and practical archaeological decision making.  Further, not only do I think this is true for archaeology, but likely for other fields that have significant stalwart factions.  That is a bold claim for sure, but having navigated this boundary for 20 years, I am really excited about the potential for this method; more so than most any other.

Bottom lines is that the method discussed below, interpretable trees from rule-based learners in the inTrees package for R, allows for the derivation of simple-to-interpret rule based decision making based on empirically derived complex and difficult-to-interpret tree based statistical learning methods.  Or more generally, use fancy-pants models to make simple rules that even the most skeptical archaeologist will have to reason with.  Please follow me…

Continue reading “One Tree to Rule Them All! InTrees and Rule Based Learing”

One Tree to Rule Them All! InTrees and Rule Based Learing

Gaussian Process Hyperparameter Estimation

Quick Way longer then expected post and some code for looking into the estimation of kernel hyperparameters using STAN HMC/MCMC and R.  I wanted to drop this work here for safe keeping. Partially for the exercise of thinking it through and writing it down, but also because it my be useful to someone.  I wrote a little about GP in a previous post, but my understanding is rather pedestrian, so these explorations help.  In general GPs are non-linear regression machines that utilize a kernel to reproject your data into a larger dimensional space in order to represent and better approximate the function we are targeting.  Then using a covariance matrix calculated from that kernel, a multivariate Gaussian posterior is derived.  The posterior can then be used for all of the great things that Bayesian analysis can do with a posterior.

Read lots more about GP here…. Big thanks to James Keirstead for blogging a bunch of the code that I used under the hood here and thanks to Bob Carpenter (github code) and the Stan team for great software with top-notch documentation.

code:

The R code for all analysis and plots can be found in a gist here, as well as the three Stan model codes, here gp-sim_SE.stangp-predict_SE.stan,and GP_estimate_eta_rho_SE.stan

 

sin2_header

The hyperparameters of topic here are parameters of the kernel within the GP algorithm.  As with other algorithms that use kernels, a number of functions can be used based on the type of generative function you are approximating.  The most commonly used kernel function for GP (and seemingly Support Vector Machines) is the Squared Exponential (SE), also known as the Radial Basis Function (RBF), Gaussian, or Exponentiated Quadratic function.

The Squared Exponential Kernel

The SE kernel is a negative length scale factor rho (\rho^2) times the square distance between data points ((x -x^\prime)^2) all multiplied by a scale factor eta (\eta).  Rho is a shorthand for the length scale which is often written as a denominator as shown below. Eta is a scale factor that determines how far the function varies from the mean. Finally sigma squared (sigma^2_{noise}) at the end is the value for the diagonal elements of the matrix where (i = j). This last term is not necessarily part of the kernel, but is instead a jitter term to set zero to near zero for numeric reasons.  The matrix created by this function is positive semi-definite and composed of the distance between observations scaled by rho and eta. Many other kernels (linear, periodic, linear * periodic, etc…) can be used here; see the kernel cookbook for examples.

k_{x,x^\prime}=\eta^2 \exp(-\rho^2 (x -x^\prime)^2)+\delta_{ij}\sigma^2_{noise} \\ \rho^2 = 1/(2\ell^2) \\ \delta_{ij}\sigma^2_{noise}=0.0001

To Fix or to Estimate?

In this post, models are created where  sigma^2_{noise}, \rho^2 , and \eta^2 are all fixed, as well as a model where  sigma^2_{noise} is fixed and \rho^2 and \eta^2 are free.  In the MCMC probabilistic framework, we can fix \rho^2 and \eta^2 or any parameter for the most part, or estimate them. To this point, there was a very informative and interesting discussion on stan-users mailing list about why you might want to estimate the SE kernel hyperparameters.  The discussion generally broke across the lines of A) you don’t need to estimate these, just use relatively informative priors based on your domain knowledge, and B) of course you want to estimate these because you may be missing a large chunk of function space and uncertainty if you do not. The conclusion to the thread is a hedge to try it both ways, but there are great bits of info it there regardless.

mc

So while the greatest minds in Hamiltonian Monte Carlo chat about it, I am going to just naively work on the Stan code to do these estimations and see where it takes me.  Even if fixed with informative priors is the way to go, I at least want to know how to write/execute the model that estimates them. So here we go.

Continue reading “Gaussian Process Hyperparameter Estimation”

Gaussian Process Hyperparameter Estimation