Boxplot or not to Boxplot: WoE-ful Example


Edited 04/01: Added Tufte boxplots and Violin plots to bottom of post (Gist code updated)!

How to visualize numerous distributions where quantiles are important and relative ranking matters?

This post is inspired by two things

  1. I am working with the R packaged called Information to write some Weights of a Evidence (WoE) based models and needed to visualize some results;
  2. While thinking about this, I listed to the¬†@NSSDeviations podcast and heard Hilary Parker (@hspter)¬†say¬†¬†that she had little use for boxplots anymore, but Rodger Peng (@rdpeng)¬†spoke in support for the boxplots. A new ggplot2 vs. base, did not ensue…

Having heard the brief but serendipitous exchange by Rodger and Hilary, I got to thinking about what other approach I could use instead of my go-to¬†boxplot. ¬†Below I will go through a few different ways to visualize the same results and maybe a better¬†approach will be evident. The R code for all that follows [except raw data… sorry] is at this Gist. Let me know what you think!


I go back to WoE pretty frequently as an exploratory technique to evaluate univariate variable contribution by way of discrimination. I have a bit of a complicated relationship with WoE as I love the utility and simplicity, but I don’t see it as the final destination. Archaeology has been dabbling in WoE for a while (e.g. Strahl, 2007)¬†as it is a comfortable middle ground between being blindly¬†adherent to¬†a narrow range of traditional variables¬†¬†and quantifying discrimination ability of these variables in¬†the binary (presence/absence) case. ¬†There is also the push-button access to a spatial WoE tool in ArcGIS that has added to its accessibility.¬†But this post isn’t about that; I will finish that rant post another time.

Continue reading “Boxplot or not to Boxplot: WoE-ful Example”

Boxplot or not to Boxplot: WoE-ful Example

R-chaeology @ SAA2016 – Beer and Data meetup

r-chaeology_oc_final_cropArchaeology’s two most favorite things; Beer and Data. ¬†Put them together with the world’s most useful scientific data modeling and analysis language (of course I mean R) and it¬†could have loads of fun! ¬†To that end,¬†I am putting this out there to propose a time and place for an unofficial R + Beers get together at the the Society for American Archaeology annual meeting in Orlando, FL.

Update 4/20: Zachary Cofran and David Pappano put on a great R-stats work shop at the American Physical Anthropologists meeting in Atlanta the week following #Rchaeology.  Here is a link to a post about their code/instruction

Update 4/8: Location!  There is a indoor/outdoor court yard at the north end of the Swan hotel.  If is just in from the boardwalk that separates the Swan and the Dolphin.  If it is too dark or noisy (fountain) we will move inside where there are couches & chairs.  See map below

Update 3/24: Open Context has graciously stepped in to support Rchaeology and provide refreshments (and some credibility!) to this little event.  A little bit about this organization:

Open Context reviews, edits, annotates, publishes and archives research data and digital documentation.

We publish your data and preserve it with leading digital libraries. We take steps beyond archiving to richly annotate and integrate your analyses, maps and media. This links your data to the wider world and broadens the impact of your ideas.

Keep an eye here for details as they emerge.  I will tweet with #Rchaeology2016  #Rchaeo2016

The Knows

  • Sponsored¬†by Open Context!
  • In or about the indoor/outdoor courtyard in Swan hotel (see map below)
  • Saturday evening around 8pm to 10pm (early flight Sunday)
  • No R/Data/Stats experience necessary!
  • Unstructured, BYOData, hack & nerd-out
  • Will certainly conflict with many other happening; sorry.
  • Great show of interest so far!
  • Liquid refreshements courtesy of Open Context!
  • Guest appearances
    • Jesse Wolfhagen¬†et al. to show off their¬†Zooarchaeology package and talk Bayes Mixed Models.
    • Amy Fox¬†may swing by to demonstrates some morphometrics


Be Prepared!

Ok, there is actually nothing for anyone to prepare; it’s just the Boy Scout motto. ¬†But, if you are interested in preparing for things, here are some things to try.

  • Bring whatever data you are currently working on and we can take it for a spin.
  • If you want to bring you laptop and play along, the following are all¬†you need to get started:


This is a bit of a crazy and impromptu idea that may fail go in any number of directions… by which I mean me sitting with a beer and a laptop running code; so pretty much a normal night. ¬†If it does work and a few people can get together to talk R and data that it will be well worth it. ¬†Collaboration, serendipity, and learning are guaranteed. I’ll update this post as new details come to light. ¬†Feel free to comment, tweet, or DM if you are interested or have ideas!

R-chaeology @ SAA2016 – Beer and Data meetup

Visualizing Zubrow: ggplot2 & cowplot

Important note: This only works with the Development version of ggplot2! (as of 3/18/2016). Find this at Github

I was planning on making a whole post about my fascination for Ezra B.¬†Zubrow’s work, but in the interim I put together some fun plots of the data I plan to write about. ¬†I am putting this post together to get that code out there and show a few cool new things that ggplot2 v2.0+ allow for. ¬†Everything that follows is 110% thanks to the hard work of @hadleywickham¬†for creating the incredible ggplot2, @hrbrmstr¬†for adding subtitles/captions, @ClausWilke¬†for the wonderful¬†cowplot package.

The plot/graph/data viz/pretty picture that I tweeted out last night was my first draft of presenting population migration data from Zubrow’s 1974¬†Population Contact Climate New Mexican Pueblos. Below I will present the R code and go through a few of the steps that pulled it together. [link to code in Gist]


The process to make this was actually very easy thanks to the strides of those mentioned above. It is all really just a straightforward use of their code.

  1. Data Preparation (Zubrow 1974, R and dplyr)
  2. Decide how to organize and present to show intentions (brain)
  3. Plot each panel individually (ggplot2)
  4. Stick plots together (cowplot)
  5. Get some more ice cream (spoon & bowl)

The data prep started with manually lifting the data from the publication (Zubrow 1974: pg 68-69, Table 8) and structuring it in such a way to minimize repetition and get it ready for plotting. ¬†I am not a very efficient coder, so other will likely find ways to do this better. ¬†This follows the “wide data” format, which is basically adding columns of variables to a rows which each contain a single observation (population). ¬†See here and here ¬†and here¬†for discussion of various “tidy” data formats.

library("ggplot2") # Must use Dev version as of 03/18/16
library("extrafont") # for font selection
library("dplyr") # for data preperation
library("cowplot") # for combining plots

# Prepare data for plotting
# data from Zubrow, E.B.W. (1974), Population, Contact,and Climate in the New Mexican Pueblos
# prepared as a long format to facilitate plotting
year <- c(1760, 1790, 1797, 1850, 1860, 1889, 1900, 1910, 1950)
sites <- c("Isleta", "Acoma", "Laguna", "Zuni", "Sandia", "San Felipe",
"Santa Ana", "Zia", "Santo Domingo", "Jemez", "Cochiti",
"Tesuque", "Nambe", "San Ildefonso", "Pojoaque", "Santa Clara",
"San Juan", "Picuris", "Toas")
# [sites] are arranged in South to North geographic order
south <- c(seq_along(sites))
# add index to rearrange geographically from East to West
east <- c(14, 18, 17, 19, 12, 11, 13, 15, 10, 16, 9, 4, 3, 7, 5, 8, 6, 2, 1)
# population figures
pop <- c(
c(304, 410, 603, 751, 440, 1037, 1035, 956, 1051),
c(1052, 820, 757, 367, 523, 582, 492, 691, 1376),
c(600, 668, 802, 749, 929, 970, 1077, 1472, 1655),
c(664, 1935, 2716, 1294, 1300, 1547, 1525, 1667, 2564),
c(291, 304, 116, 241, 217, 150, 81, 73, 150),
c(458, 532, 282, 800, 360, 501, 515, 502, 721),
c(404, 356, 634, 339, 316, 264, 228, 219, 285),
c(568, 275, 262, 124, 115, 113, 115, 109, 145),
c(424, 650, 483, 666, 262, 930, 771, 817, 978),
c(373, 485, 272, 365, 650, 474, 452, 449, 789),
c(450, 720, 505, 254, 172, 300, 247, 237, 289),
c(232, 138, 155, 119, 97, 94, 80, 80, 145),
c(204, 155, 178, 107, 107, 80, 81, 88, 96),
c(484, 240, 251, 319, 166, 189, 137, 114, 152),
c(99, 53, 79, 48, 37, 18, 12, 16, 2),
c(257, 134, 193, 279, 179, 187, 222, 243, 511),
c(316, 260, 202, 568, 343, 373, 422, 388, 152),
c(328, 254, 251, 222, 143, 120, 95, 104, 99),
c(505, 518, 531, 361, 363, 324, 462, 517, 842)
# combine above in a data.frame where each population is the key observation
dat <- data.frame(Year = rep(year, length(sites)),
Site = rep(sites, each = length(year)),
Population = pop,
South = rep(south, each = length(year)),
East = rep(east, each = length(year)))
# add a factor that arranges the levels of site by their East to West geographic location
dat$East_label <- rep(factor(sites, levels = (sites[order(east)])), each = length(year))

# Calculate quantities of interest
# this ibe-liner calculates the date to date change in population
# ave() groups and runs the diff() function that is concatenated with a 0 for the first observation in each group
# probably can make this dplyr, but it works...
dat$Change <- ave(dat$Population, dat$Site, FUN = function(x) c(0, diff(x)))
# using dplyr to group the observations by site/Pueblo and calcualte stuff
dat <- group_by(dat, Site) %>%
# change from inception date of 1760 (first data point)
mutate(Relative_Change = Population - Population[1]) %>%
# percent change of log population (not used in this graphic)
mutate(Percent_Change = (Population - lag(Population))/lag(Population)) %>%
# change from inception scaled to % change from inception
mutate(Scaled_Change = (Relative_Change / Population[1])) %>%
# make into a data.frame

# select only the rows that are the last time period observation of 1950
# this is for the second plot
net_change <- dat[which(dat$Year == 1950),]

Plot Intentions (a.k.a skip down if you just want code)

Now that the data is entered, formatted, and the quantities of interest are calculated, its time to start plotting. ¬†I imagine everyone does it this way, but my simple method for building a ggplot is simply to start simple and then add in calls for ever finer detail. ¬†Here is an amazing tutorial on building a ggplot that really covers all the bases. ¬†I could add little to that link as far as approach and basics, however I will add that going from the basics to the plot you see in your head can be quite difficult! ¬†Most of the “issues” I run into making a plot that is remotely complicated are almost always because I didn’t think through what I really want to see and what the technical implications of that will be. ¬† All you smart people probably won’t have that problem, but for me there has been a proportional shift to increased¬†time spent thinking about a plot versus a decrease in time spent coding ¬†a plot as¬†I get more experience.

The intention of this plot is to support an argument made by Zubrow that there was a 18th to 20th century population shift in Pueblo society based on cultural contact and climate change.  Of course it will take more that a single graphic to make this point, but a single graphic would go a long way to help a reader see it.  To that end, I first decided on a graphic that would be focused on the percent change from the date of the first data point at 1760.  I wanted to use percent change to scale the y axis a bit more evenly since some of Pueblos had total populations that dwarfed the smaller pueblos and decreased the visual impact of the trend. Further, period over period change was pretty variable and did not tell the story. Secondly, since there is a geographic component to this population shift, I need to arrange the Pueblo data groups geographically East to West.  Ideally this would be done a s single row, but there are too many sites for that to work.

Additionally, there is the issue about site specific population change and overall population change and how that interacts over time.  This is why using two plots made sense here. The top plot to show that site specific percent change over time and the lower plot to show the net change across geography form the inception date of 1760. But enough about that, lets see some code!

Plot #1: Site specific percent population change over time

# plotting with ggplot2 (dev version from github)
# I started doing the [p1 <- p1 +] way to build these b/c others did and
# I liked how I could add selective lines while testing without changing anything
# so, it is more things to type, but whatever. It can be done in one ggplot() call if you like
# also, I code to get stuff done then optimize later. You may likey find ways to do this better

# Plot #1: Percent change from inception for each site
# set up data and axis
p1 <- ggplot(dat, aes(x = as.factor(Year), y = Scaled_Change, group = East))
# this is the bar plot part
p1 <- p1 + geom_bar(stat="identity", fill = "skyblue", color = "white")
# this is the redline outlining the bars
p1 <- p1 + geom_line(color = "firebrick1", alpha = 0.95)
# wrap it by the reordered factor so each individual site plots
p1 <- p1 + facet_wrap( ~ East_label, nrow = 2)
# add a reference line at zero
p1 <- p1 + geom_hline(yintercept = 0, linetype = 1, color = "gray35", size = 0.5)
# scale the y axis and calculate it as a percentage
p1 <- p1 + scale_y_continuous(breaks = c(seq(-1,3,1)), labels = scales::percent)
# thin out the dates in the x axis
p1 <- p1 + scale_x_discrete(breaks = c(1760, 1797, 1860, 1900, 1950))
# trusty old theme_bw() to prepare the look of the plot
p1 <- p1 + theme_bw()
# fine suggestion by @ClausWilke to make plot less busy
p1 <- p1 + background_grid(major = "xy", minor = "none")
# add the title, subtitle, and axis labels
# this is a new feature only available in the dev version of ggplot2
p1 <- p1 + labs(title="Percent Population Change from in New Mexico Pueblos from 1760 to 1950",
subtitle="Arranged from East to West",
x = "Year",
y = "Percent Population Change")
# all the things in the theme() section that stylize the plot and fonts
p1 <- p1 + theme(
strip.background = element_rect(colour = "white", fill = "white"),
strip.text.x = element_text(colour = "black", size = 7, face = "bold",
family = "Trebuchet MS"),
panel.margin = unit(0.3, "lines"),
panel.border = element_rect(colour = "gray90"),
axis.text.x = element_text(angle = 90, size = 6, family = "Trebuchet MS"),
axis.text.y = element_text(size = 6, family = "Trebuchet MS"),
axis.title = element_text(size = 8, family = "Trebuchet MS"),
# the three elements below are all new to the dev ersion
plot.caption = element_text(size = 8, hjust=0, margin=margin(t=5),
family = "Trebuchet MS"),

Aside from a pretty straightforward ggplot, some interesting notes are the use of labs() to set up the title, X, Y, sub-title, and caption text. ¬†This new feature that is currently in the development version is thanks to @hrbmstr and detailed in his recent post. ¬†Basically, just add your text at the right tag and style accordingly; instant badass plotting! ¬†The other thing is the use of custom fonts. ¬†This takes a small bit of setup and can be a pain on a Windows machine, but is explained well by @zevross on his site here. ¬†Everything else is a pretty basic design decision based on the data and intention. I used the line to make the trend jump out and used the bars to emphasize the space. ¬†I almost always start with theme_bw()¬†as a base. Oh, and I love skyblue; I can’t make a plot without it.

The second plot took the net percent change for just the 1950 data point for each pueblo.  This was an easy way to show net change over that time period and assign a single number to each site.  Plus I would not have to facet and it would fit very nicely below the previous plot.

Plot #2: Net percent population change per Pueblo

# Plot #2: Net change in population by site
# this uses a different data.frame from plot #1
# most of this is all the same as Plot #1, but not faceted.
# one trick here is to have two geom_bar calls, but the second is just data below the 0 line
p2 <- ggplot(data = net_change, aes(x = East_label, y = Scaled_Change, group = Year))
# first call to geom_bar for all data
p2 <- p2 + geom_bar(stat="identity", fill = "skyblue")
# second call to geom_bar for just negative data with a different color
p2 <- p2 + geom_bar(data = net_change[which(net_change$Scaled_Change < 0),],
stat="identity", fill = "skyblue4")
# a spline smooth to add the fit that shows the trend of population flow
p2 <- p2 + geom_smooth(data = transform(net_change),
method = "lm", formula = y ~ splines::bs(x, 3),
se = FALSE, color = "firebrick1", size = 0.5)
p2 <- p2 + geom_hline(yintercept = 0, linetype = 1, color = "gray35", size = 0.5)
p2 <- p2 + scale_y_continuous(breaks = c(seq(-1,3,0.5)), labels = scales::percent)
p2 <- p2 + theme_bw()
# fine suggestion by @ClausWilke to make plot less busy
p2 <- p2 + background_grid(major = "xy", minor = "none")
# same new features as described in plot #1
p2 <- p2 + labs(title="Net Percent Population Change from 1760 to 1950",
subtitle="Arranged from East to West",
x = NULL,
y = "Percent Population Change",
caption="Data: Zubrow, E.B.W. (1974), Population, Contact,and Climate in the New Mexican Pueblos")
p2 <- p2 + theme(
axis.text.x = element_text(angle = 90, size = 8, hjust = 1, family = "Trebuchet MS"),
axis.text.y = element_text(size = 7, family = "Trebuchet MS"),
axis.title = element_text(size = 8, family = "Trebuchet MS"),
# same new features as described in plot #1
plot.caption = element_text(size = 8, hjust=0, margin=margin(t=5),
family = "Trebuchet MS"),


This plot simply used that same bits as above for subtitle and the added caption.  The caption is only used once since this is the bottom graphic and they come from the same data.  Fonts are the same. Now to the cool stuff.

Final plot: Putting it together with cowplot

This is where the magic happens thanks to the cowplot package. Cowplot is a ggplot2 add-on that allows for the creation of well formatted and publication ready graphics straight out of R. It achieves this though offering some basic themes with standard and appealing colors, line weights, font sizes, etc… and the ability to arrange multiple plots in one graphic. ¬†I only used the plot_grid()¬†function to arrange the plots we made above, but the themes provided by cowplot are quite nice and you can also add annotation directly on top of your plots (e.g. DRAFT). This is really a great tool and I look forward to making it part of my practice.

<pre># now the magic of cowplot to combine the plots
# plot_grid does the arranging of p1 and p2
# [labels] gives identification to each plot for your figure caption
# [nrow] arrnges them in stacked rows
# [rel_heights] is really important. This allows for the top plot to be made larger (1.75 times)
# than the lower plot. If your plots are not realtively equal in size, you may need this
# I adjusted by trial and error
cplot1 <- plot_grid(p1, p2, labels = c("A", "B"), nrow = 2, rel_heights = c(1.75,1))
# get an idea of what it looks like
# save cowplot as an image!
# [base_width] and [base_height] allow of sizing the fina image Use in conjunction with [rel_heights]
save_plot("cowplot.png", cplot1, base_width = 8, base_height = 7.5)
# Done!

Final plot at top of post
The plot_grid() and save_plot() functions have some really key arguments that were invaluable here, specifically rel_width, rel_height for plot_grid() and base_width, base_height for save_plot(). Combining these, you can scale each graphic as a ration relative to the others and then adjust the overall size on save.  This was important here because the bottom graphic was only intended to be a narrow strip below the faceted graphic that needs more headroom.

And that is about it.  I wish I had come across some deep hack or new approach, but really I just took the tools that became available and put them together.  I hope some of the code and discussion above help you see how to use these tools together and create interesting and vivid depictions of your data.  If you have any  questions or comments about this code, please find me @md_harris


  • The code in once place at this Gist
  • There was a first pass at this data which resulted in the plot below. ¬†This was my first use of the new caption/subtitle feature. ¬†I know it was being worked on, but had to dig into github to find out that the pull request was accepted 4 hours before I set out to make the plot. ¬†Only after this plot did I decide to use a different approach to the data and then get into cowplot.
  • population
  • Final¬†note, if you are interested in these data and their interpretation, there is a lot more discussion to be had. ¬†This post is just about the visualization of it and it would take a textual context to justify the argument that is being conveyed here.


Visualizing Zubrow: ggplot2 & cowplot

Thoughts on Archaeological Predictive Modeling in the Northeast symposium (2016)

As described in my last post, I gave a talk at a symposium entitled Archaeological Predictive Modeling in the Northeast last Friday (3/11/16).  That post has a few notes, my slides, and abstract.  It was a really interesting symposium and I really enjoyed meeting new people from the Northeast archaeological community.  I hope they continue these sessions and continue to discuss these matters.

I have had a few days to reflect on the symposium and wanted to jot down a few of my thoughts.   These may come off as pessimistic, but that is only because I am optimistic in what quantitative methods such as predictive modeling can do for us in archaeology.  Perhaps a little frustrated  that we have such a long road before we get there.  I am not sure these thoughts will win me any new friends, but that is ok.  We need frank and informed discussion if we hope to maintain relevance and advance as a field.

TL;DR: skepticism + no metrics + exaggerated expectations = Na√Źve Falsifiability

Post Symposium Thoughts (03.13.2016):

  1. Skepticism.  The bookend talks and general tone in this symposium were skeptical of archaeological predictive models (APM) that venture beyond a basic weighted sum model and the VT paper checklist.   We should no be skeptical of trying to do better, but we are. I advocated being very self-skeptical of our models and that simple models are great when they fit the intended purpose.
  2. A lot of what was touched on or wished for as predictive modeling is really just basic descriptive analysis of archaeological data. It is hard to have a discussion about predictive modeling specifically if we are really talking about many different kinds of analysis.  Simple descriptive and low-dimensional analysis of the data we currently have would go very far in helping us understand what is data-supported and what is archaeological tradition.  Perhaps seeing the reality of what our site information tells us will help people feel more comfortable in making inference from it.
  3. The keynote was based on a 30 year old simple statistical hypothesis test about regional site locations; it looked quite interesting.  This analysis served as the basis of studies and policy for the next 30 years, however it was framed by its author as whizz-bang New Archaeology null hypothesis testing magic of the days gone by.  Why do we poke fun at the methods of an analysis that clearly had a major impact on how settlement patterns are interpreted in the region? It was a well done study and should be treated that way. Would it not have been productive to build off this over the past 20 years with additional quantitative studies in that region?
  4. While there was some discussion of model testing with field survey, there is no discussion of testing models with independent data.  This is not too shocking (perhaps it should be), but with arbitrarily weighted summed models, you are not really assessing fit or tuning parameters, so CV error or AIC or similar can be totally side-stepped. You certainly can and should test models of the basic summed type, but it is not frequently done.
    1. While simple summed models are fine and useful (I wrote a bit about the basic version here), there are simple ways to make them better; including accounting for a variables discrimination against the background (a WOE model was briefly mentioned in the symposium), proper thresholds, and hold-out sample testing.
  5. There seems to be a ton of cross-talk about what models should be capable of. ¬†My point was that they are only capable of what we can formalize as expectations, what our data can reasonably tell us, and what uncertainty can be incorporated into our methods. ¬†The general vibe was that simple predictive models are supposed to solve problems, find all sites, or result in understanding of causality. ¬†None of these things should be expected! ¬†What should be expected is that maybe, just maybe, we can find a weak signal in the covariates that we can measure and project to other areas and achieve a reasonable error rate. ¬†Alternatively, we can make strong assumptions, study our data, and make educated model based inference to gain a better understanding of data generating processes. ¬†Either way it is HARD FREAKIN’ WORK! ¬†This will not be solved with button-mashing GIS techniques.
  6. My fear is that the combination of these general feelings¬†results in a situation where APM and inferential archaeological modeling are doomed to be no more than a niche pursuit. ¬†Perhaps “nice pursuit” is a little strong since some field of archaeology are well tuned to quantitative methods and don’t have these issues to the same degree. ¬†Perhaps it is better to say that quantitative methods are no likely to be “mainstream” under these conditions. Was it ever mainstream or am I a Processual apologist? My point is that the combination of overblown/unreal expectations + lack of testing/metrics + general skepticism = a standard of na√Źve¬†falsificationism under which¬†no model is capable of succeeding. APM or really any archaeological model is not likely these days to be posited as a universal generalization from observed data. Therefore, there is not need for discarding these models based on examples that do not fit. A model should be presented as a set of assumptions, an estimate of how well the observed data fit those assumptions, what we can reasonably generalize¬†based on that fit.
    1. Simply, there are archaeological sites everywhere and we continue to prove this. ¬†That is about as much as a universal statement about archaeological data as I can muster. ¬†Because of this, the only model that is not falsifiable is a model where everything is site-likely and everywhere should be tested because we may miss a site. ¬†Funny enough, that was pretty much the agreed upon sentiment of this symposium; sites are everywhere, we can¬†not know why, and therefore everywhere needs testing. ¬†Frankly, ¬†I totally agree with this too. ¬†But, if we are so comfortable with that understanding, then why hold this symposium on APM? ¬†The short answer is because we cannot test everywhere, that there are some sites that we will never find, and that we intuitively know that structuring our observations though models can help us gain a new understanding; even if we don’t know how to get there. ¬†The reality is that we can either keep going around the same pony ride or we can accept the limitations of our data, understanding, and models and move forward with that to develop methods that work within those constraints.¬†The sooner we come to terms with that, the sooner we can begin to agree on how wrong a model can be and still be useful (to paraphrase Box 1976). ¬†Then we can add models as a productive part of our archaeological tool kit and make better informed decisions. Or we can just keep talking about it…






Thoughts on Archaeological Predictive Modeling in the Northeast symposium (2016)

A Statewide Archaeological Predictive Model of Pennsylvania: Lessons Learned

Tomorrow (March 11th, 2016)  I am giving a 20 minute presentation at the Northeast Archaeological Predictive Modeling Symposium, hosted by the New Hampshire Department of Transportation and the New Hampshire Division of Historical Resources.  My talk covers some of the broad lessons learned from creating the statewide Pennsylvania Archaeological Predictive Model Set.  This post is a place to share the presentation, link to download the presentation, link to download the reports (Volume 7 has the most concise wrap-up), and associated information.

As with most presentations, 20 minutes can barely scratch the surface of the advances that the PA Model made or the implications it has on the broader practice of archaeological predictive modeling, but it is a start.  Further, unfortunately the slides are presented as only graphics without the context of the symposium or my talk.  However, this should give some information to dwell on.  Finally, the animated gif of fits based on background samples (slide 14) is not animated in the PDF, but I posted about that very topic here.


A recently completed archaeological predictive model (APM) of the state of Pennsylvania provided an unprecedented opportunity to explore the current status of APM methods and extend them based on current methods derived from related scientific fields, medicine, and statistical computing.  Through this process many different types of models were created and tested for validity, predictive performance, and adherence to archaeological theory.  One result of this project is a comprehensive view of the problems that beset existing APM methodologies, solutions to some of these problems, and the nature of challenges that we will face going forward with new techniques.  Most, if not all of the findings of this project are applicable to the eastern deciduous United States, and much of the methodological scope is useful to APMs in any geography.   This paper will discuss the primary lessons learned from this project in regards to archaeological data, modeling methods, and theory, as well as touch on best-practices for future APM efforts.

~ Matthew D. Harris, AECOM – Burlington, NJ



  • I will do a series of posts on the PA Model to get some more details out into the wild.
  • I am not crazy about using LinkedIn’s, but it conveniently integrated with WordPress; so be it.
  • There are lots of hand-wavy treatments in this talk (and slides). ¬†I would like¬†to spend another 20 minutes talking about spatially explicit clustered K-folds CV strategies or Bayes inference, but we archaeologists have a lot of ground to cover and the basics of a more rigorous approach for identifying potential archaeological site locations needs to be discussed and agreed upon more widely across the field. ¬†Until then, the in-the-weeds technical stuff only adds to a rather¬†small¬†echo chamber.
  • If you have any questions about this topic, please ask!


A Statewide Archaeological Predictive Model of Pennsylvania: Lessons Learned

Background Sampling & Variance in Logistic Regression Models


In preparing for a talk on Archaeological Predictive Modeling (APM) next week in Concord, NH I made some graphics to illustrate a point.  A blog post is born.

Cutting to it… If you want to use logistic regression to build an APM you are using site locations as the positive class (site present) and some form of the environmental background as the negative class (site absent). ¬†How you conceptualize site absence is a bigger fish that I have time to fry here, but for the sake of conversation these models are taking random selections of 10 meter x 10 meter non-site cells to simulate the general environmental background. It is important to note that using the locations of surveys that failed to identify archaeology sites is often problematic because of location/selection bias, imperfect detection, or just lack of surveys. ¬†For these reasons, we need to make assumptions about how the background is represented. So making the assumptions that we often do raises one big problem:

How many samples are needed to adequately characterize the background?

Good question.  The answer depends on many things, but chiefly it depends on how many site samples you have and how diverse your environment is.  Use as many as possible?  Problem is given that the occurrence of archaeological sites are typically low prevalence events on the landscape, the more background samples we use per site sample negatively effects our sample imbalance.  Sites are already in the minority, so adding more background to get a better characterization creates a larger imbalance and wreaks havoc as we will see.  Ok, so use a balanced number of background samples, 1:1?  That would be nice for the imbalance issue, but the fewer samples, the less we are to accurately characterize any relatively diverse environment. Fine then, how about 3:1 background to site samples? Sounds like a pretty good choice, lets start with a simple model:


Simple logistic regression with site presence/absence as the response, distance to water (ed_h6) and a measure of slope variation (slp_32c) as the dependent variables, and epsilon is assumed to be normally distributed and iid (but its not, lets not kid ourselves). The plot below shows the response curves varying across each variable as the 0.05, 0.5, and 0.95 quantiles of the variable named in the upper panel.  Pretty weak signal, but there is something here. Basically, the probability decreases mainly with increased slope variation and more subtly with distance from water.  That fits ok with pre-contact settlement in this area; give or take.   The second image shows how they effect probability together.predict_by_quantile


Cool, models done.  Nope.  This outcome is based on a single draw of 597 background samples from a 3:1 balance for the 199 site location samples.  What happens if we try again with another random sample of 597 background samples?  All hell breaks loose!  The gif below shows the response curve for the 0.05 quantile for distance to water over the range of slope variability for 100 different draws of 597 random background points.  It is clear that taking a different sample may lead to a very different model outcome. Logically, this should not be shocking, but relying on a single relatively small random draw for a background sample is a rather defacto approach to APM.  Unfortunately, simply taking a larger sample does not solve all our problems.


The gif shows essentially the same as above, but varying across different ratios of background to site samples. Starting with 3:1 on the left and increasing to 20:1 on the right.  As the ratio of background to sites increases, the variance between models decreases (less erratic fit lines), but the overall increased imbalance of background to sites leads to reduced overall probability. The lower variance is a good thing, but the overall lower probability is not.

Background samples by mutliplier

Ways to address these competing issues include finding the right ratio balance, modeling techniques that address highly-imbalanced data, bootstrap sampling of the background, or model averaging; in addition to different algorithms and model specifications.  Hopefully this illustrates the danger of relying on a single random draw to represent site-absence locations and the utility of criticizing you model results against multiple draws.


  • The top two graphics are not actually the simple logistic regression example. ¬†These are drawn from a Bayesian HMC GLM with priors \mathcal{N}(0, 1), but the response is pretty much the same. ¬†The uncertainty of 1000 posterior draws of a single 3:1 background sample is presented here:
  • Bayes_median_ribbon_samples.png
  • Clearly, the effect of ed_h6 and std_32c on presence/absence given these assumptions and mode is not great. ¬†However, these two variables are well ingrained in temperate area settlement theory and recognizable to any archaeologists. ¬†They make a good example for discussion.
Background Sampling & Variance in Logistic Regression Models