Gaussian Process Hyperparameter Estimation

Quick Way longer then expected post and some code for looking into the estimation of kernel hyperparameters using STAN HMC/MCMC and R.  I wanted to drop this work here for safe keeping. Partially for the exercise of thinking it through and writing it down, but also because it my be useful to someone.  I wrote a little about GP in a previous post, but my understanding is rather pedestrian, so these explorations help.  In general GPs are non-linear regression machines that utilize a kernel to reproject your data into a larger dimensional space in order to represent and better approximate the function we are targeting.  Then using a covariance matrix calculated from that kernel, a multivariate Gaussian posterior is derived.  The posterior can then be used for all of the great things that Bayesian analysis can do with a posterior.

Read lots more about GP here…. Big thanks to James Keirstead for blogging a bunch of the code that I used under the hood here and thanks to Bob Carpenter (github code) and the Stan team for great software with top-notch documentation.

code:

The R code for all analysis and plots can be found in a gist here, as well as the three Stan model codes, here gp-sim_SE.stangp-predict_SE.stan,and GP_estimate_eta_rho_SE.stan

 

sin2_header

The hyperparameters of topic here are parameters of the kernel within the GP algorithm.  As with other algorithms that use kernels, a number of functions can be used based on the type of generative function you are approximating.  The most commonly used kernel function for GP (and seemingly Support Vector Machines) is the Squared Exponential (SE), also known as the Radial Basis Function (RBF), Gaussian, or Exponentiated Quadratic function.

The Squared Exponential Kernel

The SE kernel is a negative length scale factor rho (\rho^2) times the square distance between data points ((x -x^\prime)^2) all multiplied by a scale factor eta (\eta).  Rho is a shorthand for the length scale which is often written as a denominator as shown below. Eta is a scale factor that determines how far the function varies from the mean. Finally sigma squared (sigma^2_{noise}) at the end is the value for the diagonal elements of the matrix where (i = j). This last term is not necessarily part of the kernel, but is instead a jitter term to set zero to near zero for numeric reasons.  The matrix created by this function is positive semi-definite and composed of the distance between observations scaled by rho and eta. Many other kernels (linear, periodic, linear * periodic, etc…) can be used here; see the kernel cookbook for examples.

k_{x,x^\prime}=\eta^2 \exp(-\rho^2 (x -x^\prime)^2)+\delta_{ij}\sigma^2_{noise} \\ \rho^2 = 1/(2\ell^2) \\ \delta_{ij}\sigma^2_{noise}=0.0001

To Fix or to Estimate?

In this post, models are created where  sigma^2_{noise}, \rho^2 , and \eta^2 are all fixed, as well as a model where  sigma^2_{noise} is fixed and \rho^2 and \eta^2 are free.  In the MCMC probabilistic framework, we can fix \rho^2 and \eta^2 or any parameter for the most part, or estimate them. To this point, there was a very informative and interesting discussion on stan-users mailing list about why you might want to estimate the SE kernel hyperparameters.  The discussion generally broke across the lines of A) you don’t need to estimate these, just use relatively informative priors based on your domain knowledge, and B) of course you want to estimate these because you may be missing a large chunk of function space and uncertainty if you do not. The conclusion to the thread is a hedge to try it both ways, but there are great bits of info it there regardless.

mc

So while the greatest minds in Hamiltonian Monte Carlo chat about it, I am going to just naively work on the Stan code to do these estimations and see where it takes me.  Even if fixed with informative priors is the way to go, I at least want to know how to write/execute the model that estimates them. So here we go.

Continue reading “Gaussian Process Hyperparameter Estimation”

Gaussian Process Hyperparameter Estimation

Gaussian Process in Feature Space

[In the last post, I talked about showing various learning algorithms from the perspective of predictions within the “feature space”. Feature space being a perspective of the model that looks at the predictions relative to the variables used in the model, as opposed geographic distribution or individual performance metrics.

I am adding to that post the Gaussian Process (GP) model and a fun animation Or two.

sin4_header

So what’s a GP?

The explanation will differ depending on the domain of the person you ask but essentially a GP is a mathematical construction that uses a multivariate Gaussian distribution to represent an infinite number of functions the describe your data, priors, and covariance. Each data point gets Gaussian distribution and these are all jointly represented as the multivariate Gaussian. The choice of the Gaussian is key to this as it makes the problem tractable based on some very convenient properties of the multivariate Gaussian. When referring to a GP one may be talking about regression, classification, a fully Bayesian method, some form or arbitrary functions, or some other cool stuff I am too dense to understand. That is a broad and overly-simplistic explanation, but I am overly-simplistic and trying to figure it out myself.  There is a lot of great material on the web and videos on youtube to learn about GP.

So why use it for archaeology?

GP’s allow for the the estimation of stationary nonlinear continuous functions across an arbitrary dimensional space; so it has that going for it… which is nice. But more to the point, it allows for the selection of a specific likelihood, priors, and a covariance structure (kernel) to create a highly tunable probabilisitic classifier.  The examples below use the kernlab::gausspr() function to compute the GP using Laplace approximation with a Gaussian noise kernel.  The \sigma hyperparameter of this Gaussian kernel is what is being optimized in the models below.  There are many kernels that can be used in modeling a GP.  One of the ideal aspects of a GP for archaeological data is that the continuous domain of the GP can over geographic space taking into consideration natural correlation in out samples. This has strong connections to Kriging as a form of spatial regression in the geostatistical realm, as well as relation to the SVM kernel methods of the previous post.

gopher-48499_640.png

On to the plots…

Continue reading “Gaussian Process in Feature Space”

Gaussian Process in Feature Space

To Feature Space and Beyond!

There are many of ways to look at predictive models.  Assumptions, data, predictions, features, etc… This post is about looking at predictions from the feature space perspective.  To me, at least for lower dimensional models, this is very instructive and illuminates aspects of the model (aka learning algorithm) that are not apparent in the fit or error metrics. Code for this post is in this gist.

IV_header2

Problem statement: I have a mental model, data, and a learning algorithm from which I derive predictions.  How well do the predictions fit my mental model? I can use error metrics from training and test sets to asses fit. I can use regularization and information criteria to be more confident in my fit, but these approaches only get me so far. [A Bayesian perspective offers a different perceptive on these issues, but I am not covering that here.] What I really want is a way to view how my model and predictions respond across the feature space to assess whether it fits my intuition. How do I visualize this?

Space: Geographic Vs. Feature

When I say feature space, I mean the way our data points align when measured by the variables/features/covariates that are used to define the target/response/dependent variable we are shooting for.  The easy counter to this is to think about geographic space.  Simply, geographic space is our observations measured by X and Y.  The {X,Y} pair is the horizontal and vertical coordinate pair (latitude, longitude) that put a site one the map.

site_location.png

So simply, geographic space is our site sample locations on a map.  The coordinates of the 199 sites (n=13,700 total measurements) are manipulated to protect site locations.  These points are colored by site.

Feature space?  Feature space is how the sites map out based not on {X,Y}, but based on the measure of some variables at the {X,Y} locations. In this case, I measure two variables at each coordinate, x_1 = cd_conf, anf x_2 = tpi_250c.  cd_conf is a cost distance to the nearest stream confluence with respect to slope and tpi_250c is the topographic position index over a 250 cell neighborhood.   If we map {x_1, x_2} as our {X,Y} coordinates, we end up with…

site_feature_space

FEATURE SPACE!!!!!!!!! If you are familiar with this stuff, you may look at this graphic and say “Holy co-corellation Batman!”.  You would be correct in thinking this.  As each site is uniquely colored, it is apparent that measurements on most sites have a positively correlated stretch to them.  This is because the environment writ large has a correlation between these two variables; sites get caught up in the mix.  This bias is not fully addressed here, but is a big concern that should be addressed in real modeling scenarios.

Either way, feature space is just remapping the same {X,Y} points into the dimension of cd_conf and tpi_250c. Whereas geographic space shows a relatively dispersed site distribution, the feature space shows us that sites are quite clusters around low cd_conf and low tpi_250c.  Most sites are more proximal to a stream confluence and at lower topographic positions that the general environmental background.  Sure, that makes some sense.  So what… Continue reading “To Feature Space and Beyond!”

To Feature Space and Beyond!

SAA 2015 – Pennsylvania Predictive Model Set – Realigning Old Expectations with New Techniques in the Creation of a Statewide Archaeological Sensitivity Model

Quick post to archive my 2015 Society for American Archaeology (SAA) meeting presentation (San Francisco, CA).  Slide Share is at bottom of post or download here:  MattHarris_SAA2015_final

This presentation was all about the completion of the Pennsylvania Predictive model and some post-project expansion with a new testing scheme and the Extreme Gradient Boosting (XGB) classifier.

The presentation starts with a bit about the context for Archaeological Predictive Modeling (APM) and the basics of the machine learning approach. I call it Machine Learning (ML) here (because  I was in San Francisco after all), but I generally think of it as Statistical Learning (SL).  The slight shift in perspective is based on the focus of SL on the assumptions and statistical models, where as ML is more oriented to applied maths, comp sci., and DEEP LEARNING OF BIG DATA!  Just depending on how you want to look at it.

The presentation moves to understanding the characteristics of archaeological data that make it unique and challenging.  I think this is a critical area that gets so glossed over and offers so many excuses for us to not pursue model based approaches.  Okay, yes, our data kind of stink most of the time.  Let’s accept that, plan for it, and move along.accept_data

After my typical lecture on how the Bias/Variance trade-off should keep you up at night, I go into schematic descriptions of the learning algorithms: Logistic Regression, Multivariate Adaptive Regression Splines (MARS), Random Forest, and XGB. Then try to show how each algorithm, regardless of how “fancy”, can be conceptualized in a
“simple” way. The remainder of the presentation is a tour of prediction metrics for the four models applied to a portion of the state.

Unfortunately, this portion was only developed after the project had completed.  This is partially because of the timing of the contract, but also because some of these methods were not developed until later in the project, and by that time, I needed to follow the same general methods that the project started with for consistency.

The two big take aways from this part of the presentation are that 1) XGB “won” the model bake-off as it led to the lowest error on independent site sample across most sub-regions and sub-samples.  It was the most consistent and accurate (to the positive class) learner of the four; and 2) error can be viewed in two important ways, a) percent of observations within sites that are correctly classified and b) the percent of sites that are correctly classified.  Since each site is recorded as a measurement of each ~10×10-meter cell in a site, our error measurement can go either way.  If I say there is a 20% error rate, does that mean the 20% of each site is misclassified or that 20% of all sites are misclassified.  That is a subtle, but important distinction. The methods here calculate both aspect and then combine  both measures into a (poorly named) measure called Gain or Balance.  The penultimate slide gives a bunch of views of these metrics across the entire study area.

All in all, I am relatively  proud of this presentation in that it is the culmination of 2 years of intensive work that addressed many issues in APM that existed for 20 years.  It got over that hump and found a bunch of new issues that are a bit more contemporaneous with SL/ML/general modeling approaches.  Some interesting ways to view prediction error were developed, and they were visualized in a way that (at least to me) is pretty satisfying.  Let me know what you think!

pred

The Presentation:

 

SAA 2015 – Pennsylvania Predictive Model Set – Realigning Old Expectations with New Techniques in the Creation of a Statewide Archaeological Sensitivity Model

Boxplot or not to Boxplot: WoE-ful Example

IV_header

Edited 04/01: Added Tufte boxplots and Violin plots to bottom of post (Gist code updated)!

How to visualize numerous distributions where quantiles are important and relative ranking matters?

This post is inspired by two things

  1. I am working with the R packaged called Information to write some Weights of a Evidence (WoE) based models and needed to visualize some results;
  2. While thinking about this, I listed to the @NSSDeviations podcast and heard Hilary Parker (@hspter) say  that she had little use for boxplots anymore, but Rodger Peng (@rdpeng) spoke in support for the boxplots. A new ggplot2 vs. base, did not ensue…

Having heard the brief but serendipitous exchange by Rodger and Hilary, I got to thinking about what other approach I could use instead of my go-to boxplot.  Below I will go through a few different ways to visualize the same results and maybe a better approach will be evident. The R code for all that follows [except raw data… sorry] is at this Gist. Let me know what you think!

Data

I go back to WoE pretty frequently as an exploratory technique to evaluate univariate variable contribution by way of discrimination. I have a bit of a complicated relationship with WoE as I love the utility and simplicity, but I don’t see it as the final destination. Archaeology has been dabbling in WoE for a while (e.g. Strahl, 2007) as it is a comfortable middle ground between being blindly adherent to a narrow range of traditional variables  and quantifying discrimination ability of these variables in the binary (presence/absence) case.  There is also the push-button access to a spatial WoE tool in ArcGIS that has added to its accessibility. But this post isn’t about that; I will finish that rant post another time.

Continue reading “Boxplot or not to Boxplot: WoE-ful Example”

Boxplot or not to Boxplot: WoE-ful Example

R-chaeology @ SAA2016 – Beer and Data meetup

r-chaeology_oc_final_cropArchaeology’s two most favorite things; Beer and Data.  Put them together with the world’s most useful scientific data modeling and analysis language (of course I mean R) and it could have loads of fun!  To that end, I am putting this out there to propose a time and place for an unofficial R + Beers get together at the the Society for American Archaeology annual meeting in Orlando, FL.

Update 4/20: Zachary Cofran and David Pappano put on a great R-stats work shop at the American Physical Anthropologists meeting in Atlanta the week following #Rchaeology.  Here is a link to a post about their code/instruction

Update 4/8: Location!  There is a indoor/outdoor court yard at the north end of the Swan hotel.  If is just in from the boardwalk that separates the Swan and the Dolphin.  If it is too dark or noisy (fountain) we will move inside where there are couches & chairs.  See map below

Update 3/24: Open Context has graciously stepped in to support Rchaeology and provide refreshments (and some credibility!) to this little event.  A little bit about this organization:

Open Context reviews, edits, annotates, publishes and archives research data and digital documentation.

We publish your data and preserve it with leading digital libraries. We take steps beyond archiving to richly annotate and integrate your analyses, maps and media. This links your data to the wider world and broadens the impact of your ideas.

Keep an eye here for details as they emerge.  I will tweet with #Rchaeology2016  #Rchaeo2016

The Knows

  • Sponsored by Open Context!
  • In or about the indoor/outdoor courtyard in Swan hotel (see map below)
  • Saturday evening around 8pm to 10pm (early flight Sunday)
  • No R/Data/Stats experience necessary!
  • Unstructured, BYOData, hack & nerd-out
  • Will certainly conflict with many other happening; sorry.
  • Great show of interest so far!
  • Liquid refreshements courtesy of Open Context!
  • Guest appearances
    • Jesse Wolfhagen et al. to show off their Zooarchaeology package and talk Bayes Mixed Models.
    • Amy Fox may swing by to demonstrates some morphometrics

 

Be Prepared!

Ok, there is actually nothing for anyone to prepare; it’s just the Boy Scout motto.  But, if you are interested in preparing for things, here are some things to try.

  • Bring whatever data you are currently working on and we can take it for a spin.
  • If you want to bring you laptop and play along, the following are all you need to get started:

Disclaimer:

This is a bit of a crazy and impromptu idea that may fail go in any number of directions… by which I mean me sitting with a beer and a laptop running code; so pretty much a normal night.  If it does work and a few people can get together to talk R and data that it will be well worth it.  Collaboration, serendipity, and learning are guaranteed. I’ll update this post as new details come to light.  Feel free to comment, tweet, or DM if you are interested or have ideas!

R-chaeology @ SAA2016 – Beer and Data meetup

Visualizing Zubrow: ggplot2 & cowplot

Important note: This only works with the Development version of ggplot2! (as of 3/18/2016). Find this at Github

I was planning on making a whole post about my fascination for Ezra B. Zubrow’s work, but in the interim I put together some fun plots of the data I plan to write about.  I am putting this post together to get that code out there and show a few cool new things that ggplot2 v2.0+ allow for.  Everything that follows is 110% thanks to the hard work of @hadleywickham for creating the incredible ggplot2, @hrbrmstr for adding subtitles/captions, @ClausWilke for the wonderful cowplot package.

The plot/graph/data viz/pretty picture that I tweeted out last night was my first draft of presenting population migration data from Zubrow’s 1974 Population Contact Climate New Mexican Pueblos. Below I will present the R code and go through a few of the steps that pulled it together. [link to code in Gist]

cowplot.png

The process to make this was actually very easy thanks to the strides of those mentioned above. It is all really just a straightforward use of their code.

  1. Data Preparation (Zubrow 1974, R and dplyr)
  2. Decide how to organize and present to show intentions (brain)
  3. Plot each panel individually (ggplot2)
  4. Stick plots together (cowplot)
  5. Get some more ice cream (spoon & bowl)

The data prep started with manually lifting the data from the publication (Zubrow 1974: pg 68-69, Table 8) and structuring it in such a way to minimize repetition and get it ready for plotting.  I am not a very efficient coder, so other will likely find ways to do this better.  This follows the “wide data” format, which is basically adding columns of variables to a rows which each contain a single observation (population).  See here and here  and here for discussion of various “tidy” data formats.

library("ggplot2") # Must use Dev version as of 03/18/16
library("gridExtra")
library("extrafont") # for font selection
library("dplyr") # for data preperation
library("cowplot") # for combining plots

# Prepare data for plotting
# data from Zubrow, E.B.W. (1974), Population, Contact,and Climate in the New Mexican Pueblos
# prepared as a long format to facilitate plotting
year <- c(1760, 1790, 1797, 1850, 1860, 1889, 1900, 1910, 1950)
sites <- c("Isleta", "Acoma", "Laguna", "Zuni", "Sandia", "San Felipe",
"Santa Ana", "Zia", "Santo Domingo", "Jemez", "Cochiti",
"Tesuque", "Nambe", "San Ildefonso", "Pojoaque", "Santa Clara",
"San Juan", "Picuris", "Toas")
# [sites] are arranged in South to North geographic order
south <- c(seq_along(sites))
# add index to rearrange geographically from East to West
east <- c(14, 18, 17, 19, 12, 11, 13, 15, 10, 16, 9, 4, 3, 7, 5, 8, 6, 2, 1)
# population figures
pop <- c(
c(304, 410, 603, 751, 440, 1037, 1035, 956, 1051),
c(1052, 820, 757, 367, 523, 582, 492, 691, 1376),
c(600, 668, 802, 749, 929, 970, 1077, 1472, 1655),
c(664, 1935, 2716, 1294, 1300, 1547, 1525, 1667, 2564),
c(291, 304, 116, 241, 217, 150, 81, 73, 150),
c(458, 532, 282, 800, 360, 501, 515, 502, 721),
c(404, 356, 634, 339, 316, 264, 228, 219, 285),
c(568, 275, 262, 124, 115, 113, 115, 109, 145),
c(424, 650, 483, 666, 262, 930, 771, 817, 978),
c(373, 485, 272, 365, 650, 474, 452, 449, 789),
c(450, 720, 505, 254, 172, 300, 247, 237, 289),
c(232, 138, 155, 119, 97, 94, 80, 80, 145),
c(204, 155, 178, 107, 107, 80, 81, 88, 96),
c(484, 240, 251, 319, 166, 189, 137, 114, 152),
c(99, 53, 79, 48, 37, 18, 12, 16, 2),
c(257, 134, 193, 279, 179, 187, 222, 243, 511),
c(316, 260, 202, 568, 343, 373, 422, 388, 152),
c(328, 254, 251, 222, 143, 120, 95, 104, 99),
c(505, 518, 531, 361, 363, 324, 462, 517, 842)
)
# combine above in a data.frame where each population is the key observation
dat <- data.frame(Year = rep(year, length(sites)),
Site = rep(sites, each = length(year)),
Population = pop,
South = rep(south, each = length(year)),
East = rep(east, each = length(year)))
# add a factor that arranges the levels of site by their East to West geographic location
dat$East_label <- rep(factor(sites, levels = (sites[order(east)])), each = length(year))

# Calculate quantities of interest
# this ibe-liner calculates the date to date change in population
# ave() groups and runs the diff() function that is concatenated with a 0 for the first observation in each group
# probably can make this dplyr, but it works...
dat$Change <- ave(dat$Population, dat$Site, FUN = function(x) c(0, diff(x)))
# using dplyr to group the observations by site/Pueblo and calcualte stuff
dat <- group_by(dat, Site) %>%
# change from inception date of 1760 (first data point)
mutate(Relative_Change = Population - Population[1]) %>%
# percent change of log population (not used in this graphic)
mutate(Percent_Change = (Population - lag(Population))/lag(Population)) %>%
# change from inception scaled to % change from inception
mutate(Scaled_Change = (Relative_Change / Population[1])) %>%
# make into a data.frame
data.frame()

# select only the rows that are the last time period observation of 1950
# this is for the second plot
net_change <- dat[which(dat$Year == 1950),]

Plot Intentions (a.k.a skip down if you just want code)

Now that the data is entered, formatted, and the quantities of interest are calculated, its time to start plotting.  I imagine everyone does it this way, but my simple method for building a ggplot is simply to start simple and then add in calls for ever finer detail.  Here is an amazing tutorial on building a ggplot that really covers all the bases.  I could add little to that link as far as approach and basics, however I will add that going from the basics to the plot you see in your head can be quite difficult!  Most of the “issues” I run into making a plot that is remotely complicated are almost always because I didn’t think through what I really want to see and what the technical implications of that will be.   All you smart people probably won’t have that problem, but for me there has been a proportional shift to increased time spent thinking about a plot versus a decrease in time spent coding  a plot as I get more experience.

The intention of this plot is to support an argument made by Zubrow that there was a 18th to 20th century population shift in Pueblo society based on cultural contact and climate change.  Of course it will take more that a single graphic to make this point, but a single graphic would go a long way to help a reader see it.  To that end, I first decided on a graphic that would be focused on the percent change from the date of the first data point at 1760.  I wanted to use percent change to scale the y axis a bit more evenly since some of Pueblos had total populations that dwarfed the smaller pueblos and decreased the visual impact of the trend. Further, period over period change was pretty variable and did not tell the story. Secondly, since there is a geographic component to this population shift, I need to arrange the Pueblo data groups geographically East to West.  Ideally this would be done a s single row, but there are too many sites for that to work.

Additionally, there is the issue about site specific population change and overall population change and how that interacts over time.  This is why using two plots made sense here. The top plot to show that site specific percent change over time and the lower plot to show the net change across geography form the inception date of 1760. But enough about that, lets see some code!

Plot #1: Site specific percent population change over time

# plotting with ggplot2 (dev version from github)
# I started doing the [p1 <- p1 +] way to build these b/c others did and
# I liked how I could add selective lines while testing without changing anything
# so, it is more things to type, but whatever. It can be done in one ggplot() call if you like
# also, I code to get stuff done then optimize later. You may likey find ways to do this better

# Plot #1: Percent change from inception for each site
# set up data and axis
p1 <- ggplot(dat, aes(x = as.factor(Year), y = Scaled_Change, group = East))
# this is the bar plot part
p1 <- p1 + geom_bar(stat="identity", fill = "skyblue", color = "white")
# this is the redline outlining the bars
p1 <- p1 + geom_line(color = "firebrick1", alpha = 0.95)
# wrap it by the reordered factor so each individual site plots
p1 <- p1 + facet_wrap( ~ East_label, nrow = 2)
# add a reference line at zero
p1 <- p1 + geom_hline(yintercept = 0, linetype = 1, color = "gray35", size = 0.5)
# scale the y axis and calculate it as a percentage
p1 <- p1 + scale_y_continuous(breaks = c(seq(-1,3,1)), labels = scales::percent)
# thin out the dates in the x axis
p1 <- p1 + scale_x_discrete(breaks = c(1760, 1797, 1860, 1900, 1950))
# trusty old theme_bw() to prepare the look of the plot
p1 <- p1 + theme_bw()
# fine suggestion by @ClausWilke to make plot less busy
p1 <- p1 + background_grid(major = "xy", minor = "none")
# add the title, subtitle, and axis labels
# this is a new feature only available in the dev version of ggplot2
p1 <- p1 + labs(title="Percent Population Change from in New Mexico Pueblos from 1760 to 1950",
subtitle="Arranged from East to West",
x = "Year",
y = "Percent Population Change")
# all the things in the theme() section that stylize the plot and fonts
p1 <- p1 + theme(
strip.background = element_rect(colour = "white", fill = "white"),
strip.text.x = element_text(colour = "black", size = 7, face = "bold",
family = "Trebuchet MS"),
panel.margin = unit(0.3, "lines"),
panel.border = element_rect(colour = "gray90"),
axis.text.x = element_text(angle = 90, size = 6, family = "Trebuchet MS"),
axis.text.y = element_text(size = 6, family = "Trebuchet MS"),
axis.title = element_text(size = 8, family = "Trebuchet MS"),
# the three elements below are all new to the dev ersion
plot.caption = element_text(size = 8, hjust=0, margin=margin(t=5),
family = "Trebuchet MS"),
plot.title=element_text(family="TrebuchetMS-Bold"),
plot.subtitle=element_text(family="TrebuchetMS-Italic")
)

zubrow_cowplot_upper
Aside from a pretty straightforward ggplot, some interesting notes are the use of labs() to set up the title, X, Y, sub-title, and caption text.  This new feature that is currently in the development version is thanks to @hrbmstr and detailed in his recent post.  Basically, just add your text at the right tag and style accordingly; instant badass plotting!  The other thing is the use of custom fonts.  This takes a small bit of setup and can be a pain on a Windows machine, but is explained well by @zevross on his site here.  Everything else is a pretty basic design decision based on the data and intention. I used the line to make the trend jump out and used the bars to emphasize the space.  I almost always start with theme_bw() as a base. Oh, and I love skyblue; I can’t make a plot without it.

The second plot took the net percent change for just the 1950 data point for each pueblo.  This was an easy way to show net change over that time period and assign a single number to each site.  Plus I would not have to facet and it would fit very nicely below the previous plot.

Plot #2: Net percent population change per Pueblo

# Plot #2: Net change in population by site
# this uses a different data.frame from plot #1
# most of this is all the same as Plot #1, but not faceted.
# one trick here is to have two geom_bar calls, but the second is just data below the 0 line
p2 <- ggplot(data = net_change, aes(x = East_label, y = Scaled_Change, group = Year))
# first call to geom_bar for all data
p2 <- p2 + geom_bar(stat="identity", fill = "skyblue")
# second call to geom_bar for just negative data with a different color
p2 <- p2 + geom_bar(data = net_change[which(net_change$Scaled_Change < 0),],
stat="identity", fill = "skyblue4")
# a spline smooth to add the fit that shows the trend of population flow
p2 <- p2 + geom_smooth(data = transform(net_change),
method = "lm", formula = y ~ splines::bs(x, 3),
se = FALSE, color = "firebrick1", size = 0.5)
p2 <- p2 + geom_hline(yintercept = 0, linetype = 1, color = "gray35", size = 0.5)
p2 <- p2 + scale_y_continuous(breaks = c(seq(-1,3,0.5)), labels = scales::percent)
p2 <- p2 + theme_bw()
# fine suggestion by @ClausWilke to make plot less busy
p2 <- p2 + background_grid(major = "xy", minor = "none")
# same new features as described in plot #1
p2 <- p2 + labs(title="Net Percent Population Change from 1760 to 1950",
subtitle="Arranged from East to West",
x = NULL,
y = "Percent Population Change",
caption="Data: Zubrow, E.B.W. (1974), Population, Contact,and Climate in the New Mexican Pueblos")
p2 <- p2 + theme(
axis.text.x = element_text(angle = 90, size = 8, hjust = 1, family = "Trebuchet MS"),
axis.text.y = element_text(size = 7, family = "Trebuchet MS"),
axis.title = element_text(size = 8, family = "Trebuchet MS"),
# same new features as described in plot #1
plot.caption = element_text(size = 8, hjust=0, margin=margin(t=5),
family = "Trebuchet MS"),
plot.title=element_text(family="TrebuchetMS-Bold"),
plot.subtitle=element_text(family="TrebuchetMS-Italic")
)

zubrow_cowplot_lower

This plot simply used that same bits as above for subtitle and the added caption.  The caption is only used once since this is the bottom graphic and they come from the same data.  Fonts are the same. Now to the cool stuff.

Final plot: Putting it together with cowplot

This is where the magic happens thanks to the cowplot package. Cowplot is a ggplot2 add-on that allows for the creation of well formatted and publication ready graphics straight out of R. It achieves this though offering some basic themes with standard and appealing colors, line weights, font sizes, etc… and the ability to arrange multiple plots in one graphic.  I only used the plot_grid() function to arrange the plots we made above, but the themes provided by cowplot are quite nice and you can also add annotation directly on top of your plots (e.g. DRAFT). This is really a great tool and I look forward to making it part of my practice.

</pre>
<pre># now the magic of cowplot to combine the plots
# plot_grid does the arranging of p1 and p2
# [labels] gives identification to each plot for your figure caption
# [nrow] arrnges them in stacked rows
# [rel_heights] is really important. This allows for the top plot to be made larger (1.75 times)
# than the lower plot. If your plots are not realtively equal in size, you may need this
# I adjusted by trial and error
cplot1 <- plot_grid(p1, p2, labels = c("A", "B"), nrow = 2, rel_heights = c(1.75,1))
# get an idea of what it looks like
plot(cplot1)
# save cowplot as an image!
# [base_width] and [base_height] allow of sizing the fina image Use in conjunction with [rel_heights]
save_plot("cowplot.png", cplot1, base_width = 8, base_height = 7.5)
# Done!

Final plot at top of post
The plot_grid() and save_plot() functions have some really key arguments that were invaluable here, specifically rel_width, rel_height for plot_grid() and base_width, base_height for save_plot(). Combining these, you can scale each graphic as a ration relative to the others and then adjust the overall size on save.  This was important here because the bottom graphic was only intended to be a narrow strip below the faceted graphic that needs more headroom.

And that is about it.  I wish I had come across some deep hack or new approach, but really I just took the tools that became available and put them together.  I hope some of the code and discussion above help you see how to use these tools together and create interesting and vivid depictions of your data.  If you have any  questions or comments about this code, please find me @md_harris

Notes:

  • The code in once place at this Gist
  • There was a first pass at this data which resulted in the plot below.  This was my first use of the new caption/subtitle feature.  I know it was being worked on, but had to dig into github to find out that the pull request was accepted 4 hours before I set out to make the plot.  Only after this plot did I decide to use a different approach to the data and then get into cowplot.
  • population
  • Final note, if you are interested in these data and their interpretation, there is a lot more discussion to be had.  This post is just about the visualization of it and it would take a textual context to justify the argument that is being conveyed here.

 

Visualizing Zubrow: ggplot2 & cowplot