Background Sampling & Variance in Logistic Regression Models

RJASI3BK_Fotor.jpg

In preparing for a talk on Archaeological Predictive Modeling (APM) next week in Concord, NH I made some graphics to illustrate a point.  A blog post is born.

Cutting to it… If you want to use logistic regression to build an APM you are using site locations as the positive class (site present) and some form of the environmental background as the negative class (site absent).  How you conceptualize site absence is a bigger fish that I have time to fry here, but for the sake of conversation these models are taking random selections of 10 meter x 10 meter non-site cells to simulate the general environmental background. It is important to note that using the locations of surveys that failed to identify archaeology sites is often problematic because of location/selection bias, imperfect detection, or just lack of surveys.  For these reasons, we need to make assumptions about how the background is represented. So making the assumptions that we often do raises one big problem:

How many samples are needed to adequately characterize the background?

Good question.  The answer depends on many things, but chiefly it depends on how many site samples you have and how diverse your environment is.  Use as many as possible?  Problem is given that the occurrence of archaeological sites are typically low prevalence events on the landscape, the more background samples we use per site sample negatively effects our sample imbalance.  Sites are already in the minority, so adding more background to get a better characterization creates a larger imbalance and wreaks havoc as we will see.  Ok, so use a balanced number of background samples, 1:1?  That would be nice for the imbalance issue, but the fewer samples, the less we are to accurately characterize any relatively diverse environment. Fine then, how about 3:1 background to site samples? Sounds like a pretty good choice, lets start with a simple model:

\log\{\frac{p}{p-1}\}=\alpha+\beta_1*water+\beta_2*slope+\epsilon

Simple logistic regression with site presence/absence as the response, distance to water (ed_h6) and a measure of slope variation (slp_32c) as the dependent variables, and epsilon is assumed to be normally distributed and iid (but its not, lets not kid ourselves). The plot below shows the response curves varying across each variable as the 0.05, 0.5, and 0.95 quantiles of the variable named in the upper panel.  Pretty weak signal, but there is something here. Basically, the probability decreases mainly with increased slope variation and more subtly with distance from water.  That fits ok with pre-contact settlement in this area; give or take.   The second image shows how they effect probability together.predict_by_quantile

probability_contour

Cool, models done.  Nope.  This outcome is based on a single draw of 597 background samples from a 3:1 balance for the 199 site location samples.  What happens if we try again with another random sample of 597 background samples?  All hell breaks loose!  The gif below shows the response curve for the 0.05 quantile for distance to water over the range of slope variability for 100 different draws of 597 random background points.  It is clear that taking a different sample may lead to a very different model outcome. Logically, this should not be shocking, but relying on a single relatively small random draw for a background sample is a rather defacto approach to APM.  Unfortunately, simply taking a larger sample does not solve all our problems.

background_samples

The gif shows essentially the same as above, but varying across different ratios of background to site samples. Starting with 3:1 on the left and increasing to 20:1 on the right.  As the ratio of background to sites increases, the variance between models decreases (less erratic fit lines), but the overall increased imbalance of background to sites leads to reduced overall probability. The lower variance is a good thing, but the overall lower probability is not.

Background samples by mutliplier

Ways to address these competing issues include finding the right ratio balance, modeling techniques that address highly-imbalanced data, bootstrap sampling of the background, or model averaging; in addition to different algorithms and model specifications.  Hopefully this illustrates the danger of relying on a single random draw to represent site-absence locations and the utility of criticizing you model results against multiple draws.


 

Notes:
  • The top two graphics are not actually the simple logistic regression example.  These are drawn from a Bayesian HMC GLM with priors \mathcal{N}(0, 1), but the response is pretty much the same.  The uncertainty of 1000 posterior draws of a single 3:1 background sample is presented here:
  • Bayes_median_ribbon_samples.png
  • Clearly, the effect of ed_h6 and std_32c on presence/absence given these assumptions and mode is not great.  However, these two variables are well ingrained in temperate area settlement theory and recognizable to any archaeologists.  They make a good example for discussion.
Background Sampling & Variance in Logistic Regression Models

Conway’s Game of Life + Zombies in ggaminate

gameoflife_Fotor

I was reading a new paper entitle Rethinking the Role of Agent Based Modeling in Archaeology by Cegielski (@Whcegielskiand Rogers [paywall to paper] and I was struck reminiscing about my infatuation with the topic.  I used to do a lot of research on ABM and particularly the philosophy of Emergence, but I have drifted away from it in recent years in favor or more quantitative/statistical/inferential methods (I am not sure if that is a suitable differentiation…)  This is not to say that I didn’t enjoy every paper/presentation I have seen by the likes of@CDWren,  @er_crema@Iza_Romanowska, and Kholer & Co. Whichever the case, Cegielski and Rogers was a great way to get back into some of the happenings in ABM and I really agree with a lot of what they say.  More on that in a future post.

Reading this paper reminded me of my love for Cellular Automata (CA) . I remember waiting with bated breath as Stephan Wolfram’s A New Kind Of Science hit the shelves in 2002 (you often had to physically go to book store back then).  I poured over those 1200 pages and tried to figure out that the hell it meant for understanding past human experience (a work in progress).  Cellular Automata, L Systems, and Complexity were all that mattered; oh the good old days… Thinking of this, my mind blended by old interest with my new obsession with all things gganimate, and this post was born! I want to implement CA in ggplot and make a gif with gganimate. Enter…

Conways’s Game of Life

 

Conway’s Game of Life  (GOL) is the canonical example of cellular automata and the ideas of the complex being born for simple rules and high variability outcomes based on low variability initial conditions. As stated in Wikipedia, the rules of the game are:

  1. Any live cell with fewer than two live neighbours dies, as if caused by under-population.
  2. Any live cell with two or three live neighbours lives on to the next generation.
  3. Any live cell with more than three live neighbours dies, as if by over-population.
  4. Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction.

The GOL is a pretty standard test for coders and comp. sci. folks who what a little challenge. It has been implemented in R many time and therefore, I will steal it to make my life easier.  What follows is based on John Ramey’s (@ramhiser) post Conway’s Game of Life in R with ggplot2 and animation.  In this post John goes over some GOL basics and then offers his code to do what the title suggests.  Thanks to John’s great work, the guts of the code had two functions for computing the GOL grids using the foreach parallel back-end; cool stuff.  Then John uses ggplot2 and the animation package, but I reworked most of that to make it work with gganimate.

GOL1b

Better Yet!!! I watch too much TV including the Walking Dead, so I want to make a Game of Zombies!  GOL but, with an afterlife…

GOL1a

With the options for various afterlife spans and colorBrewer colors!

GOL1c.gif

Sizes, densities, intervals…

GOL1d.gif

 

 

R Code is here

Conway’s Game of Life + Zombies in ggaminate

Model Overfitting – Some basics

“With four parameters I can fit an elephant and with five I can make him wiggle his trunk.” – John von Neumann

This post came about for two reasons:

  1. I’ve been playing with @drob‘s gganimate package for ggplot2 and it is really fun; and hopefully useful in demonstrating my point.
  2. Overfitting is arguably the single most important concept to understand in prediction. It should wake you out of a deep sleep at 3am asking yourself, ” Is my model overfit?”

The first point is fun and games, the second is very serious. Fortunately, using gganimate is simple, but unfortunately dealing with overfitting can be very difficult.

line1a

Overfitting is a pretty easy concept; your model fits your data very well, but performed poorly when predicting new data. This happens because your model fit the noise of you data; likely because it is too flexible or complex. If a model employees many degrees of freedom, it is very flexible and can seek out every data point leading to a wiggly line that misses the underlying signal that you are attempting to model. Note that the concept here is part and parcel with the concept of Bias & Variance trade-off, but I’ll sidestep that and save it for another time.

An overfit model will appear to be very accurate when compared to the data used to build (aka “train”) the model. This is because with every additional degree of freedom your error (e.g. Root Mean Square Error [RMSE]) will decrease. The more flexibility, the lower the error; yeah pack it up, we can go home now. Complex models.. FTW!

Hold on tiger… the problem is as flexibility increases the model fits more and more to the noise that is unique to the particular data sample you are training the model on. This particular noise will not likely be present in another sample drawn from the same overall population. Therefore, a model that has low error because it strongly fits the noise of the training data, will have a high rate of error on new data that has different noise. However. not all is lost: the trick is to balance model complexity to achieve optimal error rates considering both data sets and your modeling objective. Let’s see some plots.

trainv test

The plot above demonstrates this concept by showing an increasing complex model (red line) fit to the training data sample in the top pane. The bottom pane shows a larger sample of testing data that is more representative of the total target population. The red line in the lower pain is the prediction of the testing data based on the model fit to the training data above.

The key thing to note is the increased flexibility as the model complexity increases (i.e. polynomial degrees) and how it more tightly fits the variation in the training data. At the same time, that model predictions for testing data are much more complex than necessary because the model is anticipating a data set just like the one it trained on. The RMSE of the training set continues to drop as the model becomes more complex, but the testing RMSE only drops to a point and then rises as the model becomes more overfit.  An overfit model is a one trick pony. Don’t be a one trick pony.

I prefer a little bit of analogy, so for an archaeological example we can pretend that this data represents something like the y = count of rhyolite flakes and x = count of chert flakes on site within a drainage basin.  We observe in our small sample that there tends to be high amounts of rhyolite where there is little chert, but also that even for high amounts of chert there is a pretty consistently moderate amount of rhyolite.  Framing it that way, we can see how our small sample holds the general trend of all the unobserved sites in the basin, but if we fit too flexible of a model, we will estimate quantities that are simply the result of our limited sample size.

This is only for one train/test set, so maybe it is just a bad random draw, right?  Not really. Simulating this over 100 permutations of this data population show the textbook test/train error curves when aggregated to their mean. The red line for training error will continue to decrease as model complexity increases. It will level out, but it will not increase. On the other hand, the test data error decreases until the population parameters are approximated and then it increases as the model becomes over fit. This clearly depicts the risk in being over confident in a model that has a low training error.

Error over Complexity

In the graphic above, the degree of model complexity at which both errors are, on average, closest is around a linear fit (1st order) or second degree polynomial, but the point at which the model test error is lowest is around a 5 or 6 degree polynomial.  This data was low order polynomial model, so the lower error of the 6 degree model is likely more adaptation to noise.

Among competing hypotheses, the one with the fewest assumptions should be selected Occam’s Razor

While we addressed the issues of indefinitely sliding off into high complexity hell by using a test data set to get to the neighborhood of test/train error balance, this is only part of the problem. At this point, we hit the crux of the model complexity problem; how much model simplicity and interpretive capability to we want to give up for improved prediction? The linear fit of the first order polynomial is simply the wrong model, but has a very easy interpretation regarding the relationship of y|x; maybe its good enough?  On the other hand a 6th degree polynomial is a pretty insanely complex model for most responses and will give you very little in the way of inference.  Pick say a 3rd degree polynomial and you give up some predictive accuracy, you do gain the ability to have some insight into that the coefficients mean; even better for a second degree model. The model you chose depends entirely on why you are modeling the data in the first place.

From here, this discussion can drift off into any number of related directions about prediction vs. inference, Bayesian vs. Frequentist, regularization, AIC/BIC/CP, cross-validation, bias vs. variance, lost functions, etc… but that will be for another time. In the mean time, make sure you model is behaving.

Notes:

    1. The model used to generate the data used above follows as:

y\sim\mathcal{N}(\mu, \sigma) \\ \mu = \alpha+\beta1*x^{-0.3} \\ \alpha\sim\mathcal{N}(50, 1) \\ \beta1\sim\mathcal{N}(30, 1) \\ \sigma\sim\mathcal{N}(2.5, 0.1)

  1. After writing this, I happened to run across another blog discussing overfitting that also used that von Neumann quote to open.  I never said I was original.
  2. This is a overly simplified view of model overfitting, but is intended for a reader at an introductory level of modeling.
  3. R code for this post is this Gist

 

 

 

 

 

 

 

Model Overfitting – Some basics

Simpson’s Paradox in Action

Simpson’s paradox, or the Yule–Simpson effect, is a paradox in probability and statistics, in which a trend appears in different groups of data but disappears or reverses when these groups are combined.” ~ Wikipedia

I just got Simpson’s Paradox’ed and it feels so good!  This is often a bad thing, as your analysis may come to a wrong conclusion if the data is viewed at the wrong level or you have latent grouping.  But just now it helped me understand a problem and I figured I would tell the internet about it (like it or not).

This is not the best example of the Simpson’s Paradox, but I think it is illustrative. I have a set of 150 points taken with a GPS, most taken with a very high accuracy RTK network and other with less accurate SBAS.  However, I do not know which shots were by which method. My end calculation are showing a bias towards my elevations being about a half foot too high when compared to highly accurate LiDAR derived DEM. I began to suspect a vertical datum shift, but metadata and RTK information said they matched; then again, metadata and documentation can lie.  I figured I could use the Vertical Dilution of Precision (VDOP) to detect the datum shift, so I set about plotting my data and estimating the linear trend.

RTK_grouped

The slope of the linear fit is 0.44 suggesting a positive relationship between the VDOP and difference in elevation (increasing VDOP by 1 increases the delta elevation by 0.44 feet). However, this is counter intuitive given the increase in error (VDOP) is getting me closer to my target (delta elevation of zero).  Albeit counterproductive, this suggests that the higher VDOP points are closer to target, and therefore may be the points in the correct vertical datum.  based on a visual inspection of the data, I drew an arbitrary class boundary at VDOP = 0.25 to establish possible RTK vs. non-RTK groups.  Given these the same treatment illustrates the point of this post: Simpson’s Paradox!

RTK_seperated

When the points are grouped and the linear trend is refit, the slope changes direction!  Instead of the positive 0.44 of the aggregated data,the individual groups have negative slopes of -0.62 for non-RTK and -1.3 for RTK (the direction of the slope is important part here).  Firstly, the negative slope for each group gives the more initiative result that higher VDOP error moves us further from the target elevation. Secondly, the possible RTK group is close to the 0.5 ft bias I had detected, while the non-RTK is closer to zero (which much more variability (see below) because of the lack of real-time post processing).  The fact that the signs switched (the paradox part) and that my group means are close to the biased and true targets (see below) gives me confidence that I can identify the points with the incorrect vertical data and correct for it!  Thank you Simpson’s Paradox.

RTK_seperated_mean

PS – These are high accuracy GPS measurement, but they are used only for planning purposes; not construction of regulatory use.  If that point at VDOP 0.2 should actually be in the non-RTK cluster and it gets correct as if it were RTK, no one will get hurt and no buildings will fall over.

*** below is the R code for the ggplot2 examples, but because it is a client project I cannot share the data.

Data prep and first plot

## Prepare data for plotting
pdat <- data.frame(elev_diff2 = sdat1$elev_diff, vdop = sdat1$vdop,                    hdop = sdat1$hdop, pdop = sdat1$pdop,                    Status = ifelse(sdat1$vdop > 0.25, "non-RTK", "RTK"))
colnames(pdat) <- c("Elevation_Difference", "VDOP", "HDOP", "PDOP", "GPS_Status")

## first plot
ggplot(data = pdat, aes(x = VDOP, y = Elevation_Difference)) +
geom_point(size = 5, shape = 20, color = "gray20") +
geom_abline(size = 1, linetype = "dashed") +
# geom_smooth(method="lm", se=FALSE) +
# stat_smooth(method="lm", formula=y~1, se=FALSE) +
# scale_color_manual(values = c("orange", "skyblue3")) +
scale_x_continuous(breaks = seq(0, 1.5, 0.2), limits = c(0, 1.5)) +
scale_y_continuous(breaks = seq(-2.5, 1.75, 0.5), limits = c(-2.5, 1.75)) +
ylab("Difference in Elevation (feet)") +
xlab("Vertical Dilution of Precision (VDOP)") +
ggtitle("GPS Versus LiDAR DEM Point Elevations") +
theme_bw() +
theme(
axis.title.y = element_text(hjust = 0.5, vjust = 1.5, angle = 90, size=rel(1), face="bold"),
axis.title.x = element_text(vjust = -0.5, size=rel(1), face="bold"),
title = element_text(size=rel(1.5)),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(size=rel(1.5)),
axis.text.y = element_text(size=rel(1.5)))

Second Plot

ggplot(data = pdat, aes(x = VDOP, y = Elevation_Difference ,
group = GPS_Status, color = GPS_Status)) +
geom_point(size = 5, shape = 20) +
geom_abline(size = 1, linetype = "dashed") +
geom_smooth(method="lm", se=FALSE, color = "black", size = 1) +
# stat_smooth(method="lm", formula=y~1, se=FALSE) +
scale_color_manual(values = c("orange", "skyblue3")) +
scale_x_continuous(breaks = seq(0, 1.5, 0.2), limits = c(0, 1.5)) +
scale_y_continuous(breaks = seq(-2.5, 1.75, 0.5), limits = c(-2.5, 1.75)) +
ylab("Difference in Elevation (feet)") +
xlab("Vertical Dilution of Precision (VDOP)") +
ggtitle("GPS Versus LiDAR DEM Point Elevations") +
theme_bw() +
theme(
axis.title.y = element_text(hjust = 0.5, vjust = 1.5, angle = 90, size=rel(1), face="bold"),
axis.title.x = element_text(vjust = -0.5, size=rel(1), face="bold"),
title = element_text(size=rel(1.5)),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "bottom",
axis.text.x = element_text(size=rel(1.5)),
axis.text.y = element_text(size=rel(1.5)))

Third plot

ggplot(data = pdat, aes(x = VDOP, y = Elevation_Difference ,
group = GPS_Status, color = GPS_Status)) +
geom_point(size = 5, shape = 20) +
geom_abline(size = 1, linetype = "dashed") +
geom_smooth(method="lm", se=FALSE, color = "black", size = 1) +
stat_smooth(method="lm", formula=y~1, se=TRUE, size = 1) +
scale_color_manual(values = c("orange", "skyblue3")) +
scale_x_continuous(breaks = seq(0, 1.5, 0.2), limits = c(0, 1.5)) +
scale_y_continuous(breaks = seq(-2.5, 1.75, 0.5), limits = c(-2.5, 1.75)) +
ylab("Difference in Elevation (feet)") +
xlab("Vertical Dilution of Precision (VDOP)") +
ggtitle("GPS Versus LiDAR DEM Point Elevations") +
theme_bw() +
theme(
axis.title.y = element_text(hjust = 0.5, vjust = 1.5, angle = 90, size=rel(1), face="bold"),
axis.title.x = element_text(vjust = -0.5, size=rel(1), face="bold"),
title = element_text(size=rel(1.5)),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "bottom",
axis.text.x = element_text(size=rel(1.5)),
axis.text.y = element_text(size=rel(1.5)))

Linear fits

m1 <- lm(elev_diff ~ vdop, data = sdat1)
summary(m1) 

m2 <- lm(elev_diff ~ vdop, data = sdat1[which(sdat1$vdop > 0.25),])
summary(m2) 

m3 <- lm(elev_diff ~ vdop, data = sdat1[which(sdat1$vdop < 0.25),])
summary(m3)

 

Simpson’s Paradox in Action

Grand Challenges for Archaeology: a model based point of view

TL;DR – Current American archaeology lacks a cohesive framework for the construction, evaluation, and implementation of quantitative models specific to our domain…
and its my generation’s fault.

This post is in response to Doug Rocks-Macqueen’s  call for Blog Carnival to explore the Grand Challenges for Archaeology; an amazing idea! This call for blogs can be found at Doug’s blog, Doug’s Archaeology.  The idea is to collect a series of blog posts (or any web-based medium) based on this theme.

The source of this topic is a survey of archaeologists to determine the 25 grand challenges that we and our field face.  The responses from this survey (Kintigh were synthesized into an American Antiquity article Kintigh et al. 2014 (article available here and responses available here [login required]). I read this article upon publication and had a range of responses; from inspired to angered to critical to amazed.  Surely the signs of good thought provoking article.

“in God we trust. All others bring data.” ~ W. Edwards Deming

While the statement at the top of this post elicits a similar wide-ranging response in my mind, the thesis remains the same. This contention is developed based on 20 years of focusing on GIS, spatial, and quantitative analysis in a predominantly CRM context. I am guilty of every shortcoming, failure, and n∅∅b mistake one could endure while pursuing this knowledge over the years and it is through these setbacks that this point of view was developed.  At every inflection point I was sure that the there must be a body of literature just out of reach that would illuminate my questions ; surely somebody had addressed these issue already.  The truth became evident in time that a quantitative approach to archaeology was not contained within the neatly wrapped package that I had hoped.  Instead it is found evolving in a mosaic of niches, feeding off the remains of that which came before it. unconformably underlying this landscape is a rich stratigraphy of quantitative intentions gone by; some as diamonds and others as a dark smudge.

To unify and develop the quantitative archaeological landscape to be accessible to all interested archaeologists, regardless of their training, is to solve a grand challenge facing archaeology today.  The tools exist, the data exists, and the desire exists to do this, we just need the structural support on which to build it.

\pi(\theta\mid y)\propto f(y\mid\theta) p(\theta)

By “model based” archaeology I am referring to an approach that fosters the formalization of logical implications and consequences of observed data and theory.  Furthermore, formalization is carried out through quantitative methods that make explicit assumptions and results are reported in the context of the model.  Finally, this approach is supported by modern computational statistical methods and data visualization techniques that allow for effective communication to non-specialists and the publics.

The basis of this approach is not remotely revolutionary and it seems to me to be at the core of archaeological thought.  Taking noisy, fragmentary, and incomplete observations and combining them with previous (mental) models of a phenomena to derive a new understanding as a weighted average of all of our knowledge on the topic is pretty much what we do.  However, I believe we can evolve our collective instincts for building knowledge from data into a more unified, accessible, and robust approach.  Again, not necessarily a revolution (e.g. New Archaeology and The Model-based Archaeology of Socionatural Systems), but a call-to-arms to unify and bring our collective resources together for the establishment of this perspective as a more prominent feature of American archaeological discourse.  The tools, ability, and smart folks are all present, as evident by a number of people doing great work in this direction.  Let’s transform archaeology into a field where data is shared between sub-fields, models are developed based on problems specific to archaeology, findings are communicated with narrow and broad implications, and all archaeologists can access the tools to participate.

turtle

Now on to the assertion of blame… Truthfully, I am not into blame; what is is and what’s done is done. However, this is one situation where understanding the point of dissidence may be helpful in recognizing the problem more fully.  IMHO, is appears that my “Gen-X” cohort of archaeologists (myself fully included) failed to hold up our end of the bargain. Duly noting that it would be an unfair and gross overgeneralization to deny that many great 30 to 40-something archaeologists have made significant contributions over the past two decades; that is undeniable.  The vague point I am attempting is that the field itself has been so fragment and impartial (dare I say aloof), that many previous quantitative advancements have not been built upon and risen to their potential.  The reasons, real or imaginary, for this circumstance are many.

A striking case in point lies directly in front of us in the list of authors autopsying archeology’s grand challenges in the pages of American Antiquity.  Kintigh PhD class of 82′, Beaudry PhD class of 80′, Drennan PhD class of 75′, Limp, Maschner, Wright are the scholars of my formative years, Altschul founded SRI in 83′, and Kohler was developing archaeological ABM when I was pushing turtles in LOGO… [right 90]. These names are the pantheon of our intellectual predecessors.  If I may be so bold, they could be enjoying their achievements and not allotting their publications to worrying about the flames that others should be tending to. Given that the challenges in the article cover a wide variety of topics, many are justified and resolved on quantitative grounds and these grounds are fertile.

** I’ll spare the armchair quarterbacking on why we dropped the ball, but this article states my opinion much better than I can.

Epilogue

I know I am being a bit dramatic in making my points, as there are a number of bright points producing cohesive model based archaeological literature.  First in mind is the evolutionary and biological anthro/archaeo enclaves, the best use of Bayesian methods is in bettering our understanding of temporal models, digital archaeology/humanities make use of topic models and clustering often, and Agent Based Modeling has been a great source of complexity literature.  Specialization of sub-fields is great and necessary, but if a wider audience could take the logic of an seemingly unrelated models, appraise that model in terms of the problem it addresses, and apply the same model to their own data we can begin to build a unified framework of cross-discipline knowledge.  Knowledge that crosses scales, time periods, and cultures.

Solutions? learning to code; sharing data; sharing findings; publishing negative results; identify the unique nature of your archaeological data; dedicated open-access journals; introduction to logic and quantitative methods in universities; talking to people outside of our bubbles; studying other fields/sciences; reading old literature; never being satisfied; being more skeptical than the skeptics; being creative; and adapting {stealing} ideas from the smartest people in the room…

Grand Challenges for Archaeology: a model based point of view

Quantifying the present and predicting the past : theory, method, and application of archaeological predictive modeling

post3 head

The volume entitled, “Quantifying the present and predicting the past: theory, method, and application of archaeological predictive modeling” published in 1989 by the U.S. Bureau for Land Management (BLM) is the foundation for all approaches to Archaeological Predictive Modeling (APM).  Past or present, no single volume goes into greater depth on the true underlying issues associated with APM or attempts to synthesize the variety of basic approaches to the data.  I received my first copy from the back of  a BLM warehouse 1999 to 2000; but I have recently run across these scanned copies online.  A PDF is available at this link from the BLM Quantifying the Present and Predicting the Past [PDF] and also an  Alternative Version on Archive.org that allows for download in a variety of formats and is searchable. (the Archive.org link is part of the FEDLINK library and BLM collection that have a ton of great hard to find publications from days gone by)

This most ambitious undertaking was complied and edited by a large team of archaeological All-Stars (Sebastian, Kvamme, Kohler, Altschul, et al.) and sought to identify the fundamental questions the APM should address, the issues that are specific to our data, and methods to implement these models.  This volume is a synthesis of the previous decades worth model based archaeology that had developed out of “New Archaeology” and was directed by the needs of large-are land surveys for the newly formed Cultural Resources Management (CRM) industry. Editorially, I will suggest that this was at the peak of APM study as a central theme that such leaders in the field would devote themselves to.  In the years following, easy access to computer core cycles and push-button GIS made glossing over the details of these methods all to easy.  (I am a guilty contributor to this cycle, so no finger pointing here.)

From the methodological side, the technical details in this volume are not particularly sexy or exciting.  Logistic regression and PCA (Kvamme, Chapter 8) are not nearly as cool to talk about as Machine Learning and Deep Convolutional Neutral Nets, but they are incredibly important to understanding how we attempt to model site locations and are still appropriate models in many situations.  That is not to say that modern statistical learning algorithms are not important of have good performance, but just the notion that you cannot ignore the past (bad pun intended). I has to be recognized that this book was released when personal computers were relatively available andModel Output in ASCII desktop GIS was just getting off the ground. As such, model output that use ASCII characters as indicators of raster quantities was very cool for its day.  Still pretty cool in a retro sort of way.  While the modeling methods may seem outdated, the content in every chapter is well worth the read.  If not only for learning the information the first time, but also for understanding the basis for most of the APM literature from the decades to follow.

For example, many people recite Kvamme’s (pg. 327-328) basic archaeological assumptions for location based models, but wrap them in odd alternative meanings that are not likely intended.  Read this and you will get the original context and better understanding. Speaking of Kvamme, you also get to read his introduction the “Kvamme Gain”; the most widely applied (and mis-applied) measure of an archaeological models predictive ability. (Having studied many of the modern day metrics for classification models I’ll say that for as simple as it is, it is quite a good measure and holds up very well over time)

Kvamme Gain

I recommend reading this whole damn thing cover to cover.  But, say you don’t have a few extra weeks of spare time and are looking for a place to start, I recommend Sebastian’s discussion of correlative and explanatory models (I have MUCH more to say about this, but will hold my tongue for now); dip into Kholer’s contextual discussion in Chapter 2; the intro and model types of Altschul (Chapter 3); hit some high points in Ebert and Kholer; become very cozy with Chapter 5 with Rose and Altschul; read every word of Chapter 7 by Kvamme; and give some serious thought to what Judge and Martin say in Chapter 12.  Chapters 8 through 11 are very interesting, but not as critical as the others.

Quantifying the present and predicting the past : theory, method, and application of archaeological predictive modeling

Arbitrary Combined Weights (ACW) Archaeological Sensitivity Model – The Basics

header

In archaeology, there is a strong need and desire to project our understanding of where we know sites are located to areas where no one has yet looked for sites.  The practical extension of this desire is to use this technique to better protect and manage archaeological sites by identifying areas that have a heightened sensitivity to the potential for archaeological material.  Such an approach is necessary because time and funding will seldom allow archaeologists to survey 100% of a study area.  The manner in which observational data is projected onto the broader landscape can take on many forms; qualitative and quantitative.  The universe of approaches to do so are typically considered under the umbrella of “Archaeological Predictive Modeling” (APM).

This blog will contain many different takes on APM and discuss many different techniques.  As a first post, what follows is a description and code for the most basic and most commonly employed form of quantitative APM; the Arbitrary Combined Weights model (ACW). Simply, the ACW model assigns quantitative weights to values of one or more explanatory variables that are then combined in some mathematical form in order to signify a relative level of archaeological sensitivity.  Typically, a continuous variable is binned based on its characteristics (e.g. quantiles, natural breaks) or based on the investigators experience and a weight of sensitivity is assigned to each bin. Discreet variables may have assigned to each class or groups of class.  The assigned weight may be may be on any scale, such as 0 to 1 or 0 to 100, but almost always consider a greater weight value to indicate a higher sensitivity for archaeological material.  Weights may be assigned indicate the sensitivity of each bin individually or distributed across the bins relative to each other.

Once weighted, the combination can take any arithmetic form that fits the models intention.  The most common ways to do so include summing weights or averaging weights.  Many other linear combinations are possible and include adding a higher level of weights to each variable. As such, the weights within a variable are adjusted relative to the overall importance the investigator places on each variable as a whole.  The overall point goal of this method is to identify the intersection of variables that contain the highest combination of weights and ostensibly, the highest sensitivity for archaeological material.  The method described here is essentially the very basic Weighted Sum Model used in the Multi-Criteria Decision Analysis field.  That is a decent place to explore.

Clearly, the ACW method is very unstructured and leaves a great deal of room for error and interpretation.  Despite this, it is so commonly used because it is easy to describe, easy to implement with or without GIS, and coherent with our basic conceptual models of site distribution.  With this ease comes many reasons why ACW is a poor representation of archaeological sensitivity, highlighted by the arbitrary nature of bins and weights, ad hoc arithmetic, lack of a measure for variable contribution, and no inferential ability to state a few. However, despite a laundry list of theoretical and quantitive incompatibilities, the ACW model will continue to be extensively used throughout archaeology and deserves analysis. This post implements a schematic version of the ACW model in R code and explores some of the output.  This post will be followed up by deeper exploration and extensions of this basic model.

This code does most of the work with base R functions, but uses ggplot2, reshape2, and scales in order to format data for plotting

require(ggplot2)
require(reshape2)
require(scales)
## Loading required package: ggplot2
## Loading required package: reshape2
## Loading required package: scales

The first step here is to create a table of the bins and weights for each of three variables, h20_cls for distance in a water source in meters, slp_cls for percent topographic slope, and soil_cls_lbl for soil drainage class.  The objects of h20_wgt, slp_wgt, and soil_wgt are the weights associated to each bin. For example a weight of 0.85 is assigned to distances of 0 to 100 meters from a source of water.  A weight of 0.95 is assigned to slopes between 0% to 3%, followed by a weight of 0.85 for slopes of 3% to 5%.

# weights as probability within bin
h20_wgt <- c(0.85, 0.75, 0.45, 0.25, 0.1)
h20_cls <- c(0,100,300,500,900,9999)
slp_wgt <- c(0.95, 0.85, 0.65, 0.2, 0.1)
slp_cls <- c(0,3,5,8,15,999)
soil_wgt <- c(0.30, 0.65, 0.85)
soil_cls <- c(-1,0,1,2)
soil_cls_lbl <- c("poor", "moderate", "well")

As described in the model methods above, the bins and weights and bins are based on the judgement of the investigator and likely have grounding in real-world observation, they are not directly resulting from an analysis of a set of known site locations.  The explicit use of known sites to derive bin widths or weights are considered proportionally weighted models.

Example of Weights: Distance to Water Source
from to weight
0 100 0.85
100 300 0.75
300 500 0.45
500 900 0.25
900 9999 0.1

Next, I simulate some environmental data to apply my weights to.  Typically, this would be done by extracting raster cell values or by using the raster calculator in a GIS program.  The variables below draw values of slope and h20 distance with an exponential probability to simulate the high correlation that is often found in nature. In R, the raster package can be used to extract values from rasters and preform the following analysis.

# simulate a landscape. this would typically be enviro rasters
slope_sim <- sample(seq(0:100),1000, replace = TRUE, prob = dexp(seq(0:100),0.1))
h20_sim  <- sample(seq(0:5000),1000, replace = TRUE, prob = dexp(seq(0:5000),0.005))
soil_sim <- sample(c(0, 1, 2), 1000, replace = TRUE, prob = c(0.25, 0.5, 0.25))

The plot below shows the distribution and correlation of simulated data points for log slope and log water, faceted by soil drainage.

plot1

(code for above plot)

sim_landscape <- data.frame(slope = slope_sim, h20 = h20_sim, 
                            soil = factor(soil_sim, labels = soil_cls_lbl))
ggplot(data = sim_landscape, aes(x = log(slope), y = log(h20), color = soil)) +
  geom_point() +
  facet_grid(soil~.) +
  theme_bw() +
  ylab("Log Distance to H20 (meters)") +
  xlab("Log Slope (degrees)") +
  theme(legend.title = element_blank(),
        legend.title = element_blank(),
        axis.title = element_text(face = "bold"),
        strip.text.y = element_text(face = "bold", size = 12),
        strip.background = element_rect(colour = "gray10", fill = "gray90"),
        legend.position = "none") +
  scale_color_brewer(palette="Paired")

Then distribution of values extracted from the explanatory rasters are then recoded to the weight values based on the predetermined bins.  The cut function is used to recode. as.numeric(as.character(.)) is used as a bit of a hackey way to convert the factors resulting tom cut back to numeric values.

# reclass vector
h20_reclass     <- cut(h20_sim, breaks = h20_cls, labels = h20_wgt)
slope_reclass <- cut(slope_sim, breaks = slp_cls, labels = slp_wgt)
soil_reclass <- cut(soil_sim, breaks = soil_cls, labels = soil_wgt)
h20_reclass   <- as.numeric(as.character(h20_reclass))
slope_reclass <- as.numeric(as.character(slope_reclass))
soil_reclass <- as.numeric(as.character(soil_reclass))[

var_matrix <- data.frame(slope_wght = slope_reclass, 
                         h20_wght = h20_reclass,
                         soil_wght = soil_reclass)

The plot below shows the destiny of weight values across each variable from the simulated landscape.  Notice the peaked density for high sensitivity values of distance to water compared to generally low density and even densities for slope.  This results form the interaction of arbitrary weights, bins, and the nature of the landscape being modeled.

plot2

(code for above plot)

var_matrix_melt <- melt(var_matrix)
ggplot(data = var_matrix_melt, aes(x = value, fill = as.factor(variable))) +
  geom_density(binwidth = 0.05) +
  xlab("Sensitivity Weight") +
  facet_grid(variable~.) +
  theme_bw() +
  theme(
    legend.title = element_blank(),
    axis.title = element_text(face = "bold"),
    strip.text.y = element_text(face = "bold", size = 12),
    strip.background = element_rect(colour = "gray10", fill = "gray90"),
    legend.position = "none")+
  scale_fill_brewer(palette="Paired") +
  scale_x_continuous(breaks=seq(0,1,0.1))

The final step in the model is to combine the weights across variables in to form an overall sensitivity value.  As described above, this can be done in any numbers of ways, but typically is a simple sum or mean of weights.  There is little guiding theory to help inform what should be done here, but the intention is to show locations where more desirable attributes intersect.  Mean and Median have some attraction because the final sensitivity is on the same [0,1] scale. However, these turn into essentially sample means/medians and have a Central Limit Theorem effect where all values trend toward the mean. This is because the are many likely ways to combine values to equal 0.5, but many fewer likely combinations that equal 0.9. Summing the values has a similar issue and also changes the scale of the combined sensitivity based on the number of variables in the model. These issues are some of the most important to understand if working with ACW. The plots below show how the final sensitivity value distribution changes based on the method used to construct it; Sum, Mean, or Median.

# calculate aggregations of sensitivity values
var_matrix$sum <- apply(var_matrix[ , 1:3], 1, sum)
var_matrix$mean <- apply(var_matrix[ , 1:3], 1, mean)
var_matrix$median <- apply(var_matrix[ , 1:3], 1, median)

plot3

(code for above plots)

results_melt <- melt(var_matrix[,c("sum","mean","median")])
ggplot(results_melt, aes(x = value, fill = as.factor(variable))) +
  geom_density() +
  xlab("Sensitivity Weight") +
  facet_grid(variable~.) +
  theme_bw() +
  theme(
    legend.title = element_blank(),
    axis.title = element_text(face = "bold"),
    strip.text.y = element_text(face = "bold", size = 12),
    strip.background = element_rect(colour = "gray10", fill = "gray90"),
    legend.position = "none")+
  scale_fill_brewer(palette="Paired") +
  scale_x_continuous(breaks=seq(0,3,0.1))

These final plots illustrate the densities of final weights for the various combination methods relative to the weights of each variable. Notice the the sum method produces a more diverse range of values that peaks towards 0.5, which is a region of low density for all three variables. The mean function of sensitivities results in a similar pattern, but less dispersed, yet still aggregated towards the center. Finally, the median function of aggregation has a higher fidelity to the peak densities of weights per each variable.  Which of these is the best method is the best?  Well, there is probably not a good answer to that beyond either your theory or cross-validated prediction error.  However, if you had the former you would probably not use the ACW model because there are better approaches to model theories, and if you attempted the latter you would have a larger data set and could use statistical models with a better mathematical foundation.  That being said, there are times when small amounts of data, limited amounts of time, and the need to convince a skeptical audience require a model that is easy to create and explain; such as the ACW model.  Much more on this at a later date.

plot4

plot5

plot6

(data munge for above plots)

compare_weights_plot <- rbind(results_melt, var_matrix_melt)
compare_weights_plot$variable <- as.character(compare_weights_plot$variable)
compare_weights_sum_plot <- compare_weights_plot[which(compare_weights_plot$variable %in%
                              c("slope_wght","h20_wght","soil_wght")),]
compare_weights_sum <- compare_weights_plot[which(compare_weights_plot$variable %in%
                              c("sum")),]
compare_weights_sum$value <- scales::rescale(compare_weights_sum$value,to = c(0, 1))
# function(x){(x-min(x))/(max(x)-min(x))}
compare_weights_sum_plot <- rbind(compare_weights_sum_plot, compare_weights_sum)
compare_weights_mean_plot <- compare_weights_plot[which(compare_weights_plot$variable %in%
                              c("mean","slope_wght","h20_wght","soil_wght" )),]
compare_weights_median_plot <- compare_weights_plot[which(compare_weights_plot$variable %in%
                              c("median","slope_wght","h20_wght","soil_wght")),]

compare_weights_sum_plot$variable <- factor(compare_weights_sum_plot$variable,
                        levels =  c("slope_wght","h20_wght","soil_wght","sum"))
compare_weights_mean_plot$variable <- factor(compare_weights_mean_plot$variable,
                        levels =  c("slope_wght","h20_wght","soil_wght","mean"))
compare_weights_median_plot$variable <- factor(compare_weights_median_plot$variable,
                        levels =  c("slope_wght","h20_wght","soil_wght","median"))

(code for above plots)

ggplot(data = compare_weights_sum_plot, aes(x = value, color = variable,
                                             fill = variable, alpha = variable)) +
  geom_density() +
  scale_alpha_manual(values = c("0","0","0","0.2")) +
  xlab("Sensitivity Weight") +
  theme_bw() +
  theme(
    legend.title = element_blank(),
    axis.title = element_text(face = "bold"))+
  scale_fill_brewer(palette="Paired") +
  scale_color_brewer(palette="Paired") +
  scale_x_continuous(limits=c(0, 1), breaks=seq(0,1,0.1))
ggplot(data = compare_weights_mean_plot, aes(x = value, color = variable,
                                             fill = variable, alpha = variable)) +
  geom_density() +
  scale_alpha_manual(values = c("0","0","0","0.2")) +
  xlab("Sensitivity Weight") +
  theme_bw() +
  theme(
    legend.title = element_blank(),
    axis.title = element_text(face = "bold"))+
  scale_fill_brewer(palette="Paired") +
  scale_color_brewer(palette="Paired") +
  scale_x_continuous(limits=c(0, 1), breaks=seq(0,1,0.1))
ggplot(data = compare_weights_median_plot, aes(x = value, color = variable,
                                             fill = variable, alpha = variable)) +
  geom_density() +
  scale_alpha_manual(values = c("0","0","0","0.2")) +
  xlab("Sensitivity Weight") +
  theme_bw() +
  theme(
    legend.title = element_blank(),
    axis.title = element_text(face = "bold"))+
  scale_fill_brewer(palette="Paired") +
  scale_color_brewer(palette="Paired") +
  scale_x_continuous(limits=c(0, 1), breaks=seq(0,1,0.1))
Arbitrary Combined Weights (ACW) Archaeological Sensitivity Model – The Basics

Plotting Shovel Test Unit Profiles in R with the aqp package

showing the inclusion of densities.
Examples of STU profiles

The code in the following post was first generated to find a quick way to create a large number of Shovel Test Unit (STU) soil profiles with some consistency and a few added features. This methodology uses the aqp package in R as the base function to generate the profiles and data in the form of a *.csv file. This package was originally developed to work with geology or soil science data, but we adapted it a bit for our purpose. Subsequent to developing this initial code, others in the office further developed the function to fit their work flow and profiles. The basics are here and adapting it is pretty straight forward if you dive into the guts of the amp package and read the docs.

Features shown here include converting Munsell colors to hex, artifact densities, exporting as PDF and a few other tweaks. Hopefully, you will find it useful!

aqp package http://aqp.r-forge.r-project.org/

first thing is to load the packages. The first time you do this you will need to install the packages. After that, you can just import them.

# this is the main package for profiles
install.packages("aqp") 
# these may come in handy
install.packages("RColorBrewer")
install.packages("latticeExtra")
install.packages("plyr")
install.packages("reshape")

Once the packages are installed, we need to import tham into the R environment as such

require(aqp)
require(RColorBrewer)
require(latticeExtra)
require(plyr)
require(reshape)

Now these packages are imported. That means we can use all the functions and data the come with these packages. Google searches will return documentation on these packages and tell you want functions they contain and some examples.

Data

time to load the data

# this sets the working directory to a folder that contains your data and will
# contain your output.  Setting this is not mandatory, but makes it easier to get
# and put data with less typing [Notice that the slashes in the URL as forward /,
# not the standard back slash \]
setwd("[Your File Location or Working Directory]")

# in this working directory, I made a folder called data to hold the 
# CSV of STU profiles
# I read this using the read.csv function and assign it to the variable "dat"
dat <- read.csv(paste(c(getwd(), "/data/Soil_Profiles.csv"), collapse=''), 
                stringsAsFactors = FALSE)[,-1]

# that "paste" fucntion concatenates the working directory and the 
# file location and name
# set "stringsAsFactors = FALSE" tells the functin to leave text fields as text 
# in the data
# the "[,-1]" at the end drops the index column that is automatically added

Let’s see out data

# this will print the whole thing.  use head(dat,#) where # is replaced with the
# number of rows you want to print
print(dat)
##         id top bottom name    p1 soil_color
## 1  Pole 13   0     19   Ap 6.963    #5E4423
## 2  Pole 13  19     55    B 5.951    #815A21
## 3  Pole 14   0     24   Ap 5.245    #3D2F21
## 4  Pole 14  24     45   Ab 5.069    #5E4423
## 5  Pole 14  45     50    E 5.069    #BEAD97
## 6  Pole 14  50     60    B 5.338    #B88E50
## 7  Pole 22   0     19   Ap 6.454    #7A5C36
## 8  Pole 22  19     30   Ab 5.844    #554636
## 9  Pole 22  30     38    B 5.437    #94764F
## 10 Pole 22  38     50    C 5.181    #9D7337
## 11 Pole 37   0     31   Ap 5.019    #7A5C36
## 12 Pole 37  31     63  Ap1 5.283    #554636
## 13 Pole 37  63     90  Ap2 6.440    #5E4423
## 14 Pole 37  90    117    B 6.670    #94764F
# the "str" function reports the structure of the data
str(dat)
## 'data.frame':    14 obs. of  6 variables:
##  $ id        : chr  "Pole 13" "Pole 13" "Pole 14" "Pole 14" ...
##  $ top       : int  0 19 0 24 45 50 0 19 30 38 ...
##  $ bottom    : int  19 55 24 45 50 60 19 30 38 50 ...
##  $ name      : chr  "Ap" "B" "Ap" "Ab" ...
##  $ p1        : num  6.96 5.95 5.24 5.07 5.07 ...
##  $ soil_color: chr  "#5E4423" "#815A21" "#3D2F21" "#5E4423" ...
# class tells us what class the object "dat" is
class(dat)
## [1] "data.frame"
# summary gives us a look into the range of the data
summary(dat)
##       id                 top            bottom          name          
##  Length:14          Min.   : 0.00   Min.   : 19.0   Length:14         
##  Class :character   1st Qu.: 4.75   1st Qu.: 30.2   Class :character  
##  Mode  :character   Median :27.00   Median : 47.5   Mode  :character  
##                     Mean   :29.21   Mean   : 49.4                     
##                     3rd Qu.:43.25   3rd Qu.: 58.8                     
##                     Max.   :90.00   Max.   :117.0                     
##        p1        soil_color       
##  Min.   :5.02   Length:14         
##  1st Qu.:5.20   Class :character  
##  Median :5.39   Mode  :character  
##  Mean   :5.71                     
##  3rd Qu.:6.32                     
##  Max.   :6.96

Label

Add a label with hoirzon and munsell color to data.frame

dat <- within(dat,  label <- paste(name, soil_color, sep=" - "))

print(dat)
##         id top bottom name    p1 soil_color         label
## 1  Pole 13   0     19   Ap 6.963    #5E4423  Ap - #5E4423
## 2  Pole 13  19     55    B 5.951    #815A21   B - #815A21
## 3  Pole 14   0     24   Ap 5.245    #3D2F21  Ap - #3D2F21
## 4  Pole 14  24     45   Ab 5.069    #5E4423  Ab - #5E4423
## 5  Pole 14  45     50    E 5.069    #BEAD97   E - #BEAD97
## 6  Pole 14  50     60    B 5.338    #B88E50   B - #B88E50
## 7  Pole 22   0     19   Ap 6.454    #7A5C36  Ap - #7A5C36
## 8  Pole 22  19     30   Ab 5.844    #554636  Ab - #554636
## 9  Pole 22  30     38    B 5.437    #94764F   B - #94764F
## 10 Pole 22  38     50    C 5.181    #9D7337   C - #9D7337
## 11 Pole 37   0     31   Ap 5.019    #7A5C36  Ap - #7A5C36
## 12 Pole 37  31     63  Ap1 5.283    #554636 Ap1 - #554636
## 13 Pole 37  63     90  Ap2 6.440    #5E4423 Ap2 - #5E4423
## 14 Pole 37  90    117    B 6.670    #94764F   B - #94764F

Conversion

Now we need to put these data in the format that is used by the aqp package to do the cool stuff. This “depths” function takes the csv data and converts is into a object of class “SoilProfileCollection” for future use. By doing so, the aqp package conforms that data to a type it knows well and can use in the functions the package provides.

# the depth function uses the id field in the data to set the STU number
# it uses the top to set the top of the STU and bottom for the... well the bottom.
depths(dat) <- id ~ top + bottom

# now we see that the class of dat has been changed from data.frame to "SoilProfileCollection"
str(dat) # you can see that the original data is now within this object with other attributes
## Formal class 'SoilProfileCollection' [package "aqp"] with 7 slots
##   ..@ idcol     : chr "id"
##   ..@ depthcols : chr [1:2] "top" "bottom"
##   ..@ metadata  :'data.frame':   1 obs. of  1 variable:
##   .. ..$ depth_units: chr "cm"
##   ..@ horizons  :'data.frame':   14 obs. of  7 variables:
##   .. ..$ id        : chr [1:14] "Pole 13" "Pole 13" "Pole 14" "Pole 14" ...
##   .. ..$ top       : int [1:14] 0 19 0 24 45 50 0 19 30 38 ...
##   .. ..$ bottom    : int [1:14] 19 55 24 45 50 60 19 30 38 50 ...
##   .. ..$ name      : chr [1:14] "Ap" "B" "Ap" "Ab" ...
##   .. ..$ p1        : num [1:14] 6.96 5.95 5.24 5.07 5.07 ...
##   .. ..$ soil_color: chr [1:14] "#5E4423" "#815A21" "#3D2F21" "#5E4423" ...
##   .. ..$ label     : chr [1:14] "Ap - #5E4423" "B - #815A21" "Ap - #3D2F21" "Ab - #5E4423" ...
##   ..@ site      :'data.frame':   4 obs. of  1 variable:
##   .. ..$ id: chr [1:4] "Pole 13" "Pole 14" "Pole 22" "Pole 37"
##   ..@ sp        :Formal class 'SpatialPoints' [package "sp"] with 3 slots
##   .. .. ..@ coords     : num [1, 1] 0
##   .. .. ..@ bbox       : logi [1, 1] NA
##   .. .. ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slots
##   .. .. .. .. ..@ projargs: chr NA
##   ..@ diagnostic:'data.frame':   0 obs. of  0 variables
class(dat)
## [1] "SoilProfileCollection"
## attr(,"package")
## [1] "aqp"
summary(dat)
##                Length                 Class                  Mode 
##                     4 SoilProfileCollection                    S4

Plotting

Whats left to do is to plot! The plot function taks the “SoilProfileCollection” object called dat and uses the “name” field in the data to assign soil horizons… simple as that!

plot(dat, name = 'label')
Basic profile with correct munsell colors
Basic profile with correct munsell colors

In R studo this can be save to a PDF in the plot window, but in basic R, you can use the following

# this will save a pdf called "profiles" to your working directory 
# after you run "dev.off()"
pdf("profiles.pdf")
plot(dat, name = 'label')
dev.off()
## pdf 
##   2

Adding Munsell colors from field notation (e.g. “10YR3/2”)

# lets make up some example data.  Put the Hue, Value, and Chroma into 
# seperate colums
# this should be included in the original CSV probably
hue <- c("10YR", "10YR", "5YR", "2.5YR", "2.5YR")
valu <- c(3, 5, 4, 6, 2)
chroma <- c(2, 4, 4, 2, 3)
# We join these together into a data.frame called munsell 
# note "stringsAsFactors = FALSE"
munsell <- data.frame(hue, valu, chroma, stringsAsFactors = FALSE)
print(munsell)
##     hue valu chroma
## 1  10YR    3      2
## 2  10YR    5      4
## 3   5YR    4      4
## 4 2.5YR    6      2
## 5 2.5YR    2      3
# now use the "munsell2rgb" function to convert by giving it the three column 
# of hue, value, chroma.  This is added as a field called "color" to the munsell data.frame
munsell$color <- munsell2rgb(munsell$hue, munsell$valu, munsell$chroma)

#see!
print(munsell)
##     hue valu chroma     color
## 1  10YR    3      2 #554636FF
## 2  10YR    5      4 #947650FF
## 3   5YR    4      4 #805840FF
## 4 2.5YR    6      2 #A79086FF
## 5 2.5YR    2      3 #482A22FF

Artifact Counts, fractions, and depth of cultural material

Reloading the original csv data here to add some fake artifact counts to it

# reload data
dat <- read.csv(paste(c(getwd(), "/data/Soil_Profiles.csv"), collapse=''), 
                stringsAsFactors = FALSE)[,-1]
# add field for count of jasper flakes
dat$jasper <- sample(1:20, nrow(dat), replace=TRUE)
# a little function to compute the proportion of flakes in each horizon 
# (e.g 50% in Ap)
dat$jasperpcnt <- unlist(sapply(split(dat, f = dat$id), 
                  function(x) (x$jasper/sum(x$jasper)*100), simplify=TRUE))
# add a label field that has the horizon and jasper flake count
dat <- within(dat, jasperlabel    &lt;- paste(name, jasper, sep=" - "))

# see the data
print(dat)
##         id top bottom name    p1 soil_color jasper jasperpcnt jasperlabel
## 1  Pole 13   0     19   Ap 6.963    #5E4423      3     42.857      Ap - 3
## 2  Pole 13  19     55    B 5.951    #815A21      4     57.143       B - 4
## 3  Pole 14   0     24   Ap 5.245    #3D2F21     18     42.857     Ap - 18
## 4  Pole 14  24     45   Ab 5.069    #5E4423      5     11.905      Ab - 5
## 5  Pole 14  45     50    E 5.069    #BEAD97      7     16.667       E - 7
## 6  Pole 14  50     60    B 5.338    #B88E50     12     28.571      B - 12
## 7  Pole 22   0     19   Ap 6.454    #7A5C36      3      6.977      Ap - 3
## 8  Pole 22  19     30   Ab 5.844    #554636     20     46.512     Ab - 20
## 9  Pole 22  30     38    B 5.437    #94764F     13     30.233      B - 13
## 10 Pole 22  38     50    C 5.181    #9D7337      7     16.279       C - 7
## 11 Pole 37   0     31   Ap 5.019    #7A5C36     20     38.462     Ap - 20
## 12 Pole 37  31     63  Ap1 5.283    #554636      2      3.846     Ap1 - 2
## 13 Pole 37  63     90  Ap2 6.440    #5E4423     18     34.615    Ap2 - 18
## 14 Pole 37  90    117    B 6.670    #94764F     12     23.077      B - 12
# convert to apq object 
depths(dat) <- id ~ top + bottom

# plot with jasper quantity as color and lable with counts
plot(dat, name = "jasperlabel", color = "jasper")

# add overlay symbolizing the percent of jasper in each horizon
addVolumeFraction(dat, "jasperpcnt")
crazy soil colors and artifact densities
Color by artifact densities
# add in brackets to show depth of cultural material (fake in this example case)
fake.tops <- rep(0, times=length(dat))
fake.bottoms <- runif(n=length(dat), min=0, max=20)
par(mar=c(0,0,3,0)) # tighter figure margins
plot(dat, name = "jasperlabel", color = "soil_color")
addVolumeFraction(dat, "jasperpcnt")
addBracket(fake.tops, fake.bottoms, col='red') # add depth brackets

showing the inclusion of densities.
showing the inclusion of densities.
Plotting Shovel Test Unit Profiles in R with the aqp package