The alluvial plot and Sankey diagram are both forms of the more general Flow diagrams. These plot types are designed to show the change in magnitude of some quantity as it flows between states. The states (often on the x-axis) can be over space, time, or any other discreet index. While the alluvial and Sankey diagrams are differentiated by the way that the magnitude of flow is represented, they share a lot in common and are often used interchangeably as terms for this type of plot. The packages discussed here typically invoke (correctly IMHO) the term “alluvial”, but be aware that Sankey will also come up in discussion of the same plot format. Also, check out @timelyportfolio’s new post on interactive JS Sankey-like plots.
This post is motivated by a few recent events; 1) I required a visualization of the “flow” of decisions between two models; 2) my tweet asking for ways to achieve this in R; 3) the ensuing discussion within that thread; and 4) a connected thread on Github covering some of the same ground. I figured I would put this post together to address two issues that arose in the events above:
- A very basic introduction into the landscape of non-JS/static alluvial plot packages in R.
- Within these packages, if it is possible to use ordered factors to reorder the Y axis to change the order of strata.
To those ends, below is a surficial and brief look at how to get something out of each of these packages and approaches to reordering. These packages have variable levels of documentation, but say to say that for each one you can learn a lot more than you can here. However, it is a corner of data vis in R that is not as well known as some others. This view is justified by the twitter and Github threads on the matter. Of course as these packages develop, this code may fall into disrepair and new options will be added.
R packages used in this post
library("dplyr") library("ggplot2") library("alluvial") # devtools::install_github('thomasp85/ggforce') library("ggforce") # devtools::install_github("corybrunson/ggalluvial") library("ggalluvial") library("ggparallel")
To make life a little easier, I put a few common parameters up here. If you plan to cut-n-paste code from the plot examples, note that you will need these values as well to make them work as-is.
Part of the issue in this post is to look at what ordering factors does to change (or more likely not change) the vertical order of strata. Here I set factor levels to “A”, “C”, “B” and will use this in the examples below to see what using factors does to each plotting function. This order was used because it does not follow either ascending or descending alphabetical order.
A_col <- "darkorchid1" B_col <- "darkorange1" C_col <- "skyblue1" alpha <- 0.7 # transparency value fct_levels <- c("A","C","B")
Simulate data for plotting
The initial thread I started on Twitter was in response to my need to visualize the “flow” of decisions between two models. One model is the results of a predictive model I made a few years back; these values are labelled as “Modeled” in the data below. The other model is a implementation of the statistical model where individual decisions were made to change predictions or leave them the same; here the implementation is labelled “Tested”. The predictions that were decided upon are simply called “A”, “B”, and “C” here, but these could be any exhaustive set of non-overlapping values. Other examples could be “Buy”, “Sell”, and “Hold” or the like. The point of the graphic is to visually represent the mass and flow of values between the statistical model and the implemented model. Sure this can be done is a table (as the one down below), but it also lends to a visual approach. Is it the best way to visualize these data? I have no idea, but it is the one that I like.
dat_raw is the full universe of decisions shared by the two models (all the data). The
dat data frame is the full data aggregated by decision pairs with the addition of a count of observations per pair. You will see in the code for the plots below, each package ingests the data in a slightly different manner. As such,
dat_raw serve as the base data for the rest of the post.
dat_raw <- data.frame(Tested = sample(c("A","B","C"),100, replace = TRUE,prob=c(0.2,0.6,0.25)), Modeled = sample(c("A","B","C"),100, replace = TRUE,prob=c(0.56,0.22,0.85)), stringsAsFactors = FALSE) dat <- dat_raw %>% group_by(Tested,Modeled) %>% summarise(freq = n()) %>% ungroup()
ggforce is Thomas Lin Pedersen’s package for adding a range of new geoms and functionality to
geom_parallel_sets used to make the alluvial chart here is only available in the development version installed from Github.
ggforce provides a helper function
gather_set_data() to get the dat in the correct format for
ggplot; see below.
Ordering – By default
ggforce orders the strats (i.e. “A”, “B”, “C”) based on alphabetical order from the bottom up. There is no obvious way to change this default, but I am looking into it.
dat_ggforce <- dat %>% gather_set_data(1:2) %>% # <- ggforce helper function arrange(x,Tested,desc(Modeled)) ggplot(dat_ggforce, aes(x = x, id = id, split = y, value = freq)) + geom_parallel_sets(aes(fill = Tested), alpha = alpha, axis.width = 0.2, n=100, strength = 0.5) + geom_parallel_sets_axes(axis.width = 0.25, fill = "gray95", color = "gray80", size = 0.15) + geom_parallel_sets_labels(colour = 'gray35', size = 4.5, angle = 0, fontface="bold") + scale_fill_manual(values = c(A_col, B_col, C_col)) + scale_color_manual(values = c(A_col, B_col, C_col)) + theme_minimal() + theme( legend.position = "none", panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.text.y = element_blank(), axis.text.x = element_text(size = 20, face = "bold"), axis.title.x = element_blank() )