Pokémon GO post-evolution CP: A model


The punchline: It appears that firstly, the metric of Pokémon CP is correlated directly to HP with few, if any, other related metrics.  Secondly and perhaps more interestingly, that the post_evolution CP increase is a random draw from a parameterized beta distribution centered on a CP percent increase specific to each Pokémon species. Let me explain…

Black and White Pokeball Outline

Ok, this is a bit odd.  Before last month, all I knew of Pokémon was a yellow thing called Pikachu and something about a card game.  Then Pokémon Go came out a few weeks ago, and I decided to check it out.  Now I am an avid player and find the intricacies quite interesting; in addition to the psychology of the game play.  I don’t spend too much time reading the internet about the game, but it is fun to merge the game with an analysis.  The following post is akin to an Exploratory Data Analysis (EDA) of the data; not a full inference or comprehensive study.  The full code for this analysis can be found at this Gist.

Why bother?

The ever enjoyable Data Machina email newsletter included a link to a Pokémon dataset on OpenIntro in the last edition and I followed it down the rabbit hole. This data set contains the observations of a number of measurements of a specific Pokémon species before and after a series of evolutions. For non-Poké people, one goal of the game is to collect Pokémon and then evolve them into more powerful forms.  The metric for “more powerful” is the Combat Power (CP).  Like many things in this game, the mechanism behind how the CP is calculated and effected by evolution is unknown.  However, I am sure a quick Google search will turn up plenty of ideas.  The purpose of the data set on OpenIntro is to look at these variables and try to discover how the CP is modified by an evolution.  The end game to that quandary is to more efficiently select the best Pokémon to spend your limited resources on to evolve.  The final chapter is this narrative is that a more efficiently evolved Pokémon will allow you to be more competitive in gym battles and therefore represent your team (one of Blue, Yellow, or Red), gain Pokémon coins, and attract more rare Pokémons.  If a little insight on how the underlying mechanism of CP works gets you there faster, all the better.


The intent of the posting of this data set on OpenIntro is to see if the provided variables contribute meaningfully to the data generating mechanism of the CP relative to evolutions.  As stated on the site:

A key part of Pokémon Go is using evolutions to get stronger Pokémon, and a deeper understanding of evolutions is key to being the greatest Pokémon Go player of all time. This data set covers 75 Pokémon evolutions spread across four species. A wide set of variables are provided, allowing a deeper dive into what characteristics are important in predicting a Pokémon’s final combat power (CP).

Example research questions: (1) What characteristics correspond to an evolved Pokémon with a high combat power? (2) How predictable is CP from an evolution?

Good questions.  The analysis that follows takes a quick look at Q1, concluding that most of the action is in the HP and few other variables.  Question 2 is the more direct focus of this post, but I formalized the question a bit: What accounts for the variation in CP gain per evolution within a given species?

The Data

The data from OpenIntro is pretty limited; no Big Poké Data here.  There is a total of 75 observations from pre and post-evolutions of Pokés from four different species.  The species are an important distinction.  This analysis will recognize the variation between species, but try to find a universal model that fits all species.  Note the difference in sample size between species.  The uncertainty of small sample size has an important effect here.

> colnames(pokedat)
[1] "name" "species" "cp"
[4] "hp" "weight" "height"
[7] "power_up_stardust" "power_up_candy" "attack_weak"
[10] "attack_weak_type" "attack_weak_value" "attack_strong"
[13] "attack_strong_type" "attack_strong_value" "cp_new"
[16] "hp_new" "weight_new" "height_new"
[19] "power_up_stardust_new" "power_up_candy_new" "attack_weak_new"
[22] "attack_weak_type_new" "attack_weak_value_new" "attack_strong_new"
[25] "attack_strong_type_new" "attack_strong_value_new" "notes"
 Species Count
Caterpie 10
Eevee 6
Pidgey 39
Weedle 20

Thinking about the data and the two questions above, my interest turned towards univariate correlation between the variables offered figuring out that the real metric is. Diagnosing correlation would be a visual approach that offers an understating of what is moving in-step with what.  Of course “correlation is not… ” and all that, but it sure as heck doesn’t hurt when you are trying to develop a data generating model.  The metric of interest is not only the pre and post-evolution CP (cp and cp_new respectively), but some permutation of that incorporated the fact that species seems to have a big effect on the output.

pokedat <- as_tibble(read.csv(&quot;pokemon.csv&quot;))

dat1 <- mutate(pokedat, delta_cp = cp_new - cp) %>% # net change in cp
# the % change in cp from old to new
mutate(pcnt_d_cp = delta_cp / cp) %>%
# group by species to calculate additonal varaibles
group_by(species) %>%
# species grouped mean percent change
mutate(spec_mean_pdcp = mean(pcnt_d_cp)) %>%
# species grouped std dev of % changes from old to new cp
mutate(spec_mean_pdcp_sd = sd(pcnt_d_cp)) %>%
# difference in % delta cp (pdcp) change from species group mean
mutate(cent_spec_mean_pdcp = pcnt_d_cp - spec_mean_pdcp) %>%
# z score for pdcp change within species group
mutate(z_spec_mean_pdcp = cent_spec_mean_pdcp / spec_mean_pdcp_sd) %>%


the modification of the data in the code above spells out the development of the desired metric.  Since with are talking about an evolutionary process, the desired metric has to do with chance in states.  In this case the change from pre to post-evolution CP; referred to at the delta CP.  Further, since we have multiple species (and there are many more than four in the Pokémon universe), we consider delta CP relative to the mean of each species.  Finally, as each species has a different range of CP as some species are more powerful than other, it is good to look at the percent change per species, as opposed to an absolute CP increase.  Based on these observations, the desired metric is based on the Percent Delta CP (PDCP) per species.  The metric is the PDCP and it is considered relative to each species; or spec_pdcp.  This is the metric that we seek to identify a data generating mechanism for.

Correlation = Where’s the party?


Continue reading “Pokémon GO post-evolution CP: A model”

Pokémon GO post-evolution CP: A model

ggplot2 step by step

Building a ggplot2 Step by Step

I have included this viz on my blog before; as an afterthought to a more complex viz of the same data.  However, I was splitting out the steps to the plot for another purpose and though it would be worth while to post this as a step-by-step how to.  The post below will step through the making of this plot using R and ggplot2 (the dev version from Github).  Each code chunk and accompanying image adds a little bit to the plot on its way to the final plot; depicted here.  Hopefully this can help or inspire someone to take their plot beyond the basics.



Step 1 is to load the proper libraries which include ggplot2 (of course), ggalt for the handy ggsave function, and extrafont for the use of a range of fonts in your plot.  I did this on a mac book pro and using the extra fonts was pretty easy.  I have done the same on a windows machine and it was more of a pain.  Google around and it should work out.

library("ggplot2") # dev version
library("ggalt") # dev version, for ggsave()

Create the data… I included R code at the end of the post to recreate the data.frame that is in the plots below.  You will have to run that code block to make an object called dat so that the rest of this code will work.

# run the code block for making the 'dat' data.frame.
# code is located at bottom of this post

So that whole point to why I was making this plot was to show a trend in population change over time and over geography.  The hypothesis put forth by the author of this study (Zubrow, 1974) is that over time population decreased in the east and migrated to the west.  The other blog post I did covers all of this in more details.

If one were starting out in R, they may go straight to the base plot function to plot population over time as shown below.  [to be fair, there are many expert R users that will go straight to the base plot function and make fantastic plots.  I am just not one of them.  That is a battle I will leave to the pros.]

# base R line plot of population over year
plot(dat$Population ~ dat$Year, type = "l")


That result is underwhelming and not very informative.  A jumble of tightly spaced lines don’t tell us much.  Further, this is plotting data across all of the 19 sites we have data on. Instead of pursuing this further in base plot and move to ggplot2.

ggplot2 is built on the Grammar of Graphics model, check out a quick intro here.  The basic idea is to plot by layers using data and geometries intuitively. To me, plotting in ggplot2 is much like building a GIS map.  Data, layers, and geometries; it all sounds pretty cartographic to me.

Continue reading “ggplot2 step by step”

ggplot2 step by step