TL:DR
The punchline: It appears that firstly, the metric of Pokémon CP is correlated directly to HP with few, if any, other related metrics. Secondly and perhaps more interestingly, that the post_evolution CP increase is a random draw from a parameterized beta distribution centered on a CP percent increase specific to each Pokémon species. Let me explain…
Ok, this is a bit odd. Before last month, all I knew of Pokémon was a yellow thing called Pikachu and something about a card game. Then Pokémon Go came out a few weeks ago, and I decided to check it out. Now I am an avid player and find the intricacies quite interesting; in addition to the psychology of the game play. I don’t spend too much time reading the internet about the game, but it is fun to merge the game with an analysis. The following post is akin to an Exploratory Data Analysis (EDA) of the data; not a full inference or comprehensive study. The full code for this analysis can be found at this Gist.
Why bother?
The ever enjoyable Data Machina email newsletter included a link to a Pokémon dataset on OpenIntro in the last edition and I followed it down the rabbit hole. This data set contains the observations of a number of measurements of a specific Pokémon species before and after a series of evolutions. For non-Poké people, one goal of the game is to collect Pokémon and then evolve them into more powerful forms. The metric for “more powerful” is the Combat Power (CP). Like many things in this game, the mechanism behind how the CP is calculated and effected by evolution is unknown. However, I am sure a quick Google search will turn up plenty of ideas. The purpose of the data set on OpenIntro is to look at these variables and try to discover how the CP is modified by an evolution. The end game to that quandary is to more efficiently select the best Pokémon to spend your limited resources on to evolve. The final chapter is this narrative is that a more efficiently evolved Pokémon will allow you to be more competitive in gym battles and therefore represent your team (one of Blue, Yellow, or Red), gain Pokémon coins, and attract more rare Pokémons. If a little insight on how the underlying mechanism of CP works gets you there faster, all the better.
The intent of the posting of this data set on OpenIntro is to see if the provided variables contribute meaningfully to the data generating mechanism of the CP relative to evolutions. As stated on the site:
A key part of Pokémon Go is using evolutions to get stronger Pokémon, and a deeper understanding of evolutions is key to being the greatest Pokémon Go player of all time. This data set covers 75 Pokémon evolutions spread across four species. A wide set of variables are provided, allowing a deeper dive into what characteristics are important in predicting a Pokémon’s final combat power (CP).
Example research questions: (1) What characteristics correspond to an evolved Pokémon with a high combat power? (2) How predictable is CP from an evolution?
Good questions. The analysis that follows takes a quick look at Q1, concluding that most of the action is in the HP and few other variables. Question 2 is the more direct focus of this post, but I formalized the question a bit: What accounts for the variation in CP gain per evolution within a given species?
The Data
The data from OpenIntro is pretty limited; no Big Poké Data here. There is a total of 75 observations from pre and post-evolutions of Pokés from four different species. The species are an important distinction. This analysis will recognize the variation between species, but try to find a universal model that fits all species. Note the difference in sample size between species. The uncertainty of small sample size has an important effect here.
> colnames(pokedat) [1] "name" "species" "cp" [4] "hp" "weight" "height" [7] "power_up_stardust" "power_up_candy" "attack_weak" [10] "attack_weak_type" "attack_weak_value" "attack_strong" [13] "attack_strong_type" "attack_strong_value" "cp_new" [16] "hp_new" "weight_new" "height_new" [19] "power_up_stardust_new" "power_up_candy_new" "attack_weak_new" [22] "attack_weak_type_new" "attack_weak_value_new" "attack_strong_new" [25] "attack_strong_type_new" "attack_strong_value_new" "notes" >
Species | Count |
---|---|
Caterpie | 10 |
Eevee | 6 |
Pidgey | 39 |
Weedle | 20 |
Thinking about the data and the two questions above, my interest turned towards univariate correlation between the variables offered figuring out that the real metric is. Diagnosing correlation would be a visual approach that offers an understating of what is moving in-step with what. Of course “correlation is not… ” and all that, but it sure as heck doesn’t hurt when you are trying to develop a data generating model. The metric of interest is not only the pre and post-evolution CP (cp
and cp_new
respectively), but some permutation of that incorporated the fact that species seems to have a big effect on the output.
pokedat <- as_tibble(read.csv("pokemon.csv")) dat1 <- mutate(pokedat, delta_cp = cp_new - cp) %>% # net change in cp # the % change in cp from old to new mutate(pcnt_d_cp = delta_cp / cp) %>% # group by species to calculate additonal varaibles group_by(species) %>% # species grouped mean percent change mutate(spec_mean_pdcp = mean(pcnt_d_cp)) %>% # species grouped std dev of % changes from old to new cp mutate(spec_mean_pdcp_sd = sd(pcnt_d_cp)) %>% # difference in % delta cp (pdcp) change from species group mean mutate(cent_spec_mean_pdcp = pcnt_d_cp - spec_mean_pdcp) %>% # z score for pdcp change within species group mutate(z_spec_mean_pdcp = cent_spec_mean_pdcp / spec_mean_pdcp_sd) %>% data.frame()
the modification of the data in the code above spells out the development of the desired metric. Since with are talking about an evolutionary process, the desired metric has to do with chance in states. In this case the change from pre to post-evolution CP; referred to at the delta CP. Further, since we have multiple species (and there are many more than four in the Pokémon universe), we consider delta CP relative to the mean of each species. Finally, as each species has a different range of CP as some species are more powerful than other, it is good to look at the percent change per species, as opposed to an absolute CP increase. Based on these observations, the desired metric is based on the Percent Delta CP (PDCP) per species. The metric is the PDCP and it is considered relative to each species; or spec_pdcp
. This is the metric that we seek to identify a data generating mechanism for.