Extracting JSON data from websites and public APIs with R

JSON
web scraping
NFL
espnscrapeR

tidyr + jsonlite are magical.

Author

Thomas Mock

Published

12-13-2020

Finding JSON Sources

I’ve covered some strategies for parsing JSON with a few methods in base R and/or tidyverse in a previous blog post. I’d like to go one step up in the chain, and talk about pulling raw data/JSON from sites. While having a direct link to JSON is common, in some situations where you’re scraping JavaScript fed by APIs the raw data source is not always as easy to find.

I have three examples for today:
- FiveThirtyEight 2020 NFL Predictions
- ESPN Win Percentage/play-by-play (embedded JSON)
- ESPN Public API

Web vs Analysis

Most of these JSON data sources are intended to be used with JavaScript methods, and have not been oriented to a “flat” data style. This means the JSON has lots of separations of the data for a specific use/purpose inside the site, and efficient singular representations of each data in JSON storage as opposed to normalized data with repeats in a dataframe. While extreme detail is out of scope for this blogpost, JSON is structured as a “collection of name/value pairs” or a “an ordered list of values”. This means it is typically represented in R as repeated lists of list elements, where the list elements can be named lists, vectors, dataframes, or character strings.

Alternatively typically data for analysis is usually most useful as a normalized rectangle eg a dataframe/tibble. “Under the hood a data frame is a list of equal length vectors” per Advanced R.

One step further is tidy data which is essentially “3rd normal form”. Hadley goes into more detail in his “Tidy Data” publication. The takeaway here is that web designers are optimizing for their extremely focused interactive JavaScript apps and websites, as opposed to novel analyses that we often want to work with. This is often why there are quite a few steps to “rectangle” a JSON.

An aside on Subsetting

Subsetting in R is done many different ways, and Hadley Wickham has an entire chapter dedicated to this in Advanced R. It’s worth reading through that chapter to better understand the nuance, but I’ll provide a very brief summary of the options.

# a VERY basic list of named elements
car_list <- list(manufacturer = "Honda", vehicle = "Civic", year = 2020)

“Subsetting a list works in the same way as subsetting an atomic vector. Using [ always returns a list; [[ and $, … let you pull out elements of a list.”

When working with lists, you can typically use $ and [[ interchangeably to extract single list elements by name. [[ requires exact matching whereas $ allows for partial matching, so I typically prefer to use [[. To extract by location from base R you need to use [[.

purrr functions pluck() and chuck() implement a generalised form of [[ that allow you to index deeply and flexibly into data structures. pluck() consistently returns NULL when an element does not exist, chuck() always throws an error in that case.”

So in short, you can use $, [[ and pluck/chuck in many of the same ways. I’ll compare all the base R and purrr versions below (all should return “Honda”).

# $ subsets by name
car_list$manufacturer
[1] "Honda"
# notice partial match
car_list$man
[1] "Honda"
# [[ requires exact match or position
car_list[["manufacturer"]]
[1] "Honda"
car_list[[1]]
[1] "Honda"
# pluck and chuck provide a more strict version of [[
# and can subset by exact name or position
purrr::pluck(car_list, "manufacturer")
[1] "Honda"
purrr::pluck(car_list, 1)
[1] "Honda"
purrr::chuck(car_list, "manufacturer")
[1] "Honda"
purrr::chuck(car_list, 1)
[1] "Honda"

For one more warning of partial name matching with $, where we now have a case of two elements with similar names see below:

car_list2 <- list(manufacturer = "Honda", vehicle = "Civic", manufactured_year = 2020)

# partial match throws a null
car_list2$man
NULL
# exact name returns actual elements
car_list2$manufacturer
[1] "Honda"

An aside on JavaScript

If we dramatically oversimplify JavaScript or their R-based counterparts htmlwidgets, they are a combination of some type of JSON data and then functions to display or interact with that data.

We can quickly show a htmlwidget example via the fantastic reactable R package.

library(reactable)

table_ex <- mtcars %>% 
  select(cyl, mpg, disp) %>% 
  reactable()

table_ex

That gives us the power of JavaScript in R! However, what’s going on with this function behind the scenes? We can extract the dataframe that has now been represented as a JSON file from the htmlwidget!

table_data <- table_ex[["x"]][["tag"]][["attribs"]][["data"]]

table_data %>% class()
[1] "json"

This basic idea, that the data is embedded as JSON to fill the JavaScript app can be further applied to web-based apps! We can use a similar idea to scrape raw JSON or query a web API that returns JSON from a site.

FiveThirtyEight

FiveThirtyEight publishes their ELO ratings and playoff predictions for the NFL via a table at projects.fivethirtyeight.com/2020-nfl-predictions/. They are also kind enough to post this data as a download publicly! However, let’s see if we can “find” the data source feeding the JavaScript table.

rvest

We can try our classical rvest based approach to scrape the HTML content and get back a table. However, the side effect of this is we’re returning the literal data with units, some combined columns, and other formatting. You’ll notice that all the columns show up as character and this introduces a lot of other work we’d have to do to “clean” the data.

library(xml2)
library(rvest)

url_538 <- "https://projects.fivethirtyeight.com/2020-nfl-predictions/"

raw_538_html <- read_html(url_538)

raw_538_table <- raw_538_html %>% 
  html_node("#standings-table") %>% 
  html_table(fill = TRUE) %>% 
  janitor::clean_names() %>% 
  tibble()

raw_538_table %>% glimpse()
Rows: 36
Columns: 12
$ x                 <chr> "", "", "", "elo with top qbelo rating", "1734", "17…
$ x_2               <chr> "", "", "", "1-week change", "+34", "", "", "", "", …
$ x_3               <chr> "", "", "", "current qb adj.", "", "", "", "", "", "…
$ x_4               <chr> "", "playoff chances", "", "", "", "", "", "", "", "…
$ x_5               <chr> "playoff chances", "playoff chances", "", "team", "B…
$ x_6               <chr> "playoff chances", "playoff chances", "", "division"…
$ playoff_chances   <chr> "playoff chances", "playoff chances", "", "make div.…
$ playoff_chances_2 <chr> "playoff chances", NA, "", "make conf. champ", "✓", …
$ playoff_chances_3 <chr> NA, NA, "", "make super bowl", "✓", "✓", "—", "—", "…
$ playoff_chances_4 <chr> NA, NA, "", "win super bowl", "✓", "—", "—", "—", "—…
$ x_7               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ x_8               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Inspect + Network

Alternatively, we can Right Click + inspect the site, go to the Network tab, reload the site and see what sources are loaded. Again, FiveThirtyEight is very kind and essentially just loads the JSON as data.json.

I have screenshots below of each item, and the below is a short video of the entire process.

We can click over to the Network Tab after inspecting the site

We need to reload the web page to find sources

We can examine specific elements by clicking on them, which then shows us JSON!



In our browser inspect, we can see the structure, and that it has some info about games, QBs, and forecasts. This looks like the right dataset! You can right click on data.json and open it in a new page. The url is https://projects.fivethirtyeight.com/2020-nfl-predictions/data.json, and note that we can adjust the year to get older or current data. So https://projects.fivethirtyeight.com/2019-nfl-predictions/data.json returns the data for 2019, and you can go all the way back to 2016! 2015 also exists, but with a different JSON structure, and AFAIK they don’t have data before 2015.

Read the JSON

Now that we have a JSON source, we can read it into R with jsonlite. By using the RStudio viewer or listviewer::jsonedit() we can take a look at what the overall structure of the JSON.

library(jsonlite)

raw_538_json <- fromJSON("https://projects.fivethirtyeight.com/2020-nfl-predictions/data.json", simplifyVector = FALSE)

raw_538_json %>% str(max.level = 1)
List of 9
 $ archie                :List of 24
 $ clinches              :List of 114
 $ distances             :List of 32
 $ games                 :List of 269
 $ pageconfig            :List of 20
 $ playoff_qb_adjustments:List of 32
 $ qbs                   :List of 87
 $ urls                  :List of 2
 $ weekly_forecasts      :List of 2

Don’t forget that the RStudio Viewer also gives you the ability to export the base R code to access a specific component of the JSON!

Which gives us the following code:

raw_538_json[["weekly_forecasts"]][["forecasts"]][[1]][["types"]][["elo"]][[1]]

ex_538_data <- raw_538_json[["weekly_forecasts"]][["forecasts"]][[1]][["types"]][["elo"]][[1]]

ex_538_data %>% str()
List of 29
 $ conference           : chr "NFC"
 $ current_losses       : int 9
 $ current_ties         : int 0
 $ current_wins         : int 7
 $ division             : chr "NFC North"
 $ elo                  : num 1489
 $ losses               : int 9
 $ make_conference_champ: int 0
 $ make_divisional_round: int 0
 $ make_playoffs        : int 0
 $ make_superbowl       : int 0
 $ name                 : chr "MIN"
 $ point_diff           : int -45
 $ points_allowed       : int 475
 $ points_scored        : int 430
 $ rating               : num 1481
 $ rating_current       : num 1502
 $ rating_top           : num 1502
 $ seed_1               : int 0
 $ seed_2               : int 0
 $ seed_3               : int 0
 $ seed_4               : int 0
 $ seed_5               : int 0
 $ seed_6               : int 0
 $ seed_7               : int 0
 $ ties                 : int 0
 $ win_division         : int 0
 $ win_superbowl        : int 0
 $ wins                 : int 7

We can also play around with listviewer.

raw_538_json %>% 
  listviewer::jsonedit()