5  Chapter 4: Data Collection with APIs and Web Scraping

5.0.1 Key Topics

  • Introduction to data collection through APIs using packages like httr and jsonlite.
  • Basics of web scraping with rvest for gathering data from websites.
  • Hands-on exercises to retrieve data from a public API and scrape a website to collect and prepare a dataset for analysis.

5.0.2 Outcome

Participants will be able to collect data via APIs and web scraping to use in further analyses.

5.1 Introduction to APIs

APIs (Application Programming Interfaces) allow you to access data from web services in a structured format, typically JSON. The httr package in R is commonly used for making HTTP requests.

5.1.1 Example: Fetching Data from an API

Show the code
library(httr)
library(jsonlite)

# Example API request
response <- GET("https://api.exchangerate-api.com/v4/latest/USD")

# Check the status of the request
status_code(response)
[1] 200
Show the code
# Parse the JSON content
data <- content(response, as = "text")
No encoding supplied: defaulting to UTF-8.
Show the code
parsed_data <- fromJSON(data)

# Display some exchange rates
head(parsed_data$rates)
$USD
[1] 1

$AED
[1] 3.67

$AFN
[1] 68.12

$ALL
[1] 91.22

$AMD
[1] 387.16

$ANG
[1] 1.79

5.2 Introduction to Web Scraping with rvest

Web scraping involves extracting data from websites. The rvest package is designed for this purpose, allowing you to navigate HTML content and extract useful information.

5.2.1 Example 1: Scraping a Table from Wikipedia

Show the code
library(rvest)

# URL of the page to scrape
url <- "https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes"

# Read the HTML content
page <- read_html(url)

# Extract tables from the page
tables <- html_table(page, fill = TRUE)

# Display the first table
head(tables[[1]])
# A tibble: 6 × 8
  `ISO 3166[1]name[5]` `Official state name[6][a]`        `Sovereignty[6][7][8]`
  <chr>                <chr>                              <chr>                 
1 ISO 3166[1]name[5]   Official state name[6][a]          Sovereignty[6][7][8]  
2 Afghanistan          the Islamic Republic of Afghanist… UN member             
3 Åland Islands        Åland[c][d]                        Finland               
4 Albania              the Republic of Albania            UN member             
5 Algeria              the People's Democratic Republic … UN member             
6 American Samoa       American Samoa[c]                  United States         
# ℹ 5 more variables: `ISO 3166-1[2]` <chr>, `ISO 3166-1[2]` <chr>,
#   `ISO 3166-1[2]` <chr>, `ISO 3166-2[3]subdivision codes link` <chr>,
#   `TLD[9]` <chr>

5.2.2 Example 2: Scraping full Data from a Table

Show the code
# install.packages("tidyverse")
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks stats::filter()
✖ purrr::flatten()        masks jsonlite::flatten()
✖ readr::guess_encoding() masks rvest::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Show the code
# install.packages("rvest")
library(rvest)

url <- 'https://en.wikipedia.org/wiki/List_of_countries_by_foreign-exchange_reserves'
#Reading the HTML code from the Wiki website
wikiforreserve <- read_html(url)
class(wikiforreserve)
[1] "xml_document" "xml_node"    
Show the code
## Get the XPath data using Inspect element feature in Safari, Chrome or Firefox
## At Inspect tab, look for <table class=....> tag. Leave the table close
## Right click the table and Copy --> XPath, paste at html_nodes(xpath =)

foreignreserve <- wikiforreserve %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div[1]/table[1]') %>%
  html_table()
class(foreignreserve) # Why the first column is not scrapped?
[1] "list"
Show the code
fores = foreignreserve[[1]][,c(1,4,5,6,7,8,9) ] # [[ ]] returns a single element directly, without retaining the list structure.


# 
names(fores) <- c("Country", "Forexreswithgold", "Date1", "Change1","Forexreswithoutgold", "Date2","Change2")
colnames(fores)
[1] "Country"             "Forexreswithgold"    "Date1"              
[4] "Change1"             "Forexreswithoutgold" "Date2"              
[7] "Change2"            
Show the code
head(fores$Country, n=10)
 [1] "Countries"    "China"        "Japan"        "Switzerland"  "India"       
 [6] "Russia"       "Taiwan"       "Saudi Arabia" "South Korea"  "Hong Kong"   
Show the code
## Clean up variables
## What type is Date?

# Convert Date1 variable
fores$Date1 = as.Date(fores$Date1, format = "%d %b %Y")
class(fores$Date1)
[1] "Date"
Show the code
head(fores)
# A tibble: 6 × 7
  Country  Forexreswithgold Date1      Change1 Forexreswithoutgold Date2 Change2
  <chr>    <chr>            <date>     <chr>   <chr>               <chr> <chr>  
1 Countri… Millions americ… NA         Change  Millions american … Last… Change 
2 China    3,571,803        2024-10-31 21,957  3,380,334           31 O… 415    
3 Japan    1,238,950        2024-11-01 15,948  1,164,583           1 No… 18,782 
4 Switzer… 952,687          2024-09-30 1,127   864,519             30 S… 604    
5 India    682,130          2024-11-01 2,675   612,379             1 No… 3,899  
6 Russia   628,500          2024-11-01 3,400   596,100             1 No… 720    
Show the code
# write.csv(fores, "fores.csv", row.names = FALSE) # use fwrite?

5.3 Hands-On Exercise

5.3.1 Exercise 1: Using the REST Countries API

Exercise Title: Exploring Country Data with REST Countries API

Objective: Learn how to retrieve and analyze data about countries using the REST Countries API.

5.3.1.1 1. Overview of the REST Countries API

The REST Countries API provides detailed information about countries, including names, populations, area sizes, languages, currencies, and more. It is completely free to use and does not require authentication.

  • API URL: REST Countries API
  • Use Cases: Analyze global metrics, visualize country statistics, or create a simple application that displays country information.

5.3.1.2 2. Example Code to Fetch Country Data

Here’s how you can use R to fetch data from the REST Countries API:

Show the code
# Load necessary libraries
library(httr)
library(jsonlite)

# Define the API endpoint
url <- "https://restcountries.com/v3.1/all"

# Make a GET request to the API
response <- GET(url)

# Check if the request was successful
if (status_code(response) == 200) {
  # Parse the JSON response
  countries_data <- fromJSON(content(response, "text"))
  
  # Display the first few rows of the data
  head(countries_data$population)
} else {
  print("Failed to retrieve data")
}
No encoding supplied: defaulting to UTF-8.
[1]       30   112519  8654622  7976985  9749763 23503349

5.3.1.3 3. Analyze the Data

Once you have retrieved the country data, you can perform various analyses. Here are a few ideas:

  • Visualize Population Distribution: Create a histogram or bar chart showing the population distribution of countries.
  • Filter by Region: Extract countries from a specific region (e.g., Europe) and analyze their statistics.
  • Compare Area Sizes: Find out which countries have the largest and smallest areas.
Show the code
library(ggplot2)

# Example: Visualize population distribution
ggplot(data = countries_data, aes(x = population)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  scale_x_log10() + # Log scale for better visualization
  labs(title = "Population Distribution of Countries",
       x = "Population (log scale)",
       y = "Number of Countries") +
  theme_bw()

5.3.2 Additional Free APIs for Practice

If you want more options for exercises, consider these additional free APIs:

  1. JSONPlaceholder:
    • URL: JSONPlaceholder
    • Use: A fake online REST API for testing and prototyping. You can simulate CRUD operations with posts, comments, and user data.
  2. PokeAPI:
    • URL: PokeAPI
    • Use: Access Pokémon data for projects on data relationships and stats visualization. Rate-limited to 100 requests per minute.
  3. Open Library Books API:
    • URL: Open Library API
    • Use: Access data about millions of books, including titles, authors, and publication dates. Unlimited access without authentication.
  4. Dog CEO’s Dog API:
    • URL: Dog API
    • Use: Get random dog images and breed information. Free access with no authentication required.

5.3.3 Exercise 2: Scrape Data from a Web Page

  1. Choose a webpage that contains tabular data.
  2. Use rvest to scrape the table and convert it into a data frame.
Show the code
# install.packages("tidyverse")
library(tidyverse)
# install.packages("rvest")
library(rvest)

url <- 'https://en.wikipedia.org/wiki/List_of_countries_by_foreign-exchange_reserves'
#Reading the HTML code from the Wiki website
wikiforreserve <- read_html(url)
class(wikiforreserve)
[1] "xml_document" "xml_node"    
Show the code
## Get the XPath data using Inspect element feature in Safari, Chrome or Firefox
## At Inspect tab, look for <table class=....> tag. Leave the table close
## Right click the table and Copy --> XPath, paste at html_nodes(xpath =)

foreignreserve <- wikiforreserve %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div[1]/table[1]') %>%
  html_table()
class(foreignreserve) # Why the first column is not scrapped?
[1] "list"
Show the code
fores = foreignreserve[[1]][,c(1, 2,3,4,5,6,7,8) ] # [[ ]] returns a single element directly, without retaining the list structure.


# 
names(fores) <- c("Country", "Forexreswithgold", "Date1", "Change1","Forexreswithoutgold", "Date2","Change2", "Sources")
colnames(fores)
[1] "Country"             "Forexreswithgold"    "Date1"              
[4] "Change1"             "Forexreswithoutgold" "Date2"              
[7] "Change2"             "Sources"            

By following these examples and exercises, participants will gain practical experience in collecting data through APIs and web scraping using R. This session will enhance their ability to gather and prepare data for analysis in real-world scenarios.

5.4 Summary

In this chapter, we introduced the concepts of data collection through APIs and web scraping. Participants learned how to fetch data from APIs using the httr and jsonlite packages and scrape websites with rvest. By working through hands-on exercises, they practiced retrieving data from a public API and scraping a webpage to prepare a dataset for further analysis. These skills are essential for data scientists and analysts who need to collect and process data from various sources for their projects.