This notebook demonstrates collecting Twitter data using API method (academictwitteR,rtweet) and non-API method (Python-based Twint)
Twitter provides an API for Academic Research developers. Read Preparing for the application of Academic Research access and apply for access. Once approved, Twitter will provide API keys and tokens for data access up to 10 million data points a month.
Sample description:
Once approved, Twitter will provide API detail in four keys/secret/tokens.
Open an R session and store the API data:
## Required package: rtweet
# Create token for direct authentication to access Twitter data
require(rtweet)
token <- rtweet::create_token(
app = "Your App name",
consumer_key <- "YOURCONSUMERKEY",
consumer_secret <- "YOURCONSUMERSECRET",
access_token <- "YOURACCESSTOKEN",
access_secret <- "YOURACCESSSECRET")
## Check token
rtweet::get_token()
With API methods, there are plenty of R packages for collecting Twitter data. Examples include twitteR, vosonSML and rtweet. The following illustration uses rtweet, which gives most detail in twitter variables (almost 90).
## Install packages need for Twitter data download
##install.packages(c("rtweet","igraph","tidyverse","ggraph","data.table"), repos = "https://cran.r-project.org")
## Load packages
library(rtweet)
library(igraph)
library(tidyverse)
library(ggraph)
library(data.table)
## search for 1000 tweets in English
# Not run:
blm <- rtweet::search_tweets(q = "Black lives matter", n = 100, lang = "en")
# End(Not run)
## preview users data
users_data(blm)
## Boolean search for large quantity of tweets (which could take a while)
# Not run:
blm1 <- rtweet::search_tweets("blacklivematter OR Blacklivesmatter", n = 100,
retryonratelimit = TRUE)
# End(Not run)
## plot time series of tweets frequency
ts_plot(blm1, by = "mins") + theme_bw()
To explore the network structure of the Twitter data, igraph and ggraph packages are recommended for network plots
## Create igraph object from Twitter data using user id and mentioned id.
## ggraph draws the network graph in different layouts (12).
# Not run:
filter(blm1, retweet_count > 0 ) %>%
select(screen_name, mentions_screen_name) %>%
unnest(mentions_screen_name) %>%
filter(!is.na(mentions_screen_name)) %>%
graph_from_data_frame() -> blm_g
V(blm_g)$node_label <- unname(ifelse(degree(blm_g)[V(blm_g)] > 20, names(V(blm_g)), ""))
V(blm_g)$node_size <- unname(ifelse(degree(blm_g)[V(blm_g)] > 20, degree(blm_g), 0))
ggraph(blm_g, layout = 'kk') +
geom_edge_arc(edge_width=0.1, aes(alpha=..index..)) +
geom_node_label(aes(label=node_label, size=node_size),
label.size=0, fill="#ffffff66", segment.colour="light blue",
color="red", repel=TRUE, family="Apple Garamond") +
coord_fixed() +
scale_size_area(trans="sqrt") +
labs(title="Title", subtitle="Edges=volume of retweets. Screenname size=influence") +
theme_graph(base_family="Apple Garamond") +
theme(legend.position="none")
# End(Not run)
The academictwitteR is one of the newest packages allowing downloading substantial amount of Twitter data. The authors, Christopher Barrie and Justin Chun-ting Ho, provide a detailed description page with instructions to setting up the package and bearer token in RStudio.
Sample program for downloading tweet data:
# install.packages("academictwitteR")
# installation instruction: https://github.com/cjbarrie/academictwitteR
# Setup
require(academictwitteR)
# set_bearer()
get_bearer() # Check bearer token
# Keyword search
# Be sure not to do a long period, Twitter limits search with big data return
# Recommended: start with three to six months
get_all_tweets(
query = "COVID",
start_tweets = "2022-01-01T00:00:00Z",
end_tweets = "2022-03-31T00:00:00Z",
data_path = "data/", # save json data by page in this directory, bind data afterward
file = "covidtweets", # save rds format, could be skipped if file size becomes too large
n = 100000
)
## Get user data
# Step 1. Get author id by username
get_user_id("JoeBiden", get_bearer())
# Step 2. Get tweet data by author id
get_user_timeline("939091",
start_tweets = "2021-01-01T00:00:00Z",
end_tweets = "2021-12-31T00:00:00Z",
bearer_token = get_bearer(),
data_path = "biden21/",
file ="twt_biden21",
n = 10000)
# Step 3. Bind json data into data frame
bidentwt21 = bind_tweets(data_path = "biden21/", user = TRUE, output_format = "tidy")
Twitter API is not without limits. These limits vary over time and it currently allows one week’s data. Some packages can reach data within a shorter period due to data size. Other methods have been developed to collect historical Twitter data. The twint and tweeds packages are illustrated here. These non-API methods scrape Twitter data based on Twitter search results by parsing the result page with a scroll loader, then calling to a JSON provider. While theoretically it can search through oldest tweets and collect data accordingly, the number of variables are limited to the layout of search results.
Prerequisites:
Illustration using twint Install Python 3.x (e.g. Anaconda3) and run the following preparation steps (creating virtual environment, install twint package using pip):
python3 -m venv env
source ./env/bin/activate
pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint
Alternatively,
pipenv install git+https://github.com/twintproject/twint.git#egg=twint
There are two methods of collecting Twitter data. The twint command method is recommended for easier batch processing in the data collection process (can be time-consuming!)
Examples:
## Keyword search
twint -s "black lives matter" --since 2020-01-01 --until 2021-07-18 --output blm.csv
## username search with time period and size limit
twint -u "Black Lives Matter" --since 2020-01-01 --until 2021-07-18 --maxtweets 20000 --output blmaccount.csv
Prerequisites are same as twint except requiring 3.9 or higher.
Installation:
pip install tweeds
Examples:
## Keyword search
tweeds -s "build back better" --since 2022-01-01 --until 2023-02-22 --csv bbb.csv
## username search (all tweets)
tweeds -u "JoeBiden" --csv jb_2023.csv
(updated 22 February, 2023)