This notebook demonstrates collecting Twitter data using API method (academictwitteR,rtweet) and non-API method (Python-based Twint)

I. Using API Method to collect Twitter data

Twitter provides an API for Academic Research developers. Read Preparing for the application of Academic Research access and apply for access. Once approved, Twitter will provide API keys and tokens for data access up to 10 million data points a month.

  1. Login to Twitter’s developer page using your Twitter account (create one if you do not have one)
  2. Click on Apply for access to apply for the academic research developer account.
  3. If only for essential access (basic access), click on developer portal and create a project. Here is some suggestions on project description (WARNING: DO NOT COPY THE FOLLOWING TEXT AND PASTE VERBATIM. TWITTER WILL VERY LIKELY REJECT APPLICATION!)

Sample description:

  1. Using API to conduct public opinion research.
  2. Analyze Tweet contents, trends and transactional data in social networks.
  3. Focus on Tweeting, favorites/likes, following and retweeting will be involved
  4. Aggregate data will be presented to the public and reviewing agency targeting publications in academic journals and presentations in academic conferences.

Once approved, Twitter will provide API detail in four keys/secret/tokens.

1. rtweet

Open an R session and store the API data:

## Required package: rtweet
# Create token for direct authentication to access Twitter data

require(rtweet)
token <- rtweet::create_token(
  app = "Your App name",
  consumer_key <- "YOURCONSUMERKEY",
  consumer_secret <- "YOURCONSUMERSECRET",
  access_token <- "YOURACCESSTOKEN",
  access_secret <- "YOURACCESSSECRET")

## Check token

rtweet::get_token()

With API methods, there are plenty of R packages for collecting Twitter data. Examples include twitteR, vosonSML and rtweet. The following illustration uses rtweet, which gives most detail in twitter variables (almost 90).

## Install packages need for Twitter data download

##install.packages(c("rtweet","igraph","tidyverse","ggraph","data.table"), repos = "https://cran.r-project.org")

## Load packages

library(rtweet)
library(igraph)
library(tidyverse)
library(ggraph)
library(data.table)
## search for 1000 tweets in English
# Not run: 
blm <- rtweet::search_tweets(q = "Black lives matter", n = 100, lang = "en")
# End(Not run)

## preview users data
users_data(blm)

## Boolean search for large quantity of tweets (which could take a while)
# Not run: 
blm1 <- rtweet::search_tweets("blacklivematter OR Blacklivesmatter", n = 100,
  retryonratelimit = TRUE)
# End(Not run)

## plot time series of tweets frequency
ts_plot(blm1, by = "mins") + theme_bw()

To explore the network structure of the Twitter data, igraph and ggraph packages are recommended for network plots

## Create igraph object from Twitter data using user id and mentioned id.
## ggraph draws the network graph in different layouts (12). 
# Not run:
filter(blm1, retweet_count > 0 ) %>% 
  select(screen_name, mentions_screen_name) %>%
  unnest(mentions_screen_name) %>% 
  filter(!is.na(mentions_screen_name)) %>% 
  graph_from_data_frame() -> blm_g
V(blm_g)$node_label <- unname(ifelse(degree(blm_g)[V(blm_g)] > 20, names(V(blm_g)), "")) 
V(blm_g)$node_size <- unname(ifelse(degree(blm_g)[V(blm_g)] > 20, degree(blm_g), 0)) 
ggraph(blm_g, layout = 'kk') + 
  geom_edge_arc(edge_width=0.1, aes(alpha=..index..)) +
  geom_node_label(aes(label=node_label, size=node_size),
                  label.size=0, fill="#ffffff66", segment.colour="light blue",
                  color="red", repel=TRUE, family="Apple Garamond") +
  coord_fixed() +
  scale_size_area(trans="sqrt") +
  labs(title="Title", subtitle="Edges=volume of retweets. Screenname size=influence") +
  theme_graph(base_family="Apple Garamond") +
  theme(legend.position="none") 
# End(Not run)

2. academictwitteR

The academictwitteR is one of the newest packages allowing downloading substantial amount of Twitter data. The authors, Christopher Barrie and Justin Chun-ting Ho, provide a detailed description page with instructions to setting up the package and bearer token in RStudio.

Sample program for downloading tweet data:

# install.packages("academictwitteR")
# installation instruction: https://github.com/cjbarrie/academictwitteR

# Setup
require(academictwitteR)
# set_bearer()
get_bearer() # Check bearer token

# Keyword search
# Be sure not to do a long period, Twitter limits search with big data return
# Recommended: start with three to six months 
get_all_tweets(
    query = "COVID",
    start_tweets = "2022-01-01T00:00:00Z",
    end_tweets = "2022-03-31T00:00:00Z",
    data_path = "data/",  # save json data by page in this directory, bind data afterward
    file = "covidtweets", # save rds format, could be skipped if file size becomes too large
    n = 100000
  )


## Get user data
# Step 1. Get author id by username
get_user_id("JoeBiden", get_bearer())

# Step 2. Get tweet data by author id
get_user_timeline("939091",
                  start_tweets = "2021-01-01T00:00:00Z", 
                  end_tweets = "2021-12-31T00:00:00Z",
                  bearer_token = get_bearer(),
                  data_path = "biden21/",
                  file ="twt_biden21",
                  n = 10000)

# Step 3. Bind json data into data frame
bidentwt21 = bind_tweets(data_path = "biden21/", user = TRUE, output_format = "tidy")

II. Using non-API Method to collect Twitter data

Twitter API is not without limits. These limits vary over time and it currently allows one week’s data. Some packages can reach data within a shorter period due to data size. Other methods have been developed to collect historical Twitter data. The twint and tweeds packages are illustrated here. These non-API methods scrape Twitter data based on Twitter search results by parsing the result page with a scroll loader, then calling to a JSON provider. While theoretically it can search through oldest tweets and collect data accordingly, the number of variables are limited to the layout of search results.

twint

Prerequisites:

  1. Python 3.8 or newer
  2. Bash/terminal command line tool
  3. Python pip package installer

Illustration using twint Install Python 3.x (e.g. Anaconda3) and run the following preparation steps (creating virtual environment, install twint package using pip):

python3 -m venv env
source ./env/bin/activate 
pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint

Alternatively,

pipenv install git+https://github.com/twintproject/twint.git#egg=twint

There are two methods of collecting Twitter data. The twint command method is recommended for easier batch processing in the data collection process (can be time-consuming!)

Examples:

## Keyword search
twint -s "black lives matter" --since 2020-01-01 --until 2021-07-18 --output blm.csv

## username search with time period and size limit
twint -u "Black Lives Matter" --since 2020-01-01 --until 2021-07-18 --maxtweets 20000 --output blmaccount.csv

tweeds

Prerequisites are same as twint except requiring 3.9 or higher.

Installation:

pip install tweeds

Examples:

## Keyword search
tweeds -s "build back better" --since 2022-01-01 --until 2023-02-22 --csv bbb.csv

## username search (all tweets)
tweeds -u "JoeBiden" --csv jb_2023.csv

(updated 22 February, 2023)