A Little R Toolbox Workshop

Guangzhi Shang
Florida State University

04/07/2017

University of Maryland

About Me

title

Popularity

title

Console and IDE

Basic structure

  • Base R
  • Package from CRAN
  • Package from Github
  • More than 10,000 packages now
  • Most have nothing to do with running models, such as the ones we will make use of.

Top Downloads Jan-May 2015

title

  • Most of these have nothing to do with running models.
  • Step back and think: how much time you really spend on running models?
  • My case: 50% (look for data) + 20% (organize data) + 10% (run models) + 20% (present results: tables, figures, and reports)

Let's Talk about the 90% Today

Research design and data collection phase

  1. shiny + rdrop for data collection (or use shiny to illustrate results)
  2. Web crawling within R using rvest
  3. Matching string based spatial data using geocoding

Data analysis phase (exploratory and post hoc)

  1. Simple and more advanced use of R graphics with ggplot2
  2. Batch regression tabulation using broom and stargazer

Crawling Outline

What and Why

A Basic Crawler

Extensions

What is web crawling?

Many tutorials on Google, this is the normal version.

title

This is the light version.

title

What is web crawling?

The way I tend to understand a crawler is

title

in batch mode.

When is crawling better than Ctrl C + Ctrl V?

Below are two movies I watched recently (imdb.com). There is striking similarity of the content exposition.

title

Why do it ourselves?

A classic oursourcing versus in-house production problem.

Don't worry, I have a theoretical lens: Transaction Cost Economics.

Three main arguments:

  • Crawling tools today have comfortable learning curve (low setup cost for in-house production).
  • Many data problems/requirements are uncovered “on the go” (inflexibility of oursourcing).
  • Crawling can be easily done within data analysis software such as R (synergy for in-house production).

Crawling Outline

What and Why

A Basic Crawler

Extensions

Three key components

Assumption: All we need is copy-paste in batch mode.

Then, there are three key elements:

  • (link) URLs of the web pages that contain needed information.
  • (template) A bunch of codes to load a page, locate and collect the infomation, and store in a data set.
  • (loop) A simple loop function to apply “template” to all “links”.

Links from index page

Where are links coming from? Most likely collect from an index page (example: top rated 250 movies on IMDb)

Method 1: the more visual way.

library(dplyr)
link_klipper<-read.table("linkklipper.txt",header=FALSE)[[1]]
length(link_klipper)
[1] 250

Links from index page

Method 2: use an html page parser such as rvest package.

  • Read the source page. What does it look like? Right click, then view page source.
library(rvest)
index_page <- read_html("http://www.imdb.com/chart/top?ref_=nv_ch_250_4")

Links from index page

  • Locate the URLs for the top 250 movies. Each stored under this tag: titleColumn. Then, go one step deeper and extract the (partial) link from the href tag.
partial_link <- html_attr(html_node(html_nodes(index_page ,".titleColumn"),"a"),"href")
  • Attach the prefix to get full link.
link_rvest <- paste0("http://www.imdb.com/",partial_link)
length(link_rvest)
[1] 250

Links from index page

Chain things together in an easy to read way with magrittr.

Each %>% enters results from previous to the fisrt argument of the next function.

library(magrittr)
link_rvest <- 
  read_html("http://www.imdb.com/chart/top?ref_=nv_ch_250_4") %>% 
  html_nodes(".titleColumn") %>% 
  html_node("a") %>% 
  html_attr("href") %>% 
  paste0("http://www.imdb.com",.)
length(link_rvest)
[1] 250

Construct template

Time to construct a crawling template to feed into the loop that will iterate through the links.

When extracting links from the index page, we have seen these:

html_attr(html_node(html_nodes(index_page ,".titleColumn"),"a"),"href")
  • rvest parses a downloaded web page by uing the html tags.
  • Is there an easier way to come up with the “right” tag?
  • Check out the CSS Selector Gadget Chrome addon

Construct template

The top 250 movie all have the same page format, so just pick one to develop the template.

Let's say we want to collect title, rating, and director from each movie.

CSS Selector Gadget + rvest make this easy.

title

Construct template

  • Movie title -> CSS Selector -> html tag h1
library(stringi);library(stringr)
title <- link_rvest[14] %>% read_html() %>% 
  html_nodes("h1")
title[[2]] %>% html_text() 
[1] "Inception (2010)            "
  • Some text trimming
title <- title[[2]] %>% html_text() %>% 
  stri_trim_both() %>% substr(.,1,nchar(.)-7)
title
[1] "Inception"

Construct template

Similar work for rating:

rating <- 
  link_rvest[14] %>% 
  read_html() %>%
  html_node(".ratingValue") %>%
  html_text() %>%
  stri_replace_all_fixed("/10","") %>%
  stri_trim_both() %>%
  as.numeric()
rating
[1] 8.8

Construct template

  • A little more work to uniquely locate director.

  • Parse out the director text.

director <- link_rvest[14] %>% read_html() %>%
  html_node(".summary_text+ .credit_summary_item") %>%
  html_text() 
director
[1] "\n        Director:\n            \nChristopher Nolan            \n    "

Construct template

director %<>% str_split(":|,")
director <- director[[1]][2] %>% str_trim
director
[1] "Christopher Nolan"
  • Put title, rating, and director into a data set.
one_movie <- data.frame(title,rating,director)
one_movie
      title rating          director
1 Inception    8.8 Christopher Nolan

Construct template

Wrap the template into a function for a cleaner loop view.

Test it out with No. 7, Pulp Fiction.

template(link_rvest[7])
         title rating          director
1 Pulp Fiction    8.9 Quentin Tarantino

Everything works fine!

Looping

# An empty object to store data
movie_data <- NULL

# Loop through the 250 movie
for (i in 1:250) {
  movie <- template(link_rvest[i])

  # Extend the length of data set after each iteration
  movie_data %<>% rbind(movie)
}

The final product

The result is a nice data set ready for analysis!

title rating director
The Shawshank Redemption 9.3 Frank Darabont
The Godfather 9.2 Francis Ford Coppola
The Godfather: Part II 9.0 Francis Ford Coppola
The Dark Knight 9.0 Christopher Nolan
12 Angry Men 8.9 Sidney Lumet
Schindler's List 8.9 Steven Spielberg
Pulp Fiction 8.9 Quentin Tarantino
The Lord of the Rings: The Return of the King 8.9 Peter Jackson
The Good, the Bad and the Ugly 8.9 Sergio Leone
Fight Club 8.8 David Fincher

Visualize rating and title

plot of chunk unnamed-chunk-17

title

Top directors

director freq
Christopher Nolan 8
Steven Spielberg 8
Alfred Hitchcock 7
Martin Scorsese 7
Stanley Kubrick 7
Quentin Tarantino 6
Akira Kurosawa 5
Billy Wilder 5
Hayao Miyazaki 5
Sergio Leone 5
director freq
David Fincher 4
Francis Ford Coppola 4
Peter Jackson 4
Sidney Lumet 4
Charles Chaplin 3
Clint Eastwood 3
Frank Capra 3
Frank Darabont 3
Ingmar Bergman 3
James Cameron 3

Crawling Outline

What and Why

A Basic Crawler

Extensions

Page flipping

Back to the link gathering. What if the links are stored in multiple index pages?

Example: top rated action movies on IMDb

  • Open a few pages (usually 3 will be enough), compare addresses to see what's changing.
  • Identify moving parts
  • Construct URLs for all pages
  • Loop through them to collect links (for the movie page)

Page flipping

Basice structure of a web address:

  • Page layout defined by http://www.imdb.com/search/title?.

  • Everything after that requests what to be filled within the page and in what format.

  • Each specific request is separated by &

Page flipping

  • Page 1 requests: genres=action& sort=user_rating,desc& title_type=feature& num_votes=25000,& pf_rd_m=A2FGELUUNOQJNL& pf_rd_p=2406822102& pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE& pf_rd_s=right-6& pf_rd_t=15506& pf_rd_i=top& ref_=chttp_gnr_1

  • Page 2 requests: genres=action& num_votes=25000,& pf_rd_i=top& pf_rd_m=A2FGELUUNOQJNL& pf_rd_p=2406822102& pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE& pf_rd_s=right-6& pf_rd_t=15506& ref_=chttp_gnr_1& sort=user_rating,desc& start=51& title_type=feature

  • Hard to compare; some string parsing is useful here

Page flipping

LHS RHS.P1 RHS.P2 RHS.P3
genres action action action
sort user_rating,desc user_rating,desc user_rating,desc
title_type feature feature feature
num_votes 25000, 25000, 25000,
pf_rd_m A2FGELUUNOQJNL A2FGELUUNOQJNL A2FGELUUNOQJNL
pf_rd_p 2406822102 2406822102 2406822102
pf_rd_r 1HYX8S8DAZ1MCQ1CMYQE 1HYX8S8DAZ1MCQ1CMYQE 1HYX8S8DAZ1MCQ1CMYQE
pf_rd_s right-6 right-6 right-6
pf_rd_t 15506 15506 15506
pf_rd_i top top top
ref_ chttp_gnr_1 chttp_gnr_1 chttp_gnr_1
start NA 51 101

Page flipping

  • Only start= is changing. Construct URLs for the 20 pages.
start_url <- paste0("http://www.imdb.com/search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2406822102&pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_1&start=",
                   seq(from=1,to=1051,by=50))
length(start_url)
[1] 22

Page flipping

  • Loop through the multiple index pages to collect the movie URLs.
link_rvest <- NULL
for (i in 1:22) {
  links <- read_html(start_url[i]) %>% 
    html_nodes(".lister-item-header a") %>%  
    html_attr("href") %>% 
    paste0("http://www.imdb.com",.)
  link_rvest %<>% c(links)
}
length(link_rvest)
[1] 1083

Get around captcha box

  • At least several pages are accessed every second
  • Overflow for the website
  • More importantly, this is obviously not human behavior.
  • It's one thing to attract attention.
  • It's another to attract flooding traffic generated by “headless” crawling.
  • Many websites will show captcha box.
  • Ops…

Get around captcha box

title

  • The traffic control rule behind it is hard to know directly.
  • But some trial-and-error will tell us the “safe zone”.

Get around captcha box

  • Idea: access once every X seconds.
  • Usually it's enough to start with X=5.
movie_data <- NULL

for (i in 1:250) {
  movie <- template(link_rvest[i])
  movie_data %<>% rbind(movie)

  # After each iteration, pause for 5 sec
  Sys.sleep(5)
}
  • If the crawler works fine, reduce X and so on.

Get around captcha box

  • Sometimes, low frequency alone does not solve the problem.
  • After all, who can do “robotic” clicks every X seconds?
  • A robot can…
  • So after a while, “captcha” again.
  • In this case, add some randomness to the pause, just like a human does.
  • If the lowest X is 3, try this: Sys.sleep(runif(1,3,5))

Crawl a table

In the IMDb example, we have collected information scattered on a webpage.

Sometimes, we might be better off collecting a whole table.

Possibility 1: information we need is well formated in a table.

Possibility 2: some information available on one page is not available on another page.

Crawl a table

Let's collect the recent feedback rating table.

title

Crawl a table

An html table starts with the tag table.

Use html_table function to extract table into an R data set.

fb_data <- page_url %>% read_html %>% 
  html_node(".frp") %>% html_node("table") %>% html_table()
fb_data
  X1       X2   X3    X4    X5
1 NA Positive 1626 16725 28407
2 NA  Neutral    1    32    55
3 NA Negative    4    34    49

Crawl a table

Some simple text processing stores these feedback ratings as individual variables

fb_data %<>% `[`(3:5) %>% 
  unlist() %>% as.data.frame() %>% 
  t() %>% as.data.frame() %>% tbl_df()
names(fb_data) <- c("p1","m1","n1","p6","m6","n6","p12","m12","n12")
fb_data %>% kable()
p1 m1 n1 p6 m6 n6 p12 m12 n12
. 1626 1 4 16725 32 34 28407 55 49

Crawl a table

Amazon financial metrics from Mergent Online.

title

Crawl a table

Difference in information availability between Amazon and Cyberlink is nontrivial.

Data position also shifts.

Its better to collect all tables and then do text processing. title

Java script generated content

Many “fancy-looking” pages have Java script generated content. Example: English Premier League Table.

rvest still works fine as a parser, but it won't read the Java content. These content are typically loaded with a delay.

Something more sophisticated needed to load the page. A headless browser such as PhantomJS is great for such task.

“A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers” – Wikipedia

Java script generated content

  • Put PhantomJS binary in the R working directory.
  • Write a simple Java script to merely load the page.
var url ='http://www.somepage.com'; # replace this line in R
var page = new WebPage(); var fs = require('fs');
# this opens the page and waits for a while
page.open(url, function (status) {
  just_wait();
});
# the waiting is 2500 miliseconds; adjust as fit
function just_wait() {
  setTimeout(function() {
    fs.write('myfile.html', page.content, 'w');
    phantom.exit();
  }, 2500);
}

Java script generated content

  • Save the Java script “scrape.js” in the R working directory.
  • Call “scrape.js” witin R. The rest are the same as before.
# change the first line of scrape.js scrape
# replace with the URL of the to-be-scraped page
lines <- readLines("scrape.js") 
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, "scrape.js")

## let phantomjs download the website
system("phantomjs scrape.js")
  • Loaded page is saved as “myfile.html”.
  • Then, standard web crawling with rvest.

Java script generated content

2015-2016 season

(epic win by Leicester)

pos club pts
1 Leicester City 81
2 Arsenal 71
3 Tottenham Hotspur 70
4 Manchester City 66
5 Manchester United 66
6 Southampton 63
7 West Ham United 62
8 Liverpool 60
9 Stoke City 51
10 Chelsea 50

2016-2017 season

(so far)

pos club pts
1 Chelsea 69
2 Tottenham Hotspur 59
3 Manchester City 57
4 Liverpool 56
5 Manchester United 52
6 Arsenal 50
7 Everton 50
8 West Bromwich Albion 43
9 Stoke City 36
10 Southampton 33

Java script generated content

The more realistic websites to scrape might be:

  • OTAs such as Expedia and Travelocity
  • Airlines
  • Hotels
  • Newspapers such as Washington Post and New York Times
  • Propriatary company-level data

Non R-based, GUI options

Google spreadsheet has some basic scraping features.

Subscription-based online services:

Local one-time price softwares

Geocoding Outline

What and Why

A Basic Geocoder

Extensions

Common Use for Address Data

Within a single data set

  • Calculate distance between people, company, etc
  • Standardize address to form groups

Between multiple data sets

  • Matching observations from one data set to another: e.g., organization data, worker data, who works for which org?

Spatial Econometrics

  • Distance between two people/company
  • Distance moved from time A to B: e.g., Uber driver

So Many Problems with Address

X variants for “100 N Main Ave Tallahassee FL”

  • 100 = 101 = 102 …
  • N Main = north main = North main
  • Ave = avenue
  • , between Ave and Tallahassee?
  • with or without zip code?
  • string matching simply doesn't work well!
  • IDEA: no matter which format you tell google map, you always get the lovely red pin on the same spot.

Geocoding: ABC I

title

Geocoding: ABC II

title

  • Sometimes you don't have exact addresses.
  • And location names aren't standardized…
  • Using coordinates to calculate distances seems to be the only option.

Geocoding: ABC III

title

  • Popular geocoding APIs: google, nokia HERE, yahoo, etc
  • Notice their daily limit
  • Here is an unlimited one
  • There are many more APIs

Geocoding Outline

What and Why

Geocoding Packages in R

Extensions

ggmap - Google Map API

library(ggmap);library(RDSTK)
geocode("florida state university",source="google")
        lon      lat
1 -84.29849 30.44188
geocode("600 W College Ave, Tallahassee, FL 32306",source="google")
        lon      lat
1 -84.29067 30.44096

Geocoding Packages in R II

RDSTK - data science tool kit API

street2coordinates("600 W College Ave, Tallahassee, FL 32306")[c(5,3)]
  longitude latitude
1 -84.28844 30.44073

Distance between Two Sphere Points

title

title

Great circle distance is what you need.

Great Circle Distance in R

library(geosphere)

umd <- geocode("university+of+maryland+college+park",source="google")
fsu <- geocode("florida+state+university",source="google")

distCosine(fsu,umd)/1609.34
[1] 723.4794

Geocoding Outline

What and Why

Geocoding Packages in R

Extensions

Distance by Transportation Mode

  • One good aspect of Google API is that it simplifies the caculation of distance/time between A and B by travel mode.

  • Use my little wrapper function travel.dist.

travel.dist(origin='university+of+maryland+college+park',
destination='florida+state+university')
[1] 879.3297

Graphics Outline

ggplot2 Examples

Evolution of Interaction Effect Plots

Fancy ggplot2 Examples I

  • Creating good looking plots is not as easy as running regressions
  • R stands out in this with ggplot2
  • How many variables are there?

title

Fancy ggplot2 Examples II

title

How many variables are there?

Fancy ggplot2 Examples III

title

title

gg = Grammar of Graphics I

title

gg = Grammar of Graphics II

title

Rich ggplot2 Resources

Graphics Outline

ggplot2 Examples

Evolution of Interaction Effect Plots

Interaction Effect Plots I

\[ y=\beta_0+\beta_1 x + \beta_2 f + \beta_3 x*f + \epsilon \]

  • x is a continuous variable
  • f is a two-level factor variable
  • Problem: plot the effect of x
  • What's the conventional interaction plot look like in R?

Interaction Effect Plots II

Created using effect title

Created using ggplot2 title

Interaction Effect Plots III

\[ y=\beta_0+\beta_1 x + \beta_2 f + \beta_3 x*f + \epsilon \]

  • Problem: plot the effect of f
  • effect size is this: \( \beta_2 + \beta_3 x \)
  • The confidence band plot comes about

Interaction Effect Plots IV

Often seen version title

Perhaps more informative title

Interaction Effect Plots V

Even more informative title

My favorite! title

Interaction Effect Plots VI

\[ y=\beta_0+\beta_1 x + \beta_2 z + \beta_3 m + \beta_4 x*z + \beta_5 x*m + \epsilon \]

  • What if there are two “moderators”?
  • Plot the effect of x
  • Effect size is this: \( \beta_1 + \beta_4 z +\beta_5 m \)
  • One way is to fix m and do confidence band plot with z.
  • But too many plots!
  • Here's a solution with one plot

Interaction Effect Plots VII

Use color for effect size title

Three “indifferent” curves title

Thank you for listening!

Guangzhi Shang

Florida State Unviersity

gshang@business.fsu.edu