A Starter's Guide for Web Crawling in R

Guangzhi Shang
Florida State University

05/05/2016

POMS 2016 Sustainable Operations Mini Conference

Outline

Crawling: what and why

Facts about R

A Basic Crawler

Extensions

What is web crawling?

Many tutorials on Google, this is the normal version.

title

This is the light version.

title

What is web crawling?

The way I tend to understand a crawler is

title

in batch mode.

When is crawling better than Ctrl C + Ctrl V?

Below are two movies I watched recently (imdb.com). There is striking similarity of the content exposition.

title

Why do it ourselves?

A classic oursourcing versus in-house production problem.

Don't worry, I have a theoretical lens: Transaction Cost Economics.

Three main arguments:

  • Crawling tools today have comfortable learning curve (low setup cost for in-house production).
  • Many data problems/requirements are uncovered “on the go” (inflexibility of oursourcing).
  • Crawling can be easily done within data analysis software such as R (synergy for in-house production).

Outline

Crawling: what and why

Facts about R

A Basic Crawler

Extensions

Popularity

title

Console and IDE

Basic structure

  • Base R
  • Package from CRAN
  • Package from Github
  • More than 10,000 packages now
  • Most have nothing to do with running models, such as the ones we will make use of.

Outline

Crawling: what and why

Facts about R

A Basic Crawler

Extensions

Three key components

Assumption: All we need is copy-paste in batch mode.

Then, there are three key elements:

  • (link) URLs of the web pages that contain needed information.
  • (template) A bunch of codes to load a page, locate and collect the infomation, and store in a data set.
  • (loop) A simple loop function to apply “template” to all “links”.

Links from index page

Where are links coming from? Most likely collect from an index page (example: top rated 250 movies on IMDb)

Method 1: the more visual way.

library(dplyr)
link_klipper<- read.table("linkklipper.txt",header = FALSE)[[1]]
length(link_klipper)
[1] 250

Links from index page

Method 2: use an html page parser such as rvest package.

  • Read the source page. What does it look like? Right click, then view page source.
library(rvest)
index_page <- read_html("http://www.imdb.com/chart/top?ref_=nv_ch_250_4")

Links from index page

  • Locate the URLs for the top 250 movies. Each stored under this tag: titleColumn. Then, go one step deeper and extract the (partial) link from the href tag.
partial_link <- html_attr(html_node(html_nodes(index_page ,".titleColumn"),"a"),"href")
  • Attach the prefix to get full link.
link_rvest <- paste0("http://www.imdb.com/",partial_link)
length(link_rvest)
[1] 250

Links from index page

Chain things together in an easy to read way with magrittr.

Each %>% enters results from previous to the fisrt argument of the next function.

library(magrittr)
link_rvest <- 
  read_html("http://www.imdb.com/chart/top?ref_=nv_ch_250_4") %>% 
  html_nodes(".titleColumn") %>% 
  html_node("a") %>% 
  html_attr("href") %>% 
  paste0("http://www.imdb.com",.)
length(link_rvest)
[1] 250

Construct template

Time to construct a crawling template to feed into the loop that will iterate through the links.

When extracting links from the index page, we have seen these:

html_attr(html_node(html_nodes(index_page ,".titleColumn"),"a"),"href")
  • rvest parses a downloaded web page by uing the html tags.
  • Is there an easier way to come up with the “right” tag?
  • Check out the CSS Selector Gadget Chrome addon

Construct template

The top 250 movie all have the same page format, so just pick one to develop the template.

Let's say we want to collect title, rating, and director from each movie.

CSS Selector Gadget + rvest make this easy.

title

Construct template

  • Movie title -> CSS Selector -> html tag h1
library(stringi);library(stringr)
title <- link_rvest[14] %>% read_html() %>% 
  html_node("h1") %>% html_text() 
title
[1] "Inception (2010)            "
  • Some text trimming
title %<>% stri_trim_both() %>% 
  substr(.,1,nchar(.)-8)
title
[1] "Inception"

Construct template

Similar work for rating:

rating <- 
  link_rvest[14] %>% 
  read_html() %>%
  html_node(".ratingValue") %>%
  html_text() %>%
  stri_replace_all_fixed("/10","") %>%
  stri_trim_both() %>%
  as.numeric()
rating
[1] 8.8

Construct template

  • A little more work to uniquely locate director.

  • Parse out the director text.

director <- link_rvest[14] %>% read_html() %>%
  html_node(".summary_text+ .credit_summary_item") %>%
  html_text() 
director
[1] "\n        Director:\n            \nChristopher Nolan            \n    "

Construct template

director %<>% str_split(":|,")
director <- director[[1]][2] %>% str_trim
director
[1] "Christopher Nolan"
  • Put title, rating, and director into a data set.
one_movie <- data.frame(title,rating,director)
one_movie
      title rating          director
1 Inception    8.8 Christopher Nolan

Construct template

Wrap the template into a function for a cleaner loop view.

Test it out with No. 12, Star Wars V.

template(link_rvest[12])
                                            title rating       director
1 Star Wars: Episode V - The Empire Strikes Back    8.8 Irvin Kershner

Everything works fine!

Looping

# An empty object to store data
movie_data <- NULL

# Loop through the 250 movie
for (i in 1:250) {
  movie <- template(link_rvest[i])

  # Extend the length of data set after each iteration
  movie_data %<>% rbind(movie)
}

The final product

The result is a nice data set ready for analysis!

title rating director
The Shawshank Redemption 9.3 Frank Darabont
The Godfather 9.2 Francis Ford Coppola
The Godfather: Part II 9.0 Francis Ford Coppola
The Dark Knight 9.0 Christopher Nolan
Schindler's List 8.9 Steven Spielberg
12 Angry Men 8.9 Sidney Lumet
Pulp Fiction 8.9 Quentin Tarantino
The Lord of the Rings: The Return of the King 8.9 Peter Jackson
The Good, the Bad and the Ugly 8.9 Sergio Leone
Fight Club 8.9 David Fincher

Visualize rating and title

plot of chunk unnamed-chunk-17

plot of chunk unnamed-chunk-18

Top directors

director freq
Alfred Hitchcock 8
Christopher Nolan 7
Martin Scorsese 7
Stanley Kubrick 7
Steven Spielberg 7
Akira Kurosawa 6
Hayao Miyazaki 6
Billy Wilder 5
Quentin Tarantino 5
Sergio Leone 5
director freq
Ingmar Bergman 4
Ridley Scott 4
Charles Chaplin 3
Clint Eastwood 3
David Fincher 3
Frank Capra 3
James Cameron 3
Pete Docter 3
Peter Jackson 3
Sidney Lumet 3

Outline

Crawling: what and why

Facts about R

A Basic Crawler

Extensions

Page flipping

Back to the link gathering. What if the links are stored in multiple index pages?

Example: top rated action movies on IMDb

  • Open a few pages (usually 3 will be enough), compare addresses to see what's changing.
  • Identify moving parts
  • Construct URLs for all pages
  • Loop through them to collect links (for the movie page)

Page flipping

Basice structure of a web address:

  • Page layout defined by http://www.imdb.com/search/title?.

  • Everything after that requests what to be filled within the page and in what format.

  • Each specific request is separated by &

Page flipping

  • Page 1 requests: genres=action& sort=user_rating,desc& title_type=feature& num_votes=25000,& pf_rd_m=A2FGELUUNOQJNL& pf_rd_p=2406822102& pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE& pf_rd_s=right-6& pf_rd_t=15506& pf_rd_i=top& ref_=chttp_gnr_1

  • Page 2 requests: genres=action& num_votes=25000,& pf_rd_i=top& pf_rd_m=A2FGELUUNOQJNL& pf_rd_p=2406822102& pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE& pf_rd_s=right-6& pf_rd_t=15506& ref_=chttp_gnr_1& sort=user_rating,desc& start=51& title_type=feature

  • Hard to compare; some string parsing is useful here

Page flipping

LHS RHS.P1 RHS.P2 RHS.P3
genres action action action
sort user_rating,desc user_rating,desc user_rating,desc
title_type feature feature feature
num_votes 25000, 25000, 25000,
pf_rd_m A2FGELUUNOQJNL A2FGELUUNOQJNL A2FGELUUNOQJNL
pf_rd_p 2406822102 2406822102 2406822102
pf_rd_r 1HYX8S8DAZ1MCQ1CMYQE 1HYX8S8DAZ1MCQ1CMYQE 1HYX8S8DAZ1MCQ1CMYQE
pf_rd_s right-6 right-6 right-6
pf_rd_t 15506 15506 15506
pf_rd_i top top top
ref_ chttp_gnr_1 chttp_gnr_1 chttp_gnr_1
start NA 51 101

Page flipping

  • Only start= is changing. Construct URLs for the 20 pages.
start_url <- paste0("http://www.imdb.com/search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2406822102&pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_1&start=",
                   seq(from=1,to=951,by=50))
length(start_url)
[1] 20

Page flipping

  • Loop through the multiple index pages to collect the movie URLs.
link_rvest <- NULL
for (i in 1:20) {
  links <- read_html(start_url[i]) %>% 
    html_nodes(".title") %>% html_node("a") %>% 
    html_attr("href") %>% 
    paste0("http://www.imdb.com",.)
  link_rvest %<>% c(links)
}
length(link_rvest)
[1] 965

Get around captcha box

  • At least several pages are accessed every second
  • Overflow for the website
  • More importantly, this is obviously not human behavior.
  • It's one thing to attract attention.
  • It's another to attract flooding traffic generated by “headless” crawling.
  • Many websites will show captcha box.
  • Ops…

Get around captcha box

title

  • The traffic control rule behind it is hard to know directly.
  • But some trial-and-error will tell us the “safe zone”.

Get around captcha box

  • Idea: access once every X seconds.
  • Usually it's enough to start with X=5.
movie_data <- NULL

for (i in 1:250) {
  movie <- template(link_rvest[i])
  movie_data %<>% rbind(movie)

  # After each iteration, pause for 5 sec
  Sys.sleep(5)
}
  • If the crawler works fine, reduce X and so on.

Get around captcha box

  • Sometimes, low frequency alone does not solve the problem.
  • After all, who can do “robotic” clicks every X seconds?
  • A robot can…
  • So after a while, “captcha” again.
  • In this case, add some randomness to the pause, just like a human does.
  • If the lowest X is 3, try this: Sys.sleep(runif(1, 3, 5))

Crawl a table

In the IMDb example, we have collected information scattered on a webpage.

Sometimes, we might be better off collecting a whole table.

Possibility 1: information we need is well formated in a table.

Possibility 2: some information available on one page is not available on another page.

Crawl a table

Let's collect the recent feedback rating table.

title

Crawl a table

An html table starts with the tag table.

Use html_table function to extract table into an R data set.

fb_data <- page_url %>% read_html %>% 
  html_node(".frp") %>% html_node("table") %>% html_table()
fb_data
  X1       X2   X3    X4    X5
1 NA Positive 1631 16283 35993
2 NA  Neutral    3    29    63
3 NA Negative    1    35    92

Crawl a table

Some simple text processing stores these feedback ratings as individual variables

fb_data %<>% `[`(3:5) %>% 
  unlist() %>% as.data.frame() %>% 
  t() %>% as.data.frame() %>% tbl_df()
names(fb_data) <- c("p1","m1","n1","p6","m6","n6","p12","m12","n12")
fb_data %>% kable()
p1 m1 n1 p6 m6 n6 p12 m12 n12
1631 3 1 16283 29 35 35993 63 92

Crawl a table

Amazon financial metrics from Mergent Online.

title

Crawl a table

Difference in information availability between Amazon and Cyberlink is nontrivial.

Data position also shifts.

Its better to collect all tables and then do text processing. title

Java script generated content

Many “fancy-looking” pages have Java script generated content. Example: English Premier League Table.

rvest still works fine as a parser, but it won't read the Java content. These content are typically loaded with a delay.

Something more sophisticated needed to load the page. A headless browser such as PhantomJS is great for such task.

“A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers” – Wikipedia

Java script generated content

  • Put PhantomJS binary in the R working directory.
  • Write a simple Java script to merely load the page.
var url ='http://www.somepage.com'; # replace this line in R
var page = new WebPage(); var fs = require('fs');
# this opens the page and waits for a while
page.open(url, function (status) {
  just_wait();
});
# the waiting is 2500 miliseconds; adjust as fit
function just_wait() {
  setTimeout(function() {
    fs.write('myfile.html', page.content, 'w');
    phantom.exit();
  }, 2500);
}

Java script generated content

  • Save the Java script “scrape.js” in the R working directory.
  • Call “scrape.js” witin R. The rest are the same as before.
# change the first line of scrape.js scrape
# replace with the URL of the to-be-scraped page
lines <- readLines("scrape.js") 
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, "scrape.js")

## let phantomjs download the website
system("phantomjs scrape.js")
  • Loaded page is saved as “myfile.html”.

Java script generated content

Then, standard web crawling with rvest.

epl_data <- 
  read_html("myfile.html") %>% 
  html_node(".leagueTable") %>% 
  html_table() %>% 
  as.data.frame() %>% 
  `[`(c(1,4:12)) %>%
  tbl_df() %>% 
  mutate(POS=as.numeric(POS)) %>% 
  arrange(POS) %>% 
  `[`(1:20,)

Java script generated content

2015-2016 season (so far)

POS CLUB PTS
1 Leicester City 77
2 Tottenham Hotspur 70
3 Arsenal 67
4 Manchester City 64
5 Manchester United 60
6 West Ham United 59
7 Southampton 57
8 Liverpool 55
9 Chelsea 48
10 Stoke City 48

2015-2015 season

POS CLUB PTS
1 Chelsea 87
2 Manchester City 79
3 Arsenal 75
4 Manchester United 70
5 Tottenham Hotspur 64
6 Liverpool 62
7 Southampton 60
8 Swansea City 56
9 Stoke City 54
10 Crystal Palace 48

Java script generated content

The more realistic websites to scrape might be:

  • OTAs such as Expedia and Travelocity
  • Airlines
  • Hotels
  • Newspapers such as Washington Post and New York Times
  • Propriatary company-level data

Non R-based, GUI options

Google spreadsheet has some basic scraping features.

Subscription-based online services:

Local one-time price softwares

Thank you for listening!

Guangzhi Shang

Florida State Unviersity

[email protected]

Slides and codes are uploaded to http://gshang.weebly.com/crawler-workshop.html