A Starter's Guide for Web Crawling in R

Guangzhi Shang
Florida State University

05/05/2016

POMS 2016 Sustainable Operations Mini Conference

Outline

Crawling: what and why

Facts about R

A Basic Crawler

Extensions

What is web crawling?

Many tutorials on Google, this is the normal version.

title

This is the light version.

title

What is web crawling?

The way I tend to understand a crawler is

title

in batch mode.

When is crawling better than Ctrl C + Ctrl V?

Below are two movies I watched recently (imdb.com). There is striking similarity of the content exposition.

title

Why do it ourselves?

A classic oursourcing versus in-house production problem.

Don't worry, I have a theoretical lens: Transaction Cost Economics.

Three main arguments:

Crawling tools today have comfortable learning curve (low setup cost for in-house production).
Many data problems/requirements are uncovered “on the go” (inflexibility of oursourcing).
Crawling can be easily done within data analysis software such as R (synergy for in-house production).

Outline

Crawling: what and why

Facts about R

A Basic Crawler

Extensions

Popularity

title

title source article

Console and IDE

title Download R

title Download Rstudio

Basic structure

Base R
Package from CRAN
Package from Github
More than 10,000 packages now
Most have nothing to do with running models, such as the ones we will make use of.

title source article

Outline

Crawling: what and why

Facts about R

A Basic Crawler

Extensions

Three key components

Assumption: All we need is copy-paste in batch mode.

Then, there are three key elements:

(link) URLs of the web pages that contain needed information.
(template) A bunch of codes to load a page, locate and collect the infomation, and store in a data set.
(loop) A simple loop function to apply “template” to all “links”.

Links from index page

Where are links coming from? Most likely collect from an index page (example: top rated 250 movies on IMDb)

Method 1: the more visual way.

Use a browser addon such as “linkKlipper” for Chrome.
Automatically saved into a .txt file (renamed as linkklipper.txt).
Load into R.

library(dplyr)
link_klipper<- read.table("linkklipper.txt",header = FALSE)[[1]]
length(link_klipper)

[1] 250

Links from index page

Method 2: use an html page parser such as rvest package.

Read the source page. What does it look like? Right click, then view page source.

library(rvest)
index_page <- read_html("http://www.imdb.com/chart/top?ref_=nv_ch_250_4")

Links from index page

Locate the URLs for the top 250 movies. Each stored under this tag: titleColumn. Then, go one step deeper and extract the (partial) link from the href tag.

partial_link <- html_attr(html_node(html_nodes(index_page ,".titleColumn"),"a"),"href")

Attach the prefix to get full link.

link_rvest <- paste0("http://www.imdb.com/",partial_link)
length(link_rvest)

[1] 250

Links from index page

Chain things together in an easy to read way with magrittr.

Each %>% enters results from previous to the fisrt argument of the next function.

library(magrittr)
link_rvest <- 
  read_html("http://www.imdb.com/chart/top?ref_=nv_ch_250_4") %>% 
  html_nodes(".titleColumn") %>% 
  html_node("a") %>% 
  html_attr("href") %>% 
  paste0("http://www.imdb.com",.)
length(link_rvest)

[1] 250

Construct template

Time to construct a crawling template to feed into the loop that will iterate through the links.

When extracting links from the index page, we have seen these:

html_attr(html_node(html_nodes(index_page ,".titleColumn"),"a"),"href")

rvest parses a downloaded web page by uing the html tags.
Is there an easier way to come up with the “right” tag?
Check out the CSS Selector Gadget Chrome addon

Construct template

The top 250 movie all have the same page format, so just pick one to develop the template.

Let's say we want to collect title, rating, and director from each movie.

CSS Selector Gadget + rvest make this easy.

title

Construct template

Movie title -> CSS Selector -> html tag h1

library(stringi);library(stringr)
title <- link_rvest[14] %>% read_html() %>% 
  html_node("h1") %>% html_text() 
title

[1] "InceptionÂ (2010)            "

Some text trimming

title %<>% stri_trim_both() %>% 
  substr(.,1,nchar(.)-8)
title

[1] "Inception"

Construct template

Similar work for rating:

rating <- 
  link_rvest[14] %>% 
  read_html() %>%
  html_node(".ratingValue") %>%
  html_text() %>%
  stri_replace_all_fixed("/10","") %>%
  stri_trim_both() %>%
  as.numeric()
rating

[1] 8.8

Construct template

A little more work to uniquely locate director.
Parse out the director text.

director <- link_rvest[14] %>% read_html() %>%
  html_node(".summary_text+ .credit_summary_item") %>%
  html_text() 
director

[1] "\n        Director:\n            \nChristopher Nolan            \n    "

Construct template

Movie might two directors (e.g. The Lego Movie).

director %<>% str_split(":|,")
director <- director[[1]][2] %>% str_trim
director

[1] "Christopher Nolan"

Put title, rating, and director into a data set.

one_movie <- data.frame(title,rating,director)
one_movie

      title rating          director
1 Inception    8.8 Christopher Nolan

Construct template

Wrap the template into a function for a cleaner loop view.

Test it out with No. 12, Star Wars V.

template(link_rvest[12])

                                            title rating       director
1 Star Wars: Episode V - The Empire Strikes BackÂ    8.8 Irvin Kershner

Everything works fine!

Looping

# An empty object to store data
movie_data <- NULL

# Loop through the 250 movie
for (i in 1:250) {
  movie <- template(link_rvest[i])

  # Extend the length of data set after each iteration
  movie_data %<>% rbind(movie)
}

The final product

The result is a nice data set ready for analysis!

title	rating	director
The Shawshank Redemption	9.3	Frank Darabont
The Godfather	9.2	Francis Ford Coppola
The Godfather: Part II	9.0	Francis Ford Coppola
The Dark Knight	9.0	Christopher Nolan
Schindler's List	8.9	Steven Spielberg
12 Angry Men	8.9	Sidney Lumet
Pulp Fiction	8.9	Quentin Tarantino
The Lord of the Rings: The Return of the King	8.9	Peter Jackson
The Good, the Bad and the Ugly	8.9	Sergio Leone
Fight Club	8.9	David Fincher

Visualize rating and title

plot of chunk unnamed-chunk-17

plot of chunk unnamed-chunk-18

Top directors

director	freq
Alfred Hitchcock	8
Christopher Nolan	7
Martin Scorsese	7
Stanley Kubrick	7
Steven Spielberg	7
Akira Kurosawa	6
Hayao Miyazaki	6
Billy Wilder	5
Quentin Tarantino	5
Sergio Leone	5

director	freq
Ingmar Bergman	4
Ridley Scott	4
Charles Chaplin	3
Clint Eastwood	3
David Fincher	3
Frank Capra	3
James Cameron	3
Pete Docter	3
Peter Jackson	3
Sidney Lumet	3

Outline

Crawling: what and why

Facts about R

A Basic Crawler

Extensions

Page flipping

Back to the link gathering. What if the links are stored in multiple index pages?

Example: top rated action movies on IMDb

Open a few pages (usually 3 will be enough), compare addresses to see what's changing.
Identify moving parts
Construct URLs for all pages
Loop through them to collect links (for the movie page)

Page flipping

Basice structure of a web address:

Page layout defined by http://www.imdb.com/search/title?.
Everything after that requests what to be filled within the page and in what format.
Each specific request is separated by &

Page flipping

Page 1 requests: genres=action& sort=user_rating,desc& title_type=feature& num_votes=25000,& pf_rd_m=A2FGELUUNOQJNL& pf_rd_p=2406822102& pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE& pf_rd_s=right-6& pf_rd_t=15506& pf_rd_i=top& ref_=chttp_gnr_1
Page 2 requests: genres=action& num_votes=25000,& pf_rd_i=top& pf_rd_m=A2FGELUUNOQJNL& pf_rd_p=2406822102& pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE& pf_rd_s=right-6& pf_rd_t=15506& ref_=chttp_gnr_1& sort=user_rating,desc& start=51& title_type=feature
Hard to compare; some string parsing is useful here

Page flipping

LHS	RHS.P1	RHS.P2	RHS.P3
genres	action	action	action
sort	user_rating,desc	user_rating,desc	user_rating,desc
title_type	feature	feature	feature
num_votes	25000,	25000,	25000,
pf_rd_m	A2FGELUUNOQJNL	A2FGELUUNOQJNL	A2FGELUUNOQJNL
pf_rd_p	2406822102	2406822102	2406822102
pf_rd_r	1HYX8S8DAZ1MCQ1CMYQE	1HYX8S8DAZ1MCQ1CMYQE	1HYX8S8DAZ1MCQ1CMYQE
pf_rd_s	right-6	right-6	right-6
pf_rd_t	15506	15506	15506
pf_rd_i	top	top	top
ref_	chttp_gnr_1	chttp_gnr_1	chttp_gnr_1
start	NA	51	101

Page flipping

Only start= is changing. Construct URLs for the 20 pages.

start_url <- paste0("http://www.imdb.com/search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2406822102&pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_1&start=",
                   seq(from=1,to=951,by=50))
length(start_url)

[1] 20

Page flipping

Loop through the multiple index pages to collect the movie URLs.

link_rvest <- NULL
for (i in 1:20) {
  links <- read_html(start_url[i]) %>% 
    html_nodes(".title") %>% html_node("a") %>% 
    html_attr("href") %>% 
    paste0("http://www.imdb.com",.)
  link_rvest %<>% c(links)
}
length(link_rvest)

[1] 965

Get around captcha box

At least several pages are accessed every second
Overflow for the website
More importantly, this is obviously not human behavior.
It's one thing to attract attention.
It's another to attract flooding traffic generated by “headless” crawling.
Many websites will show captcha box.
Ops…

Get around captcha box

title

The traffic control rule behind it is hard to know directly.
But some trial-and-error will tell us the “safe zone”.

Get around captcha box

Idea: access once every X seconds.
Usually it's enough to start with X=5.

movie_data <- NULL

for (i in 1:250) {
  movie <- template(link_rvest[i])
  movie_data %<>% rbind(movie)

  # After each iteration, pause for 5 sec
  Sys.sleep(5)
}

If the crawler works fine, reduce X and so on.

Get around captcha box

Sometimes, low frequency alone does not solve the problem.
After all, who can do “robotic” clicks every X seconds?
A robot can…
So after a while, “captcha” again.
In this case, add some randomness to the pause, just like a human does.
If the lowest X is 3, try this: Sys.sleep(runif(1, 3, 5))

Crawl a table

In the IMDb example, we have collected information scattered on a webpage.

Sometimes, we might be better off collecting a whole table.

Possibility 1: information we need is well formated in a table.

Possibility 2: some information available on one page is not available on another page.

Crawl a table

Let's collect the recent feedback rating table.

title

Crawl a table

An html table starts with the tag table.

Use html_table function to extract table into an R data set.

fb_data <- page_url %>% read_html %>% 
  html_node(".frp") %>% html_node("table") %>% html_table()
fb_data

  X1       X2   X3    X4    X5
1 NA Positive 1631 16283 35993
2 NA  Neutral    3    29    63
3 NA Negative    1    35    92

Crawl a table

Some simple text processing stores these feedback ratings as individual variables

fb_data %<>% `[`(3:5) %>% 
  unlist() %>% as.data.frame() %>% 
  t() %>% as.data.frame() %>% tbl_df()
names(fb_data) <- c("p1","m1","n1","p6","m6","n6","p12","m12","n12")
fb_data %>% kable()

p1	m1	n1	p6	m6	n6	p12	m12	n12
1631	3	1	16283	29	35	35993	63	92

Crawl a table

Amazon financial metrics from Mergent Online.

title

Crawl a table

Difference in information availability between Amazon and Cyberlink is nontrivial.

Data position also shifts.

Its better to collect all tables and then do text processing. title

Java script generated content

Many “fancy-looking” pages have Java script generated content. Example: English Premier League Table.

rvest still works fine as a parser, but it won't read the Java content. These content are typically loaded with a delay.

Something more sophisticated needed to load the page. A headless browser such as PhantomJS is great for such task.

“A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers” – Wikipedia

Java script generated content

Put PhantomJS binary in the R working directory.
Write a simple Java script to merely load the page.

var url ='http://www.somepage.com'; # replace this line in R
var page = new WebPage(); var fs = require('fs');
# this opens the page and waits for a while
page.open(url, function (status) {
  just_wait();
});
# the waiting is 2500 miliseconds; adjust as fit
function just_wait() {
  setTimeout(function() {
    fs.write('myfile.html', page.content, 'w');
    phantom.exit();
  }, 2500);
}

Java script generated content

Save the Java script “scrape.js” in the R working directory.
Call “scrape.js” witin R. The rest are the same as before.

# change the first line of scrape.js scrape
# replace with the URL of the to-be-scraped page
lines <- readLines("scrape.js") 
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, "scrape.js")

## let phantomjs download the website
system("phantomjs scrape.js")

Loaded page is saved as “myfile.html”.

Java script generated content

Then, standard web crawling with rvest.

epl_data <- 
  read_html("myfile.html") %>% 
  html_node(".leagueTable") %>% 
  html_table() %>% 
  as.data.frame() %>% 
  `[`(c(1,4:12)) %>%
  tbl_df() %>% 
  mutate(POS=as.numeric(POS)) %>% 
  arrange(POS) %>% 
  `[`(1:20,)

Java script generated content

2015-2016 season (so far)

POS	CLUB	PTS
1	Leicester City	77
2	Tottenham Hotspur	70
3	Arsenal	67
4	Manchester City	64
5	Manchester United	60
6	West Ham United	59
7	Southampton	57
8	Liverpool	55
9	Chelsea	48
10	Stoke City	48

2015-2015 season

POS	CLUB	PTS
1	Chelsea	87
2	Manchester City	79
3	Arsenal	75
4	Manchester United	70
5	Tottenham Hotspur	64
6	Liverpool	62
7	Southampton	60
8	Swansea City	56
9	Stoke City	54
10	Crystal Palace	48

Java script generated content

The more realistic websites to scrape might be:

OTAs such as Expedia and Travelocity
Airlines
Hotels
Newspapers such as Washington Post and New York Times
Propriatary company-level data

Non R-based, GUI options

Google spreadsheet has some basic scraping features.

Subscription-based online services:

Local one-time price softwares

web content extractor

Thank you for listening!

Guangzhi Shang

Florida State Unviersity

[email protected]

Slides and codes are uploaded to http://gshang.weebly.com/crawler-workshop.html