Guangzhi Shang
Florida State University
05/05/2016
POMS 2016 Sustainable Operations Mini Conference
Crawling: what and why
Facts about R
A Basic Crawler
Extensions
Many tutorials on Google, this is the normal version.
This is the light version.
The way I tend to understand a crawler is
in batch mode.
Below are two movies I watched recently (imdb.com). There is striking similarity of the content exposition.
A classic oursourcing versus in-house production problem.
Don't worry, I have a theoretical lens: Transaction Cost Economics.
Three main arguments:
Crawling: what and why
Facts about R
A Basic Crawler
Extensions
Crawling: what and why
Facts about R
A Basic Crawler
Extensions
Assumption: All we need is copy-paste in batch mode.
Then, there are three key elements:
Where are links coming from? Most likely collect from an index page (example: top rated 250 movies on IMDb)
Method 1: the more visual way.
library(dplyr)
link_klipper<- read.table("linkklipper.txt",header = FALSE)[[1]]
length(link_klipper)
[1] 250
Method 2: use an html page parser such as rvest
package.
library(rvest)
index_page <- read_html("http://www.imdb.com/chart/top?ref_=nv_ch_250_4")
titleColumn
. Then, go one step deeper and extract the (partial) link from the href
tag.partial_link <- html_attr(html_node(html_nodes(index_page ,".titleColumn"),"a"),"href")
link_rvest <- paste0("http://www.imdb.com/",partial_link)
length(link_rvest)
[1] 250
Chain things together in an easy to read way with magrittr
.
Each %>%
enters results from previous to the fisrt argument of the next function.
library(magrittr)
link_rvest <-
read_html("http://www.imdb.com/chart/top?ref_=nv_ch_250_4") %>%
html_nodes(".titleColumn") %>%
html_node("a") %>%
html_attr("href") %>%
paste0("http://www.imdb.com",.)
length(link_rvest)
[1] 250
Time to construct a crawling template to feed into the loop that will iterate through the links.
When extracting links from the index page, we have seen these:
html_attr(html_node(html_nodes(index_page ,".titleColumn"),"a"),"href")
rvest
parses a downloaded web page by uing the html tags. The top 250 movie all have the same page format, so just pick one to develop the template.
Let's say we want to collect title, rating, and director from each movie.
CSS Selector Gadget + rvest
make this easy.
h1
library(stringi);library(stringr)
title <- link_rvest[14] %>% read_html() %>%
html_node("h1") %>% html_text()
title
[1] "Inception (2010) "
title %<>% stri_trim_both() %>%
substr(.,1,nchar(.)-8)
title
[1] "Inception"
Similar work for rating:
rating <-
link_rvest[14] %>%
read_html() %>%
html_node(".ratingValue") %>%
html_text() %>%
stri_replace_all_fixed("/10","") %>%
stri_trim_both() %>%
as.numeric()
rating
[1] 8.8
A little more work to uniquely locate director.
Parse out the director text.
director <- link_rvest[14] %>% read_html() %>%
html_node(".summary_text+ .credit_summary_item") %>%
html_text()
director
[1] "\n Director:\n \nChristopher Nolan \n "
director %<>% str_split(":|,")
director <- director[[1]][2] %>% str_trim
director
[1] "Christopher Nolan"
one_movie <- data.frame(title,rating,director)
one_movie
title rating director
1 Inception 8.8 Christopher Nolan
Wrap the template into a function for a cleaner loop view.
Test it out with No. 12, Star Wars V.
template(link_rvest[12])
title rating director
1 Star Wars: Episode V - The Empire Strikes Back 8.8 Irvin Kershner
Everything works fine!
# An empty object to store data
movie_data <- NULL
# Loop through the 250 movie
for (i in 1:250) {
movie <- template(link_rvest[i])
# Extend the length of data set after each iteration
movie_data %<>% rbind(movie)
}
The result is a nice data set ready for analysis!
title | rating | director |
---|---|---|
The Shawshank Redemption | 9.3 | Frank Darabont |
The Godfather | 9.2 | Francis Ford Coppola |
The Godfather: Part II | 9.0 | Francis Ford Coppola |
The Dark Knight | 9.0 | Christopher Nolan |
Schindler's List | 8.9 | Steven Spielberg |
12 Angry Men | 8.9 | Sidney Lumet |
Pulp Fiction | 8.9 | Quentin Tarantino |
The Lord of the Rings: The Return of the King | 8.9 | Peter Jackson |
The Good, the Bad and the Ugly | 8.9 | Sergio Leone |
Fight Club | 8.9 | David Fincher |
director | freq |
---|---|
Alfred Hitchcock | 8 |
Christopher Nolan | 7 |
Martin Scorsese | 7 |
Stanley Kubrick | 7 |
Steven Spielberg | 7 |
Akira Kurosawa | 6 |
Hayao Miyazaki | 6 |
Billy Wilder | 5 |
Quentin Tarantino | 5 |
Sergio Leone | 5 |
director | freq |
---|---|
Ingmar Bergman | 4 |
Ridley Scott | 4 |
Charles Chaplin | 3 |
Clint Eastwood | 3 |
David Fincher | 3 |
Frank Capra | 3 |
James Cameron | 3 |
Pete Docter | 3 |
Peter Jackson | 3 |
Sidney Lumet | 3 |
Crawling: what and why
Facts about R
A Basic Crawler
Extensions
Back to the link gathering. What if the links are stored in multiple index pages?
Example: top rated action movies on IMDb
Basice structure of a web address:
Page layout defined by http://www.imdb.com/search/title?
.
Everything after that requests what to be filled within the page and in what format.
Each specific request is separated by &
Page 1 requests: genres=action& sort=user_rating,desc& title_type=feature& num_votes=25000,& pf_rd_m=A2FGELUUNOQJNL& pf_rd_p=2406822102& pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE& pf_rd_s=right-6& pf_rd_t=15506& pf_rd_i=top& ref_=chttp_gnr_1
Page 2 requests: genres=action& num_votes=25000,& pf_rd_i=top& pf_rd_m=A2FGELUUNOQJNL& pf_rd_p=2406822102& pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE& pf_rd_s=right-6& pf_rd_t=15506& ref_=chttp_gnr_1& sort=user_rating,desc& start=51& title_type=feature
Hard to compare; some string parsing is useful here
LHS | RHS.P1 | RHS.P2 | RHS.P3 |
---|---|---|---|
genres | action | action | action |
sort | user_rating,desc | user_rating,desc | user_rating,desc |
title_type | feature | feature | feature |
num_votes | 25000, | 25000, | 25000, |
pf_rd_m | A2FGELUUNOQJNL | A2FGELUUNOQJNL | A2FGELUUNOQJNL |
pf_rd_p | 2406822102 | 2406822102 | 2406822102 |
pf_rd_r | 1HYX8S8DAZ1MCQ1CMYQE | 1HYX8S8DAZ1MCQ1CMYQE | 1HYX8S8DAZ1MCQ1CMYQE |
pf_rd_s | right-6 | right-6 | right-6 |
pf_rd_t | 15506 | 15506 | 15506 |
pf_rd_i | top | top | top |
ref_ | chttp_gnr_1 | chttp_gnr_1 | chttp_gnr_1 |
start | NA | 51 | 101 |
start=
is changing. Construct URLs for the 20 pages.start_url <- paste0("http://www.imdb.com/search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2406822102&pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_1&start=",
seq(from=1,to=951,by=50))
length(start_url)
[1] 20
link_rvest <- NULL
for (i in 1:20) {
links <- read_html(start_url[i]) %>%
html_nodes(".title") %>% html_node("a") %>%
html_attr("href") %>%
paste0("http://www.imdb.com",.)
link_rvest %<>% c(links)
}
length(link_rvest)
[1] 965
movie_data <- NULL
for (i in 1:250) {
movie <- template(link_rvest[i])
movie_data %<>% rbind(movie)
# After each iteration, pause for 5 sec
Sys.sleep(5)
}
Sys.sleep(runif(1, 3, 5))
In the IMDb example, we have collected information scattered on a webpage.
Sometimes, we might be better off collecting a whole table.
Possibility 1: information we need is well formated in a table.
Possibility 2: some information available on one page is not available on another page.
Let's collect the recent feedback rating table.
An html table starts with the tag table
.
Use html_table
function to extract table into an R data set.
fb_data <- page_url %>% read_html %>%
html_node(".frp") %>% html_node("table") %>% html_table()
fb_data
X1 X2 X3 X4 X5
1 NA Positive 1631 16283 35993
2 NA Neutral 3 29 63
3 NA Negative 1 35 92
Some simple text processing stores these feedback ratings as individual variables
fb_data %<>% `[`(3:5) %>%
unlist() %>% as.data.frame() %>%
t() %>% as.data.frame() %>% tbl_df()
names(fb_data) <- c("p1","m1","n1","p6","m6","n6","p12","m12","n12")
fb_data %>% kable()
p1 | m1 | n1 | p6 | m6 | n6 | p12 | m12 | n12 |
---|---|---|---|---|---|---|---|---|
1631 | 3 | 1 | 16283 | 29 | 35 | 35993 | 63 | 92 |
Amazon financial metrics from Mergent Online.
Difference in information availability between Amazon and Cyberlink is nontrivial.
Data position also shifts.
Its better to collect all tables and then do text processing.
Many “fancy-looking” pages have Java script generated content. Example: English Premier League Table.
rvest
still works fine as a parser, but it won't read the Java content. These content are typically loaded with a delay.
Something more sophisticated needed to load the page. A headless browser such as PhantomJS is great for such task.
“A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers” – Wikipedia
var url ='http://www.somepage.com'; # replace this line in R
var page = new WebPage(); var fs = require('fs');
# this opens the page and waits for a while
page.open(url, function (status) {
just_wait();
});
# the waiting is 2500 miliseconds; adjust as fit
function just_wait() {
setTimeout(function() {
fs.write('myfile.html', page.content, 'w');
phantom.exit();
}, 2500);
}
# change the first line of scrape.js scrape
# replace with the URL of the to-be-scraped page
lines <- readLines("scrape.js")
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, "scrape.js")
## let phantomjs download the website
system("phantomjs scrape.js")
Then, standard web crawling with rvest
.
epl_data <-
read_html("myfile.html") %>%
html_node(".leagueTable") %>%
html_table() %>%
as.data.frame() %>%
`[`(c(1,4:12)) %>%
tbl_df() %>%
mutate(POS=as.numeric(POS)) %>%
arrange(POS) %>%
`[`(1:20,)
2015-2016 season (so far)
POS | CLUB | PTS |
---|---|---|
1 | Leicester City | 77 |
2 | Tottenham Hotspur | 70 |
3 | Arsenal | 67 |
4 | Manchester City | 64 |
5 | Manchester United | 60 |
6 | West Ham United | 59 |
7 | Southampton | 57 |
8 | Liverpool | 55 |
9 | Chelsea | 48 |
10 | Stoke City | 48 |
2015-2015 season
POS | CLUB | PTS |
---|---|---|
1 | Chelsea | 87 |
2 | Manchester City | 79 |
3 | Arsenal | 75 |
4 | Manchester United | 70 |
5 | Tottenham Hotspur | 64 |
6 | Liverpool | 62 |
7 | Southampton | 60 |
8 | Swansea City | 56 |
9 | Stoke City | 54 |
10 | Crystal Palace | 48 |
The more realistic websites to scrape might be:
Google spreadsheet has some basic scraping features.
Subscription-based online services:
Local one-time price softwares
Guangzhi Shang
Florida State Unviersity
Slides and codes are uploaded to http://gshang.weebly.com/crawler-workshop.html