Guangzhi Shang
Florida State University
04/07/2017
University of Maryland
Research design and data collection phase
Data analysis phase (exploratory and post hoc)
What and Why
A Basic Crawler
Extensions
Many tutorials on Google, this is the normal version.
This is the light version.
The way I tend to understand a crawler is
in batch mode.
Below are two movies I watched recently (imdb.com). There is striking similarity of the content exposition.
A classic oursourcing versus in-house production problem.
Don't worry, I have a theoretical lens: Transaction Cost Economics.
Three main arguments:
What and Why
A Basic Crawler
Extensions
Assumption: All we need is copy-paste in batch mode.
Then, there are three key elements:
Where are links coming from? Most likely collect from an index page (example: top rated 250 movies on IMDb)
Method 1: the more visual way.
library(dplyr)
link_klipper<-read.table("linkklipper.txt",header=FALSE)[[1]]
length(link_klipper)
[1] 250
Method 2: use an html page parser such as rvest
package.
library(rvest)
index_page <- read_html("http://www.imdb.com/chart/top?ref_=nv_ch_250_4")
titleColumn
. Then, go one step deeper and extract the (partial) link from the href
tag.partial_link <- html_attr(html_node(html_nodes(index_page ,".titleColumn"),"a"),"href")
link_rvest <- paste0("http://www.imdb.com/",partial_link)
length(link_rvest)
[1] 250
Chain things together in an easy to read way with magrittr
.
Each %>%
enters results from previous to the fisrt argument of the next function.
library(magrittr)
link_rvest <-
read_html("http://www.imdb.com/chart/top?ref_=nv_ch_250_4") %>%
html_nodes(".titleColumn") %>%
html_node("a") %>%
html_attr("href") %>%
paste0("http://www.imdb.com",.)
length(link_rvest)
[1] 250
Time to construct a crawling template to feed into the loop that will iterate through the links.
When extracting links from the index page, we have seen these:
html_attr(html_node(html_nodes(index_page ,".titleColumn"),"a"),"href")
rvest
parses a downloaded web page by uing the html tags. The top 250 movie all have the same page format, so just pick one to develop the template.
Let's say we want to collect title, rating, and director from each movie.
CSS Selector Gadget + rvest
make this easy.
h1
library(stringi);library(stringr)
title <- link_rvest[14] %>% read_html() %>%
html_nodes("h1")
title[[2]] %>% html_text()
[1] "Inception (2010) "
title <- title[[2]] %>% html_text() %>%
stri_trim_both() %>% substr(.,1,nchar(.)-7)
title
[1] "Inception"
Similar work for rating:
rating <-
link_rvest[14] %>%
read_html() %>%
html_node(".ratingValue") %>%
html_text() %>%
stri_replace_all_fixed("/10","") %>%
stri_trim_both() %>%
as.numeric()
rating
[1] 8.8
A little more work to uniquely locate director.
Parse out the director text.
director <- link_rvest[14] %>% read_html() %>%
html_node(".summary_text+ .credit_summary_item") %>%
html_text()
director
[1] "\n Director:\n \nChristopher Nolan \n "
director %<>% str_split(":|,")
director <- director[[1]][2] %>% str_trim
director
[1] "Christopher Nolan"
one_movie <- data.frame(title,rating,director)
one_movie
title rating director
1 Inception 8.8 Christopher Nolan
Wrap the template into a function for a cleaner loop view.
Test it out with No. 7, Pulp Fiction.
template(link_rvest[7])
title rating director
1 Pulp Fiction 8.9 Quentin Tarantino
Everything works fine!
# An empty object to store data
movie_data <- NULL
# Loop through the 250 movie
for (i in 1:250) {
movie <- template(link_rvest[i])
# Extend the length of data set after each iteration
movie_data %<>% rbind(movie)
}
The result is a nice data set ready for analysis!
title | rating | director |
---|---|---|
The Shawshank Redemption | 9.3 | Frank Darabont |
The Godfather | 9.2 | Francis Ford Coppola |
The Godfather: Part II | 9.0 | Francis Ford Coppola |
The Dark Knight | 9.0 | Christopher Nolan |
12 Angry Men | 8.9 | Sidney Lumet |
Schindler's List | 8.9 | Steven Spielberg |
Pulp Fiction | 8.9 | Quentin Tarantino |
The Lord of the Rings: The Return of the King | 8.9 | Peter Jackson |
The Good, the Bad and the Ugly | 8.9 | Sergio Leone |
Fight Club | 8.8 | David Fincher |
director | freq |
---|---|
Christopher Nolan | 8 |
Steven Spielberg | 8 |
Alfred Hitchcock | 7 |
Martin Scorsese | 7 |
Stanley Kubrick | 7 |
Quentin Tarantino | 6 |
Akira Kurosawa | 5 |
Billy Wilder | 5 |
Hayao Miyazaki | 5 |
Sergio Leone | 5 |
director | freq |
---|---|
David Fincher | 4 |
Francis Ford Coppola | 4 |
Peter Jackson | 4 |
Sidney Lumet | 4 |
Charles Chaplin | 3 |
Clint Eastwood | 3 |
Frank Capra | 3 |
Frank Darabont | 3 |
Ingmar Bergman | 3 |
James Cameron | 3 |
What and Why
A Basic Crawler
Extensions
Back to the link gathering. What if the links are stored in multiple index pages?
Example: top rated action movies on IMDb
Basice structure of a web address:
Page layout defined by http://www.imdb.com/search/title?
.
Everything after that requests what to be filled within the page and in what format.
Each specific request is separated by &
Page 1 requests: genres=action& sort=user_rating,desc& title_type=feature& num_votes=25000,& pf_rd_m=A2FGELUUNOQJNL& pf_rd_p=2406822102& pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE& pf_rd_s=right-6& pf_rd_t=15506& pf_rd_i=top& ref_=chttp_gnr_1
Page 2 requests: genres=action& num_votes=25000,& pf_rd_i=top& pf_rd_m=A2FGELUUNOQJNL& pf_rd_p=2406822102& pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE& pf_rd_s=right-6& pf_rd_t=15506& ref_=chttp_gnr_1& sort=user_rating,desc& start=51& title_type=feature
Hard to compare; some string parsing is useful here
LHS | RHS.P1 | RHS.P2 | RHS.P3 |
---|---|---|---|
genres | action | action | action |
sort | user_rating,desc | user_rating,desc | user_rating,desc |
title_type | feature | feature | feature |
num_votes | 25000, | 25000, | 25000, |
pf_rd_m | A2FGELUUNOQJNL | A2FGELUUNOQJNL | A2FGELUUNOQJNL |
pf_rd_p | 2406822102 | 2406822102 | 2406822102 |
pf_rd_r | 1HYX8S8DAZ1MCQ1CMYQE | 1HYX8S8DAZ1MCQ1CMYQE | 1HYX8S8DAZ1MCQ1CMYQE |
pf_rd_s | right-6 | right-6 | right-6 |
pf_rd_t | 15506 | 15506 | 15506 |
pf_rd_i | top | top | top |
ref_ | chttp_gnr_1 | chttp_gnr_1 | chttp_gnr_1 |
start | NA | 51 | 101 |
start=
is changing. Construct URLs for the 20 pages.start_url <- paste0("http://www.imdb.com/search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=2406822102&pf_rd_r=1HYX8S8DAZ1MCQ1CMYQE&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_1&start=",
seq(from=1,to=1051,by=50))
length(start_url)
[1] 22
link_rvest <- NULL
for (i in 1:22) {
links <- read_html(start_url[i]) %>%
html_nodes(".lister-item-header a") %>%
html_attr("href") %>%
paste0("http://www.imdb.com",.)
link_rvest %<>% c(links)
}
length(link_rvest)
[1] 1083
movie_data <- NULL
for (i in 1:250) {
movie <- template(link_rvest[i])
movie_data %<>% rbind(movie)
# After each iteration, pause for 5 sec
Sys.sleep(5)
}
Sys.sleep(runif(1,3,5))
In the IMDb example, we have collected information scattered on a webpage.
Sometimes, we might be better off collecting a whole table.
Possibility 1: information we need is well formated in a table.
Possibility 2: some information available on one page is not available on another page.
Let's collect the recent feedback rating table.
An html table starts with the tag table
.
Use html_table
function to extract table into an R data set.
fb_data <- page_url %>% read_html %>%
html_node(".frp") %>% html_node("table") %>% html_table()
fb_data
X1 X2 X3 X4 X5
1 NA Positive 1626 16725 28407
2 NA Neutral 1 32 55
3 NA Negative 4 34 49
Some simple text processing stores these feedback ratings as individual variables
fb_data %<>% `[`(3:5) %>%
unlist() %>% as.data.frame() %>%
t() %>% as.data.frame() %>% tbl_df()
names(fb_data) <- c("p1","m1","n1","p6","m6","n6","p12","m12","n12")
fb_data %>% kable()
p1 | m1 | n1 | p6 | m6 | n6 | p12 | m12 | n12 | |
---|---|---|---|---|---|---|---|---|---|
. | 1626 | 1 | 4 | 16725 | 32 | 34 | 28407 | 55 | 49 |
Amazon financial metrics from Mergent Online.
Difference in information availability between Amazon and Cyberlink is nontrivial.
Data position also shifts.
Its better to collect all tables and then do text processing.
Many “fancy-looking” pages have Java script generated content. Example: English Premier League Table.
rvest
still works fine as a parser, but it won't read the Java content. These content are typically loaded with a delay.
Something more sophisticated needed to load the page. A headless browser such as PhantomJS is great for such task.
“A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers” – Wikipedia
var url ='http://www.somepage.com'; # replace this line in R
var page = new WebPage(); var fs = require('fs');
# this opens the page and waits for a while
page.open(url, function (status) {
just_wait();
});
# the waiting is 2500 miliseconds; adjust as fit
function just_wait() {
setTimeout(function() {
fs.write('myfile.html', page.content, 'w');
phantom.exit();
}, 2500);
}
# change the first line of scrape.js scrape
# replace with the URL of the to-be-scraped page
lines <- readLines("scrape.js")
lines[1] <- paste0("var url ='", url ,"';")
writeLines(lines, "scrape.js")
## let phantomjs download the website
system("phantomjs scrape.js")
rvest
.2015-2016 season
(epic win by Leicester)
pos | club | pts |
---|---|---|
1 | Leicester City | 81 |
2 | Arsenal | 71 |
3 | Tottenham Hotspur | 70 |
4 | Manchester City | 66 |
5 | Manchester United | 66 |
6 | Southampton | 63 |
7 | West Ham United | 62 |
8 | Liverpool | 60 |
9 | Stoke City | 51 |
10 | Chelsea | 50 |
2016-2017 season
(so far)
pos | club | pts |
---|---|---|
1 | Chelsea | 69 |
2 | Tottenham Hotspur | 59 |
3 | Manchester City | 57 |
4 | Liverpool | 56 |
5 | Manchester United | 52 |
6 | Arsenal | 50 |
7 | Everton | 50 |
8 | West Bromwich Albion | 43 |
9 | Stoke City | 36 |
10 | Southampton | 33 |
The more realistic websites to scrape might be:
Google spreadsheet has some basic scraping features.
Subscription-based online services:
Local one-time price softwares
What and Why
A Basic Geocoder
Extensions
Within a single data set
Between multiple data sets
Spatial Econometrics
X variants for “100 N Main Ave Tallahassee FL”
What and Why
Geocoding Packages in R
Extensions
library(ggmap);library(RDSTK)
geocode("florida state university",source="google")
lon lat
1 -84.29849 30.44188
geocode("600 W College Ave, Tallahassee, FL 32306",source="google")
lon lat
1 -84.29067 30.44096
RDSTK
- data science tool kit API
street2coordinates("600 W College Ave, Tallahassee, FL 32306")[c(5,3)]
longitude latitude
1 -84.28844 30.44073
Great circle distance is what you need.
library(geosphere)
umd <- geocode("university+of+maryland+college+park",source="google")
fsu <- geocode("florida+state+university",source="google")
distCosine(fsu,umd)/1609.34
[1] 723.4794
What and Why
Geocoding Packages in R
Extensions
One good aspect of Google API is that it simplifies the caculation of distance/time between A and B by travel mode.
Use my little wrapper function travel.dist
.
travel.dist(origin='university+of+maryland+college+park',
destination='florida+state+university')
[1] 879.3297
ggplot2 Examples
Evolution of Interaction Effect Plots
ggplot2
How many variables are there?
ggplot2 Examples
Evolution of Interaction Effect Plots
\[ y=\beta_0+\beta_1 x + \beta_2 f + \beta_3 x*f + \epsilon \]
Created using effect
Created using ggplot2
\[ y=\beta_0+\beta_1 x + \beta_2 f + \beta_3 x*f + \epsilon \]
Often seen version
Perhaps more informative
Even more informative
My favorite!
\[ y=\beta_0+\beta_1 x + \beta_2 z + \beta_3 m + \beta_4 x*z + \beta_5 x*m + \epsilon \]
Use color for effect size
Three “indifferent” curves