scraping with rvest

We’re working on a second edition for the Historian’s Macroscope. We’re pruning dead links, updating bits and bobs, and making sure things still work the way we imagined they’d work.

But we really relied on a couple of commercial pieces of software and while there’s nothing wrong in doing that, I really don’t want to be shilling for various companies, and trying to explain in print how to click this, then that, then look for this menu…

So, I figured, what the hell, let’s take the new-to-digital-history person by the hand and push them into the R and RStudio pool.

What shall we scrape? Perhaps we’re interested in the diaries of the second American President, John Adams. The diaries have been transcribed and put online by the Massachusetts Historical Society. The diaries are sorted by date on this page. Each diary has its own webpage, and is linked to on that index page. We would like to collect all of these links into a list, and then iterate through the list, grabbing all of the text of the diaries (without all of the surrounding html!) and copying them into both a series of text files on our machine, and into a variable so that we can do further analysis (eventually).

If you look at any of the webpages containing the diary entries, and study the source (right-click, ‘view source’) you’ll see that text of the diary is wrapped or embraced by an opening




<div class="entry">

and closing

</div>



That’s what we’re after.  If you look at the source code for the main index page listing all of the diaries, you’ll see that the links are all relative links rather than absolute ones – they just have the next bit of the url relative to a baseurl. Every webpage will be different; you will get used to right-clicking and ‘viewing source’ or using the ‘inspector’

For the purposes of this exercise, it isn’t necessary to install R and RStudio on your own machine, although you are welcome to do so and you will want to do so eventually. For now we can run a version of RStudio in your browser courtesy of the Binder service – if you click the link here a version of RStudio already preconfigured with many useful packages will (eventually) fire up in your browser, including rvest and dpylr, which we will be using shortly.

With RStudio loaded up, select file > new file > r script (or, click on the green plus sign beside the R icon).

The panel that opens is where we’re going to write our code. We’re not going to write our code from first principles though. We’re going to take advantage of an existing package called ‘rvest’ (pronounce it as if you’re a pirate….) and we are going to reuse but gently modify code that Jerid Francom first wrote to scrape State of the Union Addresses. By writing scripts or code to do our work (from data gathering all the way through to visualization) we enable other scholars to build on our work, to replicate our work, and to critique our work.

In the code snippets below, any line that starts with a # is a comment. Anything else is a line we run.


library(rvest)
library(dplyr)

These first two lines tell R that we want to use the rvest and dplyr packages to make things a bit easier. Put your cursor at the end of each line, and hit the ‘run’ button. R will pass the code into the console window below; if all goes well, it will just show a new prompt down there. Error messages will appear if things go wrong, of course. The cursor will move down to the next line; hit ‘run’ again. Now let’s tell R the baseurl and the main page that we want to scrape. Type:


base_url <- "https://www.masshist.org"
# Load the page
main.page <- read_html(x = "https://www.masshist.org/digitaladams/archive/browse/diaries_by_date.php")

I don’t know why WordPress is mangling those three lines up, breaking them apart like that. They should look like this:

We give a variable a name, and then use the <- arrow to tell R what goes into that variable. In the code above, we are also using rvest’s function for reading html to tell R that, well, we want it to fill the variable ‘main.page’ with the html from that location. Now let’s get some data:

# Get link URLs
urls <- main.page %>% # feed `main.page` to the next step
    html_nodes("a") %>% # get the CSS nodes
    html_attr("href") # extract the URLs
# Get link text
links <- main.page %>% # feed `main.page` to the next step
    html_nodes("a") %>% # get the CSS nodes
    html_text() # extract the link text

In the code above, we first create a variable called ‘urls’. We feed it the html from main.page; the %>% then passes the data on the left to the next function on the right, in this case ‘html_nodes’ which is a function that travels through the html looking for the ‘a’ node in the CSS, and then passes that to the next part, the ‘href’ of a hyperlink. The url is thus extracted. Then we do it again, but this time pass the text of the link to our ‘links’ variable. You’ve scraped some data!

But it’s not very usable yet. We’re going to make a ‘data frame’, or a table, of these results, creating a column for ‘links’ and a column for ‘urls’. Remember how we said earlier that the links were all relative? We’re also going to paste the base url into those links so that we get the complete path, the complete url, to each diary’s webpage.


# Combine `links` and `urls` into a data.frame
# because the links are all relative, let's add the base url with paste
diaries <- data.frame(links = links, urls = paste(base_url,urls, sep=""), stringsAsFactors = FALSE)

Here, we have created a ‘diaries’ variable, and we’ve told R that it’s actually a dataframe. Into that dataframe we are saying, ‘make a links column, and put links into it; and make an urls column, but paste the base_url and the link url together and do not put a space between them’. The ‘stringsAsFactors’ bit isn’t germane to us right now (but you can read about it here.) Want to see what you’ve got so far?


View(diaries)

The uppercase ‘V’ is important; a lowercase view doesn’t exist, in R. Your dataframe will open in a new tab beside your script, and you can see what you have. But there are a couple of rows there where we’ve grabbed links like ‘home’, ‘search’, ‘browse’ which we do not want. Every row that we want begins with ‘John Adams’ (and in fact, if we don’t get rid of those rows we don’t want, the next bit of code won’t work!).


# but we have a few links to 'home' etc that we don't want
# so we'll filter those out with grepl and a regular
# expression that looks for 'John' at the start of
# the links field.
diaries <- diaries %>% filter(grepl("^John", links))

We are telling R to overwrite ‘diaries’ with ‘diaries’ that we have passed through a filter. The filter command has also been told how to filter: use ‘grepl’ and the regular expression (or search pattern) ^John. In English: keep only the rows that begin with the word John in the links column. Try View diary again. All the extra stuff should be gone now!

We still haven’t grabbed the diary entries themselves yet. We’ll do that in a moment, while at the same time writing those entries into their own folder in individual text files. Let’s create a directory to put them in:


#create a directory to keep our materials in

dir.create("diaries")

and now, we’re going to systematically move through our list of diaries, one row at a time, extracting the diary entry which, when we examined the webpage source code earlier, we saw was marked by an ‘entry’ div. Here we go!


# Loop over each row in `diaries`
for(i in seq(nrow(diaries))) { # we're going to loop over each row in 'diaries', extracting the entries from the pages and then writing them to file.
text <- read_html(diaries$urls[i]) %>% # load the page
html_nodes(".entry") %>% # isloate the text
html_text() # get the text

# Create the file name
filename <- paste0("diaries/", diaries$links[i], ".txt") #this uses the relevant link text as the file name sink(file = filename) %>% # open file to write
cat(text) # write the file
sink() # close the file
}

The first line sets up a loop – ‘i’ is used to keep track of which row in ‘diaries’ that we are currently in. The code between the { and } is the code that we loop through, for each row. So, we start with the first row. We create a variable called ‘text’ into which we get the read_html function from rvest to read the html for the webpage address that exists in the url column of ‘diaries’ in row i. We pass that html to the html_nodes function, which looks for the div that embraces the diary entry. We pass what we found there to the html_text function, which extracts the actual text.

That was part one of the loop. In part two of the loop we create a filename variable and create a name from the link text for the webpage by pasting the folder name diaries + link-name-from-this-row + .txt. We use the ‘sink’ command to tell R we want to drain the data into a file. ‘cat’, which is short for ‘concatenate’, does the writing, putting the contents of the text variable into the file. Then we close the sink. We get to the closing bracket } and we start the loop over again, moving to the next row.

Cool, eh?

You now have a folder filled with text files, that we can analyze with a variety of tools or approaches, and a text variable all ready to go for more analysis right now in R.

The full code is in this github gist:


#after https://francojc.github.io/2015/03/01/web-scraping-with-rvest-in-r/
library(rvest)
library(dplyr)
base_url <- "https://www.masshist.org"
# Load the page
main.page <- read_html(x = "https://www.masshist.org/digitaladams/archive/browse/diaries_by_date.php")
# Get link URLs
urls <- main.page %>% # feed `main.page` to the next step
html_nodes("a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
# Get link text
links <- main.page %>% # feed `main.page` to the next step
html_nodes("a") %>% # get the CSS nodes
html_text() # extract the link text
# Combine `links` and `urls` into a data.frame
# because the links are all relative, let's add the base url with paste
diaries <- data.frame(links = links, urls = paste(base_url,urls, sep=""), stringsAsFactors = FALSE)
# but we have a few links to 'home' etc that we don't want
# so we'll filter those out with grepl and a regular
# expression that looks for 'John' at the start of
# the links field.
diaries <- diaries %>% filter(grepl("^John", links))
#update nov 9 – I find that line 26 doesn't work in some versions of r via binder that
#i have running. I think it's a versioning thing. Anyway, another way of achieving the same
#effect if you get an error there is to slice away the bits you don't want (thus keeping
#the range of stuff you *do* want:
diaries <- diaries %>% slice(9:59)
#create a directory to keep our materials in
dir.create("diaries")
# Loop over each row in `diaries`
for(i in seq(nrow(diaries))) { # we're going to loop over each row in 'diaries', extracting the entries from the pages and then writing them to file.
text <- read_html(diaries$urls[i]) %>% # load the page
html_nodes(".entry") %>% # isloate the text
html_text() # get the text
# Create the file name
filename <- paste0("diaries/", diaries$links[i], ".txt") #this uses the relevant link text as the file name
sink(file = filename) %>% # open file to write
cat(text) # write the file
sink() # close the file
}

view raw

diary-scraper.r

hosted with ❤ by GitHub