Grabbing data from Open Context

This morning, on Twitter, there was a conversation about site diaries and the possibilities of topic modeling for extracting insight from them. Open Context has 2618 diaries – here’s one of them. Eric, who runs Open Context, has an excellent API for all that kind of data. Append .json on the end of a file name, and *poof*, lots of data. Here’s the json version of that same diary.  So, I wanted all of those diaries – this URL (click & then note where the .json lives; delete the .json to see the regular html) has ’em all.

I copied and pasted that list of urls into a .txt file, and fed it to wget

wget -i urlstograb.txt -O output.txt

and now my computer is merrily pinging Eric’s, putting all of the info into a single txt file. And sometimes crashing it, too.

(Sorry Eric).

When it’s done, I’ll rename it .json and then use Rio to get it into useable form for R. The data has geographic coordinates too, so with much futzing I expect I could *probably* represent topics over space (maybe by exporting to Gephi & using its geolayout).

Futz: that’s the operative word, here.

a quick note on visualizing topic models as self organizing map

I wanted to visualize topic models as a self-organizing map. This code snippet was helpful. (Here’s its blog post).

In my standard topic modeling script in R, I added this:

head(doc.topics) <- scale(doc.topics)
doc.topics.som <- som(, grid = somgrid(20, 16, "hexagonal"))
plot(doc.topics.som, main = "Self Organizing Map of Topics in Documents")

which gives something like this:

Screen Shot 2015-05-05 at 3.02.53 PM

Things to be desired: I don’t know which circle represents what document. Each pie slice represents a topic. If you have more than around 10 topics, you get a graph in the circle instead of a pie slice. I was colouring in areas by main pie slice colour in inkscape, but then the whole thing crashed on me. Still, a move in the right direction for getting a sense of the landscape of your entire corpus. What I’m eventually hoping for is to end up with something like this (from this page):


I found this: which seems to work. In my topic model script, I need to save the doc.topics output as Rdata:

save(doc.topics, file = "doctopics.RData")

and then the following:


##Code for Plots
### source("Map_COUNTY_BMU.R") <- not necessary for SG

#Load Data
## data is from a topic model of student writing in Eric's class

#Build SOM
aGrid <- somgrid(xdim = 20, ydim = 16, topo="hexagonal")

##Rlen is arbitrarily low
aSom <- som(data=as.matrix(scale(doc.topics)), grid=aGrid, rlen=1, alpha=c(0.05, 0.01),

par(mar = rep(1, 4))
cplanelay <- layout(matrix(1:8, nrow=4))
vars <- colnames(aSom$data)
for(p in vars) {
  plotCplane(som_obj=aSom, variable=p, legend=FALSE, type="Quantile")
plot(0, 0, type = "n", axes = FALSE, xlim=c(0, 1), 
     ylim=c(0, 1), xlab="", ylab= "")
par(mar = c(0, 0, 0, 6))
image.plot(legend.only=TRUE, col=rev(designer.colors(n=10, col=brewer.pal(9, "Spectral"))), zlim=c(-1.5,1.5))



…does the trick. Notice ‘doc.topics’ makes another appearance there – I’ve got the topic model loaded into memory. Also in ‘aGrid’ the x and y have to multiply to the max number of observations you’ve got. Not enough: no problem. More than what you’ve got: you’ll get error messages. So, here’s what I ended up with:

Screen Shot 2015-05-05 at 4.50.29 PM

Now I just need to figure out how to put labels on each hexagonal bin. By the way, the helper functions have to be in your working directory for ‘source’ to find them.

What Careers Need History?

We have a new website at CU today; one of the interesting things on it is a page under the ‘admissions’ section that describes many careers and the departments whose program might fit you for such a career.

I was interested to know what careers were listed as needing a history degree. Updated Oct 17: I have since learned that these career listings were generated by members of the department some time ago; I initially believed that the list was generated solely by admissions, and I apologize for the confusion. This paragraph has been edited to reflect that correction. See also the conclusion to this piece at bottom.

I used wget to download all of the career pages:

wget -r --no-parent -w 2 -l 2 --limit-rate=20k

I then copied all of the index.html files using the Mac finder (searched within the subdirectory for all index.html; copied them into a new folder).

Then, I used grep to figure out how many instances of capital-h History (thus, the discipline, rather than the generic noun) could be found on those career pages:

grep -c \&lt;h3\&gt;History *.html &amp;gt;history-results.tsv

I did the same again for a couple of other keywords. The command counts all instances of History in the html files, and writes the results to a tab separated file. I open that file in Excel. But – I don’t know what index 1.html is about, or index 45.html, and so on. So in text wrangler, I searched multiple files for the text betweentags, using a simple regex:

Screen Shot 2014-10-15 at 2.28.53 PM

Screen Shot 2014-10-15 at 2.29.35 PM

Copy and paste those results into a new page in the excel file, search and replace with blank spaces all of the extraneous bits (index, .html,and), sort by the file numbers, then copy and paste the names (now in the correct order) into a new column in the original count.

Which gives us this, insofar as History as a degree option leading to particular careers, (and where the numbers indicate not absolute importance of history, more of an emphasis than anything else):

Career count of History
TeachingPage 2 of 3 4
Museums and Historical SitesPage 2 of 2 4
Heritage Conservation 4
TourismPage 2 of 2 2
ResearchPage 2 of 3 2
JournalismPage 2 of 3 2
Foreign ServicePage 2 of 3 2
EducationPage 2 of 3 2
Library and Information Science 2
Design 2
Archival Work 2
Architectural History 2
Archaeology 2

And here’s ‘Global’ (we have some new ‘Globalization’ programmes):

Career count of Global
Tourism 12
Teaching 12
Research 12
Public Service 12
Polling 12
Politics 12
Policy Analysis 12
Non-Profit Sector 12
Non-Governmental Organizations 12
Museums and Historical Sites 12
Media 12
Lobbying 12
Law 12
Journalism 12
International Relations 12
International Development 12
Government 12
Foreign Service 12
Finance 12
Education 12
Diplomacy 12
Consulting 12
Conservation 12
Civil Service 12
Business 12
Advocacy 12
Administration 12
Foreign ServicePage 2 of 3 6
FinancePage 2 of 3 6
TeachingPage 2 of 3 4
Museums and Historical SitesPage 2 of 2 4
TourismPage 2 of 2 4
ResearchPage 2 of 3 4
JournalismPage 2 of 3 4
EducationPage 2 of 3 4
Public ServicePage 2 of 3 4
PollingPage 2 of 2 4
PoliticsPage 2 of 3 4
Policy AnalysisPage 2 of 3 4
Non-Profit SectorPage 2 of 2 4
Non-Governmental OrganizationsPage 2 of 3 4
MediaPage 2 of 2 4
LobbyingPage 2 of 3 4
LawPage 2 of 2 4
International RelationsPage 2 of 2 4
International DevelopmentPage 2 of 3 4
GovernmentPage 2 of 4 4
DiplomacyPage 2 of 3 4
ConsultingPage 2 of 3 4
ConservationPage 2 of 2 4
Civil ServicePage 2 of 3 4
BusinessPage 2 of 5 4
AdvocacyPage 2 of 2 4
AdministrationPage 2 of 4 4
Management 2
International Trade 2
International Business 2
Humanitarian Aid 2
Human Resources 2
Broker 2
Banking 2

Interesting, non? 

Update October 17 – we shared these results with Admissions. There appears to have been a glitch in the system. See those ‘page 2 of 3′ or ‘page 3 of 5′ notes in the tables above? The entire lists were visible to wget, but not to the user of the site, leaving ‘history’ off the page of careers under ‘museums and historical sites’, for instance. The code was corrected, and now the invisible parts are visible. Also, in my correspondence with the folks at Admissions, they write “[we believe that] Global appears more than History because careers were listed under each of its 12 specializations. We will reconfigure the way the careers are listed for global and international studies so that it will reduce the number of times that it comes up.”

So all’s well that ends well. Thank you to Admissions for clearing up the confusion, fixing the glitch, and for pointing out my error which I am pleased to correct.


Web Seer and the Zeitgeist

I’ve been playing all evening with Web Seer, a toy that lets you contrast pairs of Google autocomplete suggestions. As is well known, Google autocomplete suggests completions based on what others have been searching for given that pattern of text you are entering. This is sparking some thoughts on how I might use this to think about things like public archaeology or public history.

As Alan Liu put it,

But for now, enjoy the pairings that I’ve been feeding it….

Screen Shot 2014-08-28 at 8.58.44 PM

In ancient/modern


Screen Shot 2014-08-28 at 8.57.17 PM

Greek versus Roman


Screen Shot 2014-08-28 at 8.52.25 PM

What School Should I Go To?


Screen Shot 2014-08-28 at 8.46.18 PM

Games and Literature


Screen Shot 2014-08-28 at 8.42.20 PM

Getting Down to Brass Tacks


Screen Shot 2014-08-28 at 8.35.24 PM

Drunkards and Teetotallers, never the twain shall meet


Screen Shot 2014-08-28 at 8.09.24 PM

Historians v Archaeologists, a Google Cage Match


Screen Shot 2014-08-28 at 7.55.55 PM

The DH Dilemma


Screen Shot 2014-08-28 at 7.55.08 PM



Screen Shot 2014-08-28 at 7.52.58 PM

Two Solitudes Redux

SAA 2015: Macroscopic approaches to archaeological histories: Insights into archaeological practice from digital methods

Ben Marwick and I are organizing a session for the SAA2015 (the 80th edition, this year in San Francisco) on “Macroscopic approaches to archaeological histories: Insights into archaeological practice from digital methods”. It’s a pretty big tent. Below is the session ID and the abstract. If this sounds like something you’d be interested in, why don’t you get in touch?

Session ID 743.

The history of archaeology, like most disciplines, is often presented as a sequence of influential individuals and a discussion of their greatest hits in the literature.  Two problems with this traditional approach are that it sidelines the majority of participants in the archaeological literature who are excluded from these discussions, and it does not capture the conversations outside of the canonical literature.  Recently developed computationally intensive methods as well as creative uses of existing digital tools can address these problems by efficiently enabling quantitative analyses of large volumes of text and other digital objects, and enabling large scale analysis of non-traditional research products such as blogs, images and other media. This session explores these methods, their potentials, and their perils, as we employ so-called ‘big data’ approaches to our own discipline.


Like I said, if that sounds like something you’d be curious to know more about, ping me.

Topic Modeling Greek Consumerism

I’m experimenting. Here’s what I did today.

1. Justin Walsh published the data on which his book, ‘Consumerism in the Ancient World’, rests.

2. I downloaded it, and decided I would topic model it. The table, ‘Greek Vases’, has one row = one vase. Let’s start with that, though I think it might be more useful/illuminating to decide that ‘document’ might mean ‘site’ or ‘context’. But first things first; let’s sort out the workflow.

3. I delete all columns with ‘true’ or ‘false’ values. Struck me as not useful. I concatenated all columns into a single ‘text’ column. Then, per the description on the Mallet package page for R, I added a new column ‘class’ which I left blank. So I have ‘id’, ‘class’, ‘text’. All of Walsh’s information is in the ‘text’ field.

4. I ran this code in R, using R studio:

## from
## Create a wrapper for the data with three elements, one for each column.
## R does some type inference, and will guess wrong, so give it hints with "colClasses".
## Note that "id" and "text" are special fields -- mallet will look there for input.
## "class" is arbitrary. We will only use that field on the R side.
documents <- read.table("modified-vases2.txt", col.names=c("id", "class", "text"),
                        colClasses=rep("character", 3), sep="\t", quote="")
## Create a mallet instance list object. Right now I have to specify the stoplist
## as a file, I can't pass in a list from R.
## This function has a few hidden options (whether to lowercase, how we
## define a token). See ?mallet.import for details.
mallet.instances <- mallet.import(documents$id, documents$text, "/Users/shawngraham/Desktop/data mining and tools/stoplist.csv",
                                  token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
## Create a topic trainer object.
num.topics <- 20
topic.model <- MalletLDA(num.topics)

## Load our documents. We could also pass in the filename of a
## saved instance list file that we build from the command-line tools.

## Get the vocabulary, and some statistics about word frequencies.
## These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)

## Optimize hyperparameters every 20 iterations,
## after 50 burn-in iterations.
topic.model$setAlphaOptimization(20, 50)

## Now train a model. Note that hyperparameter optimization is on, by default.
## We can specify the number of iterations. Here we'll use a large-ish round number.

## NEW: run through a few iterations where we pick the best topic for each token,
## rather than sampling from the posterior distribution.

## Get the probability of topics in documents and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities,
## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

## What are the top words in topic 7?
## Notice that R indexes from 1, so this will be the topic that mallet called topic 6., topic.words[7,])

## Show the first few documents with at least 5
head(documents[ doc.topics[7,] > 0.05 & doc.topics[10,] > 0.05, ])

## End of Mimno's sample script(Not run)

###from my other script; above was mimno's example script <- t(doc.topics) <- / rowSums(
write.csv(, "vases-topics-docs.csv" ) 

## Get a vector containing short names for the topics
topics.labels <- rep("", num.topics)
for (topic in 1:num.topics) topics.labels[topic] <- paste(, topic.words[topic,],$words, collapse=" ")

# have a look at keywords for each topic
write.csv(topics.labels, "vases-topics-labels.csv") ## "C:\\Mallet-2.0.7\\topics-labels.csv")

### do word clouds of the topics
for(i in 1:num.topics){ <-,
                                      topic.words[i,], 25)
                  c(4,.8), rot.per=0,

And this is what I get:
Topic # Label
1 france greek west eating grey
2 spain ampurias neapolis girona arf
3 france rune herault colline nissan-lez-ens
4 spain huelva east greek drinking
5 france aude drinking montlaures cup
6 spain malaga settlement cup drinking
7 france drinking bouches-du-rhone settlement cup
8 france cup stemmed herault bessan
9 france marseille massalia bouches-du-rhone storage
10 spain ullastret settlement girona puig
11 france settlement mailhac drinking switzerland
12 spain badajoz cup stemless castulo
13 spain ampurias settlement girona neapolis
14 france beziers drinking cup pyrenees
15 spain krater bell arf drinking
16 transport amphora france gard massaliote
17 france settlement saint-blaise bouches-du-rhone greek
18 france marseille massalia west bouches-du-rhone
19 spain jaen drinking cemetery castulo
20 spain settlement abg eating alicante

The three letter acronymns are ware types. The original data had location, context, ware, purpose, and dates. Still need to figure out how to get Mallet (either on the command line or in R) to treat numerals as words, but that’s something I can ignore for the moment. So what next? Map this I guess, in physical and/or temporal space, and resolve the problem of what a ‘document’ really is, for archaeological topic modeling. Here, look at the word clouds generated at the end of the script whilst I ruminate. And also a flow diagram. What it shows, I know not. Exploration, eh?justin-walsh-data-flow



Extracting Text from PDFs; Doing OCR; all within R

I am a huge fan of Ben Marwick. He has so many useful pieces of code for the programming archaeologist or historian!

Edit July 17 1.20 pm: Mea culpa: I originally titled this post, ‘Doing OCR within R’. But, what I’m describing below – that’s not OCR. That’s extracting text from pdfs. It’s very fast and efficient, but it’s not OCR. So, brain fart. But I leave the remainder of the post as it was. For command line OCR (really, actual OCR) on a Mac, see the link to Ben Schmidt’s piece at the bottom. Sorry.

Edit July 17 10 pm: I am now an even bigger fan of Ben’s. He’s updated his script to either a) perform OCR by calling Tesseract from within R or b) grab the text layer from a pdf image. So this post no longer misleads. Thank you Ben!

Object Character Recognition, or OCR, is something that most historians will need to use at some point when working with digital documents. That is, you will often encounter pdf files of texts that you wish to work with in more detail (digitized newspapers, for instance). Often, there is a layer within the pdf image containing the text already: if you can highlight text by clicking and dragging over the image, you can copy and paste the text from the image. But this is often not the case, or worse, you have tens or hundreds or even thousands of documents to examine. There is commercial software that can do this for you, but it can be quite expensive

One way of doing OCR on your own machine with free tools, is to use Ben Marwick’s pdf-2-text-or-csv.r script for the R programming language. Marwick’s script uses R as wrapper for the Xpdf programme from Foolabs. Xpdf is a pdf viewer, much like Adobe Acrobat. Using Xpdf on its own can be quite tricky, so Marwick’s script will feed your pdf files to Xpdf, and have Xpdf perform the text extraction. For OCR, the script acts as a wrapper for Tesseract, which is not an easy piece of software to work with. There’s a final part to Marwick’s script that will pre-process the resulting text files for various kinds of text analysis, but you can ignore that part for now.

  1. Make sure you have R downloaded and installed on your machine (available from
  2. Make sure you have Xpdf downloaded and installed (available from ). Make a note of where you unzipped it. In particular, you are looking for the location of the file ‘pdftotext.exe’. Also, make sure you know where ‘pdftoppm’ is located too (it’s in that download).
  3. Download and install Tesseract 
  4. Download and install Imagemagick
  5. Have a folder with the pdfs you wish to extract text from.
  6. Open R, and paste Marwick’s script into the script editor window.
  7. Make sure you adjust the path for “dest” and the path to “pdftotext.exe” to the correct location
  8. Run the script! But read the script carefully and make sure you run the bits you need. Ben has commented out the code very well, so it should be fairly straightforward.

Obviously, the above is framed for Windows users. For Mac users, the steps are all the same, except that you use the version of Xpdf, Tesseract, and Imagemagick built for IOS, and your paths to the other software are going to be different. And of course you’re using R for Mac, which means the ‘shell’ commands have to be swapped to ‘system’! (As of July 2014, the Xpdf file for Mac that you want is at ) I’m not 100% certain of any other Mac/PC differences in the R script – these should only exist at those points where R is calling on other resources (rather than on R packages). Caveat lector, eh?

The full R script may be found at So here is the section that does the text extraction from pdf images (ie, you can copy and highlight text in the pdf):

###Note: there's some preprocessing that I (sg) haven't shown here: go see the original gist

################# Wait! ####################################
# Before proceeding, make sure you have a copy of pdf2text
# on your computer! Details:
# Download:

# Tell R what folder contains your 1000s of PDFs
dest <- "G:/somehere/with/many/PDFs"

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# now there are a few options...

############### PDF to TXT #################################
# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE) )

# where are the txt files you just made?
dest # in this folder

And here’s the bit that does the OCR

                     ##### Wait! #####
# Before proceeding, make sure you have a copy of Tesseract
# on your computer! Details & download:
# and a copy of ImageMagick:
# and a copy of pdftoppm on your computer!
# Download:
# And then after installing those three, restart to
# ensure R can find them on your path.
# And note that this process can be quite slow...

# PDF filenames can't have spaces in them for these operations
# so let's get rid of the spaces in the filenames

sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))

# get the PDF file names without spaces
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# Now we can do the OCR to the renamed PDF files. Don't worry
# if you get messages like 'Config Error: No display
# font for...' it's nothing to worry about

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), using
  shell(shQuote(paste0("pdftoppm ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("convert *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("tesseract ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))

# where are the txt files you just made?
dest # in this folder

Besides showing how to do your own OCR, Marwick’s script shows some of the power of R for doing more than statistics. Mac users might be interested in Ben Schmidt’s tutorial ‘Command-line OCR on a Mac’ from his digital history graduate seminar at Northeastern University, online at

Government of Canada Edits

In recent days, a number of twitterbots have been set up to monitor changes to Wikipedia emerging from government IP address blocks. Seems to me that here is a window for data mining the mindset of government. Of course, there’s nothing to indicate that anything untoward is being done by the Government itself; I live in Ottawa, and I know what civil servants can get up to on their lunch break.

But let’s look at the recent changes documented by; I’ve taken screenshots below, but you can just scroll through gccaedits’ feed. Actually, come to think of it, someone should be archiving those tweets, too. It’s only been operational for something like 3 days, but already, we see an interesting grammar/football fanatic; someone with opinions on Amanda Knox; someone setting military history right, and someone fixing the German version of Rene Levesque’s page.

Hmmm. Keep your eyes on this, especially as next year is an election year…

Screen Shot 2014-07-14 at 1.54.49 PMScreen Shot 2014-07-14 at 1.54.24 PMScreen Shot 2014-07-14 at 1.54.03 PMScreen Shot 2014-07-14 at 1.53.35 PMScreen Shot 2014-07-14 at 1.53.14 PMScreen Shot 2014-07-14 at 1.52.55 PMScreen Shot 2014-07-14 at 1.52.23 PMScreen Shot 2014-07-14 at 1.51.52 PMScreen Shot 2014-07-14 at 1.55.06 PM

Setting up your own Data Refinery

Refinery at Oxymoron, by Wyatt Wellman, cc by-sa 2.0. Flickr.

Refinery at Oxymoron, by Wyatt Wellman, cc by-sa 2.0. Flickr.

I’ve been playing with a Mac. I’ve been a windows person for a long time, so bear with me.

I’m setting up a number of platforms locally for data mining. But since what I’m *really* doing is smelting the ore of data scraped using things like Outwit Hub or (the ‘mining operation’, in this tortured analogy), what I’m setting up is a data refinery. Web based services are awesome, but if you’re dealing with sensitive data (like oral history interviews, for example) you need something local – this will also help with your ethics board review too.  Onwards!


You can now set Voyant-Tools up locally, keeping your data safe and sound. The documentation and downloads are all on this page. This was an incredibly easy setup on Mac. Unzip, double-click voyant-tools.jar, and boom you’ve got Voyant-Tools puttering away in your browser. It’ll be at You can also hit the cogwheel icon in the top right to run your corpus through all sorts of other tools that come with Voyant but aren’t there on the main layout. You’ll want ‘export corpus with other tool’. You’ll end up with a url something like . You can then swap the name of any other tool in that URL (to save time). RezoViz by the way uses named entity extraction to construct a network of entities mentioned in the same documents. So if you upload your corpus in small-ish chunks (paragraphs; pages; every 1000 words, whatever) you can see how it all ties together this way. From the cogwheel icon on the RezoViz layout, you can get .net which you can then import into Gephi. How frickin’ cool is that?

Overview Project

Topic modeling is all the rage, and yes, you should have MALLET or the Stanford TMT or R on your machine. But sometimes, it’s nice to just see something rather like a dendrogram of folders with progressively finer levels of self-similarity. Overview does term-frequency inverse document frequency weightings to figure out similarity of documents. The instructions (for all platforms) are here. It’s not quite as painless as Voyant, but it’s pretty darn close. You’ll need to have Postgres – download, install, run it once, then download Overview. You need to have Java 7. (At some point, you’ll probably need to look into running multiple versions of Java, if you continue to add elements to your refinery). Then:

  1. Ctrl-Click or Right-click on Overview for Mac OS X.command and select Open. When the dialog box asks if you are sure you want to open the application, click on the Open button. From then on, you can start Overview by double-clicking on Overview for Mac OS X.command.
  2. Browse to http://localhost:9000 and log in as with

And you now have Overview running. You can do many many things with Overview – it’ll read pdfs, for instance, which you can then export within a csv file. You can tag folders and export those tags, to do some fun visualizations with the next part of your refinery, RAW.

Tags exported from Overview

Tags visualized. This wasn’t done with Raw (but rather, a commercial piece of software), but you get the idea.


Flow diagram in Raw, using sample movie data

Flow diagram in Raw, using sample movie data

Raw does wonderful things with CSV formatted data, all in your browser. You can use the webapp version; nothing gets communicated to the server. But, still, it’s nice to keep it close to home. So, you can get Raw source code here. It’s a little trickier to install than the others. First thing: you’ll need Bower. But you can’t install Bower without Node.js and npm. So, go to Node.js and hit install. Then, download Raw. Unzip Raw and go to that folder. To install Bower, type

$ sudo npm install -g bower

Once the dust settles, there’s a bunch of dependencies to install. Remember, you’re in the Raw folder. Type:

$ bower install

When the dust clears again, and assuming you have Python installed on your machine, fire Raw up in a server:

$ python -m SimpleHTTPServer 4000

(If you don’t have python, well, go get python. I’ll wait). Then in your browser go to


And you can now do some funky visualizations of your data. There are a number of chart types packaged with Raw, but you can also develop your own – here’s the documentation. Michelle Moravec has been doing some lovely work visualizing her historical research using Raw. You should check it out.


Your Open Source Data Refinery

With these three pieces of data refinery infrastructure installed on your machine, or in your local digital history computer lab, you’ll have no excuse not to start adding some distant reading perspective to your method. Go. Do it now.

The Web of Authors for Wikipedia’s Archaeology Page

I’m playing with a new toy, WikiImporter, which allows me to download the network of authorship on media-wiki powered sites. I fired it up, set it to grab the user-article network and “The Hyperlink Coauthorship network will analyze all the links found in the seed article and create an edge between each user that edited the article found in that link and the article”.

Naturally, I pointed it at ‘archaeology’ on Wikipedia.  I’ve posted the resulting two mode network on figshare for all and sundry to analyze.

I also asked it to download the article to article links (which is slightly different than my spidering results, as my spiders also included the wiki pages themselves, like the ‘this page is a stub’ or ‘this page needs citations’, which gives me an interesting perspective on the quality of the articles. More on that another day). This file is also on figshare here.

Just remember to cite the files. Enjoy!