Web Seer and the Zeitgeist

I’ve been playing all evening with Web Seer, a toy that lets you contrast pairs of Google autocomplete suggestions. As is well known, Google autocomplete suggests completions based on what others have been searching for given that pattern of text you are entering. This is sparking some thoughts on how I might use this to think about things like public archaeology or public history.

As Alan Liu put it,

But for now, enjoy the pairings that I’ve been feeding it….

Screen Shot 2014-08-28 at 8.58.44 PM

In ancient/modern

 

Screen Shot 2014-08-28 at 8.57.17 PM

Greek versus Roman

 

Screen Shot 2014-08-28 at 8.52.25 PM

What School Should I Go To?

 

Screen Shot 2014-08-28 at 8.46.18 PM

Games and Literature

 

Screen Shot 2014-08-28 at 8.42.20 PM

Getting Down to Brass Tacks

 

Screen Shot 2014-08-28 at 8.35.24 PM

Drunkards and Teetotallers, never the twain shall meet

 

Screen Shot 2014-08-28 at 8.09.24 PM

Historians v Archaeologists, a Google Cage Match

 

Screen Shot 2014-08-28 at 7.55.55 PM

The DH Dilemma

 

Screen Shot 2014-08-28 at 7.55.08 PM

Future/Perfect

 

Screen Shot 2014-08-28 at 7.52.58 PM

Two Solitudes Redux

SAA 2015: Macroscopic approaches to archaeological histories: Insights into archaeological practice from digital methods

Ben Marwick and I are organizing a session for the SAA2015 (the 80th edition, this year in San Francisco) on “Macroscopic approaches to archaeological histories: Insights into archaeological practice from digital methods”. It’s a pretty big tent. Below is the session ID and the abstract. If this sounds like something you’d be interested in, why don’t you get in touch?

Session ID 743.

The history of archaeology, like most disciplines, is often presented as a sequence of influential individuals and a discussion of their greatest hits in the literature.  Two problems with this traditional approach are that it sidelines the majority of participants in the archaeological literature who are excluded from these discussions, and it does not capture the conversations outside of the canonical literature.  Recently developed computationally intensive methods as well as creative uses of existing digital tools can address these problems by efficiently enabling quantitative analyses of large volumes of text and other digital objects, and enabling large scale analysis of non-traditional research products such as blogs, images and other media. This session explores these methods, their potentials, and their perils, as we employ so-called ‘big data’ approaches to our own discipline.

—-

Like I said, if that sounds like something you’d be curious to know more about, ping me.

Topic Modeling Greek Consumerism

I’m experimenting. Here’s what I did today.

1. Justin Walsh published the data on which his book, ‘Consumerism in the Ancient World’, rests.

2. I downloaded it, and decided I would topic model it. The table, ‘Greek Vases’, has one row = one vase. Let’s start with that, though I think it might be more useful/illuminating to decide that ‘document’ might mean ‘site’ or ‘context’. But first things first; let’s sort out the workflow.

3. I delete all columns with ‘true’ or ‘false’ values. Struck me as not useful. I concatenated all columns into a single ‘text’ column. Then, per the description on the Mallet package page for R, I added a new column ‘class’ which I left blank. So I have ‘id’, ‘class’, ‘text’. All of Walsh’s information is in the ‘text’ field.

4. I ran this code in R, using R studio:

## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
library(mallet)
## Create a wrapper for the data with three elements, one for each column.
## R does some type inference, and will guess wrong, so give it hints with "colClasses".
## Note that "id" and "text" are special fields -- mallet will look there for input.
## "class" is arbitrary. We will only use that field on the R side.
documents <- read.table("modified-vases2.txt", col.names=c("id", "class", "text"),
                        colClasses=rep("character", 3), sep="\t", quote="")
## Create a mallet instance list object. Right now I have to specify the stoplist
## as a file, I can't pass in a list from R.
## This function has a few hidden options (whether to lowercase, how we
## define a token). See ?mallet.import for details.
mallet.instances <- mallet.import(documents$id, documents$text, "/Users/shawngraham/Desktop/data mining and tools/stoplist.csv",
                                  token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
## Create a topic trainer object.
num.topics <- 20
topic.model <- MalletLDA(num.topics)

## Load our documents. We could also pass in the filename of a
## saved instance list file that we build from the command-line tools.
topic.model$loadDocuments(mallet.instances)

## Get the vocabulary, and some statistics about word frequencies.
## These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)

## Optimize hyperparameters every 20 iterations,
## after 50 burn-in iterations.
topic.model$setAlphaOptimization(20, 50)

## Now train a model. Note that hyperparameter optimization is on, by default.
## We can specify the number of iterations. Here we'll use a large-ish round number.
topic.model$train(200)

## NEW: run through a few iterations where we pick the best topic for each token,
## rather than sampling from the posterior distribution.
topic.model$maximize(10)

## Get the probability of topics in documents and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities,
## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

## What are the top words in topic 7?
## Notice that R indexes from 1, so this will be the topic that mallet called topic 6.
mallet.top.words(topic.model, topic.words[7,])

## Show the first few documents with at least 5
head(documents[ doc.topics[7,] > 0.05 & doc.topics[10,] > 0.05, ])

## End of Mimno's sample script(Not run)

###from my other script; above was mimno's example script
topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)
write.csv(topic.docs, "vases-topics-docs.csv" ) 

## Get a vector containing short names for the topics
topics.labels <- rep("", num.topics)
for (topic in 1:num.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")

# have a look at keywords for each topic
topics.labels
write.csv(topics.labels, "vases-topics-labels.csv") ## "C:\\Mallet-2.0.7\\topics-labels.csv")

### do word clouds of the topics
library(wordcloud)
for(i in 1:num.topics){
  topic.top.words <- mallet.top.words(topic.model,
                                      topic.words[i,], 25)
  print(wordcloud(topic.top.words$words,
                  topic.top.words$weights,
                  c(4,.8), rot.per=0,
                  random.order=F))
}

And this is what I get:
Topic # Label
1 france greek west eating grey
2 spain ampurias neapolis girona arf
3 france rune herault colline nissan-lez-ens
4 spain huelva east greek drinking
5 france aude drinking montlaures cup
6 spain malaga settlement cup drinking
7 france drinking bouches-du-rhone settlement cup
8 france cup stemmed herault bessan
9 france marseille massalia bouches-du-rhone storage
10 spain ullastret settlement girona puig
11 france settlement mailhac drinking switzerland
12 spain badajoz cup stemless castulo
13 spain ampurias settlement girona neapolis
14 france beziers drinking cup pyrenees
15 spain krater bell arf drinking
16 transport amphora france gard massaliote
17 france settlement saint-blaise bouches-du-rhone greek
18 france marseille massalia west bouches-du-rhone
19 spain jaen drinking cemetery castulo
20 spain settlement abg eating alicante

The three letter acronymns are ware types. The original data had location, context, ware, purpose, and dates. Still need to figure out how to get Mallet (either on the command line or in R) to treat numerals as words, but that’s something I can ignore for the moment. So what next? Map this I guess, in physical and/or temporal space, and resolve the problem of what a ‘document’ really is, for archaeological topic modeling. Here, look at the word clouds generated at the end of the script whilst I ruminate. And also a flow diagram. What it shows, I know not. Exploration, eh?justin-walsh-data-flow

Rplot4

Rplot3Rplot2Rplot1

Extracting Text from PDFs; Doing OCR; all within R

I am a huge fan of Ben Marwick. He has so many useful pieces of code for the programming archaeologist or historian!

Edit July 17 1.20 pm: Mea culpa: I originally titled this post, ‘Doing OCR within R’. But, what I’m describing below – that’s not OCR. That’s extracting text from pdfs. It’s very fast and efficient, but it’s not OCR. So, brain fart. But I leave the remainder of the post as it was. For command line OCR (really, actual OCR) on a Mac, see the link to Ben Schmidt’s piece at the bottom. Sorry.

Edit July 17 10 pm: I am now an even bigger fan of Ben’s. He’s updated his script to either a) perform OCR by calling Tesseract from within R or b) grab the text layer from a pdf image. So this post no longer misleads. Thank you Ben!

Object Character Recognition, or OCR, is something that most historians will need to use at some point when working with digital documents. That is, you will often encounter pdf files of texts that you wish to work with in more detail (digitized newspapers, for instance). Often, there is a layer within the pdf image containing the text already: if you can highlight text by clicking and dragging over the image, you can copy and paste the text from the image. But this is often not the case, or worse, you have tens or hundreds or even thousands of documents to examine. There is commercial software that can do this for you, but it can be quite expensive

One way of doing OCR on your own machine with free tools, is to use Ben Marwick’s pdf-2-text-or-csv.r script for the R programming language. Marwick’s script uses R as wrapper for the Xpdf programme from Foolabs. Xpdf is a pdf viewer, much like Adobe Acrobat. Using Xpdf on its own can be quite tricky, so Marwick’s script will feed your pdf files to Xpdf, and have Xpdf perform the text extraction. For OCR, the script acts as a wrapper for Tesseract, which is not an easy piece of software to work with. There’s a final part to Marwick’s script that will pre-process the resulting text files for various kinds of text analysis, but you can ignore that part for now.

  1. Make sure you have R downloaded and installed on your machine (available from http://www.r-project.org/)
  2. Make sure you have Xpdf downloaded and installed (available from ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-win-3.04.zip ). Make a note of where you unzipped it. In particular, you are looking for the location of the file ‘pdftotext.exe’. Also, make sure you know where ‘pdftoppm’ is located too (it’s in that download).
  3. Download and install Tesseract https://code.google.com/p/tesseract-ocr/ 
  4. Download and install Imagemagick http://www.imagemagick.org/
  5. Have a folder with the pdfs you wish to extract text from.
  6. Open R, and paste Marwick’s script into the script editor window.
  7. Make sure you adjust the path for “dest” and the path to “pdftotext.exe” to the correct location
  8. Run the script! But read the script carefully and make sure you run the bits you need. Ben has commented out the code very well, so it should be fairly straightforward.

Obviously, the above is framed for Windows users. For Mac users, the steps are all the same, except that you use the version of Xpdf, Tesseract, and Imagemagick built for IOS, and your paths to the other software are going to be different. And of course you’re using R for Mac, which means the ‘shell’ commands have to be swapped to ‘system’! (As of July 2014, the Xpdf file for Mac that you want is at ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-mac-3.04.tar.gz ) I’m not 100% certain of any other Mac/PC differences in the R script – these should only exist at those points where R is calling on other resources (rather than on R packages). Caveat lector, eh?

The full R script may be found at https://gist.github.com/benmarwick/11333467. So here is the section that does the text extraction from pdf images (ie, you can copy and highlight text in the pdf):

###Note: there's some preprocessing that I (sg) haven't shown here: go see the original gist

################# Wait! ####################################
# Before proceeding, make sure you have a copy of pdf2text
# on your computer! Details: https://en.wikipedia.org/wiki/Pdftotext
# Download: http://www.foolabs.com/xpdf/download.html

# Tell R what folder contains your 1000s of PDFs
dest <- "G:/somehere/with/many/PDFs"

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# now there are a few options...

############### PDF to TXT #################################
# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE) )

# where are the txt files you just made?
dest # in this folder

And here’s the bit that does the OCR

</pre>
                     ##### Wait! #####
# Before proceeding, make sure you have a copy of Tesseract
# on your computer! Details & download:
# https://code.google.com/p/tesseract-ocr/
# and a copy of ImageMagick: http://www.imagemagick.org/
# and a copy of pdftoppm on your computer!
# Download: http://www.foolabs.com/xpdf/download.html
# And then after installing those three, restart to
# ensure R can find them on your path.
# And note that this process can be quite slow...

# PDF filenames can't have spaces in them for these operations
# so let's get rid of the spaces in the filenames

sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})

# get the PDF file names without spaces
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# Now we can do the OCR to the renamed PDF files. Don't worry
# if you get messages like 'Config Error: No display
# font for...' it's nothing to worry about

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), using
  shell(shQuote(paste0("pdftoppm ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("convert *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("tesseract ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
  })

# where are the txt files you just made?
dest # in this folder

Besides showing how to do your own OCR, Marwick’s script shows some of the power of R for doing more than statistics. Mac users might be interested in Ben Schmidt’s tutorial ‘Command-line OCR on a Mac’ from his digital history graduate seminar at Northeastern University, online at http://benschmidt.org/dighist13/?page_id=129.

Government of Canada Edits

In recent days, a number of twitterbots have been set up to monitor changes to Wikipedia emerging from government IP address blocks. Seems to me that here is a window for data mining the mindset of government. Of course, there’s nothing to indicate that anything untoward is being done by the Government itself; I live in Ottawa, and I know what civil servants can get up to on their lunch break.

But let’s look at the recent changes documented by https://twitter.com/gccaedits; I’ve taken screenshots below, but you can just scroll through gccaedits’ feed. Actually, come to think of it, someone should be archiving those tweets, too. It’s only been operational for something like 3 days, but already, we see an interesting grammar/football fanatic; someone with opinions on Amanda Knox; someone setting military history right, and someone fixing the German version of Rene Levesque’s page.

Hmmm. Keep your eyes on this, especially as next year is an election year…

Screen Shot 2014-07-14 at 1.54.49 PMScreen Shot 2014-07-14 at 1.54.24 PMScreen Shot 2014-07-14 at 1.54.03 PMScreen Shot 2014-07-14 at 1.53.35 PMScreen Shot 2014-07-14 at 1.53.14 PMScreen Shot 2014-07-14 at 1.52.55 PMScreen Shot 2014-07-14 at 1.52.23 PMScreen Shot 2014-07-14 at 1.51.52 PMScreen Shot 2014-07-14 at 1.55.06 PM

Setting up your own Data Refinery

Refinery at Oxymoron, by Wyatt Wellman, cc by-sa 2.0. Flickr.

Refinery at Oxymoron, by Wyatt Wellman, cc by-sa 2.0. Flickr.

I’ve been playing with a Mac. I’ve been a windows person for a long time, so bear with me.

I’m setting up a number of platforms locally for data mining. But since what I’m *really* doing is smelting the ore of data scraped using things like Outwit Hub or Import.io (the ‘mining operation’, in this tortured analogy), what I’m setting up is a data refinery. Web based services are awesome, but if you’re dealing with sensitive data (like oral history interviews, for example) you need something local – this will also help with your ethics board review too.  Onwards!

Voyant-Tools

You can now set Voyant-Tools up locally, keeping your data safe and sound. The documentation and downloads are all on this page. This was an incredibly easy setup on Mac. Unzip, double-click voyant-tools.jar, and boom you’ve got Voyant-Tools puttering away in your browser. It’ll be at  http://127.0.0.1:8888. You can also hit the cogwheel icon in the top right to run your corpus through all sorts of other tools that come with Voyant but aren’t there on the main layout. You’ll want ‘export corpus with other tool’. You’ll end up with a url something like http://127.0.0.1:8888/tool/RezoViz/?corpus=1404405786475.8384 . You can then swap the name of any other tool in that URL (to save time). RezoViz by the way uses named entity extraction to construct a network of entities mentioned in the same documents. So if you upload your corpus in small-ish chunks (paragraphs; pages; every 1000 words, whatever) you can see how it all ties together this way. From the cogwheel icon on the RezoViz layout, you can get .net which you can then import into Gephi. How frickin’ cool is that?

Overview Project


Topic modeling is all the rage, and yes, you should have MALLET or the Stanford TMT or R on your machine. But sometimes, it’s nice to just see something rather like a dendrogram of folders with progressively finer levels of self-similarity. Overview does term-frequency inverse document frequency weightings to figure out similarity of documents. The instructions (for all platforms) are here. It’s not quite as painless as Voyant, but it’s pretty darn close. You’ll need to have Postgres – download, install, run it once, then download Overview. You need to have Java 7. (At some point, you’ll probably need to look into running multiple versions of Java, if you continue to add elements to your refinery). Then:

  1. Ctrl-Click or Right-click on Overview for Mac OS X.command and select Open. When the dialog box asks if you are sure you want to open the application, click on the Open button. From then on, you can start Overview by double-clicking on Overview for Mac OS X.command.
  2. Browse to http://localhost:9000 and log in as admin@overviewproject.org with passwordadmin@overviewproject.org.

And you now have Overview running. You can do many many things with Overview – it’ll read pdfs, for instance, which you can then export within a csv file. You can tag folders and export those tags, to do some fun visualizations with the next part of your refinery, RAW.

Tags exported from Overview

Tags visualized. This wasn’t done with Raw (but rather, a commercial piece of software), but you get the idea.

RAW

Flow diagram in Raw, using sample movie data

Flow diagram in Raw, using sample movie data

Raw does wonderful things with CSV formatted data, all in your browser. You can use the webapp version; nothing gets communicated to the server. But, still, it’s nice to keep it close to home. So, you can get Raw source code here. It’s a little trickier to install than the others. First thing: you’ll need Bower. But you can’t install Bower without Node.js and npm. So, go to Node.js and hit install. Then, download Raw. Unzip Raw and go to that folder. To install Bower, type

$ sudo npm install -g bower

Once the dust settles, there’s a bunch of dependencies to install. Remember, you’re in the Raw folder. Type:

$ bower install

When the dust clears again, and assuming you have Python installed on your machine, fire Raw up in a server:

$ python -m SimpleHTTPServer 4000

(If you don’t have python, well, go get python. I’ll wait). Then in your browser go to

localhost:4000

And you can now do some funky visualizations of your data. There are a number of chart types packaged with Raw, but you can also develop your own – here’s the documentation. Michelle Moravec has been doing some lovely work visualizing her historical research using Raw. You should check it out.

 

Your Open Source Data Refinery

With these three pieces of data refinery infrastructure installed on your machine, or in your local digital history computer lab, you’ll have no excuse not to start adding some distant reading perspective to your method. Go. Do it now.

The Web of Authors for Wikipedia’s Archaeology Page

I’m playing with a new toy, WikiImporter, which allows me to download the network of authorship on media-wiki powered sites. I fired it up, set it to grab the user-article network and “The Hyperlink Coauthorship network will analyze all the links found in the seed article and create an edge between each user that edited the article found in that link and the article”.

Naturally, I pointed it at ‘archaeology’ on Wikipedia.  I’ve posted the resulting two mode network on figshare for all and sundry to analyze.

I also asked it to download the article to article links (which is slightly different than my spidering results, as my spiders also included the wiki pages themselves, like the ‘this page is a stub’ or ‘this page needs citations’, which gives me an interesting perspective on the quality of the articles. More on that another day). This file is also on figshare here.

Just remember to cite the files. Enjoy!

 

Archeology versus Archaeology versus #Blogarch

I’m working on a paper that maps the archaeological blogosphere. I thought this morning it might be good to take a quick detour into the Twitterverse.

Behold!

'archaeology' on twitter, april 7 2014

‘archaeology’ on twitter, april 7 2014

‘archaeology’ on twitter

Here we have every twitter username, connected by referring to each other in a tweet. There’s a seriously strong spine of tweeting, but it doesn’t make for a unified graph. The folks keeping this clump all together, measured by betweeness centrality:

pompeiiapp
arqueologiabcn
herculaneumapp
romanheritage
openaccessarch
cmount1
groovyhistorian
lornarichardson

top replied-to
hotrodngold
raymondsnoddy
colesprouse
1014retold
janell_elise
yorksarch
holleyalex
bonesbehaviours
uclu
illustreets

Top URLS:

http://bit.ly/1husSFB

http://phy.so/316076983

http://bit.ly/1sqHFu0

http://beasiswaindo.com/1796

https://www.dur.ac.uk/archaeology/conferences/current/babao2014/

http://wanderinggypsyvoyager.blogspot.com/2014/04/archaeology-two-day-search.html?spref=tw

http://www.thisiscolossal.com/2014/04/aerial-archaeology/

http://news.sciencemag.org/archaeology/2014/04/did-europeans-get-fat-neandertals

http://www.smartsurvey.co.uk/s/HadriansWall

http://ift.tt/PWRYrf

Top hashtags:
archaeology 325
Pompeii 90
fresco 90
Archaeology 77
Herculaneum 40
Israel 24
nowplaying 20
roman 18
newslocker 16
Roman 14

Archeology

Let’s look at american archeology – as signified by the dropped ‘e’.

'archeology' on twitter, april 7

‘archeology’ on twitter, april 7

An awful lot more fragmented – less popular consciousness of archaeology-as-a-community?
Top by betweeness centrality – the ones who hold this together:
illumynous
archeologynow
youtube
heritagedaily
algenpfleger
riosallier
david328124
ogurek3
gold248131
leafenthusiast

Top urls:

http://ift.tt/1hN75Lp

http://wp.me/p4jAM9-1cZ

http://fav.me/d7d95kp

http://bit.ly/1qdaHLD

http://newszap.com

http://www.valencia953fm.com.ve

http://bit.ly/PS6hg4

http://goo.gl/fb/MfmNZ

http://goo.gl/fb/IfRnh

Top hashtags:
archeology
history
rome
ancient
easterisland
mystery
easter
slave
esoteric
egypt

Top replied-to
atheistlauren
nofaith313
faraishah
sebpatrick
swbts
thebiblestrue
animal
christofpierson
simba_83
andystacey

#Blogarch on twitter

twitter search '#blogarch' april 7 2014

twitter search ‘#blogarch’ april 7 2014

And now, the archaeologists themselves, as indicated by #blogarch

We talk to ourselves – but with the nature of the hashtag, I suppose that’s to be expected?

Top by betweeness centrality
openaccessarch
drspacejunk
bonesdonotlie
archeowebby
drkillgrove
fieldofwork
archaeo_girl
brennawalks
ejarchaeology
yagumboya

top urls

http://zoharesque.blogspot.com/2014/03/space-age-archaeology-and-future-do-i.html?spref=tw

http://bit.ly/1gBkNin

http://campusarch.msu.edu/?p=2782

http://wp.me/p36umf-cW

http://www.poweredbyosteons.org/2014/03/blogging-bioarchaeology-where-do-we-go.html#.Uzm7zM8kJUw.twitter

http://ow.ly/3iVK4f

http://wp.me/p3Kfwu-cb

http://bit.ly/PCdEIE

http://wp.me/p1rKjz-V2

http://diggin-it-archaeology.blogspot.com/2014/04/my-future-in-blogging-archaeology.html

Top hashtags
blogarch
BlogArch
archaeology
saa2014
SAA2014
blogging
CRMArch
newslocker
crmarch

Top replied to
electricarchaeo (yay me!)

Top mentioned:
drspacejunk
bonesdonotlie
fieldofwork
openaccessarch
archeowebby
jsatgra
cmount1
archaeo_girl
capmsu
drkillgrove

Put them altogether now…

And now, we put them altogether to get ‘archaeology’ on the twitterverse today:

'archaeology, archeology, and #blogarch' on twitter, april 7

‘archaeology, archeology, and #blogarch’ on twitter, april 7

Visually, it’s apparent that the #blogarch crew are the ones tying together the wider twitter worlds of archaeology & archeology, thought it’s still pretty fragmented. There’re 460 folks in this graph.

Top by betweeness centrality:

openaccessarch
drspacejunk
bonesdonotlie
archeowebby
drkillgrove
fieldofwork
jamvallmitjana
archaeo_girl
brennawalks
ejarchaeology

Top urls

http://zoharesque.blogspot.com/2014/03/space-age-archaeology-and-future-do-i.html?spref=tw
http://bit.ly/1gBkNin
http://www.poweredbyosteons.org/2014/03/blogging-bioarchaeology-where-do-we-go.html#.Uzm7zM8kJUw.twitter
http://campusarch.msu.edu/?p=2782
http://wp.me/p4jAM9-1cZ
http://fav.me/d7d95kp
http://wp.me/p1rKjz-V2
http://diggin-it-archaeology.blogspot.com/2014/04/my-future-in-blogging-archaeology.html
http://bonesdontlie.wordpress.com/2014/04/01/the-future-of-blogging-for-bones-dont-lie/
http://soundcloud.com/vrecordings/l-side-andrezz-archeology-v

top hashtags (not useful, given the nature of the search, right? But anyway)

blogarch
archeology
archaeology
BlogArch
history
ancient
easterisland
mystery
easter
slave

Top word pairs in those largest groups:

archeology,professor 30
started,yesterday 21
yesterday,battle 21
battle,towton 21
towton,weapon 21
weapon,tests 21
tests,forensic 21
forensic,archeology 21
museum,archeology 19
blogging,archaeology 17

second group:
blogging,archaeology 13
future,blogging 12
archaeology,go 7
archaeology,future 7
archaeology,final 6
final,review 6
review,blogarch 6
hopes,dreams 6
dreams,fears 6
fears,blogging 6

third group:
space,age 6
age,archaeology 6
archaeology,future 6
future,know 6
know,going 6
saa2014,blogarch 6
going,blogarch 5
blogarch,post 3
post,future 3
future,blogging 3

fourth group:
easterisland,ancient 10
ancient,mystery 10
mystery,easter 10
easter,slave 10
slave,history 10
history,esoteric 10
esoteric,archeology 10
archeology,egypt 10
rt,illumynous 9
illumynous,easterisland 9

fifth group:
costa,rica 8
rt,archeologynow 7
archeologynow,modern 4
modern,archeology 4
archeology,researching 4
researching,dive 4
dive,bars 4
bars,costa 4
rica,costa 4
rica,star 4

(once I saw ‘bars’, I stopped. Archaeological stereotypes, maybe).

Top mentioned in the entire graph

illumynous 9 bonesdonotlie 8
drspacejunk 8 drkillgrove 4
bonesdonotlie 8 capmsu 4
archeologynow 7 yagumboya 3
openaccessarch 7 drspacejunk 3
macbrunson 6 archeowebby 3
swbts 6 allarchaeology 3
archeowebby 6 openaccessarch 3
algenpfleger 5 cmount1 3
youtube 5 brennawalks 2

So what does this all mean? Answers on a postcard, please…

(My network files will be on figshare.com eventually).

Quickly Extracting Data from PDFs

By ‘data’, I mean the tables. There are lots of archaeological articles out there that you’d love to compile together to do some sort of meta-study. Or perhaps you’ve gotten your hands on pdfs with tables and tables of census data. Wouldn’t it be great if you could just grab that data cleanly? Jonathan Stray has written a great synopsis of the various things you might try and has sketched out a workflow you might use. Having read that, I wanted to try ‘Tabula‘, one of the options that he mentioned. Tabula is open source and runs on all the major platforms. You simply download it an double-click on the icon; it runs within your browser. You load your pdf into it, and then draw bounding boxes around the tables that you want to grab. Tabula will then extract that table cleanly, allowing you to download it as a csv or tab separated file, or paste it directly into something else.

For instance, say you’re interested in the data that Gill and Chippindale compiled on Cycladic Figures. You can grab the pdf from JSTOR:

Material and Intellectual Consequences of Esteem for Cycladic Figures
David W. J. Gill and Christopher Chippindale
American Journal of Archaeology , Vol. 97, No. 4 (Oct., 1993) , pp. 601-659
Article DOI: 10.2307/506716

Download it, and then feed it into Tabula. Let’s look at table 2.

gillchippendaletable2
You could just highlight this table in your pdf reader and hit ctrl+c to copy it; when you paste that into your browser, you’d get:
gillchippendaletable2cutnpaste
Everything in a single column. For a small table, maybe that’s not such a big deal. But let’s look at what you get with Tabula. You drag the square over that same table; when you release the mouse button you get:
tabula1
Much, much cleaner & faster! I say ‘faster’, because you can quickly drag the selection box around every table and hit download just the one time. Open the resulting csv file, and you have all of your tables in a useful format:
tabula2
But wait, there’s more! Since you can copy directly to the clipboard, you can paste directly into a google drive spreadsheet (thus taking advantage of all the visualization options that Google offers) or into something like Raw from Density Design.
Tabula is a nifty little tool that you’ll probably want to keep handy.

Mapping the Web in Real Time

I don’t think I’ve shared my workflow before for mapping the structure of a webcrawl. After listening to Sebastian Heath speak at #dapw it occurred to me that it might be useful for, interalia linked open data type resources. So, here’s what you do (and my example draw’s from this year’s SAA 2014 blogging archaeology session blog-o-sphere):

1. install the http graph generator from the gephi plugin marketplace.

2. download the navicrawler + firefox portable zip file at the top of this page.

3. make sure no other instance of firefox is open. Open firefox portable. DO NOT click the ‘update firefox’ button, as this will make navicrawler unusable.

4. Navicrawler can be used to download or scrape the web. In the navicrawler window, click on the (+) to select the ‘crawl’ pane. This will let you set how deep and how far to crawl. Under the ‘file’ tab, you can save all of what you crawl in various file formats. With the httpgraph plugin for Gephi however, we will simply ‘listen’ to the browser and render the graph in real time.

5. The first time you run firefox portable, you will need to configure a manual proxy. Do this by going to tools >> options >> network >> settings. Set the manual proxy configuration for http to 127.0.0.1 and the port to 8088. Click ‘ok’.

If you tried loading a webpage at this point, you’d get an error. To resolve this, you need to tell Gephi to connect to that port as well, and then web traffic will be routed correctly.

6. Open Gephi. Select new project. Under ‘generate’, select ‘http graph’. This will open a dialogue box asking for the port number. Enter 8088.

7. Over in Firefox portable, you can now start a websearch or go to the page from which you wish to crawl. For instance, you could put in the address bar, http://dougsarchaeology.wordpress.com/2013/11/05/blogging-archaeology/. Over in gephi, you will start to see a number of nodes and edges appearing. In the ‘crawl’ window in Navicrawler, set ‘max depth’ to 1, ‘crawl distance’ to 2′ and ‘tabs count’ to 25. Then hit the ‘start’ button. Your Gephi window will now begin to fill with the structure of the internet. There are 4 types of nodes: client, uri, host, and domain. For our purposes here, we will want to filter the resulting graph to hide most of the architecture of the web and just show the URIs. (This by the way could be very useful for visualizing archaeological resources organized via Linked Open Data principles).

Your crawl can run for quite some time.  I was running the crawl describe above for around 10 minutes when it crashed on me. The resulting gephi file (which has 5374 nodes and 14993 edges) can be downloaded from my space on figshare. For the illustration below, I filtered the ‘content-type’ for ‘text/html’, to present the structure of the human readable archaeo-blog-o-sphere as represented by Doug’s Blogging Archaeology Carnival.

The view from Doug's place
The view from Doug’s place