Quickly Extracting Data from PDFs

By ‘data’, I mean the tables. There are lots of archaeological articles out there that you’d love to compile together to do some sort of meta-study. Or perhaps you’ve gotten your hands on pdfs with tables and tables of census data. Wouldn’t it be great if you could just grab that data cleanly? Jonathan Stray has written a great synopsis of the various things you might try and has sketched out a workflow you might use. Having read that, I wanted to try ‘Tabula‘, one of the options that he mentioned. Tabula is open source and runs on all the major platforms. You simply download it an double-click on the icon; it runs within your browser. You load your pdf into it, and then draw bounding boxes around the tables that you want to grab. Tabula will then extract that table cleanly, allowing you to download it as a csv or tab separated file, or paste it directly into something else.

For instance, say you’re interested in the data that Gill and Chippindale compiled on Cycladic Figures. You can grab the pdf from JSTOR:

Material and Intellectual Consequences of Esteem for Cycladic Figures
David W. J. Gill and Christopher Chippindale
American Journal of Archaeology , Vol. 97, No. 4 (Oct., 1993) , pp. 601-659
Article DOI: 10.2307/506716

Download it, and then feed it into Tabula. Let’s look at table 2.

You could just highlight this table in your pdf reader and hit ctrl+c to copy it; when you paste that into your browser, you’d get:
Everything in a single column. For a small table, maybe that’s not such a big deal. But let’s look at what you get with Tabula. You drag the square over that same table; when you release the mouse button you get:
Much, much cleaner & faster! I say ‘faster’, because you can quickly drag the selection box around every table and hit download just the one time. Open the resulting csv file, and you have all of your tables in a useful format:
But wait, there’s more! Since you can copy directly to the clipboard, you can paste directly into a google drive spreadsheet (thus taking advantage of all the visualization options that Google offers) or into something like Raw from Density Design.
Tabula is a nifty little tool that you’ll probably want to keep handy.

Briefly Noted: Lytro, Light-Field Photography

  In the latest MIT Technology Review, there’s a short piece on the ‘Lytro‘, a camera that captures not just the light that falls on its sensor, but also the angle of that light. This feature allows different information, different kinds of shots, to be extracted computationally after the button is pressed.

I want one. They sell for $500.

Think of the archaeological uses! I’m no photographer, but as I understand things, a lot of archaeological photography comes down to the creative use of oblique angles, whether to see crop marks or to pick out very fine details of artefacts. If the Lytro captures the angles of the light hitting its sensors, then presumably one could take a shot, post the database of information associated with that shot, then allow other [digital] archaeologists to comb through that data extracting information/pictures of relevance? Perhaps a single photo of the soil could be combed through highlighting different textures, colours, etc…  Try out their gallery here.

The future of this camera is in the software apps developed to take advantage of the massive database of information that it will generate:

Refocusing images after they are shot is just the beginning of what Lytro’s cameras will be able to do. A downloadable software update will soon enable them to capture everything in a photo in sharp focus regardless of its distance from the lens, which is practically impossible with a conventional camera. Another update scheduled for this year will use the data in a Lytro snapshot to create a 3-D image. Ng is also exploring a video camera that could be focused after shots were taken, potentially giving home movies a much-needed boost in production values.

Getting Started with MALLET and Topic Modeling

UPDATE! September 19th 2012: Scott Weingart, Ian Milligan, and I have written an expanded ‘how to get started with Topic Modeling and MALLET’ for the Programming Historian 2. Please do consult that piece for detailed step-by-step instructions for getting the software installed, getting your data into it, and thinking through what the results might mean.

Original Post that Inspired It All:

I’m very interested in topic modeling at the moment. It has not been easy however to get started – I owe a debt of thanks to Rob Nelson for helping me to get going. In the interests of giving other folks a boost, of paying it forward, I’ll share my recipe. I’m also doing this for the benefit of some of my students. Let’s get cracking!

First, some background reading:

  1. Clay Templeton, “Topic Modeling in the Humanities: An Overview | Maryland Institute for Technology in the Humanities”, n.d., http://mith.umd.edu/topic-modeling-in-the-humanities-an-overview/.
  2. Rob Nelson, Mining the Dispatch http://dsl.richmond.edu/dispatch/
  3. Cameron Blevins, “Topic Modeling Martha Ballard’s Diary” Historying, April 1, 2010, http://historying.org/2010/04/01/topic-modeling-martha-ballards-diary/
  4. David J Newman and Sharon Block, “Probabilistic topic decomposition of an eighteenth‐century American newspaper,” Journal of the American Society for Information Science and Technology 57, no. 6 (April 1, 2006): 753-767.
  5. David Blei, Andrew Ng, and Michael Jordan, “Latent dirichlet allocation,” The Journal of Machine Learning Research 3 (2003), http://dl.acm.org/citation.cfm?id=944937.

Now you’ll need the software. Go to the MALLET project page, and download Mallet. (Mallet was developed by Andrew McCallum at U Massachusetts, Amherst).

Then, you’ll need the Java developer’s kit – nb, not the regular Java that’s on every computer, but the one that lets you program things. Install this.

Unzip Mallet into your C:/ directory . This is important; it can’t be anywhere else. You’ll then have a folder called C:/mallet-2.0.6 or similar.

Next, you’ll need to create an environment variable called MALLET_HOME. You do this by clicking on control panel >> system >> advanced system settings (in Windows 7; for XP, see this article), ‘environment variables’. In the pop-up, click ‘new’ and type MALLET_HOME in the variable name box; type c:/mallet-2.0.6 (ie, the exact location where you unzipped Mallet) in variable value.

To run mallet, click on your start menu >> all programs >> accessories >> command prompt. You’ll get the command prompt window, which will have a cursor at c:\user\user> (or similar). type cd .. (two periods; that ain’t a typo) to go up a level; keep doing this until you’re at the C:\ .  Then type cd:\mallet-2.0.6 and you’re in the Mallet directory. You can now type Mallet commands directly. If you type bin\mallet at this point, you should be presented with a list of Mallet commands – congratulations!

At this point, you’ll want some data. Using the regular windows explorer, I create a folder within mallet where I put all of the data I want to study (let’s call it ‘data’). If I were to study someone’s diary, I’d create a unique text file for each entry, naming the text file with the entry’s date. Then, following the topic modeling instructions on the mallet page, I’d import that folder, and see what happens next. I’ve got some work flow for scraping data from websites and other repositories, but I’ll leave that for another day (or skip ahead to The Programming Historian for one way of going about it.)

Once you’ve imported your documents, Mallet creates a single ‘mallet’ file that you then manipulate to determine topics.

bin\mallet import-dir --input \data\johndoediary --output
johndoediary.mallet \ --keep-sequence --remove-stopwords

(modified from the Mallet topic modeling page)

This sequence of commands tells mallet to import a directory located in the subfolder ‘data’ called ‘johndoediary’ (which contains a sequence of txt files). It then outputs that data into a file we’re calling ‘johndoediary.mallet. Removing stopwords strips out ‘and’ ‘of’ ‘the’ etc.

Then we’re ready to find some topics:

bin\mallet train-topics --input johndoediary.mallet \
  --num-topics 100 --output-state topic-state.gz --output-topic-keys
  johndoediary_keys.txt --output-doc-topics johndoediary_composition.txt

(modified from the Mallet topic modeling page)

Now, there are more complicated things you can do with this – take a look at the documentation on the Mallet page. Is there a ‘natural’ number of topics? I do not know. What I have found is that I have to run the train-topics with varying numbers of topics to see how the composition file breaks down. If I end up with the majority of my original texts all in a very limited number of topics, then I need to increase the number of topics; my settings were too coarse.

More on interpreting the output of Mallet to follow.

Again, I owe an enormous debt of gratitude to Rob Nelson for talking me through the intricacies of getting Mallet to work, and for the record, I think the work he is doing is tremendously important and fascinating!

Thoughts on the Shadow Scholar

In the Chronicle of Higher Education, there is a troubling piece written by a fellow who writes and sells papers for/to students. Which got me to thinking: shouldn’t text analysis be able to solve this?

Here’s my thinking: I’m willing to bet every author produces unique combinations of words and phrases – a concept that Amazon for instance uses to improve its search functions (“statistically improbable phrases“).  As the ‘ghost writer’ points out, most of the emails he gets from students are nearly illegible or otherwise atrocious. So – what if at the start of a school year, you sat all of your students down to handwrite a couple thousand words, any topic.  Writing by hand is important, so that you get that student’s actual genuine writing. Scan it all in. Perform text analysis on it. Obtain a ‘signature’ for that student’s style. Then, when students submit their papers, analyze them again and compare the signatures.  Where the signatures don’t match within a certain range, bring the student in to talk about their work. Chances are if they didn’t write it, they probably haven’t read it either…. Repeat each year to account for developing skill and ability.

Perhaps I’m naive, and text analysis isn’t at that level yet (but I’m willing to bet it could be…). If the problem is a student submits someone else’s work as his own, then maybe if we had a clear signal of his own true work, all this latent computer power sitting around could be brought into the equation…?

Just a thought.


7Scenes: Augmented Reality Authoring for Digital Storytelling

I’m very interested in augmented reality for interpreting/experiencing landscapes (archaeological or historical). I’ve explored things like Wikitude and Layar. There’s a great deal of flexibility and possibility with those two, if you’ve got the ability and resources to do a bit of programming. Skidmore College has used Layar with success to produce a Campus Map Layar. (follow that link for excellent pointers on how they did it). But what if you’d like to explore the potential of AR, but don’t have the programming skills?

One platform that I’ve come across recently which can help there is called ‘7Scenes‘. It explicitly bills itself as a ‘mobile storytelling platform’.  The free account allows you a basic ‘tour’ kind of story to tell; presumably if you purchase another kind of account, different genres become available to you.

I signed up for the free account, and began playing around with it (I’m ‘DoctorG’ if you’re looking). Even with this level of functionality, some playful elements are available – you can set quizzes by location, for instance, and keep score. A tour of your campus for first year students as part of orientation could include quizzes at crucial points.

In the editor window, you first select the genre. Then details (backstory, introduction etc).

The real work begins in the map window. When you add a location, you can make it trigger information or photos when the player encounters it. You can also build in simple quizzes, as in the screenshot.

Once the ‘scene’ is published, anyone with 7scenes on their smartphone can access it. The app knows where you are, and pulls in the closest scene. In about 15 minutes I created a scene with 3 locations, one photo, one info panel, and one quiz, around the main quad here at Carleton. Then, I fired up the app on my iphone and went outside. Even though it was quite simple, it was really rather engaging, wandering about the quad trying to get close enough to the point to trigger the interaction (note to scene makers: zoom into the map interface so that your location is precisely where you want. I put my first point actually outside my intended target, Paterson Hall, so I was wandering about the parking lot.)

I will be playing with this some more; but fired up after only a short investment in time, I wanted to share. The authoring environment makes sense, it’s easy to use, and the results are immediately apparent. When you log back into the 7scenes site, you also get use metrics and reviews of your scene. If only my digital history students had more smartphones!

More on 7scenes from their own press page

I know that I know nothing

Commuting in Ottawa is an interesting experience. It seems the entire city disappears in the summer, beguiling one into thinking that a commute that takes 30 – 40 minutes in August will continue to be 30 – 40 minutes in September.

This morning, I was pushing 1 hr and 40 minutes. On the plus side, this gives me the opportunity to listen to the podcasts from Scholars’ Lab, from the University of Virginia (available via iTunes U).  As I listen to this excellent series of talks (one talk per commute…) I realize just how profoundly shallow my knowledge is of the latest happenings in Digital Humanities – and that’s a good thing! For instance, I learned about Intrasis, a system from Sweden for recording archaeological sites (or indeed, any kind of knowledge) that focuses on generating relationships from the data, rather than specifying beforehand a relationships table (and it melds very well with GIS). This is cool. I learned also about Heurist, a tool for managing research.  Also ‘Heml’ – the Historical Event Markup and Linking Project, lead by Bruce Robertson. As I listened to this last talk, as Bruce described the problems of marking up events/places/persons using non-Gregorian calendars and so on, it struck me that this problem was rather similar to the one of defining sites in a GIS – what do you do when the boundaries are fuzzy? How do you avoid the in-built precision of dots-on-a-map, or URLS that lead to one specific location? Time is Space, as Einstein taught us….

The upshot is, I feel very humbled when I listen to these in-depth and fascinating talks – I feel rather out of my depth. At the same time, I am excited to be able to participate in such a fast moving field.  Hopefully, my small contributions to agent modeling for history generate the same kind of excitement for others!

Publish your excavation in minutes

…provided you blogged the whole thing in the first place.

How, you say?

With Anthologize, the outcome of the one-week-one-tool experiment.

Anthologize is a free, open-source, plugin that transforms WordPress 3.0 into a platform for publishing electronic texts. Grab posts from your WordPress blog, import feeds from external sites, or create new content directly within Anthologize. Then outline, order, and edit your work, crafting it into a single volume for export in several formats, including—in this release—PDF, ePUB, TEI.

How Anthologize came to be is remarkable in itself (see Dan Cohen’s blog) and is a model for what we as digitally-minded archaeology folks could be doing. Which puts me in mind of excavation reports, catalogues, and other materials produced in the day to day work of archaeology.

What if, in the course of doing your fieldwork/archive work/catalogue work/small finds work, you used WordPress as your content management system? There are plugins a-plenty for keeping things private, if that’s a concern. But once the work is complete, run Anthologize and voila: a publication fit for the 21st century.

And, since the constraints of paper publishing no longer apply, David Wilkin’s thoughts on the fuller experience of archaeology could also now find easier expression – in 2007 I wrote the following:

But he asks, ‘what of characters in archaeological writing?’ Wilkinson’s paper is really making a plea for archaeologists to remember that they themselves are characters in the story of the site or landscape that they are studying, and that they should put themselves into it:

“We all sit in portacavins, in offices, in vans, in pubs or round fires, and we tell stories… we have a great time and drink too much and what do we do the next morning? We get up and go to our offices and we rite, ‘In Phase 1 ditch 761 was recut (794) along part of its length.’ Surely, we can do better”.

A similar argument was made in the SAA Archaeological Record last May, by Cornelius Holtorf , in an article called ‘Learning from Las Vegas: Archaeology in the Experience Economy”. Holtorf argued:

“Learning from Las Vegas means learning to embrace and build upon the amazing fact that archaeologists can connect so well with some of the most widespread fantasies, dreams, and desires that people have today.[…] I am suggesting that the greatest value of archaeology in society lies in providing people with what they most desire from archaeology: great stories both about the past and about archaeological research.”

Archaeology – the doing of archaeology! – is a fantastic experience. You learn so much more about the past when you are at the coal-face itself, when you stand in 35 degree C heat, with the dust on your face so thick you almost choke, debating with the site supervisor the meaning of a complicated series of walls, or sitting at the bar afterwards with a cool beer, still debating the situation, laughing, chatting. Reading ‘Three shards of Vernice-Nera ware found in-situ below 342 indicate…’ sucks the fun out of archaeology. It certainly has no romance which puts the practice of archaeology – as published to the public – far down the list of priorities in this modern Experience Economy. The serious face of archaeology we present to the public is so lifeless : how can we expect government and the public to be excited about our work if we ourselves give every indication of not being excited either?

I’m not arguing that we turn every site monograph into a graphic novel (though that’s an interesting idea, and has been done for teaching archaeology). But with the internet being the way it is these days: couldn’t a project website contain blogs and twitters (‘tweets’, actually) from the people working on it? Can’t we make the stories of the excavation at least as important as the story of the site?

Contragulations to the folks who participated in the creation of Anthologize; there’ll be great things ahead for this tool!

Zotero Maps: Visualizing Archaeology?

You can now map your Zotero Library:

Potential Use Cases:
Map Your Collection By Key Places:
Many records from library catalogs and journal databases come pre-loaded with geographic keywords. Zotero Maps lets you quickly see the relationships between the terms catalogers, authors, and publishers have assigned to the items in your collection. Similarly, as you apply your own geographic tags to items you can then explore those geographic relationships. Whether you’re looking at key locations in studies of avian flu, ethnographic work in the American southwest, or the history of the transatlantic slave trade, the tags associated with your items provide valuable geographic information.

Map Places of Publication:
In many cases places of publication include crucial information about your items. If your working on a project involving the history of the book, how different media outlets cover an issue, or how different journals present distinct scientific points of view, the places in which those items are published can provide valuable insight.

In 2007, I was trying something along these lines using Platial (now deceased). Now – since you can add objects from things like Opencontext.org into your Zotero library, and describe these using tags, you could begin to build a map of not only ‘things’ but also the relevant reports etc, all from your browser, without doing any of the fancy coding stuff…

From my library:

Twitter Times: The Electric Archaeology Edition

Inspired by Dan Cohen’s ‘Digital Humanities Now‘ implementation of Twitter Times, I’ve done the same thing with my own twitter feed, its lists, and everyone I follow.

Heard of Twitter Times?

Dan writes,

More recently, social media such as Twitter has provided a surprisingly good set of pointers toward worthy materials I should be reading or exploring. (And as happened with blogs five years ago, the critics are now dismissing Twitter as unscholarly, missing the filtering function it somehow generates among so many unfiltered tweets.) I follow as many digital humanists as I can on Twitter, and created a comprehensive list of people in digital humanities. (You can follow me @dancohen.)


Digital Humanities Now is a new web publication that is the experimental result of this thought. It aggregates thousands of tweets and the hundreds of articles and projects those tweets point to, and boils everything down to the most-discussed items, with commentary from Twitter. A slightly longer discussion of how the publication was created can be found on the DHN “About” page.

I’m following mostly folks in elearning, archaeology, and digital humanities; you can see my edition here.