Quickly Extracting Data from PDFs

By ‘data’, I mean the tables. There are lots of archaeological articles out there that you’d love to compile together to do some sort of meta-study. Or perhaps you’ve gotten your hands on pdfs with tables and tables of census data. Wouldn’t it be great if you could just grab that data cleanly? Jonathan Stray has written a great synopsis of the various things you might try and has sketched out a workflow you might use. Having read that, I wanted to try ‘Tabula‘, one of the options that he mentioned. Tabula is open source and runs on all the major platforms. You simply download it an double-click on the icon; it runs within your browser. You load your pdf into it, and then draw bounding boxes around the tables that you want to grab. Tabula will then extract that table cleanly, allowing you to download it as a csv or tab separated file, or paste it directly into something else.

For instance, say you’re interested in the data that Gill and Chippindale compiled on Cycladic Figures. You can grab the pdf from JSTOR:

Material and Intellectual Consequences of Esteem for Cycladic Figures
David W. J. Gill and Christopher Chippindale
American Journal of Archaeology , Vol. 97, No. 4 (Oct., 1993) , pp. 601-659
Article DOI: 10.2307/506716

Download it, and then feed it into Tabula. Let’s look at table 2.

You could just highlight this table in your pdf reader and hit ctrl+c to copy it; when you paste that into your browser, you’d get:
Everything in a single column. For a small table, maybe that’s not such a big deal. But let’s look at what you get with Tabula. You drag the square over that same table; when you release the mouse button you get:
Much, much cleaner & faster! I say ‘faster’, because you can quickly drag the selection box around every table and hit download just the one time. Open the resulting csv file, and you have all of your tables in a useful format:
But wait, there’s more! Since you can copy directly to the clipboard, you can paste directly into a google drive spreadsheet (thus taking advantage of all the visualization options that Google offers) or into something like Raw from Density Design.
Tabula is a nifty little tool that you’ll probably want to keep handy.

One thought on “Quickly Extracting Data from PDFs

Comments are closed.