Quickly Extracting Data from PDFs

By ‘data’, I mean the tables. There are lots of archaeological articles out there that you’d love to compile together to do some sort of meta-study. Or perhaps you’ve gotten your hands on pdfs with tables and tables of census data. Wouldn’t it be great if you could just grab that data cleanly? Jonathan Stray has written a great synopsis of the various things you might try and has sketched out a workflow you might use. Having read that, I wanted to try ‘Tabula‘, one of the options that he mentioned. Tabula is open source and runs on all the major platforms. You simply download it an double-click on the icon; it runs within your browser. You load your pdf into it, and then draw bounding boxes around the tables that you want to grab. Tabula will then extract that table cleanly, allowing you to download it as a csv or tab separated file, or paste it directly into something else.

For instance, say you’re interested in the data that Gill and Chippindale compiled on Cycladic Figures. You can grab the pdf from JSTOR:

Material and Intellectual Consequences of Esteem for Cycladic Figures
David W. J. Gill and Christopher Chippindale
American Journal of Archaeology , Vol. 97, No. 4 (Oct., 1993) , pp. 601-659
Article DOI: 10.2307/506716

Download it, and then feed it into Tabula. Let’s look at table 2.

gillchippendaletable2
You could just highlight this table in your pdf reader and hit ctrl+c to copy it; when you paste that into your browser, you’d get:
gillchippendaletable2cutnpaste
Everything in a single column. For a small table, maybe that’s not such a big deal. But let’s look at what you get with Tabula. You drag the square over that same table; when you release the mouse button you get:
tabula1
Much, much cleaner & faster! I say ‘faster’, because you can quickly drag the selection box around every table and hit download just the one time. Open the resulting csv file, and you have all of your tables in a useful format:
tabula2
But wait, there’s more! Since you can copy directly to the clipboard, you can paste directly into a google drive spreadsheet (thus taking advantage of all the visualization options that Google offers) or into something like Raw from Density Design.
Tabula is a nifty little tool that you’ll probably want to keep handy.

Getting Historical Network Data into Gephi

I’m running a workshop next week on getting started with networks & gephi. Below, please find my first pass at a largely self-directed tutorial. This may eventually get incorporated into the Macroscope.

Data files for this tutorial may be found here. There’s a pdf/pptx with the images below, too.

The data for this exercise comes from Peter Holdsworth’s MA dissertation research, which Peter shared on Figshare here. Peter was interested in the social networks surrounding ideas of commemoration of the centenerary of the War of 1812, in 1912. He studied the membership rolls for women’s service organization in Ontario both before and after that centenerary. By making his data public, Peter enables others to build upon his own research in a way not commonly done in history. (Peter can be followed on Twitter at https://twitter.com/P_W_Holdsworth).

On with the show!

Download and install Gephi. (What follows assumes Gephi 0.8.2). You will need the MultiMode Projection pluging installed.

To install the plugin – select Tools >> Plugins  (across the top of Gephi you’ll see ‘File Workspace View Tools Window Plugins Help’. Don’t click on this ‘plugins’. You need to hit ‘tools’ first. Some images would be helpful, eh?).

In the popup, under ‘available plugins’ look for ‘MultimodeNetworksTransformation’. Tick this box, then click on Install. Follow the instructions, ignore any warnings, click on ‘finish’. You may or may not need to restart Gephi to get the plugin running. If you suddenly see on the far right of ht Gephi window a new tab besid ‘statistics’, ‘filters’, called ‘Multimode Network’, then you’re ok.

Slide1

Getting the Plugin

Assuming you’ve now got that sorted out,

1. Under ‘file’, select -> New project.
2. On the data  laboratory tab, select Import-spreadsheet, and in the pop-up, make sure to select under ‘as table: EDGES table. Select women-orgs.csv.  Click ‘next’, click finish.

(On the data table, have ‘edges’ selected. This is showing you the source and the target for each link (aka ‘edge’). This implies a directionality to the relationship that we just don’t know – so down below, when we get to statistics, we will always have to make sure to tell Gephi that we want the network treated as ‘undirected’. More on that below.)

Slide2

Loading your csv file, step 1.

Slide3

Loading your CSV file, step 2

3. Click on ‘copy data to other column’. Select ‘Id’. In the pop-up, select ‘Label’
4. Just as you did in step 2, now import NODES (Women-names.csv)

(nb. You can always add more attribute data to your network this way, as long as you always use a column called Id so that Gephi knows where to slot the new information. Make sure to never tick off the box labeled ‘force nodes to be created as new ones’.)

Adding new columns

Adding new columns

5. Copy ID to Label
6. Add new column, make it boolean. Call it ‘organization’

Filtering & ticking off the boxes

Filtering & ticking off the boxes

7. In the Filter box, type [a-z], and select Id – this filters out all the women.
8. Tick off the check boxes in the ‘organization’ columns.

Save this as ‘women-organizations-2-mode.gephi’.

Now, we want to explore how women are connected to other women via shared membership.

Setting up the transformation.

Setting up the transformation.

Make sure you have the Multimode networks projection plugin installed.

On the multimode networks projection tab,
1. click load attributes.
2. in ‘attribute type’, select organization
4. in left matrix, select ‘false – true’ (or ‘null – true’)
5. in right matrix, select ‘true – false’. (or ‘true – null’)
(do you see why this is the case? what would selecting the inverse accomplish?)

6. select ‘remove edges’ and ‘remove nodes’.

7. Once you hit ‘run’, organizations will be removed from your bipartite network, leaving you with a single-mode network. hit ‘run’.

8. save as ‘women to women network.csv’

…you can reload your ‘women-organizations-2-mode.gephi’ file and re-run the multimode networks projection so that you are left with an organization to organization network.

! if your data table is blank, your filter might still be active. make sure the filter box is clear. You should be left with a list of women.

9. You can add the ‘women-years.csv’ table to your gephi file, to add the number of organizations the woman was active in, by year, as an attribute. You can then begin to filter your graph’s attributes…

10. Let’s filter by the year 1902. Under filters, select ‘attributes – equal’ and then drag ‘1902’ to the queries box.
11. in ‘pattern’ enter [0-9] and tick the ‘use regex’ box.
12. click ok, click ‘filter’.

You should now have a network with 188 nodes and 8728 edges, showing the women who were active in 1902.

Let’s learn something about this network. On statistics,
13. Run ‘avg. path length’ by clicking on ‘run’
14. In the pop up that opens, select ‘undirected’ (as we know nothing about directionality in this network).
15. click ok.

16. run ‘modularity’ to look for subgroups. make sure ‘randomize’ and ‘use weights’ are selected. Leave ‘resolution’ at 1.0

Let’s visualize what we’ve just learned.

17. On the ‘partition’ tab, over on the left hand side of the ‘overview’ screen, click on nodes, then click the green arrows beside ‘choose a partition parameter’.
18. Click on ‘choose a partition parameter’. Scroll down to modularity class. The different groups will be listed, with their colours and their % composition of the network.
19. Hit ‘apply’ to recolour your network graph.

20. Let’s resize the nodes to show off betweeness-centrality (to figure out which woman was in the greatest position to influence flows of information in this network.) Click ‘ranking’.
21. Click ‘nodes’.
22. Click the down arrow on ‘choose a rank parameter’. Select ‘betweeness centrality’.
23. Click the red diamond. This will resize the nodes according to their ‘betweeness centrality’.
24. Click ‘apply’.

Now, down at the bottom of the middle panel, you can click the large black ‘T’ to display labels. Do so. Click the black letter ‘A’ and select ‘node size’.

Mrs. Mary Elliot-Murray-Kynynmound and Mrs. John Henry Wilson should now dominate your network. Who were they? What organizations were they members of? Who were they connected to? To the archives!

Congratulations! You’ve imported historical network data into Gephi, manipulated it, and run some analyzes. Play with the settings on ‘preview’ in order to share your visualization as svg, pdf, or png.

Now go back to your original gephi file, and recast it as organizations to organizations via shared members, to figure out which organizations were key in early 20th century Ontario…

Historian’s Macroscope- how we’re organizing things

‘One of the sideshows was wrestling’ from National Library of Scotland on Flickr Commons; found by running this post through http://serendipomatic.org

How do you coordinate something as massive as a book project, between three authors across two countries?

Writing is a bit like sausage making. I write this, thinking of Otto von Bismarck, but Wikipedia tells me:

  • Laws, like sausages, cease to inspire respect in proportion as we know how they are made.
    • As quoted in University Chronicle. University of Michigan (27 March 1869) books.google.de, Daily Cleveland Herald (29 March 1869), McKean Miner (22 April 1869), and “Quote… Misquote” by Fred R. Shapiro in The New York Times (21 July 2008); similar remarks have long been attributed to Otto von Bismarck, but this is the earliest known quote regarding laws and sausages, and according to Shapiro’s research, such remarks only began to be attributed to Bismarck in the 1930s.

I was thinking just about the messiness rather that inspiring respect; but we think there is a lot to gain when we reveal the messiness of writing. Nevertheless, there are some messy first-first-first drafts that really ought not to see the light of day. We want to do a bit of writing ‘behind the curtain’, before we make the bits and pieces visible on our Commentpress site, themacroscope.org.  We are all fans of Scrivener, too, for the way it allows the bits and pieces to be moved around, annotated, rejected, resurrected and so on. Two of us are windows folks, the other a Mac. We initially tried using Scrivener and Github, as a way of managing version control over time and to provide access to the latest version simultaneously. This worked fine, for about three days, until I detached the head.

Who knew that decapitation was possible? Then, we started getting weird line breaks and dropped index cards happening. So we switched tacts and moved our project into a shared dropbox folder. We know that with dropbox we absolutely can’t have more than one of us be in the project at the same time. We started emailing each other to say, ‘hey, I’m in the project….now. It’s 2.05 pm’ but that got very messy. We installed yshout  and set it up to log our chats. Now, we can just check to see who’s in, and leave quick memos about what we were up to.

Once we’ve got a bit of the mess cleaned up, we’ll push bits and pieces to our Commentpress site for comments. Then, we’ll incorporate that feedback back in our Scrivener, and perhaps re-push it out for further thoughts.

One promising avenue that we are not going down, at least for now, is to use Draft.  Draft has many attractive features, such as multiple authors, side-by-side comparisons, and automatic pushing to places such as WordPress. It even does footnotes! I’m cooking up an assignment for one of my classes that will require students to collaboratively write something, using Draft. More on that some other day.

Announcing a live-writing project: the Historian’s Macroscope, an approach to big digital history

Robert Hook’s Microscope http://www.history-of-the-microscope.org

I’ve just signed a book contract today with Imperial College Press; it’s winging its way to London as I type. I’m writing the book with the fantastically talented Ian Milligan and Scott Weingart. (Indeed, I sometimes feel the weakest link – goodbye!).

It seems strangely appropriate, given the twitter/blog furor over the AHA’s statement recommendation to graduate students that they embargo their dissertations online, for fear of harming their eventual monograph-from-dissertation chances. We were approached by ICP to write this book largely on the strength of our blog posts, social media presence, and key articles, many of which come from our respective dissertations. The book will be targeted at senior/advanced undergrads for the most part, as a way of unpeeling the tacit knowledge around the practice of digital history. In essence, we can’t all be part of, or initiate, fantastic multi-investigator projects like ChartEx or Old Bailey Online; in which case, what can the individual achieve in the realm of fairly-big data? Our book will show you.

One could reasonably ask, ‘why a book? why not a website? why not just continue adding to things like the Programming Historian?’.  We wanted to write more than tutorials (although we owe an enormous debt to the Programming Historian team whose example and project continues to inspire us). We wanted to make the case for why as much as explore the how, and we wanted reach a broader audience than the digital technosavy. In our teaching, we’ve all experienced the pushback from students who are exposed to digital tools & media all the time; a book-length treatment normalizes these kinds of approaches so that students (and lay-people) can say, ‘oh, right, yes, these are the kinds of things that historians do’ – and then they’ll seek out Programming Historian, Stack Overflow, and myriad other sites to develop their nascent skills.  Another attraction of doing a book is that we recognize that editors add value to the finished product. Indeed, our commissioning editor sent our first attempt at a proposal out to five single-blind reviewers! This project is all the stronger for it, and I wish to thank those reviewers for their generous reviews.

One thing that we insisted upon from the start was that we were going to live-write the book, openly, via a comment-press installation. I submitted a piece to the Writing History in the Digital Age project a few years ago. That project exposed the entire process of writing an edited volume. The number and quality of responses was fantastic, and we knew we wanted to try for that here. We argued in our proposal that this process would make the book stronger, save us from ourselves, and build a potential readership long before the book ever hit store shelves. We were astonished and pleased that ICP thought it was a great idea! They had no hesitation at all – thank you Alice! We’ve had long discussions about the relationship of the online materials to the eventual finished book, and wording to that effect is in the final contract. Does that mean that the final type-set manuscript will appear on the commentpress online? No, but nor will the book’s materials be embargoed.  None of us, including the Press, have tried this scale of things before. No doubt there will be hiccups along the way, but there’s a lot of goodwill built up and I trust that we will be able to work out any issues that may (will) arise.

We’re going to write this book over the course of one academic year. In all truthfulness, I’m a bit nervous about this, but the rationale is that digital tools and approaches can change rapidly. We want to be as up-to-date as possible, but we also have to be aware in our writing not to date ourselves either. That’s where all of you come in. As we put bits and parts up on The Historian’s Macroscope – Big Digital History, please do read and offer comments. Consider this an open invitation. We’d love to hear from undergraduate students. Some of these pieces I’m going to road test on my ‘HIST2809 Historian’s Craft’ students this autumn and winter. Ian, Scott, and I will be reflecting on the writing process itself (and my student’s experiences) on the blog portion of the live-writing website.

I’m excited, but nervous as hell, about doing this. Nervous, because this is a tall order. Excited, because it seems to me that the real transformative power of the digital humanities is not in the technology, but in a mindset that peels back the layers, to reveal the process underneath, that says it’s ok to tinker with the ways things have been done before.

Won’t you join us?

Shawn

The George Garth Graham Undergraduate Digital History Research Fellowship

My grandfather, George Garth Graham, in the 1930s.

At Carleton University, we have a number of essay awards for undergraduate history students. We do not have any awards geared towards writing history in new media, or doing historical research using digital tools, or any of the various permutations that would broadly fall within big-tent digital humanities.

So I decided to create an award, using the University’s micro-giving (crowdfunding) platform, Futurefunder.

I’m establishing this fellowship in tribute to my grandfather and the values he represented. George Garth Graham did not have any formal education after Grade 8. He educated himself through constant reading. One of my fondest memories is going through his stack of Popular Science and Popular Mechanics magazines, and making things with him in his basement workshop. Digital History is often about making, tinkering, and exploring, and this was something that my grandfather exemplified. He had a great love of history, showing my brothers and I around the area, telling us the stories of the land. He was generous with his time and would also quietly help those in need, never asking for nor expecting recognition for his contribution.

I’m calling this a ‘research fellowship’ rather than a scholarship because I want it to encourage future work, rather than reward past work. I intend this fellowship to be available to any History student who has taken the second year, required, HIST2809 Historian’s Craft course (a methods course). The student would have to have a certain GPA (appropriate to their year), and have a potential faculty member and project in mind (and I would help facilitate that). A committee of the department would adjudicate applications.

One of the conditions of the fellowship would be for the student to maintain an active research blog, where she or he would detail their work, their reflections, their explorations and experiments. It would become the locus for managing their digital online identity as a scholar. I imagine that holders of this fellowship would be well set-up to pursue further work in graduate programs in the digital humanities or in the digital media sector. I imagine opportunities for the students to publish with faculty (as did the students who worked on my 2011 ‘HeritageCrowd’ project, writinghistory.trincoll.edu). I know of no other undergraduate fellowship like this, in this field. Students who held such a post would not just be assistants, but potential leaders in the field.

For more details about the Fellowship, and how you can contribute to it, please see the Fellowship’s page on Futurefunder.

Historical Friction

edit June 6 – following on from collaboration with Stu Eve, we’ve got a version of this at http://graeworks.net/historicalfriction/

I want to develop an app that makes it difficult to move through the historically ‘thick’ places – think Zombie Run, but with a lot of noise when you are in a place that is historically dense with information. I want to ‘visualize’ history, but not bother with the usual ‘augmented reality’ malarky where we hold up a screen in front of our face. I want to hear the thickness, the discords, of history. I want to be arrested by the noise, and to stop still in my tracks, be forced to take my headphones off, and to really pay attention to my surroundings.

So here’s how that might work.

1. Find wikipedia articles about the place where you’re at. Happily, inkdroid.org has some code that does that, called ‘Ici’. Here’s the output from that for my office (on the Carleton campus):

http://inkdroid.org/ici/#lat=45.382&lon=-75.6984

2. I copied that page (so not the full wikipedia articles, just the opening bits displayed by Ici). Convert these wikipedia snippets into numbers. Let A=1, B=2, and so on. This site will do that:

http://rumkin.com/tools/cipher/numbers.php

3. Replace dashes with commas. Convert those numbers into music. Musical Algorithmns is your friend for that. I used the default settings, though I sped it up to 220 beats per minute. Listen for yourself here. There are a lot of wikipedia articles about the places around here; presumably if I did this on, say, my home village, the resulting music would be much less complex, sparse, quiet, slow. So if we increased the granularity, you’d start to get an acoustic soundscape of quiet/loud, pleasant/harsh sounds as you moved through space – a cost surface, a slope. Would it push you from the noisy areas to the quiet? Would you discover places you hadn’t known about? Would the quiet places begin to fill up as people discovered them?

Right now, each wikipedia article is played in succession. What I really need to do is feed the entirety of each article through the musical algorithm, and play them all at once. And I need a way to do all this automatically, and feed it to my smartphone. Maybe by building upon this tutorial from MIT’s App Inventor. Perhaps there’s someone out there who’d enjoy the challenge?

I mooted all this at the NCPH THATCamp last week – which prompted a great discussion about haptics, other ways of engaging the senses, for communicating public history. I hope to play at this over the summer, but it’s looking to be a very long summer of writing new courses, applying for tenure, y’know, stuff like that.

Edit April 26th – Stuart and I have been playing around with this idea this morning, and have been making some headway per his idea in the comments. Here’s a quick screengrab of it in action: http://www.screencast.com/t/DyN91yZ0

Practical Necromancy talk @Scholarslab – part I

Below is a draft of the first part of my talk for Scholarslab this week, at the University of Virginia. It needs to be whittled down, but I thought that those of you who can’t drop by on Thursday might enjoy this sneak peak.

Thursday, March 21 at 2:00pm
in Scholars’ Lab, 4th floor Alderman Library.

When I go to parties, people will ask me, ‘what do you do?’. I’ll say, I’m in the history department at Carleton. If they don’t walk away, sometimes they’ll follow that up with, ‘I love history! I always wanted to be an archaeologist!’, to which I’ll say, ‘So did I!’

My background is in Roman archaeology. Somewhere along the line, I became a ‘digital humanist’, so I am honoured to be here to speak with you today, here at the epicentre, where the digital humanities movement all began.

If the digital humanities were a zombie flick, somewhere in this room would be patient zero.

Somewhere along the line, I became interested in the fossilized traces of social networks that I could find in the archaeology. I became deeply interested – I’m still interested – in exploring those networks with social network analysis. But I became disenchanted with the whole affair, because all I could develop were static snapshots of the networks at different times. I couldn’t fill in the gaps. Worse, I couldn’t really explore what flowed over those networks, or how those networks intersected with broader social & physical environments.

It was this problem that got me interested in agent based modeling. At the time, I had just won a postdoc in Roman Archaeology at the University of Manitoba with Lea Stirling. When pressed about what I was actually doing, I would glibly respond, ‘Oh, just a bit of practical necromancy, raising the dead, you know how it is’. Lea would just laugh, and once said to me, ‘I have no idea what it is you’re doing, but it seems cool, so let’s see what happens next!’

How amazing to meet someone with the confidence to dance out on a limb like that!

But there was truth in that glib response. It really is a form of practical necromancy, and the connections with actual necromancy and technologies of death is a bit more profound than I first considered.

So today, let me take you through a bit of the deep history of divination, necromancy, and talking with the dead; then we’ll consider modern simulation technologies as a form of divination in the same mold; and then I’ll discuss how we can use this power for good instead of evil, of how it fits into the oft-quote digital humanities ethos of ‘hacking as a way of knowing’ (which is rather like experimental archaeology, when you think about it), and how I’m able to generate a probabilistic historiography through this technique.

And like all good necromancers, it’s important to test things out on unwilling victims, so I would also like to thank the students of HIST3812 who’ve had all of the ideas road-tested on them earlier this term.

Zombies clearly fill a niche in modern western culture. The president of the University of Toronto recently spoke about ‘zombie ideas’ that despite our best efforts, persist, infect administrators, politicians, and students alike, trying to eat the brains of university education.

Zombies emerge in popular culture in times of angst, fear, and uncertainty. If hollywood has taught us anything, it’s that Zombies are bad news. Sometimes the zombies are formerly dead humans; sometimes they are humans who have been transformed. Sometimes we deliberately create a zombie. The zombie can be controlled, and made to do useful work; zombie as a kind of slavery. More often, the zombies break loose, or are the result of interfering with things humanity was wont not too; apocalypse beckons. But sometimes, like ‘Fido’, a zombie can be useful, can be harnessed, and somehow, be more human than the humans. [Fido]

If you’d like to raise the dead yourself, the answer is always just a click away [ehow].

There are other uses for the restless dead. Before our current fixation with apocalypse, the restless dead could be useful for keeping the world from ending.

In video games, we call this ‘the problem space’ – what is it that a particular simulation or interaction is trying to achieve? For humanity, at a cosmological level, the response to that problem is through necromancy and divination.

I’m generalizing horribly, of course, and the anthropologists in the audience are probably gritting their teeth. Nevertheless, when we look at the deep history and archaeology of many peoples, a lot can be tied to this problem of keeping the world from ending. A solution to the problem was to converse with those who had gone before, those who were currently inhabiting another realm. Shamanism was one such response. The agony of shamanism ties well into subsequent elaborations such as the ball games of mesoamerica, or other ‘game’ like experiences. The ritualized agony of the athlete was one portal into recreating the cosmogonies and cosmologies of a people, thus keeping the world going.

The bull-leaping game at Knossos is perhaps one example of this, according to some commentators. Some have seen in the plan of the middle minoan phase of this palace (towards the end of the 2nd millenium BC) a replication in architecture of a broader cosmology, that its very layout reflects the way the Minoans saw the world (this is partly also because this plan seems to replicate in other Minoan centres around the Aegean). Jeffrey Soles, pointing to the architectural play of light and shadow throughout the various levels of Knossos argues that this maze-like structure was all part of the ecstatic journey, and ties shamanism directly to the agonies of sport & game in this location. We don’t have the Minoans’ own stories, of course, but we do have these frescoes of bull-leaping, and other paraphernalia which tie in nicely with the later dark-age myths of Greece

So I’m making a connection here between the way a people see the world working, and their games & rituals. I’m arguing that the deep history of games  is a simulation of how the world works.

This carries through to more recent periods as well. Herodotus wrote about the coming of the Etruscans to Italy: “In the reign of Atys son of Menes there was a great scarcity of food in all Lydia. For a while the Lydians bore this with patience; but soon, when the famine continued, they looked for remedies, and various plans were suggested. It was then that they invented the games of dice, knucklebones, and ball, and all the other games of pastime, except for checkers, which the Lydians do not claim to have invented. Then, using their discovery to forget all about the famine, they would play every other day, all day, so that they would not have to eat… This was their way of life for eighteen years. Since the famine still did not end, however, but grew worse, the king at last divided the people into two groups and made them draw lots, so that one should stay and the other leave the country’.

Here I think Herodotus misses the import of the games: not as a pasttime, but as a way of trying to control, predict, solve, or otherwise intercede with the divine, to resolve the famine. In later Etruscan and Roman society, gladiatorial games for instance were not about entertainment but rather about cleansing society of disruptive elements, about bringing everything into balance again, hence the elaborate theatre of death that developed.

The specialist never disappears though, the one who has that special connection with the other side and intercedes for broader society as it navigates that original problem space. These were the magicians and priests. But there is an important distinction here. The priest is passive in reading signs, portents, and omens. Religion is revealed, at its proper time and place, through proper observation of the rituals. The magician is active – he (and she) compels the numinous to reveal itself, the spirits are dragged into this realm; it is the magician’s skill and knowledge which causes the future to unfurl before her eye.

The priest was holy, the magician was unholy.

Straddling this divide is the Oracle. The oracle has both elements of revelation and compulsion. Any decent oracle worth its salt would not give a straight-up answer, either, but rather required layers of revelation and interpretation. At Delphi, the God spoke to the Pythia, the priestess, who sat on the stool over the crack in the earth. When the god spoke, the fumes from below would overcome her, causing her to babble and writhe uncontrollably. Priests would then ‘interpret’ the prophecy, in form of a riddle.

Why riddles? Riddles are ancient. They appear on cuneiform texts. Even Gollum knew what a true riddle should look like – a kind of lyric poem asking a question that guards the right answer in hints and wordplay.

‘I tremble at each breath of air/ And yet can heaviest burders bear. [implicit question being asked is who am I? – water]

Bilbo cheated.

We could not get away from a discussion of riddles in the digital humanities without of course mentioning the I-ching. It’s a collection of texts that, depending on dice throws, get combined and read in particular ways. Because this is essentially a number of yes-or-no answers, the book can be easily coded onto a computer or represented mechanically. In which case, it’s not really a ‘book’ at all, but a machine for producing riddles.

Ruth Wehlau writes, “Riddlers, like poets, imitate God by creating their own cosmos; they recreate through words, making familiar objects into something completely new, rearranging the parts of pieces of things to produce creatures with strange combinations of arms, legs, eyes and mouths. In this transformed world, a distorted mirror of the real world, the riddler is in control, but the reader has the ability to break the code and solve the mystery (wehlau 1997)

Riddles & divination are related, and are dangerous. But they also create a simulation, of how the world can come to be, of how it can be controlled.

One can almost see the impetus for necromancy, when living in a world described by riddles. Saul visits the Witch of Endor; Oddyseus goes straight to the source.

…and Professor Hix prefers the term ‘post mortem communications’. However you spin it, though, the element of compulsion, of speaking with the dead, marks it out as a transgression; necromancers and those who seek their aid never end well.

It remains true today, that those who practice simulation, are similarly held in dubious regard. If that was not the case, tongue in cheek articles titles such as this would not be necessary.

I am making the argument that modern computational simulation, especially in the humanities, is more akin to necromancy than it is to divination, for all of these reasons.

But it’s also the fact that we do our simulation through computation itself that marks this out as a kind of necromancy.

The history of the modern digital computer is tied up with the need to accurately simulate the yields of atomic bombs,  of blast zones, and potential fallout, of death and war. Modern technoculture has its roots in the need to accurately model the outcome of nuclear war, an inversion of the age old problem space, ‘how can we keep the world from ending’ through the doctrines of mutually assured destruction.

The playfulness of those scientists, and the acceleration of hardware technology lead to video games, but that’s a talk for another day (and indeed, has been recently well treated by Rob MacDougall of Western University).

‘But wait! Are you implying that you can simulate humans just as you could individual bits of uranium and atoms, and so on, like the nuclear physicists?’ No, I’m not saying that, but it’s not for nothing that Isaac Asimov gave the world Hari Seldon & the idea of ‘psychohistory’ in the 1950s. As Wikipedia so ably puts it, “Psychohistory is a fictional science in Isaac Asimov’s Foundation universe which combines history, sociology, etc., and mathematical statistics to make general predictions about the future behavior of very large groups of people, such as the Galactic Empire.”

Even if you could do Seldon’s psychohistorical approach, it’s predicated on a population of an entire galaxy. One planetfull, or one empire-full, or one region-full, of people just isn’t enough. Remember, this is a talk on ‘practical’ necromancy, not science-fiction.

Well what about so-called ‘cliodynamics’? Cliodynamics looks for recurring patterns in aggregate statistics of human culture. It may well find such patterns, but it doesn’t really have anything to say about ‘why’ such patterns might emerge. Both psycohistory and cliodynamics are concerned with large aggregates of people. As an archaeologist, all I ever find are the traces of individuals, of individual decisions in the past. It always requires some sort of leap to jump from these individual traces to something larger like ‘the group’ or ‘the state’. A Roman aqueduct is, at base, still the result of many individual actions.

A practical necromancy therefore is a simulation of the individual.

There are many objections to simulation of human beings, rather than things like atoms, nuclear bombs, or the weather. Our simulations can only do what we program them to do. So they are only simulations of how we believe the world works (ah! Cosmology!). In some cases, like weather, our beliefs and reality match quite well, at least for a few days, and we know much about how the variables intersect. But, as complexity theory tells us, starting conditions strongly affect how things transpire. Therefore we forecast from multiple runs with slightly different starting conditions. That’s what a 10% chance of rain really means: We ran the simulation 100 times, and in 10 of them, rain emerged.

And humans are a whole lot more complex than the water cycle. In the case of humans, we don’t know all the variables; we don’t know how free will works; we don’t know how a given individual will react; we don’t understand how individuals and society influence each other. We do have theories though.

This isn’t a bug, it’s a feature. The direction of simulation is misplaced. We cannot really simulate the future, except in extremely circumscribed situations, such as pedestrian flow. So let us not simulate the future, as humanists. Let us create some zombies, and see how they interact. Let our zombies represent individuals in the past. Give these zombies rules for interacting that represent our best beliefs, our best stories, of how some aspect of the past worked. Let them interact. The resulting range of possible outcomes becomes a kind of probabilistic historiography. We end up with not just a story about the past, but also about other possible pasts that could have happened if our initial story we are telling about how individuals in the past acted is true, for a given value of true.

 We create simulacra, zombies, empty husks representing past actors. We give them rules to be interpreted given local conditions. We set them in motion from various starting positions. We watch what emerges, and thus can sweep the entire behavior space, the entire realm of possible outcomes given this understanding. We map what did occur (as best as we understand it) against the predictions of the model. For the archaeologist, for the historian, the strength of agent based modeling is that it allows us to explore the unintended consequences inherent in the stories we tell about the past. This isn’t easy. But it can be done. And compared to actually raising the dead, it is indeed practical.

[and here begins part II, which runs through some of my published ABMS, what they do, why they do it. All of this has to fit within an hour, so I need to do some trimming.]

[Postscriptum, March 23: the image of the book of random digits came from Mark Sample's 'An Account of Randomness in Literary Computing, & was meant to remind me to talk about some of the things Mark brought up. As it happens, I didn't do that when I presented the other day, but you really should go read his post.]

Introducing Voyant in a History Tutorial

This week my HIST2809 students are encountering digital history, as part of their ‘Historian’s Craft’ class (an introduction to various tools & methods). As part of the upcoming assignment, I’m having them run some history websites through Voyant, as a way of sussing out how these websites craft a particular historical consciousness. Each week, there’s a two-hour lecture and one hour of tutorial where the students lead discussions given the lecture & assigned readings. For this week, I want the students to explore different flavours of Digital History – here are the readings:

“Possible discussion questions: How is digital history different? In ten years, will there still be something called ‘digital history’ or will we all history be digital? Is there space for writing history through games or simulations? How should historians cope with that? What kind of logical fallacies would such approaches be open to?”

To help the TAs bring the students up to speed with using Voyant, I’ve suggested to them that they might find it fun/interesting/useful/annoying to run one of those papers through Voyant. Here’s a link to the ‘Interchange’ article, loaded into Voyant:

http://voyant-tools.org/?corpus=1363622350848.367&stopList=stop.en.taporware.txt

The TAs could put that up on the screen, click on various words in the word cloud, to see how the word is used over the course of a single article (though in this case, there are several academics speaking, so the patterns are in part author-related). Click on ‘scholarship’ in the word cloud, and you get a graph of its usage on the right – the highest point is clickable (‘segment six’). Click on that, and the relevant bit of text appears in the middle, as Bill Turkel talks about the extent to which historical scholarship should be free. On the bottom left, if you click on ‘words in the entire corpus’, you can select ‘access’ and ‘scholarship’, which will put both of them on the graph

( http://voyant-tools.org/tool/TypeFrequenciesChart/?corpus=1363622350848.367&docIdType=d1363579550728.b646f3e3-65d1-2347-c580-5e5c0985e6d0%3Ascholarship&docIdType=d1363579550728.b646f3e3-65d1-2347-c580-5e5c0985e6d0%3Aaccess&stopList=stop.en.taporware.txt&mode=document&limit=2 )

and you’ll see that the two words move in perfect tandem, so the discussion in here is all about digital tools opening access to scholarship – except in segment 8. The question would then become, why?

….so by doing this exercise, the students should get a sense of how looking at macroscopic patterns involves jumping back to the close reading we’re normally familiar with, then back out again, in an iterative process, generating new questions all along the way. An hour is a short period of time, really, but I think this would be a valuable exercise.

(I have of course made screen capture videos walking the students through the various knobs and dials of Voyant. This is a required course here at Carleton. 95 students are enrolled. 35 come to every lecture. Approximately 50 come to the tutorials. Roughly half the class never comes…. in protest that it’s a requirement? apathy? thinking they know how to write an essay so what could I possibly teach them? That’s a question for another day, but I’m fairly certain that the next assignment, as it requires careful use of Voyant, is going to be a helluva surprise for that fraction.”

Thinking out loud: language re tenure guidelines for the Digital Historian

I’ve not actually seen this.

At my university, we’ve been asked to consider discipline-specific language for new tenure & promotion guidelines. I’ve been writing a response to our chair, and I thought, in keeping with how I regard this problem, it would be a good idea to share these thoughts.

Onwards.

The 1.4 edition of the Journal of Digital Humanities wrestles with the problem of evaluating digital scholarship for tenure http://journalofdigitalhumanities.org/volumes/  (or download as pdf: http://journalofdigitalhumanities.org/files/jdh_1_4.pdf )

Moving Goalposts & Scholarship as Processes
As far as discipline specific guidelines are concerned, from my perspective, is the problem that the goalposts are always going to be shifting. What was fairly technically demanding becomes easier with time, and so the focus shifts from ‘can we do x’ to ‘what are the implications of x for y’, or as Bethany Nowviskie put it, a shift from 18th century ‘Lunaticks‘ who lay the groundwork for 19th century science and industrialization. Another problem is that in digital work, the lone scholar is very much the outlier. To achieve anything worthwhile takes a team – and who gets to be first author does not necessarily reflect the way the work was divied up or undertaken.  We should resist trying to shoehorn digital work into boxes meant for a different medium. Nowiskie writes,

“The danger here … is that T&P committees faced with the work of a digital humanities scholar will instigate a search for print equivalencies — aiming to map every project that is presented to them, to some other completed, unary and generally privately-created object (like an article, an edition, or a monograph). That mapping would be hard enough in cases where it is actually appropriate “

She goes on to say,

“…the new responsibility of tenure and promotion committees [is] to assess quality in digital humanities work — not in terms of product or output — but as embodied in an evolving and continuous series of transformative processes.”
(http://journalofdigitalhumanities.org/1-4/evaluating-collaborative-digital-scholarship-by-bethany-nowviskie/)

This was the gist of Bill Turkel’s address to the Underhill Graduate Students Colloquium on ‘doing history in real time’ – that the unique value, in an increasingly digital world, of formal academic knowledge is not about things per se, but rather about method. You can look up any fact in the world in seconds. But learning how to think, how to query, how to judge between competing stories – that’s what we bring.  That then is the problem for assessing digital work as part of tenure and promotion: how does this work change the process?

That suggests a hierarchy too, of importance. Merely putting things online, while important, is not necessarily transformative unless that kind of material has never been digitized before. Then the conversation also becomes about how that work was done, the decisions made, the relationship between the digital object and the physical one. I have a student working on a project, for instance, to put together an online exhibition related to Black History in Canada. This is important, but the exhibition itself is not transformative. The real scholarship, the real transformation, happens when she starts exploring those materials through text analysis, putting a macroscopic lens on the whole corpus of materials that she has collected.

Digital Work is Public Work
The other important point about process is that digital work always (99.9 times out of 100; my early agent modeling work had no internet presence, for instance) has a public, outward looking face. Platforms like blogs allow for public engagement with our work – so digital work is a kind of public humanities. The structure of the internet, of how its algorithmns find and construct knowledge and serve that up to us via Google, is such that work that is valuable and of interest creates a bigger noise in a positive feedback loop. The best digital work is done in public. ‘Public’ should not be a dirty word along the lines of ‘popular’. The internet looks different to each person who goes online (and our algorithmns make sure that each person sees a personalized internet, because that’s how one makes money online), so hits on a blog post are not random meaningless clicks but rather an engagement with a broader community. As far as academic blogging goes, that broader community is other academics and students. Print journals & peer reviewed articles are just one way of engaging with our chosen communities. With post-publication models of peer review like Digital Humanities Now and the Journal of Digital Humanities (models that are making inroads in other disciplines), we should treat these on an equal footing with the more familiar models. I’d argue that post-publication peer review is a greater indicator of significance and value that a regular, two blind reviewers into print model.

I’d like to see language then that regarded digital work, or work in media other than print, on an equal footing with the more familiar forms. That is, as things that do not have equivalencies to what we traditionally expect and thus must be taken on their own terms.  I appreciate that I’m pretty much the only person in this department that any of this might apply to, for the time being. I would hate to see my work on topic modeling though get considered as ‘service’. Figuring out how to apply natural language processing to vast corpi of historical materials, figuring out the ways the code force particular worldviews, hide others, and writing all of this up as a ‘how-to’ guide is indeed research. It’s akin to figuring out how gene-sequencing works, its limitations, etc, which needs to be well understood before a biologist can use it to link modern humans to Neanderthals. We understand both of those activities as research, in biology, but we’d only understand the second as research if the example was the limits, potentials of topic modeling / discourses in the political thought of the 18th century. I bring this up, because of Sean Takats experience at George Mason:
http://quintessenceofham.org/2013/02/07/a-digital-humanities-tenure-case-part-2-letters-and-committees/

Project Management & Project Outputs

In that particular case, Takats was also managing major development projects to develop various tools and approaches. He writes,

” I want to focus on the committee’s disregard for project management, because it’s here I think that we find evidence of a much broader communication breakdown between DH and just-H, despite the best efforts to develop reasonable standards for evaluating digital scholarship. Although the committee’s letter effectively excludes “project management” from consideration as research, I would argue that it’s actually the cornerstone of all successful research. It’s project management that transforms a dissertation prospectus into a thesis, and it’s certainly project management that shepherds a monograph from proposal to published book. Fellow humanists, I have some news for you: you’re all project managers, even if you only direct a staff of one.”

Which leads me to my next point. Digital work creates all sorts of outputs, that are of use at many different stages to other researchers. These outputs should be considered as valuable publications in their own right. An agent based simulation of emergent social structures in the early Iron Age makes an argument in code about how the Roman world worked. If I published a discussion of the results of such a model, that is fine; but if I don’t make that code available for someone else to critique, extend, or transform, I am being academically dishonest. The time it takes to build a model that works, is valid, that simulates something important, and the process it takes to build such a model, is considerable. The data that such a model produces is valuable for others looking to re-build a model of the same phenomena in another platform (which is crucial to validating the truth-content of models). All of these sorts of outputs can be made available online in digital archives built for the purpose of long term storage. The number of times such models are downloaded or discussed online can often be measured; these measures should also be taken into account as a kind of citation (see http://figshare.com/authors/Shawn_Graham/97736 ).

Experimentation and Risk Taking
Finally, I think that work that is experimental, that discusses what didn’t work, should be recognized and celebrated. Todd Presner writes, (http://journalofdigitalhumanities.org/1-4/how-to-evaluate-digital-scholarship-by-todd-presner/ )

” Digital projects in the Humanities, Social Sciences, and Arts share with experimental practices in the Sciences a willingness to be open about iteration and negative results. As such, experimentation and trial-and-error are inherent parts of digital research and must be recognized to carry risk. The processes of experimentation can be documented and prove to be essential in the long-term development process of an idea or project. White papers, sets of best practices, new design environments, and publications can result from such projects and these should be considered in the review process. Experimentation and risk-taking in scholarship represent the best of what the university, in all its many disciplines, has to offer society. To treat scholarship that takes on risk and the challenge of experimentation as an activity of secondary (or no) value for promotion and advancement, can only serve to reduce innovation, reward mediocrity, and retard the development of research.”

One of my blog posts, ‘How I Lost the Crowd‘, discusses how my one project got hacked. That piece was read by some 400 people shortly after it was posted – and it later found its way into various digital history syllabi ( for instance here. This post has been read over 700 times in the past 10 months. Failing in public is where research and teaching are the same side of the same coin (he said, to mangle a metaphor).

So what should one look for?

Work that is transformative; where multi-authored work is valued as much as the single-author opus; work that is outward facing and is recognized by others through linking, reposting, sharing (and other so-called ‘alt-metrics; cf http://impactstory.org/ for one attempt to pull these all together); data-as-publication; code-as-publication; experiments and risktaking and open discussion of what does and what does not work; software development & project management should be recognized as research; and any work that lays the groundwork for others to see further – the humble ‘how to’ (our lunatick moment; see for instance http://programminghistorian.org ).

For explicit guidelines on how to evaluate digital work, see Rockwell, http://journalofdigitalhumanities.org/1-4/short-guide-to-evaluation-of-digital-work-by-geoffrey-rockwell/

Considering any digital work, Rockwell suggests the following questions:

  • Is it accessible to the community of study?
  • Did the creator get competitive funding? Have they tried to apply?
  • Have there been any expert consultations? Has this been shown to others for expert opinion?
  • Has the work been reviewed? Can it be submitted for peer review?  (things like Digital Humanities Now, & JDH are crucial here)
  • Has the work been presented at conferences?
  • Have papers or reports about the project been published?  (whether online or print, born-digital or otherwise is not the issue here)
  • Do others link to it? Does it link out well?
  • If it is an instructional project, has it been assessed appropriately?
  • Is there a deposit plan? Will it be accessible over the longer term? Will the library take it?

I’m not saying that we should build this checklist into any tenure and promotion language; rather I’m offering it here to suggest that any such language, if it broadly considers such things, will probably be ok, in the hopes of finding an acceptable middle ground between the box-tickers and the non-boxtickers. Rockwell offers some best practices for carrying out digital work, that speak to these questions:

  • Appropriate content (What was digitized?)
  • Digitization to archival standards (Are images saved to museum or archival standards?)
  • Encoding (Does it use appropriate markup like XML or follow TEI guidelines?)
  • Enrichment (Has the data been annotated, linked, and structured appropriately?)
  • Technical Design (Is the delivery system robust, appropriate, and documented?)
  • Interface Design and Usability (Is it designed to take advantage of the medium? Has the interface been assessed? Has it been tested? Is it accessible to its intended audience?)
  • Online Publishing (Is it published from a reliable provider? Is it published under a digital imprint?)
  • Demonstration (Has it been shown to others?)
  • Linking (Does it connect well with other projects?)
  • Learning (Is it used in a course? Does it support pedagogical objectives? Has it been assessed?)

****

This is of course a thinking-out-loud exercise, and will no doubt change. Thoughts?