Text Analysis of 2012 Digital Humanities Job Adverts part 2

digital-humanities-jobsIf we look at simple word frequencies in the 2012 job advertisement documents for Digital Humanities, we find these top words and raw frequency counts:

research    650
university    577
experience    499
library            393
work            334
information    303
position    299
project            269
applications    257

(I’ve deleted ‘digital’ and ‘humanities’ from this list).

If job advertisements are a way of signalling what an institution hopes the future will hold, one gets the sense that the focus of digital humanities work will be on projects, on research, in conjunction with libraries. But we can extract more nuance, using network analysis. You can feed the texts into Voyant’s ‘RezoViz’ tool, which extracts paired nouns in each document.

This can be outputted as a .net file, and then imported into Gephi. The resulting graph has 1461 nodes, and 20649 edges. Of course, there are some duplicates (like ‘US’ and ‘United States’), but this is only meant to be rough and ready, ‘generative‘, as it were (and note also that a network visualization is not necessary for the analysis. So no spaghetti balls. What’s important are the metrics). What I’d like to find out are what concepts are doing the heavy lifting in these job advertisements? What is the hidden structure of the future of digital humanities, as evidenced by job advertisements in the English speaking world?

My suspicion is that ‘modularity’ aka ‘community detection’, and ‘betweeness centrality’, are going to be the key metrics for figuring this out. Modularity groups nodes on the basis of shared similar local patternings of ties (or, to put it another way, it decomposes the global network into maximal subnetworks). Seth Long recently did some network analysis on the Unabomber’s manifesto, and lucidly explains why betweeness centrality is a useful metric for understanding semantic meaning: “A word with high betweenness centrality is a word through which many meanings in a text circulate.” In other words, the heavy lifters.

So let’s peer into the future.

I ended up with about 15 groups. The first three groups by modularity account for 75% of the nodes, and 80% of the ties. These are the groups where the action lies. So let’s look at words with the highest betweenness centrality scores for those first three groups.

The first group

METS (Metadata encoding and transmission standard)
United States
New York

‘University’ is not surprising, and not useful. So let us discard it and bring in the next highest word:


This one group by modularity also has all of the highest betweenness centrality scores – and it reads like a laundry list of the skills a budding DH practitioner must hold. The US, and New York would seem to be the centre of the world, too.

If we take the next ten words, we get:

MODS (Netadata Object Description Schema)
University Libraries
CLIR (Council on Library and Information Resources)
University of Alberta
North America
Duke University

Again, skills and places figure – in Canada, U of A appears. So far, the impression is that DH is all about text, markup, and metadata. Our favorite programming languages are python and ruby. We use php, xhtml, xml, and drupal (plain-jane vanilla html eventually turns up in the list, but it’s buried very, very deep.).

So that’s an impression of the first group. (Remembering that groups are defined by patterns of similarity in their linkages).

The Second Group

The next group looks like this:
Digital Humanities
Department of Digital Humanities
Department of History

“digital humanities” is probably not helpful, so let’s eliminate that and go one more down: “US”. Indeed, let’s take a look at the next ten, too:

Human Resources
Computer Science
Head of School
Faculty of Humanities
University of Amsterdam

Here, we’re dealing very much with a UK, Ireland, and European focus. The ‘BCE’ is telling, for it suggests an archaeological focus in there, somewhere (unless this is some new DH acronym of which I’m not aware; I’m assuming ‘before the common era’).

The Third Group

In the final group we’ll consider here, we find a strong Canadian focus:

CRC (Canada Research Chair)
TEI (Text Encoding Initiative)
Canada Research Chair
Digital Humanities Summer Institute
University of Victoria

Since we’ve got some duplication in here, let’s look at the next ten:

ETCL (Electronic Textual Cultures Laboratory, U Victoria)
University of Waterloo
DHSI (Digital Humanities Summer Institute)
Faculty of Arts
Stratford Campus

‘Canada Research Chairs’ are well-funded government appointments, and so give an indication of where the state would like to see some research. Victoria continually punches above its weight, with look ins from Waterloo and Concordia.

So what have we learned? Well, despite the efforts of the digital history community, ‘digital humanities’ is still largely a literary endeavor – although it’s quite possible that a lot of the marking up that these job advertisements might envision could be of historical documents. Invest in some python skills (see Programming Historian). My friends in government tell me that if you can data mine, you’ll be set for life, as the government is looking for those skills. (Alright, that didn’t come out in this analysis at all, but he’s looking over my shoulder right now).

Finally – London, Dublin, New York, Edmonton, Victoria, Waterloo, Montreal – these seem to be the geographic hotspots. Speaking of temperature, Victoria has the nicest weather. Go there, young student!

Or come to Carleton and study with me.  We’ve got tunnels.

update March 4th: jobs-topics-dh as a network graph IN the analysis above, I’ve generated a network using Voyant’s RezoViz tool. Today, I topic modelled all of the texts looking for 10 topics. So a slightly different approach. I turned the resulting document composition (ie doc 1 is 44% topic 1, 22% topic 4, 10% topic 3, etc) into a two mode graph, job advert to top two constituent topics. I then turned this into a 1 mode graph where job adverts are tied to other job adverts based on topic composition. Then I ran modularity, and found 3 groups by modularity; edges are percent composition by topics discerned through topic modeling.Nodes are ‘betweenness centrality’. Most between? George Mason University. I’m not sure what ‘betweenness centrality’ means though in this context, yet.

Makes for interesting clusters of job adverts. Topic model results to be discussed tomorrow.

One thought on “Text Analysis of 2012 Digital Humanities Job Adverts part 2

Comments are closed.