The Day of Archaeology is modeled after the Day of Digital Humanities. Archaeologists from around the world take a few moments to blog about what they’re doing, right now. This year, it was on June 29th. It’s a fascinating window into a fascinating profession. As an archaeologist-cum-digital-humanities person, the obvious thing to do with all of this info (over 700 individual archaeologists; over 300 individual posts of some 250-500 words each, at least) is to mine it, to analyze it, to topic model it. What are the discourses of practicing archaeologists?
- The first thing is to collect all of the information. I’m using OutWit Hub to scrape every post. Now scraping can be morally dubious, but happily the organizers of DoA and all of its contributors agree to a creative commons attribution. Designing a scraper involves looking at the source code for the page, figuring out the page structure, and identifying the tags that enclose the information that one wants to collect. Then, OutWit can be sent forth to work through each page in succession. The resulting information can then be exported into Excel; I send it over as a csv file so that I can then do further work with it. (The CSV file may be downloaded here)
- I then use a macro in Excel to save each individual row as an individual text files, into a separate folder. This folder can be zipped and uploaded into Voyant Tools.
- Finally, I can point the Mallet Java GUI at the original csv file and topic model all of the posts (400 iterations, with 40 topic words and topic proportion threshold 0.05, for 20 topics); I then visualize the interrelationships of the topics using Gephi.
In this post, I’m going to provide you with links to the data in various tools, and give a first pass over the data. Then, why not play with this info for yourself, and see what you find? Perhaps we can all crowdsource an article out of this; comments and findings in the comments please!
Let’s begin, shall we?
This is what I find. First, the topics (related files here):
List of Topics
Archaeologists love the communities they work in and the people that they work with. This I think is evident from the number of topics that could be labelled ‘public archaeology’, like #20, 14, and 11. You can click through the topics above to read the DoA posts that are composed by these documents; it will indicate to what percentage a given post is composed of the various topics. From this, you can begin to choose your own adventure through the day of archaeology.
I can also take that information, and represent it as a network where each document is linked to its highest percentage topic. Keeping in mind all of the caveats that such an approach entails (see Scott Weingart’s salutary warnings), we end up with a map of the mental geography of topics to posts; a mental geography of archaeological discourse. Interestingly, the top three topics holding it all together are 13, 17, and 10. The first two would seem to be topics related to the mundane every day tasks that archaeologists do; topic 10 seems to relate to how we teach the discipline.
The Gephi file may be downloaded here, so that you can explore this data for yourself. I ran the modularity routine to detect any ‘communities’ of thought in the topics/posts. The colours in the image below correspond to community; the size of the node relates to betweeness. In the gephi file, you can filter the data table by ‘modularity’ to see which posts and topics are in what community. According to this routine, there are roughly 13 communities of thought across 335 posts. Where does your post fit in?
Voyant Tools
I’ve also uploaded all of the posts into Voyant Tools for text analysis. Obviously, ‘archaeology’ and its derivatives will skew things a bit. But lets see what we find. “There are 335 documents in this corpus with a total of 156,396 words and 15,100 unique words” says Voyant. We’ve got a wide vocabulary folks! But in the spirit of Steven Ramsay’s algorithmic reading, what are the surprises? What do we see when we deform an entire corpus of text in this manner? It’s worth pointing out that you ought to open the corpus in Voyant in Chrome, as sometimes Firefox trips, crashes, and burns.
Let’s extract named entities from the text, and stitch them together based on appearance in the same post. You get the following (remember, open in Chrome for best results: http://voyant-tools.org/tool/RezoViz/?corpus=1341853693115.3474 ). If you mouse over an entity, it highlights all others to whom it is connected. You can also fiddle with the settings to show more or less connections.
You can also do a principle components analysis on the word frequencies. In the image below, all instances of ‘day’ and derivatives of ‘archaeology’ have been excised, to make the patterns clearer (try for yourself here).
So – what patterns do you notice? What strikes you as odd and in need of explanation?
I was wondering if there was a way to scrape the images associated with the articles as well. Atlas.ti has functionality for coding images and video in discourse analysis… I’m not familiar with the programs your using here though.
Hi Nicolas,
Outwit can grab images or other files (as could something like downthemall, firefox plugin), but for subsequent analysis, I wouldn’t know where to begin. I haven’t played with Atlas.ti.
I am playing with Atlas.ti as a way to analyzing discourse in historic documents associated with the distilling industry. I’m not an expert in it (just tinkering at the moment) but I think that photos can be tagged (the same way you tag people on Facebook) with codes you have assigned for the analysis. Tags like “paperwork, artifact, archives, computer, office” might be a place to start. Then ATLAS.ti can take those tags correlate them with codes you’ve assigned to text and make a network map with nodes (similar in appearance to what you posted above from Voyant) although what you can do with that info is probably different.
I’ve got a big project I’m trying to finish in the next week or so, but after that I might try to integrate the data into ATLAS if the baby lets me (we are getting teeth right now so my spare time is non-existent)
En fait, Rama Yade n’est pas jour. Car sa prvision de croissance pour la France l’an prochain de 0,8% 1% ; soit de juste en-dessous juste au-dessus du chiffrage de Bercy… En revanche, elle a quelques mois de retard pour la Commission europenne. Ds mai, la croissance fran?aise pour 2014.
abercrombie pas cher http://www.agenceneptune.com/FR/Abercrombie/