Mining a Day of Archaeology

The Day of Archaeology is modeled after the Day of Digital Humanities. Archaeologists from around the world take a few moments to blog about what they’re doing, right now. This year, it was on June 29th. It’s a fascinating window into a fascinating profession. As an archaeologist-cum-digital-humanities person, the obvious thing to do with all of this info (over 700 individual archaeologists; over 300 individual posts of some 250-500 words each, at least) is to mine it, to analyze it, to topic model it. What are the discourses of practicing archaeologists?

  1. The first thing is to collect all of the information. I’m using OutWit Hub to scrape every post. Now scraping can be morally dubious, but happily the organizers of DoA and all of its contributors agree to a creative commons attribution. Designing a scraper involves looking at the source code for the page, figuring out the page structure, and identifying the tags that enclose the information that one wants to collect. Then, OutWit can be sent forth to work through each page in succession. The resulting information can then be exported into Excel; I send it over as a csv file so that I can then do further work with it. (The CSV file may be downloaded here)
  2. I then use a macro in Excel to save each individual row as an individual text files, into a separate folder. This folder can be zipped and uploaded into Voyant Tools.
  3. Finally, I can point the Mallet Java GUI at the original csv file and topic model all of the posts (400 iterations, with 40 topic words and topic proportion threshold 0.05, for 20 topics);  I then visualize the interrelationships of the topics using Gephi.

In this post, I’m going to provide you with links to the data in various tools, and give a first pass over the data. Then, why not play with this info for yourself, and see what you find? Perhaps we can all crowdsource an article out of this; comments and findings in the comments please!

Let’s begin, shall we?

This is what I find. First, the topics (related files here):

List of Topics

1. archaeological information artifacts museum years university research cultural collection important collections sites including county center materials management include projects includes resource curator work archaeologists focus identify individual ago assistant provide recovered thousands analysis history artifact southern tasks cooper idaho patterns
2. war shropshire number uk british castle whilst military county royalist local london young market royal active base money significant nice july command clear australia rescue country hard shrewsbury memorial army news march garrison britain considered bristol charles highly leaving support
3. de la en el los una por del es las para se lo con arqueolog madrid al amtta hist como ya os museo english patrimonio civil arqueol historia spanish su crisis campus cruz dayofarch ser pero part ver punto law
4. find http www conservation blog shelf images months object follow number post image create org photos completely leather task finish major preserved produced department stop add conditions content set watch possibly laarc helping arts tiny pictures rob detail excitement examining
5. finds roman interesting iron medieval glass building post excavated age pottery found large years london final today bones bone metal discovered part working date houses archive fragments pieces based period animal vessel assemblage scheme specialist general evidence parts range amazing
6. working long stone field part half survey results end high understand equipment tools weather water notes modern walk located techniques hand sherds system structures hours reports valley lost process variety prehistoric complete tool personal level looked shell bag takes findings
7. small city early site record century ve recording big soil present analysis recorded common excavated back inside middle stage study context picture wood piece beer colleague deposits simple written suggest special air map shoe put ll size upper wild generally
8. site found excavation today june pm twitter test excavations areas maps vitaemilia trench worked dayofarch pit digging tea section surface find samples construction block hard clay emily natural place wright wall dug pits fill left storage paperwork excavating feature opened
9. archaeological excavation archaeologists community local past house team excavations professional town members human groups volunteers main information late ireland interest early st children northern medieval involved programme friends role association similar enjoy revealed result scientific gardens city give structure irish
10. field day students year summer school reading week continue lab university learn photo program dig questions process called student season graduate page river campus digging unit crew experience class undergraduate veterans features larger dirt director learning sense order indiana artifacts
11. archaeology public historic historical archaeological national state park project day house today cemetery society archaeologist american history philadelphia pennsylvania june media meeting photographs york university spent states usa work resources teaching anthropology arkansas native ideas united preservation document college region
12. world ancient art material open write people recently make website plan starting institute centre studies related idea web street current yesterday july monday series lovely issues individuals exciting article received wrote giving italy topic term lives church earlier week left
13. archaeology project day heritage work report team week activities staff month busy time working emails blog wessex today planning archaeological friday check due fieldwork visit company environment officer previous design exciting office manager event based current reports development side table
14. time group people archaeological west make virginia garden show culture mound complex evening grave activity room volunteer discovery kids central ready needed future important helped visitors box museum creek tour daily brought hands end volunteers light order check care change
15. area sites site remains landscape survey ground fort buildings local land find field east view built scotland features visit great south prehistoric buried building trust historic places story place large geophysical occupied green north trip loch brick camp surrounding map
16. research data year ve time part post phd life colleagues projects project form access student don interested digital list conference writing database place desk full book taking funding dissertation online paper didn department library publication books position break point email
17. day work time back things job office today days good home made morning working start ll lunch making started spend call long pretty couple finally short writing moment meeting finished leave lots read afternoon involved feel happy rest top don
18. museum objects material artefacts museums collection age hoard display history coins social century exhibition records archaeology collections events case record public found council north items type late british pot period service store saxon treasure early beautiful bronze space person workshops
19. di gis il farm che layer real completed scanned normal washington al digital rome playing del ad ma end difficult george ferry scanning laser tree change works della camera pi whale occupation da computer spoon victoria hut maria nice data
20. archaeology archaeologist people lot bit great years spent project work archaeologists working love thing past ve dig weeks coming understanding make share hope university wanted makes posts fact hour fun talk involves night bring means kind showing stories thinking jobs

Archaeologists love the communities they work in and the people that they work with. This I think is evident from the number of topics that could be labelled ‘public archaeology’, like #20, 14, and 11. You can click through the topics above to read the DoA posts that are composed by these documents; it will indicate to what percentage a given post is composed of the various topics. From this, you can begin to choose your own adventure through the day of archaeology.

I can also take that information, and represent it as a network where each document is linked to its highest percentage topic. Keeping in mind all of the caveats that such an approach entails (see Scott Weingart’s salutary warnings), we end up with a map of the mental geography of topics to posts; a mental geography of archaeological discourse. Interestingly, the top three topics holding it all together are 13, 17, and 10. The first two would seem to be topics related to the mundane every day tasks that archaeologists do; topic 10 seems to relate to how we teach the discipline.

The Gephi file may be downloaded here, so that you can explore this data for yourself. I ran the modularity routine to detect any ‘communities’ of thought in the topics/posts. The colours in the image below correspond to community; the size of the node relates to betweeness. In the gephi file, you can filter the data table by ‘modularity’ to see which posts and topics are in what community. According to this routine, there are roughly 13 communities of thought across 335 posts. Where does your post fit in?

Voyant Tools

I’ve also uploaded all of the posts into Voyant Tools for text analysis. Obviously, ‘archaeology’ and its derivatives will skew things a bit. But lets see what we find. “There are 335 documents in this corpus with a total of 156,396 words and 15,100 unique words” says Voyant. We’ve got a wide vocabulary folks! But in the spirit of Steven Ramsay’s algorithmic reading, what are the surprises? What do we see when we deform an entire corpus of text in this manner? It’s worth pointing out that you ought to open the corpus in Voyant in Chrome, as sometimes Firefox trips, crashes, and burns.

Let’s extract named entities from the text, and stitch them together based on appearance in the same post. You get the following (remember, open in Chrome for best results: http://voyant-tools.org/tool/RezoViz/?corpus=1341853693115.3474 ). If you mouse over an entity, it highlights all others to whom it is connected. You can also fiddle with the settings to show more or less connections.

You can also do a principle components analysis on the word frequencies. In the image below, all instances of ‘day’ and derivatives of ‘archaeology’ have been excised, to make the patterns clearer (try for yourself here).

So – what patterns do you notice? What strikes you as odd and in need of explanation?

5 thoughts on “Mining a Day of Archaeology

  1. I was wondering if there was a way to scrape the images associated with the articles as well. Atlas.ti has functionality for coding images and video in discourse analysis… I’m not familiar with the programs your using here though.

  2. Hi Nicolas,
    Outwit can grab images or other files (as could something like downthemall, firefox plugin), but for subsequent analysis, I wouldn’t know where to begin. I haven’t played with Atlas.ti.

    1. I am playing with Atlas.ti as a way to analyzing discourse in historic documents associated with the distilling industry. I’m not an expert in it (just tinkering at the moment) but I think that photos can be tagged (the same way you tag people on Facebook) with codes you have assigned for the analysis. Tags like “paperwork, artifact, archives, computer, office” might be a place to start. Then ATLAS.ti can take those tags correlate them with codes you’ve assigned to text and make a network map with nodes (similar in appearance to what you posted above from Voyant) although what you can do with that info is probably different.

      I’ve got a big project I’m trying to finish in the next week or so, but after that I might try to integrate the data into ATLAS if the baby lets me (we are getting teeth right now so my spare time is non-existent)

Comments are closed.