I was interviewed by Ben Meredith on procedurally generated game worlds and their affinities with archaeology, for Kill Screen Magazine. The piece was published this morning. It’s a good read, and an interesting take on one of the more interesting recent developments in gaming. I asked Ben if I could post the unedited communication we had, from which he drew on for his article. He said ‘yes!’, so here it is.

It seems to me that archaeology and video games share a number of affinities, not least of which because they are both procedurally generated. There is a method for field archaeology; follow the method, and you will have correctly excavated the site/surveyed the landscape/recorded the standing remains/etc. These procedures contain within them various ways of looking at the world, and emphasize certain kinds of values over others, which is why it is possible to have a marxist archaeology, or a gendered archaeology, or so on. Thus, it also seems obvious to me that you can have an archaeology within video games (not to be confused with media archaeology, or an archaeology of video games). A great example of this kind of work is Andrew Rheinhart’s exploration of the beta of Elder Scrolls Online – you should touch base with him, too.

What motivated you to become an archaeologist?

Romance, mystery, allure, the ‘other’, the desire to travel… my initial impetus for getting into archaeology comes from the fact that I’m ‘from the bush’ in rural Canada and as a teenager I wanted so much more from the world. I now recognize that there’s some amazing archaeology in my own backyard (as it were) but I was too young and immature to recognize it then. The Greek Bronze Age, the Mycenaean heroes, the Minoans, Thera… all these captured my imagination. And there was no snow!

Personally, what single facet of archaeology captures the spirit of the field most effectively?

Check out the work of Colleen Morgan and Sophie Hay and Lorna Richardson If there is a ‘spirit of the field’, I think these three scholars capture it admirably. They are curious, reflective, aware of the impact that the doing of archaeology has in the wider world. Archaeology produces powerful narratives, powerful ways of framing our current situation regarding the past and the present. I aspire to be more like these three remarkable women.

Which game do you think, so far, best achieves this?

A hard question to answer. But I think I’d go with Minecraft, for its community and especially its ability to be adopted in educational circles, for the way it requires the player to build and engage with the environments created. The world is what you make it, in Minecraft. So too in archaeology.
If a game attempted to procedurally generate ancient civilizations, what do you think would be the three most important elements that had to be generated?
I’ve done a lot of agent-based simulation. . Such a game would have to be built on an agent-based framework, for the NPCs. Each NPC would have to be unique. Those rules of behaviours that describe how the NPCs interact with each other, the environment, and the player would have to accurately capture the target ancient civilization. You can’t just have an ‘ancient civilization’; you’ll have to consider one very particular culture in one very particular time and place. That’s what a procedural rhetoric is all about: an argument in code about how this aspect of the world worked/is/existed.
Would investigation play an integral part in a video game interpretation?
I’m not sure I follow. Procedural generation on its own still is meaningless; it would have to be interpreted. The act of playing the game (and see the work of Roger Travis on on practicomimetics) sings it into existence.
Conversely, for you would stumbling blindly upon a ruin diminish the effect?
If the world is procedurally generated, then there would be clues in the landscape that would attune the attentive player to the presence of the past in that location. If there is no rhyme or reason – we stumble blindly – then the procedures do not describe an ancient (or any) civilization.

Do you think an archaeology simulator would be best implemented in first person (e.g. Minecraft) or third person (e.g. Terraria)? Would it be more important to convey an intimate atmosphere or impressive scale?
I like first person, but on a screen, first person can just induce nausea in the player. Maybe with an Oculus Rift that’s not a concern, in which case I’d say go first person! On a screen, I think third is better. Why not go AR and put your procedurally generated civilization into the local landscape?

Archeology versus Archaeology versus #Blogarch

I’m working on a paper that maps the archaeological blogosphere. I thought this morning it might be good to take a quick detour into the Twitterverse.


‘archaeology’ on twitter

Here we have every twitter username, connected by referring to each other in a tweet. There’s a seriously strong spine of tweeting, but it doesn’t make for a unified graph. The folks keeping this clump all together, measured by betweeness centrality:


top replied-to


Top hashtags:
archaeology 325
Pompeii 90
fresco 90
Archaeology 77
Herculaneum 40
Israel 24
nowplaying 20
roman 18
newslocker 16
Roman 14


Let’s look at american archeology – as signified by the dropped ‘e’.

An awful lot more fragmented – less popular consciousness of archaeology-as-a-community?
Top by betweeness centrality – the ones who hold this together:

Top urls:

Top hashtags:

Top replied-to

#Blogarch on twitter

And now, the archaeologists themselves, as indicated by #blogarch

We talk to ourselves – but with the nature of the hashtag, I suppose that’s to be expected?

Top by betweeness centrality

top urls

Top hashtags

Top replied to
electricarchaeo (yay me!)

Top mentioned:

Put them altogether now…

And now, we put them altogether to get ‘archaeology’ on the twitterverse today:

Visually, it’s apparent that the #blogarch crew are the ones tying together the wider twitter worlds of archaeology & archeology, thought it’s still pretty fragmented. There’re 460 folks in this graph.

Top by betweeness centrality:


Top urls

top hashtags (not useful, given the nature of the search, right? But anyway)


Top word pairs in those largest groups:

archeology,professor 30
started,yesterday 21
yesterday,battle 21
battle,towton 21
towton,weapon 21
weapon,tests 21
tests,forensic 21
forensic,archeology 21
museum,archeology 19
blogging,archaeology 17

second group:
blogging,archaeology 13
future,blogging 12
archaeology,go 7
archaeology,future 7
archaeology,final 6
final,review 6
review,blogarch 6
hopes,dreams 6
dreams,fears 6
fears,blogging 6

third group:
space,age 6
age,archaeology 6
archaeology,future 6
future,know 6
know,going 6
saa2014,blogarch 6
going,blogarch 5
blogarch,post 3
post,future 3
future,blogging 3

fourth group:
easterisland,ancient 10
ancient,mystery 10
mystery,easter 10
easter,slave 10
slave,history 10
history,esoteric 10
esoteric,archeology 10
archeology,egypt 10
rt,illumynous 9
illumynous,easterisland 9

fifth group:
costa,rica 8
rt,archeologynow 7
archeologynow,modern 4
modern,archeology 4
archeology,researching 4
researching,dive 4
dive,bars 4
bars,costa 4
rica,costa 4
rica,star 4

(once I saw ‘bars’, I stopped. Archaeological stereotypes, maybe).

Top mentioned in the entire graph

illumynous 9 bonesdonotlie 8
drspacejunk 8 drkillgrove 4
bonesdonotlie 8 capmsu 4
archeologynow 7 yagumboya 3
openaccessarch 7 drspacejunk 3
macbrunson 6 archeowebby 3
swbts 6 allarchaeology 3
archeowebby 6 openaccessarch 3
algenpfleger 5 cmount1 3
youtube 5 brennawalks 2

So what does this all mean? Answers on a postcard, please…

(My network files will be on eventually).

HIST4805b Looted Heritage: The Illicit Antiquities Trade

I’m teaching a fourth year seminar next year dealing with issues surrounding the illicit antiquities trade. This seminar will be in conjunction with a larger project spearheaded by the investigative reporter and author Jason Felch, of Chasing Aphrodite. I’m quite excited about this; as an undergraduate, I once had the opportunity to work on a term project that looked at the antiquities market. That was twenty years ago; I’ve never really had the opportunity to scratch that itch since. So, when I was asked to suggest a seminar topic, I jumped at the chance to plumb the depths of my own ignorance together with my students. What better way to teach than to be learning right along with your students?

As ever, I turned to twitter, to see what folks there had to say.

Many folks chimed in with suggestions, including:

I’m keeping all of these in a zotero library for eventual sharing with my students (wider world too), but for now, this is the kind of stuff that’s come in:

Legal & Academic Frameworks

And from Donna Yates, the exciting news that she and her collaborators at Trafficking Culture are going to write a textbook on the subject:


In terms of assessment, I want to avoid long research essays based on secondary sources. Instead, I’d rather have the students build something, analyze something, visualize something… so this will be a heavily digital humanities inflected course. I want my students at the coalface. My little looted heritage social media observatory, will be pulled out of the mothballs and will become an active part of the course. We’ll be mining eBay, looking at the auction sites, exploring museum archives… probably. Stay tuned!

If you have suggestions for things the students should be reading/looking at/exploring, please do drop me a line or leave a comment.

Shared Authority & the Return of the Human Curated Web

A few years ago, I wrote a piece on Why Academic Blogging Matters: A structural argument. This was the text for a presentation as part of the SAA in Sacremento that year. In the years since, the web has changed (again). It is no longer enough for us to create strong signals in the noise, trusting in the algorithmns to connect us with our desired publics. (That’s the short version. The long version is rather more nuanced and sophisticated, trust me).

The war between the botnets and the SEO specialists has outstripped us.

In recent months, I have noticed an upsurge of new ‘followers’ on this blog with emails and handles that really do not seem to be those of actual humans. Similarly, on Twitter, I find odd tweets directed at me filled with gibberish web addresses (which I dare not touch). Digital Humanities Now highlighted an interesting post in recent days that explains what’s going on, discusses this ‘war’, and in how this post came to my attention, points the way forward for the humanistic use of the web.

In ‘Crowd-Frauding: Why the Internet is Fake‘, Eric Hellman discusses a new avenue for power (assuming that power ‘derives from the ability to get people to act together’. In this case, ‘cooperative traffic generation’, or software-organized crime. Hellman was finding a surge of fake users on his site, and he began to investigate why this was. Turns out, if you want to promote your website and jack up its traffic, you can install a program that manufacturers fake visitors to your sites, who click around, click on adverts, register… and in turn does this for other users of the software. Money is involved.

“In short, your computer has become part of a botnet. You get paid for your participation with web traffic. What you thought was something innocuous to increase your Alexa- ranking has turned you into a foot-soldier in a software-organized crime syndicate. If you forgot to run it in a sandbox, you might be running other programs as well. And who knows what else.

The thing that makes cooperative traffic generation so difficult to detect is that the advertising is really being advertised. The only problem for advertisers is that they’re paying to be advertised to robots, and robots do everything except buy stuff. The internet ad networks work hard to battle this sort of click fraud, but they have incentives to do a middling job of it. Ad networks get a cut of those ad dollars, after all.

The crowd wants to make money and organizes via the internet to shake down the merchants who think they’re sponsoring content. Turns out, content isn’t king, content is cattle.”

Hellman goes on to describe how the arms race, the red queen effect, between these botnets and advertising models that depend on clickrates etc will push those of us without the computing resources to fight in these battles into the arms of the Googles, the Amazons, the Facebooks: and their power will increase correspondingly.

“So with the crowd-frauders attacking advertising, the small advertiser will shy away from most publishers except for the least evil ones- Google or maybe Facebook. Ad networks will become less and less efficient because of the expense of dealing with click-fraud. The rest of the the internet will become fake as collateral damage. Do you think you know how many users you have? Think again, because half of them are already robots, soon it will be 90%. Do you think you know how much visitors you have? Sorry, 60% of it is already robots.”

I sometimes try explaining around the department here that when we use the internet, we’re not using a tool, we’re sharing authority with countless engineers, companies, criminals, folks-in-their-parents-basement, ordinary folks, students, algorithms whose interactions with other algorithms can lead to rather unintended outcomes. We can’t naively rely on the goodwill of the search engine to help us get our stuff out there. This I think is an opportunity for a return of the human curated web. No, I don’t mean building directories and indices. I mean, a kind of supervised learning algorithm (as it were).

Digital Humanities Now provides one such model (and there are of course others, such as Reddit, etc). A combination of algorithm and human editorial oversite, DHNow is a cybernetic attempt to bring to the surface the best in the week’s digital humanities work, wherever on the net it may reside. We should have the same in archaeology. An Archaeology Now!  The infrastructure is already there. Pressforward, the outfit from the RRCHNM has developed a workflow for folding volunteer editors into the weekly task of separating the wheat from the chaff, using a custom built plugin for WordPress. Ages ago we talked about a quarterly journal where people would nominate their own posts and we would spider the web looking for these nominations, but the technology wasn’t really there at that time (and perhaps the idea was too soon). With the example of DHNow, and the emergence of this new front in botnets/SEO/clickfraud and the dangers that that poses, perhaps it’s time to revisit the idea of the human-computer curated archaeoweb?

Exploring Trends in Archaeology: Professional, Public, and Media Discourses

The following is a piece by Joe Aitken, a student in my CLCV3202a Roman Archaeology for Historians class at Carleton University. His slides may be found here. I asked Joe if I could share his work with the wider world, because I thought it an interesting example of using simple text analysis to explore broader trends in public archaeology. Happily, he said yes.

Exploring Trends in Archaeology: Professional, Public, and Media Discourses

An immense shift in content and terminology emerges when analysing the text of several documents relating to the archaeology of Colchester, as information grows from its genesis as an archaeological report, through the stage of public archaeology, and finally to mass media. Many inconsistencies emerge as the form in which archaeological information is presented changes.

This analysis was done with the help of Voyant Tools, “a web-based text analysis environment.”[1] Z-score, representing the number of standard deviations above the mean at which each term appears, will be used as the basic marker of frequency. Skew, “A measure of the asymmetry of relative frequency values for each document in the corpus,”[2] will also be used. Having a skew close to zero suggests that the term appears with relative consistency throughout the documents. This means that in comparison to, for example, “piggery,” with a skew of 11, terms with a low skew are not only frequent in the corpus as a whole, but are prevalent in many of the documents that make up the corpus.

A text analysis of Colchester Archaeological Trust Reports 585-743 (February 2011 to 22nd October 2013)[3] is the basis of this comparison. Dominant in this corpus are terms related to archaeological excavations. The term “report” has a z-score of 8.69, “finds” has a z-score of 6.43, and “site” has a z-score of 8.81. The same terms, respectively, have skews of 0.93, 0, and 0.88. Another relatively consistent term is “pottery,” which has a skew of 1 and a z-score of 5.26. “Brick”, with a skew of 2.17 and a z-score of 3.1, is similarly consistent.

The relevance of these figures becomes clearer upon a comparison with the public archaeological writings as they appear on the Colchester Archaeologist blog. The blog exists on the public-facing website of the Colchester Archaeological Trust, and has been blogging about its archaeological discoveries since 2011. This analysis will use the Voyant-Tools difference function, which returns a value based on a comparison between the z-scores of two corpora,[4] as well as a direct comparison of the z-score and skew of each term between the two corpora.

Some of the most consistent terms from the archaeological corpus appear very infrequently in the public archaeology. “Pottery” has a skew of 9.49 and a z-score of 0.25, and appears at about 1/5 of the frequency as it does in the reports. “Brick” similarly disappears: in the public archaeology, it has a skew of 9.56 and a z-score of -0.02, compared to a skew of 2.17 and a z-score of 3.1 in the archaeological reports.

Terms relating to the excavation also disappear. “Finds,” which in the archaeological reports has a skew of 0 and a z-score of 6.43, has a skew of 4.94 and a z-score of 0.42 in the public archaeology. “Report” similarly changes from a skew of 0.93 to 9.87, with it’s z-score dropping from 8.69 to -0.06. Site follows this trend to a lesser extent, although this is likely due to it appearing in the public archaeology in the context of “website,” rather than as an archaeological term. Still, the shift in z-score and skew are significant, and in the same direction: an archaeological z-score of 8.81 to a public z-score of 3.83, and an archaeological skew of 0.88 to a public skew of 1.28. In each case, these commonly used terms from the archaeological reports appeared less frequently and less consistently in the blog.

On the other hand, some terms are much more common in the public archaeology. Compared to the corpus of archaeological reports, the public archaeology texts contain the term “circus” at 5 times the frequency. In the blog, “circus” has a z-score of 5.77 and a relatively stable skew of 1.79, compared to a minimal z-score of 0.69 and a volatile skew of 6.3 in the archaeological reports. A similar change occurs to the term “burial,” although to a lesser extent: from report to blog, the z-score rises from 0.25 to 0.86, and the skew drops from 3.84 to 3.65.

Terms with a high skew and a non-insignificant z-score in the archaeological reports seem to be the most prevalent terms altogether in the public archaeology, while terms with a skew closer to zero in the reports disappear in the public archaeology: that is, the terms that appear infrequently but in large numbers in the reports are the ones selected for representation in the blog. This emphasises rare and exciting discoveries, such as the circus and large burials, while ignoring the more regular and consistent discoveries of pottery and bricks. For terms with high skew, there is a consistent rise in z-score and drop in skew in the incidences of the term between the archaeological and public corpora. For terms with a skew closer to zero, there is a consistent decline in z-score. The two trends that terms follow with regards to their relative frequencies between the two corpora can be defined as follows: low-skew terms, which tend to disappear, and significant-z-score/high skew terms, which tend to be emphasised in the public archaeology.

Archaeology in the media seems to mostly follow from the public archaeology rather than the archaeological reports on most aspects. The media corpus contains articles about the archaeology of Colchester from sources ranging from local to national media, including the BBC, the Colchester Daily Gazette, the Essex County Standard, and the Independent, in addition to international Archaeological publications. In these articles, “circus” has a low skew of 1.51, although its z-score isn’t as overwhelmingly high as it is in the public archaeology at 1.64. Still, it is much greater than the z-score of 0.69 for “circus” in the reports, and this z-score most likely reflects a greater lexical variety rather than a focus on other aspects of the archaeology, as this is the fifth-highest z-score in the entire media corpus. Still, there is less emphasis on the circus here than in the blog.

In common between the public and media corpora is their near complete removal of non-Roman archaeological terminology. The term “medieval” appears 1555 times in the archaeological corpus, with a z-score of 3.42 and a skew of 2.64. In the public corpus, the same term appears twice, with a z-score of negative -0.09 and a skew of 10.30. In the selection of news about the archaeology of Colchester, the term never appears. This follows the same trends of selection as the public archaeology: “medieval,” a low-skew term in the archaeological corpus, is ignored in favour of high-skew terms.

Although the media and public corpora contain writings about the same discoveries and use similar language, the frequency at which they do so differs. The media, unlike the blog, is unlikely to repeatedly write about the circus even when no new information is available. Rather, each media seems to be inspired by the archaeological reports, but takes its information from the public archaeology. That is, instead of repeating the public archaeology, the media takes inspiration from the actual archaeological discovery, but takes their information about this archaeology from the blog rather than directly from the report.

Altogether, archaeological writing about Colchester appears to become much narrower over time. While the archaeological reports assumedly accurately reflect what is found, the public archaeology, and, in turn, the media, does not. Instead, they focus on more marketable and exciting aspects of the archaeology: these can be recognized as the high-skew/high-z-score terms in the analysis. As a result, the particulars of the excavation, as well as the majority of findings, are de-emphasised; these are the low-skew terms. By the stage of public presentation, only a very narrow view of the archaeology of Colchester has been presented. It is almost exclusively monumental and Roman, and is at odds with the multiplicity of archaeological findings that are seen in the reports.


Patterns in Roman Inscriptions

Update August 22 I’ve now analyzed all 1385 inscriptions. I’ve put an interactive browser of the visualized topic model at

See how nicely the Latin clusters?

I’ve played with topic modeling inscriptions before. I’ve now got a very effective script in R that runs the topic model and produces various kinds of output (I’ll be sharing the script once the relevant bit from our book project goes live). For instance, I’ve grabbed 220 inscriptions from Miko Flohr’s database of inscriptions regarding various occupations in the Roman world(there are many more; like everything else I do, this is a work in progress).

Above is the dendrogram of the resulting topics. Remember, those aren’t phrases, and I’ve made no accounting for case endings. (Now, it’s worth pointing out that I didn’t include any of the meta data for these inscriptions; just the text of the inscription itself, with the diacritical marks removed.) Nevertheless, you get a sense of both the structure and content of the inscriptions, reading from left to right, top to bottom.

We can also look at which inscriptions group together based on the similarity matrix of their topics, and graph the result.


Inscriptions, linked based on similarity of the language of the inscription, via topics. If the image appears wonky, just click through.

So let’s look at these groups in a bit more depth. I can take the graph exported by R and import it into Gephi (or another package) to do some exploratory statistical analysis.

I’ve often put a lot of stock in ‘betweeness centrality’, reckoning that if a document is highly between in a network representation of the patterns of similarity of topics, then that document is representative of the kinds of discourses that run through it. What do we get, then?

We get this (here’s the page in the database):

aurifices Roma CIL 6, 9207 Inscription Occupation
M(arcus) Caedicius Iucundus / aurifex de / sacra via vix(it) a(nnos) XXX // Clodia …

But there are a lot of subgroupings in this graph. Something like ‘closeness’ might indicate more locally important inscriptions. In this case, the two with the highest ‘closeness’ measures are

aurifices Roma CIL 6, 9203 Inscription Occupation
Protogeni / aurfici / vix(it) an(nos) LXXX / et Claudiae / Pyrallidi con(iugi) …


aurifices Roma CIL 6, 3950 Inscription Occupation
Lucifer v(ixit) a(nnum) I et d(ies) XLV / Hesper v(ixit) a(nnos) II / Callistus …

If we look for subgroupings based on the patterning of connections, the biggest subgroup has 22 inscriptions:
Dis Manibus Felix publicus Brundisinorum servus aquarius vixit…
Dis Manibus Laetus publicus populi Romani 3 aquarius aquae An{n}ionis…
Dis Manibus sacrum Euporo servo vilico Caesaris aquario fecit Vestoria Olympias…
Nymphis Sanctis sacrum Epictetus aquarius Augusti nostri
Dis Manibus Agathemero Augusti liberto fecerunt Asia coniugi suo bene…
Agatho Aquarius Caesaris sibi et Anniae Myrine et suis ex parte parietis mediani…
Dis Manibus Sacrum Doiae Palladi coniugi dignissimae Caius Octavius…
Dis Manibus Tito Aelio Martiali architecto equitum singularium …
Dis Manibus Aureliae Fortunatae feminae incomparabili et de se bene merenti..
Dis Manibus Auliae Laodices filiae dulcissimae Rusticus Augusti libertus…
Dis Manibus Tychico Imperatoris Domitiani servo architecto Crispinilliano.
Dis Manibus Caio Iulio 3 architecto equitum singularium…
Dis Manibus Marco Claudio Tryphoni Augustali dupliciario negotiatori…
Dis Manibus Bromius argentarius
Faustus 3ae argentari
Dis Manibus sacrum Tiberius Claudius Hymeneus aurarius argentarius…
Dis Manibus Silio Victori filio et Naebiae Amoebae coniugi et Siliae…
Dis Manibus 3C3 argentari Allia coniugi? bene merenti fecit…
Dis Manibus Marco Ulpio Augusti liberto Martiali coactori argentario…
Suavis 3 aurarius
Dis Manibus sacrum Tiberius Claudius Hymeneus aurarius argentarius…
Dis Manibus Tito Aurelio Aniceto Augusti liberto aurifici Aurelia…

What ties these together? Well, ‘dis manibus’ is good, but it’s pretty common. The occupations in this group are all argentarii, architectii, or aquarii. So that’s a bit tighter. Many of these folks are mentioned in conjunction with their spouses.

In the next largest group, we get what must be a family (or familia, extended slave family) grouping:
Caius Flaminius Cai libertus Atticus argentarius Reatinus
Caius Octavius Parthenio Cai Octavi Chresti libertus argentarius
Musaeus argentarius
Caius Caicius Cai libertus Heracla argentarius de foro Esquilino sibi…
Caius Iunius Cai libertus Salvius Caius Iunius Cai libertus Aprodisi…
Caius Vedennius Cai filius Quirina Moderatus Antio militavit in legione…
Aurifex brattarius
Caius Acilius Luci filius Trebonia natus architectus
Caius Postumius Pollio architectus
Caius Camonius Cai libertus Gratus faber anularius
Caius Antistius Isochrysus architectus
Elegans architectus
Caius Cuppienus Cai filius Pollia Terminalis praefectus cohortis…
Cresces architectus
Cresces architectus
Caius Vedennius Cai filius Quirina Moderatus Antio militavit in legione…
Pompeia Memphis fecit sibi et Cnaeo Pompeio Iucundo coniugi suo aurifici…
Caius Papius Cai libertus Salvius Caius Papius Cai libertus Apelles…
Caius Flaminius Cai libertus Atticus argentarius Reatinus

The outliers here are graffitos or must be being picked up by the algorithmn due to the formation of the words; the inclusion of Pompeia in here is interesting, which must be to the overall structure of that inscription. Perhaps a stretch too far to wonder why these would be similar…?

This small experiment demonstrates I think the potential of topic modeling for digging out patterns in archaeological/epigraphic materials. In due time I will do Flohr’s entire database. Here are my files to play with yourself.

Prescot Street as Topic Model, or, reading an excavation distantly

I tried a new tact in my quest to data mine archaeological records. Stuart Eve sent me the csv from the Prescot Street excavations, where each record was a unique context. I fed this into the vanilla java gui for MALLET (so no tuning, just the basic settings, looking for 25 topics) to see what – if anything – might result. The output seems very promising. I deliberately did not look up any information on the excavation until after I’d run this analysis. Can reading site records algorithmically tell us anything useful, that we did not otherwise know?

As I often do, I posted my initial reaction to twitter:

How to visualize this? I’m growing cold towards network visualizations of this kind of data, but in this case a two-mode representation might be appropriate, since the topic modeling algorithm is functioning as a kind of unsupervised clustering routine, pulling words out of the records that seem to go together. Here’s a two-mode network of the results, contexts tied to their constituent topics:

It seems promising. In that image, I took the excavators’ names out. But upon reflection, I shouldn’t do that:

I asked Gephi to look for modules (communities; groups; based on similarity of ties) within this two mode network. Below are a series of images that focus on the individual modules. Two items jump out immediately – one, particular excavators are associated with particular word choice, patterning of word usages; two, particular kinds of materials clump together quite nicely.

Do particular excavators ‘see’ particular kinds of info that others don’t? Do they ‘specialize’ in certain kinds of info? As a newbie on the Forum Novum project for BSR many years ago, I was never allowed on any of the ‘interesting’ stuff, being consigned to digging through layers of fill to find the depth of the natural soil level. There’s only so many ways to describe dirt. This kind of thing happens often. You want your most experienced excavators to handle the most delicate/intricate/complicated situations, but… I wonder.

Topic modeling this material, whilst including the names of the excavators attached to each context, seems to shed interesting light on the ways we see things archaeologically. In my other experiments with the PAS database, because of extraneous commas creeping in and shifting the fields, I often ended up with an inconsistent inclusion of the finds officers’ names, so I tended to just exclude them completely. That might be an error. I think we need to know whose voice is most tied to the ‘topics’/’discourses’ that make up our record (after all, once it’s excavated, this is all we have left, right?) This experiment here suggests that perhaps one of the more valuable outcomes of topic modeling archaeological material is the re-introduction of subjectivity into our records, the idea that many voices (modern and ancient) make up the ‘record’ – and we should listen to them.

In due course I’ll put the html up somewhere so that the interested reader can jump through the contexts along the topic – context – topic pathways suggested by the topic modeling. We use Harris matrices (a kind of network) to understand the three dimensional relationships amongst contexts (which imply their chronological ordering); what kinds of insights can deforming our reading of an excavation along the network paths suggested by the topic modeling produce?

Below are the visualizations of the modules.

And the topics with their top words:
topicId words..

1 schager elisabet pottery area part remains found bone similar poss fills bones appears burnt located human pieces waste grey activity main animal clear cremations broken cbm fragments truncates domestic skull high underneath mid shells bit edge sort chalk vessels deposits charcoal nw sherds disarticulated lost oyster sterile specific includes thrown

2 pit roman ii po ossuary irregular large latest including probable mixed pictured truncating inside planned sealed appears cut continuation surviving soakaways remained intercutting step pitting results topped width relates infilling partial include moved northwards steven ashley contexts adult perpendicular offset remain aesthetically loaced disturb sprial mentioned compass fed skeletons connections

3 fill floor basement rubble concrete slab fl evidence bedding ce larger glass abutting represent demolition room darker suggesting repair boundaries situe remaining unclear feature continues samian cessy eval packed facade john photo subrectangular reused actual ws lay inclusion noted lie teh constrcution looked crees brick lots archaeology flexed state

4 soakaway late water sump collection su pm brick soak masonry structure horn core back lined bricks lining drainage masonary materials face smell fit red held system courses time functioned sloping putrid cores aid headers lain knocked pipes mottled lies bands buried rotten real lying tirtiary simple earthernware exterior acrivity respective

5 pm pooley ashley late backfill century brick lucas tom cellar made garden line deliberate material walls cistern places sitting leveling thc proximity shallow backfilling based lerza rivets lifting limestone rebuild characteristic general redep suggested potential campion signs putrid map shown phase bits occurance structure element disintegrated ash southwards act crumble

6 truncated linear modern clark william heavily south west east truncation due north shape rectangular foundation cist machine stone cutting running aligned relationship pre composition tiles ne note observed worked sides deeper manhole intrusion define identical machining unknown depression tile mod axis bagged tegula limit channel erosional forming sample loe uneven

7 cut construction structural back slightly ring recut ephemeral completely realised doesn left partly heading heavy fragment contents analogous suggests comprises properties limestone short wells thc intervening association reflecting pictures clarify count sotnes terminus browny vertically bar unarticulated highest repdeposited things redeposit crmated tank approx ifthe lessnes forming explaining inclination plan

8 fill top base finds contained clancy sara clay organic context section date level horncore excavated original sheet shallow sketch nature suggest silting pipe suggests depth sampled trap lined dumped fully put reverse cemetery hearth beam deliberately frequent removal orientation orange paper backfilled horncores lain discussion sealed cultural appeared thick tenon

9 pit roman cut howell paula oval tip big ground difficult exact probable vertical pocket reflect shows pretty phase work means times duffy region alignment man matches nail wasn sequence build silting clinker fl brittle abuts tentar db sewer quadrant disarticulation implying characteristics revised constuction bottomed pressed unimpressive smc extending

10 gary webster filled fill possibly surface metalling gravelly related underlying unclear laid em difficult compact cobbling overlying represent dark modern mix undetermined series metalled place yellow gaps se stratigraphically extend dumps intentionally missing size charnal foundations spilled lack unsure things areas barrell blue metal yard variety respects ploughsoil anphora

11 pm post early fill med large medieval cess lot contemporary light pc inclusions latrine observed single mortar collapse character leather recovered ceramic suggests extent glacial lense hand event green interpretation resting case demo roughly curved household apparently assist inflow setting render cores varying determining belongs tenuously derivation mixture unlike consistant

12 refuse pr greg crees pit rubbish kind determine paula representing previously rounded bs discovered full gradual probing based enclosing struck housing similarities fronted coursings characterisation excavate sharp valley abse meeting people compare chronologically indication hypocaust blurring subjected distinctive amost grain remaining forms patchy interred colours including similar time midden

13 fill lower natural secondary sand greyish shaped statter claire surveyed black yorkstone proper ashley loose return horizontally mmx rest slots tbm patches largest acidic order distinct interface terrace drainage seperated ark rubberley hit spiralled rebuild destruction coming eastwards sharply hold candidate smells distorted field air powdered stains overview vacant dated

14 roman carreton adrian upper pit colour excavation makes funerary alignment southern preservation fireplace cover collapsed extending scattered adhering pinkish comprised ns nw bag smelly find soakaway whoel gs meant belonged disused regard ditches meters quarries huge making corresponds ritual existing cemented dimensions starts dimension marked paired excavtion staining shipton

15 wall fergal donoghue late pm wa sill building georgian st internal tenter butting wooden victorian present buttress support extension long prescot barrel house rear street dividing facing immediately platform front rising moisture prevent slate thinks medium slabs beneath seemingly fl counstruction plot wider lienar knees erosional lies cu trample photographed

16 roman fill matt ceri shipton law nails williams earlier black form find situ obvious fe uncertain complete amounts objects culvert smae skulls notably addition wood stoney domed truncations rectilinear pyres quality moderate working bonding earliest dark gis ark failry compost peeled functional rows ended properly remnants buildings accounted variation

17 cremation burial cr urn disturbed pot plan tile dobosz ukasz votive diffuse recorded dug built sample cm represents bone chest cremated position surrounding box analysis nb lifted coin regular offering vessel concentration occasional deposits suggesting block intact urned sw notable lid samples deep stones western broad higher plate cms relate

18 make levelling mu layer gravel dump material brickearth redeposited ed sandy deposited earth dumped spread dirty ground silty slumped capping clayey charcoal derived quarrying stoney extraction layers thin square sorted period exposed occupation sands soft parts provide lines stuff didn true partly significantly basal white tom mixey cluster test central

19 void external soil deposit hole cultivation posthole ec soils sp fra features lerza site fairly agricultural number brown debris dep evaluation reworked dumping result dates horticultural environmental run unurned plough residue deposition manuring representing upright storage exit family connected cleaning difference squared linked geophoto amorphous gravely concentrations poo defined

20 pit david unspecified ross edge roman brenna lowest shallow expect final basal presume dimensions marcus pebbles angular appeared covering diffuse processing stage stuart lens stored missed thickness const irregularities souther button funcation limits uncear oblong wider poshole suggested works fil metaling patella jaws grounds greater major purposes elisabet derive pegtile

21 cut small ruth rolfe side cuts end pits grid sq eastern circular originally western piece hard mm fact partially edges removed orangey wood half northern thought directly separate nearby degraded marcus initial urns period solid straight slope inwash graves limit rough wide occured occasionally centre good concave leading survives undertermined

22 drain ditch gully feature possibly shallow trench bottom boundary hassett visible southern runs burials sides postmed aspects slot cemetary presence point robber quarrying footing essentially direction formed doesn land homogenous number indicating section constructed thc terraced gulley parallel holes assoc overflow longbone debitage arising pressure fragment mark glazed wash sealing

23 pit quarry pq hassan anies primary silt dark prob middle zone skull filling mausoleum planned machined edges tiled tanked evident reused stain northeast ts corner sit redepoisted doubt terminate pillow overleaf shale fits standard means dateable existant redundant easts dropping quarried usage gc report truncate trampling compositions marcus bag

24 morse chaz deposit mixed roman dumped brown gravels rich function silty forms lenses subsided narrow assume robbed rest past discernable pitcut con sitly barren bucket cesspit shot beneath late unfrogged sister occupancy flure terminates consister retrieved resolved parallel joining ideas give millefiore burrial cd assumption regularity uppermost imbrex deposite

25 grave roman skeleton cut sk moskal tomasz coffin dug inhumation preserved poorly erroded goods body left legs head articulated skeletal nos poor events juvenile severly feet condition fragmentary holding bed ends stain strongly info spaced cu deposited shaped assigned disturbance cleaned chalk disatriculated femur hands soakawy showing overhangs hom cen

In which I topic model the entire PAS database by individual rows

Previously, I was trying to consider the geography of Roman Britain as a corpus of documents – individual geographic (modern) areas – where the records in the Portable Antiquities Scheme database formed the words of the document.

Today, I inverted that process. I treated each individual row in the entire PAS database as an individual document, with the data within that record its words. It took about two hours of processing time, looking for 100 topics. I now have a series of outputs that neither Excel nor Notepad++ can open, as they are too big. I’ll have to break the files up before I can dig too much deeper into them. However, what I can examine seems promising – topics that seem to indicate various regions; topics that indicate particular finds officers; topics that indicate particular kinds of artefacts; topics that indicate the status of the object (whether it was returned to the finder). Here’s a sampling:

Topic Weight Words
94 0.01654 mm thick wide weighs long diameter measures grams length width weight thickness high weighing fragment edge section maximum measuring
3 0.01442 suffolk east metal detector minter faye finder returned alloy plouviez judith copper mid st jane carr coastal edmundsbury geake
22 0.01409 green patina surface dark colour mid brown alloy copper corrosion light worn grey slightly condition corroded object pitted original
45 0.01374 mm weight width thickness length diameter atherton rachel maximum thick derbyshire dimensions wt height fragment midlands max including complete
1 0.01165 yorkshire humber riding metal detector east finder north returned alloy copper holmes simon paynton ceinwen hambleton selby david illegible
56 0.01076 east lincolnshire adam daubney midlands detector metal alloy lindsey copper finder returned kesteven north west elwes marina nottinghamshire rushcliffe
49 0.01028 lines decorated incised side decoration line central edge raised ring centre border grooves dot cross rectangular end upper punched
69 0.01003 ae nummus constantine house gloria exercitvs bust soldiers standards copper standard victory prow ii left illegible constantinopolis helmeted ad
27 0.00985 frame buckle pin bar alloy copper medieval loop edge oval outer missing cast strap double narrowed shaped looped section
92 0.00976 sherd pottery rim fabric sherds ware vessel grey chance find body medieval ceramic roman detecting inclusions colour surface orange

topic3topic1topic94  topic3 topic22 topic45

Topics as Word Clouds

Elijah Meeks and Mat Jockers both have used word clouds to visualize topics from topic models. Colour, orientation, relative placement of the words – all of these could be used to convey different dimensions of the data. Below, you’ll find clouds for each of my initial 50 topics generated from the Roman materials in the Portable Antiquities Scheme database (some 100 000 rows, or nearly 1/5 the database, collected together into ‘documents’ where each unitary district authority is the ‘document’ and the text are the descriptions of things found there). The word clouds are generated from the word weights file that MALLET can output. There are 8100 unique tokens when I convert the database into a MALLET file; each one of those is present in each ‘bag of words’ or topic that MALLET generates, but to differing degrees. Thus, word clouds (here generated with Wordle) pull out important information that the word keys document does not. However, given that I optimized the interval whilst generating the topic models, the keys document provides an indication of the strength of the topic in the corpus. I’ve arranged the word clouds scaling them against the size of the strongest topic (topic 22), top-bottom, left-right. I’ll be damned if I can get wordpress to just display each image under the other one. Even stripped my table out, it did!

At any rate, as one churns through the 50 topics, after about the first 11 (depicted below), the topics get progressively more noisy as MALLET attempts to deal with incomplete transcriptions of the epigraphy of the coins, and the frequent notes about the source for the identification of the coins (the work of Guest & Wells). The final topic depicted here, topic 20, directly references a note often left in the database concerning the quality of an individual record; these frequently are in connection with materials that entered the British Museum collection before the Portable Antiquities Scheme got going and hence the information is not up to usual standards.

This exercise then suggests to me that 50 topics is just too much. I’m rerunning everything with 10 topics this time.

Where Roman Roads and Topic Models Intersect

Previously, I ended up with a map of UK districts, coloured by the five groups that Gephi’s modularity routine suggested were present, in the network of districts to districts based on shared patterns in the underlying topics (the topic model generated from the total dump of the Portable Antiquities Scheme database).

I asked on twitter if the patterns seemed evocative of anything; Phil Mills suggested that they seemed to match perhaps civitas boundaries. He provided me with an image of those boundaries (thanks Phil!) as well as some kmz files. Below are two images, one with civitas capitals (hand-drawn in by me) and Roman roads. Together, they are evocative.  Blocks of colour seem to go very well with civitas boundaries; where blocks of colour overlap those boundaries, they seem to march along well the routes of the roads. And all this from looking at topic models! I think it is getting progessively safer to say that the patterns found in an archaeological database through topic modelling are indeed meaningful on the ground. The factors of government, of identity, of mobility, seem to emerge in the topic model.

