In July, I’m presenting work related to data mining an archaeological database, in this case, the Portable Antiquity Scheme.
I wondered, if I treated each district in the UK as a ‘document’, and the items recovered in its territory as the words, would I see any interesting or useful patterns if I ran some topic models?
To give you a sense of the scale of this data, there are over 160 000 individual records in the material I obtained from PAS. An individual record might include a ‘hoard’, so there are *well* over 160 000 individual objects. When you sort this material into broad chronological materials, you find:
Paleolithic: 305 records
Bronze Age: 2620
Early Medieval: 8421
Post Medieval: 27879
Blank cells: 1278
Quite a lot of material. So, after massaging the data, cleaning things up, I began to work with a very small subset of materials – records tagged ‘bronze age’ from 14 districts (104 records). This was merely an exploration, to see if there’s any meat to my intuitive belief that there should be some sort of latent structure. The 14 districts I selected (the first 14 when I sorted ‘Bronze Age’) are:
I put every record from Wokingham District into a single txt file, then every one from Winchester, until I was done (and I really need to automate that). Then I fed the text files through MALLET, using the JAVA Gui for this initial exploration (using the JAVA Gui’s default settings. In a more robust exploration, I would go direct from the command line, tweaking until I found the best number of topics, etc).
So here’s what I found.
List of Topics
1. alloy palstave mm copper green surface slight cast dark penannular
2. mouth sides loop dims looped corners armorican axeheads core cast
3. blade axehead prominent casting iron hoard intact uneven single narrow
4. age fragment late surfaces alan spear body faces head flanged
5. age socket collar sectioned alloy slightly ridge seams front square
6. record flint grey scraper antiquities dorsal tool angle black visible
7. bronze patina end stop made remains flat decoration found corroded
8. database central rectify working recording standards usual fall aware began
9. bronze copper flashes part side edge large ridges shallow top
10. socketed straight axe rounded complete horizontal moulded rectangular expanding upper
What do those topics mean? To a human, they are all variations on the description of the artefacts. Given that multiple humans described these artefacts in the first place, perhaps (and it depends too on the kind of guidance and rigour that the PAS uses in its data entry) these topics gather some of the blurriness of categorization, a way of bypassing the clumpers and the splitters amongst us. Obviously, some more thought about what these may mean is necessary. But onwards!
I brought the resultant ‘documents: topics, % contribution’ list into gephi for some visualization. Since it was a small dataset, I did no pruning. Topic 4 does the most lifting in this network. In its ‘module’, you find topics 9, 10, 3, 5 (coloured purple) and districts of Gravesham, Bromley, Dover, Canterbury, Test Valley, and New Forest. But how much weight does this visualization carry? Since it’s two-mode, and these metrics are really only appropriate for a one-mode graph, probably not much. So I collapsed this graph into a one-mode graph of district to district, based on weighted ties by topic.
The resultant graph is probably more useful for archaeology, for it ties areas together based on all of the material culture recorded in the database. At the recent SAA in Honolulu, in the Connected Past session, folks were constructing networks from artefacts using Brainerd Robinson coefficients. The methodology I’m trying ought to be compared with those studies (see for instance Barbara Mill’s et al recent article). I then ran modularity and betweeness statistics again. Why betweeness? If the ‘topics’ that emerge in this database reflect something within the underlying material culture, then interconnections between sites constructed from topics show some kind of flow (of ideas? culture? economics?), thus ‘between’ sites straddle the most important of those flows – in which case the most ‘between’ districts might be rather more important.
Remarkably (and this could be an artefact of the method, rather than the underlying data), I get next to no variation in betweeness – every district except for East Hamphsire, Ashford, and New Forest has the same score (and these three all have the same score too). Modularity finds two groups. Perhaps it’s an east/west dichotomy? I laid the network out with the nodes at their geographic locations (typically, the district council office). No east-west dichotomy. (Incidentally, you can now export to Google Earth, overlaying your network against pretty satellite pictures).
So… there seems to be something to it. The thing to do now is to do every record, every district, and every period, mapping out changes over time. In the interests of being able to assess this, though, I should perhaps stick to my knitting and just do the Roman period.