Topic modeling is very popular at the moment in the digital humanities. Ian, Scott and I described them as tools for extracting topics or injecting semantic meaning into vocabularies: “Topic models represent a family of computer programs that extract topics from texts. A topic to the computer is a list of words that occur in statistically meaningful ways. A text can be an email, a blog post, a book chapter, a journal article, a diary entry – that is, any kind of unstructured text” (Graham, Weingart, and Milligan 2012). In that tutorial, ‘unstructured’ means that there is no encoding in the text by which a computer can model any of its semantic meaning.
Archaeological datasets are rich, largely unstructured bodies of text. While there are examples of archaeological datasets that are coded with semantic meaning through xml and Text Encoding Initiative practices, many of these are done after the fact of excavation or collection. Day to day, things can be rather different, and this material can be considered to be ‘largely unstructured’ despite the use of databases, controlled vocabulary, and other means to maintain standardized descriptions of what is excavated, collected, and analyzed. This is because of the human factor. Not all archaeologists are equally skilled. Not all data gets recorded according to the standards. Where some see few differences in a particular clay fabric type, others might see many, and vice versa. Archaeological custom might call a particular vessel type a ‘casserole’, thus suggesting a particular use, only because in the 19th century when that vessel type was first encountered it reminded the archaeologist of what was in his kitchen – there is no necessary correlation between what we as archaeologists call things and what those things were originally used for. Further, once data is recorded (and the site has been destroyed through the excavation process), we tend to analyze these materials in isolation. That is, we write our analyses based on all of the examples of a particular type, rather than considering the interrelationships amongst the data found in the same context or locus. David Mimno in 2009 turned the tools of data analysis on the databases of household materials recovered and recorded room by room at Pompeii. He considered each room as a ‘document’ and the artefacts therein as the ‘tokens’ or ‘words’ within that document, for the purposes of topic modeling. The resulting ‘topics’ of this analysis are what he calls ‘vocabularies’ of object types which when taken together can suggest the mixture of functions particular rooms may have had in Pompeii. He writes, ‘the purpose of this tool is not to show that topic modeling is the best tool for archaeological investigation, but that it is an appropriate tool that can provide a complement to human analysis….mathematically concrete in its biases’. The ‘casseroles’ of Pompeii turn out to have nothing to do with food preparation, in Mimno’s analysis. To date, I believe this is the only example of topic modeling applied to archaeological data.
Directly inspired by that example, I’ve been exploring the use of topic models on another rich archaeological dataset, the Portable Antiquities Scheme database in the UK. The Portable Antiquities Scheme is a project “to encourage the voluntary recording of archaeological objects found by members of the public in England and Wales”. To date, there are over half a million unique records in the Scheme’s database. These are small things, things that fell out of pockets, things that often get found via metal-detecting.
Here’s what I’ve been doing.
1. I downloaded a nightly dump of the PAS data back in April; it came as a csv file. I opened the file, and discovered over a million lines of records. Upon closer examination, I think what happened is something to do with the encoding- there are line breaks, carriage returns, and other non-printing characters (as well as commas being used within fields) that when I open the file I end up with a single record (say a coin hoard) occupying tens of lines, or of fields shifting at the extraneous commas.
2. I cleaned this data up using Notepad++ and the liberal use of regular expressions to put everything back together again. The entire file is something like 385 mb.
3. I imported it into MS Access so that I could begin to filter it. I’ve been playing with paleo – meso – and neolithic records; bronze age records; and Roman records. The Roman material itself occupies somewhere around 100 000 unique records.
4. I exported my queries so that I would have a simpler table with dates, descriptions, and measurements.
5. I filtered this table in Excel so that I could copy and paste out all of the records found within a particular district (which left me with a folder with 275 files, totaling something like 25 mb of text).
6. Meanwhile, I began topic modeling the unfiltered total PAS database (just after #2 above). Each run takes about 3 hours, as I’ve been running diagnostics to explore the patterns. The problem I have here though is what, precisely, am I finding? What does a cluster of records who share a topic actually mean, archaeologically? Do topics sort themselves out by period, by place, by material, by finds officer…?
7. As that’s been going on, I’ve been topic modeling the folders that contain the districts of England and Wales for a given period. Let’s look at the Roman period.
There are 275 files, where a handful have *a lot* of data (> 1000 kb), while the vast majority are fairly small (< 100 kb). Perhaps that replicates patterns of metal detecting – see Bevan on biases in the PAS. The remaining districts seem to have no records in the database. So I’ve got 80% coverage for all of England and Wales. I’ve been iterating over all of this data, so I’ll just describe the most recent, as it seems to be a typical result. Using MALLET 2.0.7, I made a topic model with 50 topics (and optimized the interval, to shake out the useful from the not-so-useful topics). Last night, as I did this, the topic diagnostics package just wouldn’t work for me (you run it from the MALLET directory, but it lives at the MALLET site; perhaps they were working on it). So I’ll probably want to run all these again.
If I sort the topic keys by their prominence (see ‘optimize interval’) the top 14 all seem to describe different kinds of objects – brooches, denarii, nummus, sherds, lead weights, radiate, coin dates, the ‘heads’ sides of coins – which Emperor. Then we get to the next topic, which reads :” record central database recording usual standards fall created scheme aware portable began antiquities rectify working corroded ae worn century”. This meta-note about data quality appears throughout the database, and refers to materials collected before the Scheme got going.
After that, the remaining topics all seem to deal with the epigraphy of coins, and the various inscriptions, figurative devices, their weights & materials. A number of these topics also include allusions to the work of Guest and Wells, whose work on Iron Age Coins is frequently cited in the database.
Let’s look at the individual districts now, and how these topics play over geographic space. Given that these are modern districts, it’d be better – perhaps – to do this over again with the materials sorted into geographic entities which make sense from a Roman perspective. Perhaps do it by major Roman Roads ( sorting the records so that districts through which Wattling Street traverses are gathered into a single document). Often what people do when they want to visualize the patterns of topic interconnections in a corpus is to trim the composition document so that only topics greater than a certain threshold are imported to a package like Gephi.
My suspicion is that that would throw out a lot of useful data. It may be that it’s the very weak connections that matter. A very strong topic-document relationship might just mean that a coin hoard found in the area is blocking the other signals.
In which case, let’s bring the whole composition document into Gephi. Start with this:
and delete out the edge weights. (I’m trying to figure out how to do what follows without deleting those edge weights, but bear with me.)
You end up with something like this:
adur 4 15 22 […etc…]
Save the file with a new name, as csv.
Open in Notepad++ (or similar) and replace the commas with ;
Go to gephi. Under ‘open graph file’, select your csv file. This is not the same as ‘import spreadsheet’ under the data table tab. You can import a comma separated file where the first item on a line is a node, and each subsequent item is another node to which it is attached. If you tried to open that file under the ‘import spreadsheet’ button, you’d get an error message – in that dialogue, you have to have two columns source and target where each row describes a single relationship. See the difference?
This is why if you left the edge weights in the csv file – let’s call it an adjaceny file – you’d end up with weights becoming nodes, which is a mess. If you want to keep the weights, you have to do the second option.
I’ve tried it both ways. Ultimately, while the first option is much much faster, the second option is the one to go for because the edge weights (the proportion that a topic is present in a document) is extremely important. So I created a single list that included seven pairs of topic-weight combinations. (This doesn’t created a graph where k=7, because not every document had that many topics. But why 7? In truth, after that point, the topics all seemed to be well under 1% of each document’s composition).
With me so far? Great.
Now that I have a two mode network in Gephi, I can begin to analyze the pattern of topics in the documents. Using the multi-mode plugin, I separate this network into two one-mode networks: topics to topics (based on appearing in the same district) and district – district based on having the same topics, in different strengths.
Network visualization doesn’t offer anything useful here (although Wales always is quite distinctly apparent, when you do. It’s because of the coin hoards). Instead, I simply compute useful network metrics. For instance, ‘betweeness’ literally counts the number of times a node is in between all pairs of nodes, given all the possible paths connecting them. In a piece of text such words do the heavy semantic lifting. So identifying topics that are most in between in the topic – topic network should be a useful thing to do. But what does ‘betweeness’ imply for the district – district network? I’m not sure yet. Pivotal areas in the formation of material culture?
What is perhaps more useful is the ‘modularity’. It’s just one of a number of algorithmns one could use to try to find structural sub-groups in a network (nodexl has many more). But perhaps there are interesting geographical patterns if we examined the pattern of links. So I ran modularity, and uploaded the results to openheatmap to visualize them geographically. Network analysis doesn’t need to produce network visualizations, by the way.
See the result for yourself here: http://www.openheatmap.com/embed.html?map=AnteriorsFrijolsHermetists
It colours each district based on the group that it belongs to. If you mouse-over a district, it’ll give you that group’s number – those numbers shouldn’t be confused with anything else. I’d do this in QGIS, but this was quicker for getting a sense of what’s going on.
I asked on Twitter (referencing a slightly earlier version) if these patterns suggested anything to any of the Romano-Britain crowd.
@electricarchaeo would be interesting to overlay civitas boundries/ Creightons IA coin core boundaries
— Phil Mills (@Tileman_and_son) May 29, 2013
Modularity for topic-topic also implies some interesting groupings, but these seem to mirror what one would expect by looking at their prominence in the keys.txt file. So that’s where I am now, soon to try out Phil’s suggestion.
As Paul Harvey was wont to say, ‘…and now you know… the REST of the story’. At DH2013 I hope to be able to tell you what all of this may mean.