Visualizing texts using Overview

I’ve come across an interesting tool called ‘Overview‘. It’s meant for journalists, but I see no reason why it can’t serve historical/archaeological ends as well. It does recursive adaptive k-means clustering rather than topic modeling, as I’d initially assumed (more on process here). You can upload texts as pdfs or within a table. One of the columns in your table could be a ‘tags’ column, whereby – for example – you indicate the year in which the entry was made (if you’re working with a diary). Then, Overview sorts your documents or entries into nested folders of similiarity. You can then see how your tags – decades – play out across similar documents. In the screenshot below, I’ve fed the text of ca 600  historical plaques into Overview:

Overview divides the historical plaques, at the broadest level, of similarity into the following groups:

‘church, school, building, toronto, canada, street, first, house, canadian, college (545 plaques),

‘road, john_graves, humber, graves_simcoe, lake, river, trail, plant’ (41 plaques)

‘community’ with ‘italian, north_york,  lansing, store, shepard, dempsey, sheppard_avenue’, 13 documents

‘: years’ with ‘years_ago, glacier, ice, temperance, transported, found, clay, excavation’, 11 documents.

That’s interesting information to know. In terms of getting the info back out, you can export a spreadsheet with tags attached. Within Overview, you might want to tag all documents together that sort into similar groupings, which you could then visualize with some other program. You can also search documents, and tag them manually. I wondered how plaques concerned with ‘children’, ‘women’, ‘agriculture’, ‘industry’, etc might play out, so I started using Overview’s automatic tagger (search for a word or phrase, apply that word or phrase as a tag to everything that is found). One could then visually explore the way various tags correspond with particular folders of similar documents (as in this example). That first broad group of ‘church school building canada toronto first york house street canadian’ just is too darned big, and so my tagging is hidden (see the image)- but it does give you a sense that the historical plaques in Toronto really are concerned with the first church, school, building, house, etc in Toronto (formerly, York). Architectural history trumps all. It would be interesting to know if these plaques are older than the other ones: has the interest in spaces/places of history shifted over time from buildings to people? Hmmm. I’d better check my topic models, and do some close reading.

Anyway, leaving that aside for now, I exported my tagged texts, and did a quick and dirty network visualization of tags connected to other tags by virtue of shared plaques. I only did this for 200 of the plaques, because, frankly, it’s Friday evening and I’d like to go home.

Here’s what I saw [pdf version]:


So a cluster with ‘elderly’, ‘industry’, ‘doctor’, ‘medical’, ‘woman’…. I don’t think this visualization that I did was particularly useful.

Probably, it would be better to generate tags that collect everything together in the groups that the tree visualization in Overview generates, export that, and visualize as some kind of dendrogram. It would be good if the groupings could be exported without having to do that though.

Getting Historical Network Data into Gephi

I’m running a workshop next week on getting started with networks & gephi. Below, please find my first pass at a largely self-directed tutorial. This may eventually get incorporated into the Macroscope.

Data files for this tutorial may be found here. There’s a pdf/pptx with the images below, too.

The data for this exercise comes from Peter Holdsworth’s MA dissertation research, which Peter shared on Figshare here. Peter was interested in the social networks surrounding ideas of commemoration of the centenerary of the War of 1812, in 1912. He studied the membership rolls for women’s service organization in Ontario both before and after that centenerary. By making his data public, Peter enables others to build upon his own research in a way not commonly done in history. (Peter can be followed on Twitter at

On with the show!

Download and install Gephi. (What follows assumes Gephi 0.8.2). You will need the MultiMode Projection pluging installed.

To install the plugin – select Tools >> Plugins  (across the top of Gephi you’ll see ‘File Workspace View Tools Window Plugins Help’. Don’t click on this ‘plugins’. You need to hit ‘tools’ first. Some images would be helpful, eh?).

In the popup, under ‘available plugins’ look for ‘MultimodeNetworksTransformation’. Tick this box, then click on Install. Follow the instructions, ignore any warnings, click on ‘finish’. You may or may not need to restart Gephi to get the plugin running. If you suddenly see on the far right of ht Gephi window a new tab besid ‘statistics’, ‘filters’, called ‘Multimode Network’, then you’re ok.


Getting the Plugin

Assuming you’ve now got that sorted out,

1. Under ‘file’, select -> New project.
2. On the data  laboratory tab, select Import-spreadsheet, and in the pop-up, make sure to select under ‘as table: EDGES table. Select women-orgs.csv.  Click ‘next’, click finish.

(On the data table, have ‘edges’ selected. This is showing you the source and the target for each link (aka ‘edge’). This implies a directionality to the relationship that we just don’t know – so down below, when we get to statistics, we will always have to make sure to tell Gephi that we want the network treated as ‘undirected’. More on that below.)


Loading your csv file, step 1.


Loading your CSV file, step 2

3. Click on ‘copy data to other column’. Select ‘Id’. In the pop-up, select ‘Label’
4. Just as you did in step 2, now import NODES (Women-names.csv)

(nb. You can always add more attribute data to your network this way, as long as you always use a column called Id so that Gephi knows where to slot the new information. Make sure to never tick off the box labeled ‘force nodes to be created as new ones’.)

Adding new columns

Adding new columns

5. Copy ID to Label
6. Add new column, make it boolean. Call it ‘organization’

Filtering & ticking off the boxes

Filtering & ticking off the boxes

7. In the Filter box, type [a-z], and select Id – this filters out all the women.
8. Tick off the check boxes in the ‘organization’ columns.

Save this as ‘women-organizations-2-mode.gephi’.

Now, we want to explore how women are connected to other women via shared membership.

Setting up the transformation.

Setting up the transformation.

Make sure you have the Multimode networks projection plugin installed.

On the multimode networks projection tab,
1. click load attributes.
2. in ‘attribute type’, select organization
4. in left matrix, select ‘false – true’ (or ‘null – true’)
5. in right matrix, select ‘true – false’. (or ‘true – null’)
(do you see why this is the case? what would selecting the inverse accomplish?)

6. select ‘remove edges’ and ‘remove nodes’.

7. Once you hit ‘run’, organizations will be removed from your bipartite network, leaving you with a single-mode network. hit ‘run’.

8. save as ‘women to women network.csv’

…you can reload your ‘women-organizations-2-mode.gephi’ file and re-run the multimode networks projection so that you are left with an organization to organization network.

! if your data table is blank, your filter might still be active. make sure the filter box is clear. You should be left with a list of women.

9. You can add the ‘women-years.csv’ table to your gephi file, to add the number of organizations the woman was active in, by year, as an attribute. You can then begin to filter your graph’s attributes…

10. Let’s filter by the year 1902. Under filters, select ‘attributes – equal’ and then drag ‘1902’ to the queries box.
11. in ‘pattern’ enter [0-9] and tick the ‘use regex’ box.
12. click ok, click ‘filter’.

You should now have a network with 188 nodes and 8728 edges, showing the women who were active in 1902.

Let’s learn something about this network. On statistics,
13. Run ‘avg. path length’ by clicking on ‘run’
14. In the pop up that opens, select ‘undirected’ (as we know nothing about directionality in this network).
15. click ok.

16. run ‘modularity’ to look for subgroups. make sure ‘randomize’ and ‘use weights’ are selected. Leave ‘resolution’ at 1.0

Let’s visualize what we’ve just learned.

17. On the ‘partition’ tab, over on the left hand side of the ‘overview’ screen, click on nodes, then click the green arrows beside ‘choose a partition parameter’.
18. Click on ‘choose a partition parameter’. Scroll down to modularity class. The different groups will be listed, with their colours and their % composition of the network.
19. Hit ‘apply’ to recolour your network graph.

20. Let’s resize the nodes to show off betweeness-centrality (to figure out which woman was in the greatest position to influence flows of information in this network.) Click ‘ranking’.
21. Click ‘nodes’.
22. Click the down arrow on ‘choose a rank parameter’. Select ‘betweeness centrality’.
23. Click the red diamond. This will resize the nodes according to their ‘betweeness centrality’.
24. Click ‘apply’.

Now, down at the bottom of the middle panel, you can click the large black ‘T’ to display labels. Do so. Click the black letter ‘A’ and select ‘node size’.

Mrs. Mary Elliot-Murray-Kynynmound and Mrs. John Henry Wilson should now dominate your network. Who were they? What organizations were they members of? Who were they connected to? To the archives!

Congratulations! You’ve imported historical network data into Gephi, manipulated it, and run some analyzes. Play with the settings on ‘preview’ in order to share your visualization as svg, pdf, or png.

Now go back to your original gephi file, and recast it as organizations to organizations via shared members, to figure out which organizations were key in early 20th century Ontario…

Patterns in Roman Inscriptions

Update August 22 I’ve now analyzed all 1385 inscriptions. I’ve put an interactive browser of the visualized topic model at

See how nicely the Latin clusters?

See how nicely the Latin clusters?

I’ve played with topic modeling inscriptions before. I’ve now got a very effective script in R that runs the topic model and produces various kinds of output (I’ll be sharing the script once the relevant bit from our book project goes live). For instance, I’ve grabbed 220 inscriptions from Miko Flohr’s database of inscriptions regarding various occupations in the Roman world(there are many more; like everything else I do, this is a work in progress).

Above is the dendrogram of the resulting topics. Remember, those aren’t phrases, and I’ve made no accounting for case endings. (Now, it’s worth pointing out that I didn’t include any of the meta data for these inscriptions; just the text of the inscription itself, with the diacritical marks removed.) Nevertheless, you get a sense of both the structure and content of the inscriptions, reading from left to right, top to bottom.

We can also look at which inscriptions group together based on the similarity matrix of their topics, and graph the result.


Inscriptions, linked based on similarity of the language of the inscription, via topics. If the image appears wonky, just click through.

So let’s look at these groups in a bit more depth. I can take the graph exported by R and import it into Gephi (or another package) to do some exploratory statistical analysis.

I’ve often put a lot of stock in ‘betweeness centrality’, reckoning that if a document is highly between in a network representation of the patterns of similarity of topics, then that document is representative of the kinds of discourses that run through it. What do we get, then?

We get this (here’s the page in the database):

aurifices Roma CIL 6, 9207 Inscription Occupation
M(arcus) Caedicius Iucundus / aurifex de / sacra via vix(it) a(nnos) XXX // Clodia …

But there are a lot of subgroupings in this graph. Something like ‘closeness’ might indicate more locally important inscriptions. In this case, the two with the highest ‘closeness’ measures are

aurifices Roma CIL 6, 9203 Inscription Occupation
Protogeni / aurfici / vix(it) an(nos) LXXX / et Claudiae / Pyrallidi con(iugi) …


aurifices Roma CIL 6, 3950 Inscription Occupation
Lucifer v(ixit) a(nnum) I et d(ies) XLV / Hesper v(ixit) a(nnos) II / Callistus …

If we look for subgroupings based on the patterning of connections, the biggest subgroup has 22 inscriptions:
Dis Manibus Felix publicus Brundisinorum servus aquarius vixit…
Dis Manibus Laetus publicus populi Romani 3 aquarius aquae An{n}ionis…
Dis Manibus sacrum Euporo servo vilico Caesaris aquario fecit Vestoria Olympias…
Nymphis Sanctis sacrum Epictetus aquarius Augusti nostri
Dis Manibus Agathemero Augusti liberto fecerunt Asia coniugi suo bene…
Agatho Aquarius Caesaris sibi et Anniae Myrine et suis ex parte parietis mediani…
Dis Manibus Sacrum Doiae Palladi coniugi dignissimae Caius Octavius…
Dis Manibus Tito Aelio Martiali architecto equitum singularium …
Dis Manibus Aureliae Fortunatae feminae incomparabili et de se bene merenti..
Dis Manibus Auliae Laodices filiae dulcissimae Rusticus Augusti libertus…
Dis Manibus Tychico Imperatoris Domitiani servo architecto Crispinilliano.
Dis Manibus Caio Iulio 3 architecto equitum singularium…
Dis Manibus Marco Claudio Tryphoni Augustali dupliciario negotiatori…
Dis Manibus Bromius argentarius
Faustus 3ae argentari
Dis Manibus sacrum Tiberius Claudius Hymeneus aurarius argentarius…
Dis Manibus Silio Victori filio et Naebiae Amoebae coniugi et Siliae…
Dis Manibus 3C3 argentari Allia coniugi? bene merenti fecit…
Dis Manibus Marco Ulpio Augusti liberto Martiali coactori argentario…
Suavis 3 aurarius
Dis Manibus sacrum Tiberius Claudius Hymeneus aurarius argentarius…
Dis Manibus Tito Aurelio Aniceto Augusti liberto aurifici Aurelia…

What ties these together? Well, ‘dis manibus’ is good, but it’s pretty common. The occupations in this group are all argentarii, architectii, or aquarii. So that’s a bit tighter. Many of these folks are mentioned in conjunction with their spouses.

In the next largest group, we get what must be a family (or familia, extended slave family) grouping:
Caius Flaminius Cai libertus Atticus argentarius Reatinus
Caius Octavius Parthenio Cai Octavi Chresti libertus argentarius
Musaeus argentarius
Caius Caicius Cai libertus Heracla argentarius de foro Esquilino sibi…
Caius Iunius Cai libertus Salvius Caius Iunius Cai libertus Aprodisi…
Caius Vedennius Cai filius Quirina Moderatus Antio militavit in legione…
Aurifex brattarius
Caius Acilius Luci filius Trebonia natus architectus
Caius Postumius Pollio architectus
Caius Camonius Cai libertus Gratus faber anularius
Caius Antistius Isochrysus architectus
Elegans architectus
Caius Cuppienus Cai filius Pollia Terminalis praefectus cohortis…
Cresces architectus
Cresces architectus
Caius Vedennius Cai filius Quirina Moderatus Antio militavit in legione…
Pompeia Memphis fecit sibi et Cnaeo Pompeio Iucundo coniugi suo aurifici…
Caius Papius Cai libertus Salvius Caius Papius Cai libertus Apelles…
Caius Flaminius Cai libertus Atticus argentarius Reatinus

The outliers here are graffitos or must be being picked up by the algorithmn due to the formation of the words; the inclusion of Pompeia in here is interesting, which must be to the overall structure of that inscription. Perhaps a stretch too far to wonder why these would be similar…?

This small experiment demonstrates I think the potential of topic modeling for digging out patterns in archaeological/epigraphic materials. In due time I will do Flohr’s entire database. Here are my files to play with yourself.

Giant component at the centre of these 220 inscriptions.

Giant component at the centre of these 220 inscriptions.

Topic Modeling #dh2013 with Paper Machines

I discovered the pdf with all of the abstracts from #dh2013 on a memory-stick-cum-swag this AM. What can I do with these? I know! I’ll topic model them using Paper Machines for Zotero.

Iteration 1.
1. Drop the pdf into a zotero collection.
2. Create a parent item from it.
3. Add a date (July 2013) to the date field on the parent item.
4. Right click on the collection, extract text for paper machines.
5. Right click on the collection, topic model –> by date.
6. Result: blank screen.

Right-click the collection, ‘reset papermachines output’.

Iteration 2.
1. Split the pdfs for the abstracts themselves into separate pages. (pg 9 – 546).
2. Drop the pdfs into a zotero collection.
3. Create parent items for it. (Firefox hangs badly at this stage. And keeps redirecting through for reasons I don’t know why).
4. Add dates to the date field; grab these by hand from the dh schedule page. God, there’s gotta be an easier way of doing this. Actually, I’ll just skip this for now and hope that the sequential page numbers/multiple documents will suffice.
5. Right click on the collection, extract text for paper machines.
6. Right click on the collection, topic model –> by date.
7. Result: IndexError: index out of range: -1.

Right-click the collection, ‘reset papermachines output’.

Iteration 3.
Jump directly to #4, add dates to date field. In the interests of getting something done this morning, I will give them all the same date – a range from July 16 – July 19. If I gave them all their correct dates, you’d get a much more granular view. But I’m adding these by hand. (Though there probably exists some sort of batch edit for Zotero fields? Hang on, I right click on ‘change fields for items’ type ‘date’ for field, put in my range, hey presto! Thanks, Zotero)
5. Right click on the collection, extract text for paper machines.
6. Right click on the collection, topic model –> by date.
7. Result:


Chased down the folder where all of these was being stored. Ahha. Each extracted text file is blank. Nice.

Blow this for a lark. Sometimes, folks, the secret is to go away, and come back later.

Update: I tweeted:

And then walked away for a while. Came back, and went to the TEI file. I used Notepad ++ to strip everything else out but the abstracts. I saved it as a csv. Then, in Excel, I used a custom script I found lying about on teh webs to turn each line into its own txt file. Then I copied the directory into Zotero. I gave each txt file its own parent. I mass edited those items so that they all carried the date July 16 – 19 2013. Then I extracted texts (which seems redundant, but you can’t jump ahead).

And then I selected topic modeling by time.

Which at least created a topic model, but it didn’t make the stream graph. The heat map worked, but all it showed was the US, UK, and Germany. And Florida, for reasons unexplained.

So I went back to Gephi for my topic model visualization. I used Ben Marwick’s Mallet-in-R script to do the modeling and to transform the output so I could easily visualize the correlations. Behold, I give you the network of strongly correlated #dh2013 abstracts by virtue of their shared topics:


It’s coloured by modularity and sized by betweeness, which gives us groups of abstracts and the identification of the abstract whose topics/discourse/text do all of the heavy lifting. A brief glance at the titles suggests that these papers are all concerned with issues of data management of text.

I’ll put all of this data up on my space at in a moment It’s up on Figshare, and provide some further reflections. Currently, this machine is hanging up on me frequently, and I want to get this out before it crashes. Here are the topics; you can add labels if you’d like, but the top three seem to be ‘publishing & scholarly communication’; ‘visualization’; ‘teaching’:

Correlated topics at #dh2013

Correlated topics at #dh2013

0.35142 humanities digital social scholarly http research history accessed work community scholarship www access dh journal publication citation communication publishing
0.28061 literary reading analysis visualization text texts digital literature century studies media topic humanities corpus mining modeling press textual paper
0.21684 digital humanities students university teaching research dh participants workshop projects education pedagogy program tools academic arts graduate project resources
0.18993 digital collections research collection content researchers users access library user resources image images libraries archives metadata cultural information tools
0.14539 tei text document documents encoding markup xml texts index london indexing http uk html encoded links search version modern
0.11833 data historical map time gis information temporal maps university spatial geographic locations texts geographical place names mapping date dates
0.11792 crowdsourcing digital project public states united archaeological america archaeology projects poster university virginia web community social civil media users
0.11289 systems model modeling system narrative media theory elements classification type features user markup ic gesture expression representation press character
0.09601 editions edition text scholarly digital women editing collation print textual texts tools http image manuscript electronic editorial versions environment
0.08569 authorship author words texts corpus attribution characters frequency plays fig classification results number novels genre authors analysis character delta
0.08016 semantic annotation web linked open ontology data rdf scholarly http ontologies research annotations information review project metadata knowledge org
0.07777 social network networks analysis graph relationships characters group graphs jazz science family de interaction publication relationship nodes discussion cultural
0.06328 language corpus text txm http german de web lexicon platform corpora tools analysis unicode research annotation encoding languages lexus
0.05286 digital knowledge community fabrication migration book open feminist learning field knitting desktop world practices cultural experience work lab academic
0.04856 text analysis programming voyant tools ca poster interface alberta live rank sinclair http latent environments ualberta touch screen environment
0.04131 words poetry word text poem texts poetic ford english author segments conrad analysis language poems zeta newton mining chapters
0.0364 simulation information time content model vsim environment narrative abm distribution feature light embedded study narratives virtual japan plot resources
0.03538 query search google alloy xml language words typesetting algorithm de detection cf engine speech mql algorithms body searches paris
0.01131 de la el homer movement uncertainty en se clock catalogue del astronomical una movements para los dance las imprecision

A quick run with Serendip-o-matic

I just ran my announcement of our book through the #owot Serendip-o-matic serendipity engine.

It took the text of my post, and extracted these key words:

book, digital, writing, online, process, project, students, things, us, wanted, going, historian, nervous, one, programming.

I wondered if the selected keywords changed each time, if there was a bit of fuzziness to the extraction routine.  The image results this second time looked different than the first (more digitally than booky the second time, more bookish the first time than digital), but the results from the ‘save’ button were the same:

So, for pass one:

  1. Writing 2.0: Using Google Docs as a Collaborative Writing Tool in the Elementary Classroom: From DPLA.
  2. Effectiveness of an Improvement Writing Program According to Students’ Reflexivity Levels: From Europeana.
  3. Students in the incubation room at the Woodbine Agricultural School, New Jersey: From Flickr Commons.
  4. Impossible things [book review]: From DPLA.
  5. Let The Feeling Flow: From Europeana.
  6. Student reading to two little girls. Photographed for 1920 home economics catalog by Troy.: From Flickr Commons.

For pass two: – well, lots of different stuff, some overlaps, but a glitch meant that my results didn’t get saved.

Pass three: these words extracted- book, digital, writing, online, process, project, students, things, us, wanted, going, historian, nervous, one, programming. Same words, different order; but there were many different images from passes 1 and 2, while some images stayed the same. The ‘save’ page brought up the list above.  If I was serious about saving, I’d try to push from the results page into Zotero; in any event, after five workdays, this is a hell of a neat piece of work!  For contrast, let me take those keywords that serendipomatic extracted, and run them through google. Three results:

So serendipomatic is the winner, hands down! Putting the keywords* extracted via natural language processing into google really highlights how google works: it exactly points to the post with which we began. And there, ladies and gentlemen, is the reason why Google, for all its power, is not the friend to research that you might have thought. Google is for generating needles; Serendipomatic is for generating haystacks, and it does it well. Well done #owot team!

*putting the whole text generated an error: Error 414 (Request URI too large!) Sorry google, didn’t mean to break you.

Announcing a live-writing project: the Historian’s Macroscope, an approach to big digital history

Robert Hook’s Microscope

I’ve just signed a book contract today with Imperial College Press; it’s winging its way to London as I type. I’m writing the book with the fantastically talented Ian Milligan and Scott Weingart. (Indeed, I sometimes feel the weakest link – goodbye!).

It seems strangely appropriate, given the twitter/blog furor over the AHA’s statement recommendation to graduate students that they embargo their dissertations online, for fear of harming their eventual monograph-from-dissertation chances. We were approached by ICP to write this book largely on the strength of our blog posts, social media presence, and key articles, many of which come from our respective dissertations. The book will be targeted at senior/advanced undergrads for the most part, as a way of unpeeling the tacit knowledge around the practice of digital history. In essence, we can’t all be part of, or initiate, fantastic multi-investigator projects like ChartEx or Old Bailey Online; in which case, what can the individual achieve in the realm of fairly-big data? Our book will show you.

One could reasonably ask, ‘why a book? why not a website? why not just continue adding to things like the Programming Historian?’.  We wanted to write more than tutorials (although we owe an enormous debt to the Programming Historian team whose example and project continues to inspire us). We wanted to make the case for why as much as explore the how, and we wanted reach a broader audience than the digital technosavy. In our teaching, we’ve all experienced the pushback from students who are exposed to digital tools & media all the time; a book-length treatment normalizes these kinds of approaches so that students (and lay-people) can say, ‘oh, right, yes, these are the kinds of things that historians do’ – and then they’ll seek out Programming Historian, Stack Overflow, and myriad other sites to develop their nascent skills.  Another attraction of doing a book is that we recognize that editors add value to the finished product. Indeed, our commissioning editor sent our first attempt at a proposal out to five single-blind reviewers! This project is all the stronger for it, and I wish to thank those reviewers for their generous reviews.

One thing that we insisted upon from the start was that we were going to live-write the book, openly, via a comment-press installation. I submitted a piece to the Writing History in the Digital Age project a few years ago. That project exposed the entire process of writing an edited volume. The number and quality of responses was fantastic, and we knew we wanted to try for that here. We argued in our proposal that this process would make the book stronger, save us from ourselves, and build a potential readership long before the book ever hit store shelves. We were astonished and pleased that ICP thought it was a great idea! They had no hesitation at all – thank you Alice! We’ve had long discussions about the relationship of the online materials to the eventual finished book, and wording to that effect is in the final contract. Does that mean that the final type-set manuscript will appear on the commentpress online? No, but nor will the book’s materials be embargoed.  None of us, including the Press, have tried this scale of things before. No doubt there will be hiccups along the way, but there’s a lot of goodwill built up and I trust that we will be able to work out any issues that may (will) arise.

We’re going to write this book over the course of one academic year. In all truthfulness, I’m a bit nervous about this, but the rationale is that digital tools and approaches can change rapidly. We want to be as up-to-date as possible, but we also have to be aware in our writing not to date ourselves either. That’s where all of you come in. As we put bits and parts up on The Historian’s Macroscope – Big Digital History, please do read and offer comments. Consider this an open invitation. We’d love to hear from undergraduate students. Some of these pieces I’m going to road test on my ‘HIST2809 Historian’s Craft’ students this autumn and winter. Ian, Scott, and I will be reflecting on the writing process itself (and my student’s experiences) on the blog portion of the live-writing website.

I’m excited, but nervous as hell, about doing this. Nervous, because this is a tall order. Excited, because it seems to me that the real transformative power of the digital humanities is not in the technology, but in a mindset that peels back the layers, to reveal the process underneath, that says it’s ok to tinker with the ways things have been done before.

Won’t you join us?


Prescot Street as Topic Model, or, reading an excavation distantly

I tried a new tact in my quest to data mine archaeological records. Stuart Eve sent me the csv from the Prescot Street excavations, where each record was a unique context. I fed this into the vanilla java gui for MALLET (so no tuning, just the basic settings, looking for 25 topics) to see what – if anything – might result. The output seems very promising. I deliberately did not look up any information on the excavation until after I’d run this analysis. Can reading site records algorithmically tell us anything useful, that we did not otherwise know?

As I often do, I posted my initial reaction to twitter:

How to visualize this? I’m growing cold towards network visualizations of this kind of data, but in this case a two-mode representation might be appropriate, since the topic modeling algorithm is functioning as a kind of unsupervised clustering routine, pulling words out of the records that seem to go together. Here’s a two-mode network of the results, contexts tied to their constituent topics:

Prescot Street as Topic Model.

Prescot Street as Topic Model.

It seems promising. In that image, I took the excavators’ names out. But upon reflection, I shouldn’t do that:

I asked Gephi to look for modules (communities; groups; based on similarity of ties) within this two mode network. Below are a series of images that focus on the individual modules. Two items jump out immediately – one, particular excavators are associated with particular word choice, patterning of word usages; two, particular kinds of materials clump together quite nicely.

Do particular excavators ‘see’ particular kinds of info that others don’t? Do they ‘specialize’ in certain kinds of info? As a newbie on the Forum Novum project for BSR many years ago, I was never allowed on any of the ‘interesting’ stuff, being consigned to digging through layers of fill to find the depth of the natural soil level. There’s only so many ways to describe dirt. This kind of thing happens often. You want your most experienced excavators to handle the most delicate/intricate/complicated situations, but… I wonder.

Topic modeling this material, whilst including the names of the excavators attached to each context, seems to shed interesting light on the ways we see things archaeologically. In my other experiments with the PAS database, because of extraneous commas creeping in and shifting the fields, I often ended up with an inconsistent inclusion of the finds officers’ names, so I tended to just exclude them completely. That might be an error. I think we need to know whose voice is most tied to the ‘topics’/’discourses’ that make up our record (after all, once it’s excavated, this is all we have left, right?) This experiment here suggests that perhaps one of the more valuable outcomes of topic modeling archaeological material is the re-introduction of subjectivity into our records, the idea that many voices (modern and ancient) make up the ‘record’ – and we should listen to them.

In due course I’ll put the html up somewhere so that the interested reader can jump through the contexts along the topic – context – topic pathways suggested by the topic modeling. We use Harris matrices (a kind of network) to understand the three dimensional relationships amongst contexts (which imply their chronological ordering); what kinds of insights can deforming our reading of an excavation along the network paths suggested by the topic modeling produce?

Below are the visualizations of the modules.

pits and burials

pits and burials

roman pits, fills, structures

roman pits, fills, structures

cellars and latrines

cellars and latrines

graves and cemeteries

graves and cemeteries

roman fill

roman fill

modern ditches

modern ditches

And the topics with their top words:
topicId words..

1 schager elisabet pottery area part remains found bone similar poss fills bones appears burnt located human pieces waste grey activity main animal clear cremations broken cbm fragments truncates domestic skull high underneath mid shells bit edge sort chalk vessels deposits charcoal nw sherds disarticulated lost oyster sterile specific includes thrown

2 pit roman ii po ossuary irregular large latest including probable mixed pictured truncating inside planned sealed appears cut continuation surviving soakaways remained intercutting step pitting results topped width relates infilling partial include moved northwards steven ashley contexts adult perpendicular offset remain aesthetically loaced disturb sprial mentioned compass fed skeletons connections

3 fill floor basement rubble concrete slab fl evidence bedding ce larger glass abutting represent demolition room darker suggesting repair boundaries situe remaining unclear feature continues samian cessy eval packed facade john photo subrectangular reused actual ws lay inclusion noted lie teh constrcution looked crees brick lots archaeology flexed state

4 soakaway late water sump collection su pm brick soak masonry structure horn core back lined bricks lining drainage masonary materials face smell fit red held system courses time functioned sloping putrid cores aid headers lain knocked pipes mottled lies bands buried rotten real lying tirtiary simple earthernware exterior acrivity respective

5 pm pooley ashley late backfill century brick lucas tom cellar made garden line deliberate material walls cistern places sitting leveling thc proximity shallow backfilling based lerza rivets lifting limestone rebuild characteristic general redep suggested potential campion signs putrid map shown phase bits occurance structure element disintegrated ash southwards act crumble

6 truncated linear modern clark william heavily south west east truncation due north shape rectangular foundation cist machine stone cutting running aligned relationship pre composition tiles ne note observed worked sides deeper manhole intrusion define identical machining unknown depression tile mod axis bagged tegula limit channel erosional forming sample loe uneven

7 cut construction structural back slightly ring recut ephemeral completely realised doesn left partly heading heavy fragment contents analogous suggests comprises properties limestone short wells thc intervening association reflecting pictures clarify count sotnes terminus browny vertically bar unarticulated highest repdeposited things redeposit crmated tank approx ifthe lessnes forming explaining inclination plan

8 fill top base finds contained clancy sara clay organic context section date level horncore excavated original sheet shallow sketch nature suggest silting pipe suggests depth sampled trap lined dumped fully put reverse cemetery hearth beam deliberately frequent removal orientation orange paper backfilled horncores lain discussion sealed cultural appeared thick tenon

9 pit roman cut howell paula oval tip big ground difficult exact probable vertical pocket reflect shows pretty phase work means times duffy region alignment man matches nail wasn sequence build silting clinker fl brittle abuts tentar db sewer quadrant disarticulation implying characteristics revised constuction bottomed pressed unimpressive smc extending

10 gary webster filled fill possibly surface metalling gravelly related underlying unclear laid em difficult compact cobbling overlying represent dark modern mix undetermined series metalled place yellow gaps se stratigraphically extend dumps intentionally missing size charnal foundations spilled lack unsure things areas barrell blue metal yard variety respects ploughsoil anphora

11 pm post early fill med large medieval cess lot contemporary light pc inclusions latrine observed single mortar collapse character leather recovered ceramic suggests extent glacial lense hand event green interpretation resting case demo roughly curved household apparently assist inflow setting render cores varying determining belongs tenuously derivation mixture unlike consistant

12 refuse pr greg crees pit rubbish kind determine paula representing previously rounded bs discovered full gradual probing based enclosing struck housing similarities fronted coursings characterisation excavate sharp valley abse meeting people compare chronologically indication hypocaust blurring subjected distinctive amost grain remaining forms patchy interred colours including similar time midden

13 fill lower natural secondary sand greyish shaped statter claire surveyed black yorkstone proper ashley loose return horizontally mmx rest slots tbm patches largest acidic order distinct interface terrace drainage seperated ark rubberley hit spiralled rebuild destruction coming eastwards sharply hold candidate smells distorted field air powdered stains overview vacant dated

14 roman carreton adrian upper pit colour excavation makes funerary alignment southern preservation fireplace cover collapsed extending scattered adhering pinkish comprised ns nw bag smelly find soakaway whoel gs meant belonged disused regard ditches meters quarries huge making corresponds ritual existing cemented dimensions starts dimension marked paired excavtion staining shipton

15 wall fergal donoghue late pm wa sill building georgian st internal tenter butting wooden victorian present buttress support extension long prescot barrel house rear street dividing facing immediately platform front rising moisture prevent slate thinks medium slabs beneath seemingly fl counstruction plot wider lienar knees erosional lies cu trample photographed

16 roman fill matt ceri shipton law nails williams earlier black form find situ obvious fe uncertain complete amounts objects culvert smae skulls notably addition wood stoney domed truncations rectilinear pyres quality moderate working bonding earliest dark gis ark failry compost peeled functional rows ended properly remnants buildings accounted variation

17 cremation burial cr urn disturbed pot plan tile dobosz ukasz votive diffuse recorded dug built sample cm represents bone chest cremated position surrounding box analysis nb lifted coin regular offering vessel concentration occasional deposits suggesting block intact urned sw notable lid samples deep stones western broad higher plate cms relate

18 make levelling mu layer gravel dump material brickearth redeposited ed sandy deposited earth dumped spread dirty ground silty slumped capping clayey charcoal derived quarrying stoney extraction layers thin square sorted period exposed occupation sands soft parts provide lines stuff didn true partly significantly basal white tom mixey cluster test central

19 void external soil deposit hole cultivation posthole ec soils sp fra features lerza site fairly agricultural number brown debris dep evaluation reworked dumping result dates horticultural environmental run unurned plough residue deposition manuring representing upright storage exit family connected cleaning difference squared linked geophoto amorphous gravely concentrations poo defined

20 pit david unspecified ross edge roman brenna lowest shallow expect final basal presume dimensions marcus pebbles angular appeared covering diffuse processing stage stuart lens stored missed thickness const irregularities souther button funcation limits uncear oblong wider poshole suggested works fil metaling patella jaws grounds greater major purposes elisabet derive pegtile

21 cut small ruth rolfe side cuts end pits grid sq eastern circular originally western piece hard mm fact partially edges removed orangey wood half northern thought directly separate nearby degraded marcus initial urns period solid straight slope inwash graves limit rough wide occured occasionally centre good concave leading survives undertermined

22 drain ditch gully feature possibly shallow trench bottom boundary hassett visible southern runs burials sides postmed aspects slot cemetary presence point robber quarrying footing essentially direction formed doesn land homogenous number indicating section constructed thc terraced gulley parallel holes assoc overflow longbone debitage arising pressure fragment mark glazed wash sealing

23 pit quarry pq hassan anies primary silt dark prob middle zone skull filling mausoleum planned machined edges tiled tanked evident reused stain northeast ts corner sit redepoisted doubt terminate pillow overleaf shale fits standard means dateable existant redundant easts dropping quarried usage gc report truncate trampling compositions marcus bag

24 morse chaz deposit mixed roman dumped brown gravels rich function silty forms lenses subsided narrow assume robbed rest past discernable pitcut con sitly barren bucket cesspit shot beneath late unfrogged sister occupancy flure terminates consister retrieved resolved parallel joining ideas give millefiore burrial cd assumption regularity uppermost imbrex deposite

25 grave roman skeleton cut sk moskal tomasz coffin dug inhumation preserved poorly erroded goods body left legs head articulated skeletal nos poor events juvenile severly feet condition fragmentary holding bed ends stain strongly info spaced cu deposited shaped assigned disturbance cleaned chalk disatriculated femur hands soakawy showing overhangs hom cen

In which I topic model the entire PAS database by individual rows

Previously, I was trying to consider the geography of Roman Britain as a corpus of documents – individual geographic (modern) areas – where the records in the Portable Antiquities Scheme database formed the words of the document.

Today, I inverted that process. I treated each individual row in the entire PAS database as an individual document, with the data within that record its words. It took about two hours of processing time, looking for 100 topics. I now have a series of outputs that neither Excel nor Notepad++ can open, as they are too big. I’ll have to break the files up before I can dig too much deeper into them. However, what I can examine seems promising – topics that seem to indicate various regions; topics that indicate particular finds officers; topics that indicate particular kinds of artefacts; topics that indicate the status of the object (whether it was returned to the finder). Here’s a sampling:

Topic Weight Words
94 0.01654 mm thick wide weighs long diameter measures grams length width weight thickness high weighing fragment edge section maximum measuring
3 0.01442 suffolk east metal detector minter faye finder returned alloy plouviez judith copper mid st jane carr coastal edmundsbury geake
22 0.01409 green patina surface dark colour mid brown alloy copper corrosion light worn grey slightly condition corroded object pitted original
45 0.01374 mm weight width thickness length diameter atherton rachel maximum thick derbyshire dimensions wt height fragment midlands max including complete
1 0.01165 yorkshire humber riding metal detector east finder north returned alloy copper holmes simon paynton ceinwen hambleton selby david illegible
56 0.01076 east lincolnshire adam daubney midlands detector metal alloy lindsey copper finder returned kesteven north west elwes marina nottinghamshire rushcliffe
49 0.01028 lines decorated incised side decoration line central edge raised ring centre border grooves dot cross rectangular end upper punched
69 0.01003 ae nummus constantine house gloria exercitvs bust soldiers standards copper standard victory prow ii left illegible constantinopolis helmeted ad
27 0.00985 frame buckle pin bar alloy copper medieval loop edge oval outer missing cast strap double narrowed shaped looped section
92 0.00976 sherd pottery rim fabric sherds ware vessel grey chance find body medieval ceramic roman detecting inclusions colour surface orange

topic3topic1topic94  topic3 topic22 topic45

Topics as Word Clouds

Elijah Meeks and Mat Jockers both have used word clouds to visualize topics from topic models. Colour, orientation, relative placement of the words – all of these could be used to convey different dimensions of the data. Below, you’ll find clouds for each of my initial 50 topics generated from the Roman materials in the Portable Antiquities Scheme database (some 100 000 rows, or nearly 1/5 the database, collected together into ‘documents’ where each unitary district authority is the ‘document’ and the text are the descriptions of things found there). The word clouds are generated from the word weights file that MALLET can output. There are 8100 unique tokens when I convert the database into a MALLET file; each one of those is present in each ‘bag of words’ or topic that MALLET generates, but to differing degrees. Thus, word clouds (here generated with Wordle) pull out important information that the word keys document does not. However, given that I optimized the interval whilst generating the topic models, the keys document provides an indication of the strength of the topic in the corpus. I’ve arranged the word clouds scaling them against the size of the strongest topic (topic 22), top-bottom, left-right. I’ll be damned if I can get wordpress to just display each image under the other one. Even stripped my table out, it did!

At any rate, as one churns through the 50 topics, after about the first 11 (depicted below), the topics get progressively more noisy as MALLET attempts to deal with incomplete transcriptions of the epigraphy of the coins, and the frequent notes about the source for the identification of the coins (the work of Guest & Wells). The final topic depicted here, topic 20, directly references a note often left in the database concerning the quality of an individual record; these frequently are in connection with materials that entered the British Museum collection before the Portable Antiquities Scheme got going and hence the information is not up to usual standards.

This exercise then suggests to me that 50 topics is just too much. I’m rerunning everything with 10 topics this time.

Topic 22

Topic 22

Topic 48

Topic 48

Topic 43

Topic 43

Topic 32

Topic 32

Topic 7

Topic 7

Topic 33

Topic 33

Topic 13

Topic 13

Topic 47

Topic 47

Topic 46

Topic 46

Topic 35

Topic 35

Topic 20

Topic 20

Where Roman Roads and Topic Models Intersect

Previously, I ended up with a map of UK districts, coloured by the five groups that Gephi’s modularity routine suggested were present, in the network of districts to districts based on shared patterns in the underlying topics (the topic model generated from the total dump of the Portable Antiquities Scheme database).

I asked on twitter if the patterns seemed evocative of anything; Phil Mills suggested that they seemed to match perhaps civitas boundaries. He provided me with an image of those boundaries (thanks Phil!) as well as some kmz files. Below are two images, one with civitas capitals (hand-drawn in by me) and Roman roads. Together, they are evocative.  Blocks of colour seem to go very well with civitas boundaries; where blocks of colour overlap those boundaries, they seem to march along well the routes of the roads. And all this from looking at topic models! I think it is getting progessively safer to say that the patterns found in an archaeological database through topic modelling are indeed meaningful on the ground. The factors of government, of identity, of mobility, seem to emerge in the topic model.

UK Districts by Modularity, overlain with hand-drawn civitas boundaries

UK Districts by Modularity, overlain with hand-drawn civitas boundaries

Roman roads overlain on same.

Roman roads overlain on same.

Reading Inscriptions Algorithmically

Inscriptions are complicated beasts. Frequently quite small and incomplete, epigraphers are able to extract an enormous amount of information from inscriptions – especially when they have other inscriptions with which to contrast and compare. Let us look at the inscriptions from Aphrodisias, which are published online following Epidoc conventions. Because of this, we are able to do some data-mining on them with a minimum of pre-processing.

(Joyce Reynolds, Charlotte Roueché, Gabriel Bodard, Inscriptions of Aphrodisias (2007), available <>, ISBN 978-1-897747-19-3.)

The first one looks like this, when the xml tags are stripped away:

Creative Commons licence Attribution 2.5 (
All reuse or distribution of this work must contain somewhere a link back to the URL
Originally published in Reynolds (1982).
English French German Ancient Greek Transliterated Greek Modern Greek Italian Latin Spanish Turkish 2007-07-04cmrDONE2007-04-02Charlotte Tupmanhand tidiedGBhand tidied 2007-03-15Elliott HallBatch converted Word2XML

Description of MonumentUpper right corner of a white marble block (0.36 x 0.24 x 0.34).
Description of TextInscribed on one face.
LettersLate Republican or Augustan; ave 0.02. rho in ll. 1, 3, 6 has a very small stroke slanting rightwards from the junction of the bowl with the vertical.
Date Late Republican or Augustan (lettering, content)
Edition οὗτος ὁ τόπος ἱερὸς ἄσυλος ὡς ἔκριναν ὁ μέγας Καῖσαρ ὁ δικτάτωρ καὶ ὁ υἱὸς αὐτοῦ αὐτοκράτωρ Καῖσαρ καὶ ἡ σύνκλητος καὶ ὁ δῆμος ὁ Ῥωμαίων καθὼς καὶ τὰ φιλάνθρωπα καὶ δελτογραφήματα καὶ ἐπικρίματα περιέχει ἀνέστησεν δὲ τοὺς ὅρους Γάϊος Ἰούλιος Ζωΐλος ὁ ἱερεὺς τῆς Ἀφροδείτης
Apparatus For the supplements, compare the partner inscription 1.38.
Translation [?This area is] the sacred asylum [?as defined by] the great [?Caesar, the] Dictator, and [?his son] Imperator [Caesar and the ] Senate [and People] of Rome, [as is also contained in the] grants of privilege, the public documents [and decrees. C. Iulius Zoilos priest of Aphrodite set up the boundary stones.]
Commentary See , 159-160.
Locations Stray find. Temple/Church temenos. Museum (1977)
Text Constituted From Transcription (Reynolds)
History of Recording Recorded by the NYU expedition in 1963 (63.596)
Bibliography Published by Reynolds, , doc. 35, whence SEG 1982.1097, BE 1983.388, 1984.878, McCabe 379, R. R. R. Smith, (Mainz, 1993) T5.


Face (1977)

There’s a lot of meta information that goes along with a single inscription above and beyond its transcription and translation, all of it which is necessary to understand the possible significance. I don’t think there’s a better illustration of what ‘close reading’ might mean in archaeology, than the epigrapher’s art.

What might we spot if we look at a corpus of inscriptions from a macro level? What patterns might exist? Is there something going on related to geography? Researcher? language of the inscription? Publication history? Dating? This is where the algorithmns of topic modeling might be useful. My go-to tool for this is MALLET. Mallet allows one to strip out all of the xml tags (see MALLET’s help file from the command line for -import-dir), so I can download the xml files as zip from the Inscriptions of Aphrodisias site, and begin exploring for patterns. I optimize the interval too when I train the topic model, to shake out the utility of the resulting ‘topics’. I began by modeling 50 topics.

You can download the MALLET file and results here, to play with and explore for yourself.

When I look at the results (inscriptionkeys.txt), the ‘strongest’ topics all relate to metadata regarding their online publication (the top 3). The next few clearly relate to the researchers who are behind the inscriptions of Aphrodisias website, so not overly useful for me here. The next couple seem to be a mixture of findspot information and publishing history:

topic6 0.34603 unpublished fragment reynolds face museum version lettering inscription born digital joyce unknown marble expedition white centuries nyu inscribed stray 32 0.23776 face upper moulding left side lettering part lower expedition museum white marble nyu broken aphrodisias asia corner inscribed front
topic39 0.15711 south walls east face west block part wall gate expedition findspot city stretch tupman mama lettering depth measurable marble
topic7 0.14493 mama gaudin published reinach mccabe cormack kubitschek squeeze notebook phi expedition records originally aphrodisias reichel publications recorded charlotte representations
topic43 0.13213 mccabe published originally bodard gabriel rouech aphrodisias phi description findspot reported subsequently charlotte unknown preliminary inscription tidied publication funerary

The remaining topics all deal explicitly with the inscriptions themselves, their texts and their findspots (it seems).

topic47 0.06106 son honoured honours people council claudius priest diogenes family tiberius man high cl public gerousia lived virtue life zenon
topic8 0.05807 roman family wife names father aphrodisian case daughter suggests citizenship early century reference possibly menodotos clear named civic late
topic38 0.05137 son zenon adrastos attalos dionysios athenagoras artemidoros apollonios hypsikles aphrodite diogenes daughter early tupman menestheus cf goddess grandson sons

Groupings in Inscriptions of Aphrodisias

Groupings in Inscriptions of Aphrodisias

Every file is composed of all these different topics, to differing degrees. I would like to visualize the paths of these discourses through the corpus, so I translate the inscriptioncomp.txt file so that I end up with at at least 9/10s of each document’s composition (in practice, this means cutting and pasting the inscriptioncomp.txt file so that I end up with a single list with source-document, target-topic, and weight). I also filtered out those strongest topics described above (5,6,7,9,16,17,29,39,43).

I imported this list into Gephi, and set about trying to find groupings of topics and inscriptions, based on the shared patterns (and the weighting) of relationships. I coloured it by group (modularity) and resized nodes based on ‘betweeness’. What does betweeness mean here? I think it means the principle ideas (the discourse) that ties this entire collection together. In this case, topic 0:

statue base honours shaft ll moulding set city sbi council feature honoured capital aurelius prosopography top moulded ligatures antonius

followed closely by 1 and 37:

topic1 sarcophagus funerary inscription front lid standard necropolis aurelius forms buried tomb east formula elements aur rim burial end line

topic37 city face village house inscribed recording edition wall block text transliterated unknown large line greek area lettering viii marble

Topics - topics, Inscriptions of Aphrodisias

Topics – topics, Inscriptions of Aphrodisias

It might be that these most ‘between’ topics are not the ones that are archaeologically interesting. This is of course a 2-mode network (inscriptions-topics) so it might be desireable to consider this data as two 1-mode networks, inscriptions – inscriptions by virtue of shared topics, and topics – topics by virtue of shared inscriptions. When we take topics – topics, running our familiar grouping and betweeness metrics, topic 37 comes out on top, followed by 10 and 33:

topic10 building reynolds blocks block son architrave published theatre decoration papers fasciae dedication end aphrodite people aphrodisias fascia found demos

topic33 ii iii iv cut text left cross fortune monogram mccabe letters end triumphs broken vi acclamation texts drawing vii

When we turn the two mode network into an inscriptions – inscriptions by virtue of shared topics, we end up with a monster of a graph: 1505 nodes (inscriptions), with 241,002 relationships! The most between inscription is iAph050118:

Building inscription of Helladios
Charlotte M. Roueché2007
Creative Commons licence Attribution 2.5 (
All reuse or distribution of this work must contain somewhere a link back to the URL
Originally published in Roueché (2004).
English French Ancient Greek Transliterated Greek Latin AsiaTurkey
2004-06-08Gabriel BodardChecked and fixed all image divs and refs 2004-03-16 Gabriel Bodard Completed lemmatisation, checked figure ids, tagged keywords 2003-11-04John LavagninoConverted beta code to Unicode 2003-05-27 Gabriel Bodard tidied and corrected 2003-04-30 Juan Garcés tidied and corrected 2003-06-22CMRtagged, tidied and corrected2003-07-14JLGLemmatised2003-08-20CMRname tags reduced2004-01-16CMRtidied; image refs2003-05-27Gabriel BodardTyped and marked-up Greek Description of Monument

A rectangular white marble block, perhaps from a lintel (0.285 × 0.665 × 0.50) with simple moulding above and below on one face. Chipped to the right, but complete.
Description of Text

Inscribed on the moulded face, in one line on the surface between the mouldings, which is slightly concave. The text must have continued onto an adjacent block.

Description of Letters
Flowing style, similar to 5.302, 5.119 and 4.120; 0.05-0.06.

First half of the fourth century (lettering, prosopography).
ὁ ἁγνός

Me also Helladios the pure

For Helladios see also 1.131, 4.120 and discussion at II.35.

Hadrianic Baths: central chamber. Unknown. Findspot (1972)..

History of Recording
Excavated by the NYU expedition.
Bibliography Published by Roueché, Aphrodisias in Late Antiquity, no. 18, whence PHI 605.
Text Constituted From Transcription (Roueché).
Photographs Face (1972)

Seems a bit underwhelming, no? But look at what is in this inscription – a personal name, the central chamber of the Baths, links outward to other inscriptions… reading the inscriptions algorithmically doesn’t absolve us from having to jump back in to do the close reading. Instead, we have to bounce back and forth between the micro and the macro. The modularity routine suspects that there are around 52 distinct subgroups in this material. That’s probably where the most interest will lie, for scholars of this material. Are these groups related to context of discovery, or named individuals appearing in mutliple inscriptions or…? Five groups account for 1456 inscriptions. (It’s easier to load the ‘inscriptions-inscriptions-inscriptions-of-aphrodisias [Nodes].csv’ file to examine all of these). What might be causing the ‘big five’ to group together? I will leave it up to the epigraphers to examine them…

Those 47 inscriptions which the modularity routine found so odd that they each were put into their own group are curious indeed. The first of these uniques is Inscription iAph080906:

ὁ μεγαλοπρεπέστατος πολιτευόμενος σὺν θεῷ πατὴρ τῆς πόλεως

Up with Theopompos, magnificentissimus, member of the council and, with God’s help, pater civitatis

…which seems to be a good place to draw this note to a close. Up with Theopompos indeed! One wonders if he won the election. The remainder (checked at random) seem to have no translations associated with them. So perhaps what really sets these apart is simply that they haven’t been translated. If so, that’s rather astonishing that that should be visible from a topic-model & graph viz combination.

Topic modeling the things that fell out of pockets

UK Districts by Modularity, overlain with hand-drawn civitas boundaries

Modern Districts by Modularity, overlain with hand-drawn 1st century civitas boundaries

Topic modeling is very popular at the moment in the digital humanities. Ian, Scott and I described them as tools for extracting topics or injecting semantic meaning into vocabularies: “Topic models represent a family of computer programs that extract topics from texts. A topic to the computer is a list of words that occur in statistically meaningful ways. A text can be an email, a blog post, a book chapter, a journal article, a diary entry – that is, any kind of unstructured text” (Graham, Weingart, and Milligan 2012). In that tutorial, ‘unstructured’ means that there is no encoding in the text by which a computer can model any of its semantic meaning.

But there are topic models of ships’ logs, of computer code. So why not archaeological databases?

Archaeological datasets are rich, largely unstructured bodies of text. While there are examples of archaeological datasets that are coded with semantic meaning through xml and Text Encoding Initiative practices, many of these are done after the fact of excavation or collection. Day to day, things can be rather different, and this material can be considered to be  ‘largely unstructured’ despite the use of databases, controlled vocabulary, and other means to maintain standardized descriptions of what is excavated, collected, and analyzed. This is because of the human factor. Not all archaeologists are equally skilled. Not all data gets recorded according to the standards. Where some see few differences in a particular clay fabric type, others might see many, and vice versa. Archaeological custom might call a particular vessel type a ‘casserole’, thus suggesting a particular use, only because in the 19th century when that vessel type was first encountered it reminded the archaeologist of what was in his kitchen – there is no necessary correlation between what we as archaeologists call things and what those things were originally used for. Further, once data is recorded (and the site has been destroyed through the excavation process), we tend to analyze these materials in isolation. That is, we write our analyses based on all of the examples of a particular type, rather than considering the interrelationships amongst the data found in the same context or locus. David Mimno in 2009 turned the tools of data analysis on the databases of household materials recovered and recorded room by room at Pompeii. He considered each room as a ‘document’ and the artefacts therein as the ‘tokens’ or ‘words’ within that document, for the purposes of topic modeling. The resulting ‘topics’ of this analysis are what he calls ‘vocabularies’ of object types which when taken together can suggest the mixture of functions particular rooms may have had in Pompeii. He writes, ‘the purpose of this tool is not to show that topic modeling is the best tool for archaeological investigation, but that it is an appropriate tool that can provide a complement to human analysis….mathematically concrete in its biases’. The ‘casseroles’ of Pompeii turn out to have nothing to do with food preparation, in Mimno’s analysis. To date, I believe this is the only example of topic modeling applied to archaeological data.

Directly inspired by that example, I’ve been exploring the use of topic models on another rich archaeological dataset, the Portable Antiquities Scheme database in the UK. The Portable Antiquities Scheme is a project “to encourage the voluntary recording of archaeological objects found by members of the public in England and Wales”. To date, there are over half a million unique records in the Scheme’s database. These are small things, things that fell out of pockets, things that often get found via metal-detecting.

Here’s what I’ve been doing.

1. I downloaded a nightly dump of the PAS data back in April; it came as a csv file. I opened the file, and discovered over a million lines of records. Upon closer examination, I think what happened is something to do with the encoding- there are line breaks, carriage returns, and other non-printing characters (as well as commas being used within fields) that when I open the file I end up with a single record (say a coin hoard) occupying tens of lines, or of fields shifting at the extraneous commas.

2. I cleaned this data up using Notepad++ and the liberal use of regular expressions to put everything back together again. The entire file is something like 385 mb.

3. I imported it into MS Access so that I could begin to filter it. I’ve been playing with paleo – meso – and neolithic records; bronze age records; and Roman records. The Roman material itself occupies somewhere around 100 000 unique records.

4. I exported my queries so that I would have a simpler table with dates, descriptions, and measurements.

5. I filtered this table in Excel so that I could copy and paste out all of the records found within a particular district (which left me with a folder with 275 files, totaling something like 25 mb of text).

6. Meanwhile, I began topic modeling the unfiltered total PAS database (just after #2 above). Each run takes about 3 hours, as I’ve been running diagnostics to explore the patterns. The problem I have here though is what, precisely, am I finding? What does a cluster of records who share a topic actually mean, archaeologically? Do topics sort themselves out by period, by place, by material, by finds officer…?

7. As that’s been going on, I’ve been topic modeling the folders that contain the districts of England and Wales for a given period. Let’s look at the Roman period.

There are 275 files, where a handful have *a lot* of data (> 1000 kb), while the vast majority are fairly small (< 100 kb). Perhaps that replicates patterns of metal detecting – see Bevan on biases in the PAS.  The remaining districts seem to have no records in the database. So I’ve got 80% coverage for all of England and Wales. I’ve been iterating over all of this data, so I’ll just describe the most recent, as it seems to be a typical result. Using MALLET 2.0.7, I made a topic model with 50 topics (and optimized the interval, to shake out the useful from the not-so-useful topics). Last night, as I did this, the topic diagnostics package just wouldn’t work for me (you run it from the MALLET directory, but it lives at the MALLET site; perhaps they were working on it). So I’ll probably want to run all these again.

If I sort the topic keys by their prominence (see ‘optimize interval’) the top 14 all seem to describe different kinds of objects – brooches, denarii, nummus, sherds, lead weights, radiate, coin dates, the ‘heads’ sides of coins – which Emperor. Then we get to the next topic, which reads :” record central database recording usual standards fall created scheme aware portable began antiquities rectify working corroded ae worn century”.  This meta-note about data quality appears throughout the database, and refers to materials collected before the Scheme got going.

After that, the remaining topics all seem to deal with the epigraphy of coins, and the various inscriptions, figurative devices, their weights & materials. A number of these topics also include allusions to the work of Guest and Wells, whose work on Iron Age Coins is frequently cited in the database.

Let’s look at the individual districts now, and how these topics play over geographic space. Given that these are modern districts, it’d be better – perhaps – to do this over again with the materials sorted into geographic entities which make sense from a Roman perspective. Perhaps do it by major Roman Roads ( sorting the records so that districts through which Wattling Street traverses are gathered into a single document). Often what people do when they want to visualize the patterns of topic interconnections in a corpus is to trim the composition document so that only topics greater than a certain threshold are imported to a package like Gephi.

My suspicion is that that would throw out a lot of useful data. It may be that it’s the very weak connections that matter. A very strong topic-document relationship might just mean that a coin hoard found in the area is blocking the other signals.

In which case, let’s bring the whole composition document into Gephi. Start with this:

adur 4 0.238806 15 0.19403 22 0.179104 13 0.119403 17 0.089552

and delete out the edge weights. (I’m trying to figure out how to do what follows without deleting those edge weights, but bear with me.)

You end up with something like this:

adur 4 15 22  […etc…]

Save the file with a new name, as csv.

Open in Notepad++ (or similar) and replace the commas with ;

Go to gephi. Under ‘open graph file’, select your csv file. This is not the same as ‘import spreadsheet’ under the data table tab. You can import a comma separated file where the first item on a line is a node, and each subsequent item is another node to which it is attached. If you tried to open that file under the ‘import spreadsheet’ button, you’d get an error message – in that dialogue, you have to have two columns source and target where each row describes a single relationship. See the difference?

This is why if you left the edge weights in the csv file – let’s call it an adjaceny file – you’d end up with weights becoming nodes, which is a mess. If you want to keep the weights, you have to do the second option.

I’ve tried it both ways. Ultimately, while the first option is much much faster, the second option is the one to go for because the edge weights (the proportion that a topic is present in a document) is extremely important. So I created a single list that included seven pairs of topic-weight combinations. (This doesn’t created a graph where k=7, because not every document had that many topics. But why 7? In truth, after that point, the topics all seemed to be well under 1% of each document’s composition).

With me so far? Great.

Now that I have a two mode network in Gephi, I can begin to analyze the pattern of topics in the documents. Using the multi-mode plugin, I separate this network into two one-mode networks: topics to topics (based on appearing in the same district) and district – district based on having the same topics, in different strengths.

Network visualization doesn’t offer anything useful here (although Wales always is quite distinctly apparent, when you do. It’s because of the coin hoards). Instead, I simply compute useful network metrics. For instance, ‘betweeness’ literally counts the number of times a node is in between all pairs of nodes, given all the possible paths connecting them. In a piece of text such words do the heavy semantic lifting. So identifying topics that are most in between in the topic – topic network should be a useful thing to do. But what does ‘betweeness’ imply for the district – district network? I’m not sure yet. Pivotal areas in the formation of material culture?

What is perhaps more useful is the ‘modularity’. It’s just one of a number of algorithmns one could use to try to find structural sub-groups in a network (nodexl has many more). But perhaps there are interesting geographical patterns if we examined the pattern of links. So I ran modularity, and uploaded the results to openheatmap to visualize them geographically.  Network analysis doesn’t need to produce network visualizations, by the way.

See the result for yourself here:

It colours each district based on the group that it belongs to. If you mouse-over a district, it’ll give you that group’s number – those numbers shouldn’t be confused with anything else. I’d do this in QGIS, but this was quicker for getting a sense of what’s going on.

I asked on Twitter (referencing a slightly earlier version) if these patterns suggested anything to any of the Romano-Britain crowd.


Modularity for topic-topic also implies some interesting groupings, but these seem to mirror what one would expect by looking at their prominence in the keys.txt file.  So that’s where I am now, soon to try out Phil’s suggestion.

As Paul Harvey was wont to say, ‘…and now you know… the REST of the story’.  At DH2013 I hope to be able to tell you what all of this may mean.