Using Goldstone’s Topic Modeling R package with JSTOR’s Data for Research

Andrew Goldstone and Ted Underwood have an article on ‘the quiet transformation of literary studies’ (preprint), where they topic model a literary history journal and discuss the implications of that model for their discipline. Andrew has a blog post discussing their process and the coding that went into that process.

I’m currently playing with their code, and thought I’d share some of the things you’ll need to know if you want to try it out for yourself – get it on github. I’m assuming you’re using a Windows machine.

1. Get the right version of R. You need 3.0.3 for this. Use either the 32 bit or 64 bit version of R ( both download in a single installer; when you install it, you can choose the 32 bit or the 64 bit version, depending on your machine. Choose wisely).

2. Make sure you’re using the right version of Java. If you are on a 64 bit machine, have 64 bit java; 32 bit: 32 bit java.

3. Dependencies. You need to have the rJava package, and the Mallet wrapper, installed in R. You’ll also need ‘devtools’. In the basic R gui, you can do this by clicking on packages >> install packages. Select rJava. Do the same again for Mallet. Do the same again for ‘devtools’. Now you can install Goldstone’s dfrtopics by typing, at the R command prompt

library(devtools)
install_github("dfrtopics","agoldst")

Now. Assuming that you’ve downloaded and unzipped a dataset from JSTOR (go to dfr.jstor.org to get one), here’s what you’re going to need to do. You’ll need to increase the available memory in Java for rJava to play with. You do this before you tell R to use the rJava library. I find it best to just close R, then reload it. Then, type the following, one at a time:

options(java.parameters="-Xmx2048m")
library(rJava)
library(mallet)
library(dfrtopics)

The first item in that list increases the memory heap size. If all goes well, there’ll be a little message telling you that your heap size is 2048 mb and you should really increase it to 2gb. As these are the same thing, then no worries. Now to topic model your stuff!

m <- model_documents(citations_files="[path to your]\\citations.CSV",
dirs="[path to your]\\wordcounts\\",
stoplist_file="[path to your]\\stoplist.txt",
n_topics=60)

Change n_topics to whatever you want. In the path to your files, remember to use double \\.

Now to export your materials.

output_model(m, "data")

This will create a ‘data’ folder with all of your outputs. But where? In your working directory! If you don’t know where this is, wait until the smoke clears (the prompt returns) and type

getwd()

You can use setwd() to set that to whatever you want:

setwd("c:\\path-to-your-preferred-work-directory")

You can also export all of this to work with Goldstone’s topic model browser, but that’ll be a post for another day. Open up your data folder and explore your results.

 

Patterns in Roman Inscriptions

Update August 22 I’ve now analyzed all 1385 inscriptions. I’ve put an interactive browser of the visualized topic model at http://graeworks.net/roman-occupations/.

See how nicely the Latin clusters?
See how nicely the Latin clusters?

I’ve played with topic modeling inscriptions before. I’ve now got a very effective script in R that runs the topic model and produces various kinds of output (I’ll be sharing the script once the relevant bit from our book project goes live). For instance, I’ve grabbed 220 inscriptions from Miko Flohr’s database of inscriptions regarding various occupations in the Roman world(there are many more; like everything else I do, this is a work in progress).

Above is the dendrogram of the resulting topics. Remember, those aren’t phrases, and I’ve made no accounting for case endings. (Now, it’s worth pointing out that I didn’t include any of the meta data for these inscriptions; just the text of the inscription itself, with the diacritical marks removed.) Nevertheless, you get a sense of both the structure and content of the inscriptions, reading from left to right, top to bottom.

We can also look at which inscriptions group together based on the similarity matrix of their topics, and graph the result.

roman-occ-graph
Inscriptions, linked based on similarity of the language of the inscription, via topics. If the image appears wonky, just click through.

So let’s look at these groups in a bit more depth. I can take the graph exported by R and import it into Gephi (or another package) to do some exploratory statistical analysis.

I’ve often put a lot of stock in ‘betweeness centrality’, reckoning that if a document is highly between in a network representation of the patterns of similarity of topics, then that document is representative of the kinds of discourses that run through it. What do we get, then?

We get this (here’s the page in the database):

aurifices Roma CIL 6, 9207 Inscription Occupation
M(arcus) Caedicius Iucundus / aurifex de / sacra via vix(it) a(nnos) XXX // Clodia …

But there are a lot of subgroupings in this graph. Something like ‘closeness’ might indicate more locally important inscriptions. In this case, the two with the highest ‘closeness’ measures are

aurifices Roma CIL 6, 9203 Inscription Occupation
Protogeni / aurfici / vix(it) an(nos) LXXX / et Claudiae / Pyrallidi con(iugi) …

and

aurifices Roma CIL 6, 3950 Inscription Occupation
Lucifer v(ixit) a(nnum) I et d(ies) XLV / Hesper v(ixit) a(nnos) II / Callistus …

If we look for subgroupings based on the patterning of connections, the biggest subgroup has 22 inscriptions:
Dis Manibus Felix publicus Brundisinorum servus aquarius vixit…
Dis Manibus Laetus publicus populi Romani 3 aquarius aquae An{n}ionis…
Dis Manibus sacrum Euporo servo vilico Caesaris aquario fecit Vestoria Olympias…
Nymphis Sanctis sacrum Epictetus aquarius Augusti nostri
Dis Manibus Agathemero Augusti liberto fecerunt Asia coniugi suo bene…
Agatho Aquarius Caesaris sibi et Anniae Myrine et suis ex parte parietis mediani…
Dis Manibus Sacrum Doiae Palladi coniugi dignissimae Caius Octavius…
Dis Manibus Tito Aelio Martiali architecto equitum singularium …
Dis Manibus Aureliae Fortunatae feminae incomparabili et de se bene merenti..
Dis Manibus Auliae Laodices filiae dulcissimae Rusticus Augusti libertus…
Dis Manibus Tychico Imperatoris Domitiani servo architecto Crispinilliano.
Dis Manibus Caio Iulio 3 architecto equitum singularium…
Dis Manibus Marco Claudio Tryphoni Augustali dupliciario negotiatori…
Dis Manibus Bromius argentarius
Faustus 3ae argentari
Dis Manibus sacrum Tiberius Claudius Hymeneus aurarius argentarius…
Dis Manibus Silio Victori filio et Naebiae Amoebae coniugi et Siliae…
Dis Manibus 3C3 argentari Allia coniugi? bene merenti fecit…
Dis Manibus Marco Ulpio Augusti liberto Martiali coactori argentario…
Suavis 3 aurarius
Dis Manibus sacrum Tiberius Claudius Hymeneus aurarius argentarius…
Dis Manibus Tito Aurelio Aniceto Augusti liberto aurifici Aurelia…

What ties these together? Well, ‘dis manibus’ is good, but it’s pretty common. The occupations in this group are all argentarii, architectii, or aquarii. So that’s a bit tighter. Many of these folks are mentioned in conjunction with their spouses.

In the next largest group, we get what must be a family (or familia, extended slave family) grouping:
Caius Flaminius Cai libertus Atticus argentarius Reatinus
Caius Octavius Parthenio Cai Octavi Chresti libertus argentarius
Musaeus argentarius
Caius Caicius Cai libertus Heracla argentarius de foro Esquilino sibi…
Caius Iunius Cai libertus Salvius Caius Iunius Cai libertus Aprodisi…
Caius Vedennius Cai filius Quirina Moderatus Antio militavit in legione…
Aurifex brattarius
Caius Acilius Luci filius Trebonia natus architectus
Caius Postumius Pollio architectus
Caius Camonius Cai libertus Gratus faber anularius
Caius Antistius Isochrysus architectus
Elegans architectus
Caius Cuppienus Cai filius Pollia Terminalis praefectus cohortis…
Cresces architectus
Cresces architectus
Caius Vedennius Cai filius Quirina Moderatus Antio militavit in legione…
Pompeia Memphis fecit sibi et Cnaeo Pompeio Iucundo coniugi suo aurifici…
Caius Papius Cai libertus Salvius Caius Papius Cai libertus Apelles…
Caius Flaminius Cai libertus Atticus argentarius Reatinus

The outliers here are graffitos or must be being picked up by the algorithmn due to the formation of the words; the inclusion of Pompeia in here is interesting, which must be to the overall structure of that inscription. Perhaps a stretch too far to wonder why these would be similar…?

This small experiment demonstrates I think the potential of topic modeling for digging out patterns in archaeological/epigraphic materials. In due time I will do Flohr’s entire database. Here are my files to play with yourself.

Giant component at the centre of these 220 inscriptions.
Giant component at the centre of these 220 inscriptions.

Topic Modeling an archaeological database: today’s adventures

If you follow me on twitter, and saw a number of bizarre/cryptic tweets today, I was live tweeting my work stream. This is what I did today – think of this as stream of consciousness over the last five hours.

  • imported portable antiquities scheme database into access so I could work with it.
  • queried it, selecting just those columns I was interested in
  • exported back to csv
  • cleaned up the data by removing ‘=’ signs (circular reference error in excel), names of liason officers, meta notes from PAS on the quality of the record, and indications that the record was sourced from the work of Guest and Wells (nb, not any citations to them). also celtic coins index note.
  • run a simple defaults topic model to get a sense of what words I need to add to a custom stopwords list.
  • 552438 rows (id numbers run to 548561, so I must have lost some).

it occurs to me that I should have left the names of the liason officers in, in case they get associated with a particular topic. d’oh.

bin\mallet import-file –input pasnebraska/everything.csv –output paseverything.mallet –keep-sequence –token-regex ‘[\p{L}\p{M}\p{N}]+’ –remove-stopwords

bin\mallet train-topics –input paseverything.mallet –num-topics 50 –optimize-interval 20 –output-state topic-state.gz –output-topic-keys everything_keys.txt –output-doc-topics everything_composition.txt

  • I think these results will be more useful than the previous ones. Although I believe I forgot to optimize-interval. Yes, I did.

so, running this:

bin\mallet run cc.mallet.topics.tui.TopicTrainer –input paseverything.mallet –num-topics 50 –optimize-interval 20 –diagnostics-file everythingdiagnostics.xml –output-topic-keys everythingdiag_keys.txt –output-doc-topics everythingdiag_topics.txt –xml-topic-pharse-report everythingdiag_phrase.txt –xml-topic-report everythingdiag_topicreport.xml –topic-word-weights-file everythingdiag_word_weights.txt –word-topic-counts-file everythingdiag_word_counts.txt –output-state output-state.gz

looking at the results, it looks like the first two columns, first three? were taken to be labels. shite.

  • reformat csv so that I have an id, and a text, per row.
  • found formula to combine all columns into a single column. but blank rows are buggering things up:

=stoneage!B1&” “&stoneage!C1&stoneage!D1&” “&stoneage!E1&” “&stoneage!F1&” “&stoneage!G1&stoneage!H1&” “&stoneage!I1&” “&stoneage!J1&” “&stoneage!K1&stoneage!L1&” “&stoneage!M1&” “&stoneage!N1&” “&stoneage!O1&stoneage!P1&” “&stoneage!Q1&” “&stoneage!R1&” “&stoneage!S1&stoneage!T1&” “&stoneage!U1&” “&stoneage!V1&” “&stoneage!W1&stoneage!X1&” “&stoneage!Y1&” “&stoneage!Z1&” “&stoneage!AA1&stoneage!AB1&” “&stoneage!AC1

  • returned to access database. gone with the april pas database (csv, download). importing selected columns, ignoring column shift. filtering out blank rows (and/or rows where everything’s colunm shifted all over the place)
  • exporting by period. leaving liason officers in. too bloody awkward to deal with the entire database at once.
  • put all the columns into a single column, so now I have just two: an id number, and a ‘text’.
  • imported, with regex, and diagnostics topic model,

wierd errors when running the model.

  • reimporting without regex.
  • rerunning with diagnostics. looks much better.

topic composition file is crazy talk.

  • ok, screw diagnostics. run normal. optimization 20, topics 50, for stoneage (9680 records).

ok, still the same problem with the composition file. What the hang?

  • re-running without optimization.

nope, still getting this kind of thing:

#doc name topic proportion …
“0 51” “FLAKE 10 0.480592529670674 44 0.3208674000538096 32 0.10756601869328378
“15 422” BLADE NEOLITHIC -4000 -2200 “Black/grey 22 0.8415325393137797 30 0.12728676320083662

So, that says to me that something weird happened in the initial import. Yet topic keys seem to make sense.

Sigh…

Wait, over on Mallet page it says,

… the first token of each line (whitespace delimited, with optional comma) becomes the instance name, the second token becomes the label, and all additional text on the line is interpreted as a sequence of word tokens.

Simple as that? So I just need a bloody extra column in there. For the love of god…

  • add column. filled it with document id (again).
  • reimporting. no regex.
  • running the topic model with diagnostics. 50 topics, optimized interval of 20.

By god, that was it.

Almost. Something still weird with the document names. Ah, found it. A blank in the first few rows.

  • reimporting. no regex.
  • running the topic model with diagnostics. 50 topics, optimized interval of 20.

SUCCESS!

Now to repeat with other periods (I elided palaeo, meso, and neolithic into a separate csv file). Then to interpret what all this means.

And I think I really need to reformulate my idea of ‘document’ to not be the individual rows, but rather the districts. I could pull all that out by hand, but I really want to figure out how to make the computer do that.

Anyway, some of the ‘topics’ from today’s adventures (what are ‘topics’ anyway? what might it mean, archaeologically, to think of these as ‘discourses’?, are some questions I need to ask):

implement lithic mesolithic neolithic flint flake bc blade dating date waste possibly period atherton rachel worked core flakes circa
grey flint colour brown dark neolithic light flake mottled cortex mid white cream flakes patina pale coloured knapped translucent

neolithic age bronze early flint date late scraper bc flake probable tool dating complete retouched knife made part tertiary

scraper end neolithic tool flint retouch semi edge circular thumb distal dorsal face abrupt side cortex nail thumbnail plan

core platform flint mesolithic single flakes removed cortex worked scars platforms tl pebble neolithic striking multi removals small blades

flake flint cortex rface dorsal grey retouch neolithic edge small damage face remaining scars edges brown secondary made white

bulb flake percussion platform striking end ventral rface flint proximal dorsal face neolithic scars small grey mid distal patina

side plan end profile margin left distal retouched dorsal proximal shaped flake flint convex scraper ventral snapped edge plano

mm mea length res width weighs flake flint neolithic ring wide long weighing adams kurt thick thickness west blade

section implement plan neolithic lithic triangular shaped date cross shape object oval rectangular rface sides roughly worked edge knapped

hill graham south flake west cornwall paul penwith flint sw margin cream end grey brown distal fig dorsal translucent

arrowhead neolithic leaf shaped flint tanged worked tip barbed point triangular broken early faces transverse missing oblique invasive tang

blade end distal mesolithic proximal snapped broken retouch flint ends parallel dorsal edges break incomplete missing sides damage long

Topic Modeling an Archaeological Database 2

Some things I have learned in recent days:

  • data must be cleaned. Really. It’s probably still too noisy, even when you think it isn’t. Eliminate frequently occuring meta-notes (as it were). All citations to Guest & Wells on Coins in the UK, for instance, really muck things up.
  • you can enter a single csv file as an input for MALLET. I knew this; but I had forgotten it, faced with a few hundred thousand rows of material (as I type this, the thought also occurs that I could run MALLET on the entire single file download I got from PAS, all ~500 000 rows. Presumably, locations and periods would sort themselves out into different topics?)
  • MALLET considers letter characters to make up words. If you’ve got other stuff in there – numerals, for instance – that are significant, you’ll need to become familiar with – -token regex , which you’d use during that initial file-import. It was suggested to me to try these

— token-regex \s\d+\s

–token-regex ‘[\p{L}\p{M}\p{N}]+’

What else? Oh, that’s about all, for now. Oh, wait: custom stopwords. Instead of –remove-stopwords, you’ll want –extra-stopwords yourlist.txt . And your list has to be formatted so that there is whitespace between the words. I’m not sure if that means ‘white space’ like how you and I would figure it, or if that means ‘white space’ in some kind of crazy hidden code kind of way (like this in regex: \s (see this)). If you open one of the default stopword lists, there doesn’t look like there’s any hit-the-space-bar-kind-of white space that I’d normally assume.

Onwards!

Topic Modeling the Portable Antiquities Scheme

The Frome Hoard of Roman Coins

I got my hands on the latest build of the Portable Antiquities Scheme database. I want to topic model the items in this database, to look for patterns in the small material culture of Britain, across time and space.

The data comes in a single CSV, with approximately 500 000 individual rows. The data’s a bit messy, as a result of extra commas slipping in here and there. The names of the Finds Liaison Officers slip into a column meant to record epigraphic info from coins, for instance, from time to time. Not a big deal, over 500 000 records.

The first issue I had was that after opening the CSV file in Excel, Excel would regard all of those epigraphic conventions (the use of =, +, or [ ] and so on) as formulae. This would generate ‘circular reference’ errors. I could sort that out by inserting a ‘ at the beginning of that column. But as you can imagine, sorting through, filtering, or any kind of manipulation of a single table that large would slow things considerably – and frequently crashed this poor ol’ desktop. I tried using Open Refine to clean up the data. I suspect with a bit of time and effort I’d be able to use that product well, but yesterday all I achieved, once I imported my csv file and clicked ‘make project’, was an ‘undefined error’ (after several minutes of chugging). This morning, I turned to Access and was able to import the csv, and begin querying it, cleaning things up a bit, and so on.

So I decided to focus on the Roman records, for the time being. There are some 66 000 unique records, coming from over 80 unique districts of the UK. This leaves me with a table with the chronological range for the object, a description of the object, and some measurements. I have a script that can take each individual row, and turn it into a txt file which I can then import into MALLET. Each individual row can also include the district name.

So I’m wondering now: should I just cut and paste all of the rows for a single district into a single txt file (and thus the routine will not have the place-name in the analyzed text)? Or should I preserve the granularity, and just topic model over every record, preserving the place name?  Ie, a collection of 80 txt files where there are no place names, or a collection of 66 000 txt files where every file has the place name – will they swamp the signals?

It’s too early in the morning for this kind of thinking.