Topics as Word Clouds

Elijah Meeks and Mat Jockers both have used word clouds to visualize topics from topic models. Colour, orientation, relative placement of the words – all of these could be used to convey different dimensions of the data. Below, you’ll find clouds for each of my initial 50 topics generated from the Roman materials in the Portable Antiquities Scheme database (some 100 000 rows, or nearly 1/5 the database, collected together into ‘documents’ where each unitary district authority is the ‘document’ and the text are the descriptions of things found there). The word clouds are generated from the word weights file that MALLET can output. There are 8100 unique tokens when I convert the database into a MALLET file; each one of those is present in each ‘bag of words’ or topic that MALLET generates, but to differing degrees. Thus, word clouds (here generated with Wordle) pull out important information that the word keys document does not. However, given that I optimized the interval whilst generating the topic models, the keys document provides an indication of the strength of the topic in the corpus. I’ve arranged the word clouds scaling them against the size of the strongest topic (topic 22), top-bottom, left-right. I’ll be damned if I can get wordpress to just display each image under the other one. Even stripped my table out, it did!

At any rate, as one churns through the 50 topics, after about the first 11 (depicted below), the topics get progressively more noisy as MALLET attempts to deal with incomplete transcriptions of the epigraphy of the coins, and the frequent notes about the source for the identification of the coins (the work of Guest & Wells). The final topic depicted here, topic 20, directly references a note often left in the database concerning the quality of an individual record; these frequently are in connection with materials that entered the British Museum collection before the Portable Antiquities Scheme got going and hence the information is not up to usual standards.

This exercise then suggests to me that 50 topics is just too much. I’m rerunning everything with 10 topics this time.

Topic 22
Topic 22
Topic 48
Topic 48
Topic 43
Topic 43
Topic 32
Topic 32
Topic 7
Topic 7
Topic 33
Topic 33
Topic 13
Topic 13
Topic 47
Topic 47
Topic 46
Topic 46
Topic 35
Topic 35
Topic 20
Topic 20
Advertisements

Topic Modeling an archaeological database: today’s adventures

If you follow me on twitter, and saw a number of bizarre/cryptic tweets today, I was live tweeting my work stream. This is what I did today – think of this as stream of consciousness over the last five hours.

  • imported portable antiquities scheme database into access so I could work with it.
  • queried it, selecting just those columns I was interested in
  • exported back to csv
  • cleaned up the data by removing ‘=’ signs (circular reference error in excel), names of liason officers, meta notes from PAS on the quality of the record, and indications that the record was sourced from the work of Guest and Wells (nb, not any citations to them). also celtic coins index note.
  • run a simple defaults topic model to get a sense of what words I need to add to a custom stopwords list.
  • 552438 rows (id numbers run to 548561, so I must have lost some).

it occurs to me that I should have left the names of the liason officers in, in case they get associated with a particular topic. d’oh.

bin\mallet import-file –input pasnebraska/everything.csv –output paseverything.mallet –keep-sequence –token-regex ‘[\p{L}\p{M}\p{N}]+’ –remove-stopwords

bin\mallet train-topics –input paseverything.mallet –num-topics 50 –optimize-interval 20 –output-state topic-state.gz –output-topic-keys everything_keys.txt –output-doc-topics everything_composition.txt

  • I think these results will be more useful than the previous ones. Although I believe I forgot to optimize-interval. Yes, I did.

so, running this:

bin\mallet run cc.mallet.topics.tui.TopicTrainer –input paseverything.mallet –num-topics 50 –optimize-interval 20 –diagnostics-file everythingdiagnostics.xml –output-topic-keys everythingdiag_keys.txt –output-doc-topics everythingdiag_topics.txt –xml-topic-pharse-report everythingdiag_phrase.txt –xml-topic-report everythingdiag_topicreport.xml –topic-word-weights-file everythingdiag_word_weights.txt –word-topic-counts-file everythingdiag_word_counts.txt –output-state output-state.gz

looking at the results, it looks like the first two columns, first three? were taken to be labels. shite.

  • reformat csv so that I have an id, and a text, per row.
  • found formula to combine all columns into a single column. but blank rows are buggering things up:

=stoneage!B1&” “&stoneage!C1&stoneage!D1&” “&stoneage!E1&” “&stoneage!F1&” “&stoneage!G1&stoneage!H1&” “&stoneage!I1&” “&stoneage!J1&” “&stoneage!K1&stoneage!L1&” “&stoneage!M1&” “&stoneage!N1&” “&stoneage!O1&stoneage!P1&” “&stoneage!Q1&” “&stoneage!R1&” “&stoneage!S1&stoneage!T1&” “&stoneage!U1&” “&stoneage!V1&” “&stoneage!W1&stoneage!X1&” “&stoneage!Y1&” “&stoneage!Z1&” “&stoneage!AA1&stoneage!AB1&” “&stoneage!AC1

  • returned to access database. gone with the april pas database (csv, download). importing selected columns, ignoring column shift. filtering out blank rows (and/or rows where everything’s colunm shifted all over the place)
  • exporting by period. leaving liason officers in. too bloody awkward to deal with the entire database at once.
  • put all the columns into a single column, so now I have just two: an id number, and a ‘text’.
  • imported, with regex, and diagnostics topic model,

wierd errors when running the model.

  • reimporting without regex.
  • rerunning with diagnostics. looks much better.

topic composition file is crazy talk.

  • ok, screw diagnostics. run normal. optimization 20, topics 50, for stoneage (9680 records).

ok, still the same problem with the composition file. What the hang?

  • re-running without optimization.

nope, still getting this kind of thing:

#doc name topic proportion …
“0 51” “FLAKE 10 0.480592529670674 44 0.3208674000538096 32 0.10756601869328378
“15 422” BLADE NEOLITHIC -4000 -2200 “Black/grey 22 0.8415325393137797 30 0.12728676320083662

So, that says to me that something weird happened in the initial import. Yet topic keys seem to make sense.

Sigh…

Wait, over on Mallet page it says,

… the first token of each line (whitespace delimited, with optional comma) becomes the instance name, the second token becomes the label, and all additional text on the line is interpreted as a sequence of word tokens.

Simple as that? So I just need a bloody extra column in there. For the love of god…

  • add column. filled it with document id (again).
  • reimporting. no regex.
  • running the topic model with diagnostics. 50 topics, optimized interval of 20.

By god, that was it.

Almost. Something still weird with the document names. Ah, found it. A blank in the first few rows.

  • reimporting. no regex.
  • running the topic model with diagnostics. 50 topics, optimized interval of 20.

SUCCESS!

Now to repeat with other periods (I elided palaeo, meso, and neolithic into a separate csv file). Then to interpret what all this means.

And I think I really need to reformulate my idea of ‘document’ to not be the individual rows, but rather the districts. I could pull all that out by hand, but I really want to figure out how to make the computer do that.

Anyway, some of the ‘topics’ from today’s adventures (what are ‘topics’ anyway? what might it mean, archaeologically, to think of these as ‘discourses’?, are some questions I need to ask):

implement lithic mesolithic neolithic flint flake bc blade dating date waste possibly period atherton rachel worked core flakes circa
grey flint colour brown dark neolithic light flake mottled cortex mid white cream flakes patina pale coloured knapped translucent

neolithic age bronze early flint date late scraper bc flake probable tool dating complete retouched knife made part tertiary

scraper end neolithic tool flint retouch semi edge circular thumb distal dorsal face abrupt side cortex nail thumbnail plan

core platform flint mesolithic single flakes removed cortex worked scars platforms tl pebble neolithic striking multi removals small blades

flake flint cortex rface dorsal grey retouch neolithic edge small damage face remaining scars edges brown secondary made white

bulb flake percussion platform striking end ventral rface flint proximal dorsal face neolithic scars small grey mid distal patina

side plan end profile margin left distal retouched dorsal proximal shaped flake flint convex scraper ventral snapped edge plano

mm mea length res width weighs flake flint neolithic ring wide long weighing adams kurt thick thickness west blade

section implement plan neolithic lithic triangular shaped date cross shape object oval rectangular rface sides roughly worked edge knapped

hill graham south flake west cornwall paul penwith flint sw margin cream end grey brown distal fig dorsal translucent

arrowhead neolithic leaf shaped flint tanged worked tip barbed point triangular broken early faces transverse missing oblique invasive tang

blade end distal mesolithic proximal snapped broken retouch flint ends parallel dorsal edges break incomplete missing sides damage long

Topic Modeling the Portable Antiquities Scheme

The Frome Hoard of Roman Coins

I got my hands on the latest build of the Portable Antiquities Scheme database. I want to topic model the items in this database, to look for patterns in the small material culture of Britain, across time and space.

The data comes in a single CSV, with approximately 500 000 individual rows. The data’s a bit messy, as a result of extra commas slipping in here and there. The names of the Finds Liaison Officers slip into a column meant to record epigraphic info from coins, for instance, from time to time. Not a big deal, over 500 000 records.

The first issue I had was that after opening the CSV file in Excel, Excel would regard all of those epigraphic conventions (the use of =, +, or [ ] and so on) as formulae. This would generate ‘circular reference’ errors. I could sort that out by inserting a ‘ at the beginning of that column. But as you can imagine, sorting through, filtering, or any kind of manipulation of a single table that large would slow things considerably – and frequently crashed this poor ol’ desktop. I tried using Open Refine to clean up the data. I suspect with a bit of time and effort I’d be able to use that product well, but yesterday all I achieved, once I imported my csv file and clicked ‘make project’, was an ‘undefined error’ (after several minutes of chugging). This morning, I turned to Access and was able to import the csv, and begin querying it, cleaning things up a bit, and so on.

So I decided to focus on the Roman records, for the time being. There are some 66 000 unique records, coming from over 80 unique districts of the UK. This leaves me with a table with the chronological range for the object, a description of the object, and some measurements. I have a script that can take each individual row, and turn it into a txt file which I can then import into MALLET. Each individual row can also include the district name.

So I’m wondering now: should I just cut and paste all of the rows for a single district into a single txt file (and thus the routine will not have the place-name in the analyzed text)? Or should I preserve the granularity, and just topic model over every record, preserving the place name?  Ie, a collection of 80 txt files where there are no place names, or a collection of 66 000 txt files where every file has the place name – will they swamp the signals?

It’s too early in the morning for this kind of thinking.