Where Roman Roads and Topic Models Intersect
Previously, I ended up with a map of UK districts, coloured by the five groups that Gephi’s modularity routine suggested were present, in the network of districts to districts based on shared patterns in the underlying topics (the topic model generated from the total dump of the Portable Antiquities Scheme database).
I asked on twitter if the patterns seemed evocative of anything; Phil Mills suggested that they seemed to match perhaps civitas boundaries. He provided me with an image of those boundaries (thanks Phil!) as well as some kmz files. Below are two images, one with civitas capitals (hand-drawn in by me) and Roman roads. Together, they are evocative. Blocks of colour seem to go very well with civitas boundaries; where blocks of colour overlap those boundaries, they seem to march along well the routes of the roads. And all this from looking at topic models! I think it is getting progessively safer to say that the patterns found in an archaeological database through topic modelling are indeed meaningful on the ground. The factors of government, of identity, of mobility, seem to emerge in the topic model.
github.com/shawngraham
I’ve got a Github account. My first repository may be viewed at https://github.com/shawngraham/historicalfriction .
Took a bit of tinkering, but I think I’ve got the idea. Other humanities github type projects can be found via a simple search.
Reading Inscriptions Algorithmically
Inscriptions are complicated beasts. Frequently quite small and incomplete, epigraphers are able to extract an enormous amount of information from inscriptions – especially when they have other inscriptions with which to contrast and compare. Let us look at the inscriptions from Aphrodisias, which are published online following Epidoc conventions. Because of this, we are able to do some data-mining on them with a minimum of pre-processing.
(Joyce Reynolds, Charlotte Roueché, Gabriel Bodard, Inscriptions of Aphrodisias (2007), available <http://insaph.kcl.ac.uk/iaph2007>, ISBN 978-1-897747-19-3.)
The first one looks like this, when the xml tags are stripped away:
Reynolds1982
Creative Commons licence Attribution 2.5 (http://creativecommons.org/licenses/by/2.5/)
All reuse or distribution of this work must contain somewhere a link back to the URL http://insaph.kcl.ac.uk/
Originally published in Reynolds (1982).
English French German Ancient Greek Transliterated Greek Modern Greek Italian Latin Spanish Turkish 2007-07-04cmrDONE2007-04-02Charlotte Tupmanhand tidiedGBhand tidied 2007-03-15Elliott HallBatch converted Word2XMLDescription of MonumentUpper right corner of a white marble block (0.36 x 0.24 x 0.34).Description of TextInscribed on one face.LettersLate Republican or Augustan; ave 0.02. rho in ll. 1, 3, 6 has a very small stroke slanting rightwards from the junction of the bowl with the vertical.Date Late Republican or Augustan (lettering, content)Edition οὗτος ὁ τόπος ἱερὸς ἄσυλος ὡς ἔκριναν ὁ μέγας Καῖσαρ ὁ δικτάτωρ καὶ ὁ υἱὸς αὐτοῦ αὐτοκράτωρ Καῖσαρ καὶ ἡ σύνκλητος καὶ ὁ δῆμος ὁ Ῥωμαίων καθὼς καὶ τὰ φιλάνθρωπα καὶ δελτογραφήματα καὶ ἐπικρίματα περιέχει ἀνέστησεν δὲ τοὺς ὅρους Γάϊος Ἰούλιος Ζωΐλος ὁ ἱερεὺς τῆς ἈφροδείτηςApparatus For the supplements, compare the partner inscription 1.38.Translation [?This area is] the sacred asylum [?as defined by] the great [?Caesar, the] Dictator, and [?his son] Imperator [Caesar and the ] Senate [and People] of Rome, [as is also contained in the] grants of privilege, the public documents [and decrees. C. Iulius Zoilos priest of Aphrodite set up the boundary stones.]Commentary See , 159-160.Locations Stray find. Temple/Church temenos. Museum (1977)Text Constituted From Transcription (Reynolds)History of Recording Recorded by the NYU expedition in 1963 (63.596)Bibliography Published by Reynolds, , doc. 35, whence SEG 1982.1097, BE 1983.388, 1984.878, McCabe 379, R. R. R. Smith, (Mainz, 1993) T5.Photographs
Face (1977)
There’s a lot of meta information that goes along with a single inscription above and beyond its transcription and translation, all of it which is necessary to understand the possible significance. I don’t think there’s a better illustration of what ‘close reading’ might mean in archaeology, than the epigrapher’s art.
What might we spot if we look at a corpus of inscriptions from a macro level? What patterns might exist? Is there something going on related to geography? Researcher? language of the inscription? Publication history? Dating? This is where the algorithmns of topic modeling might be useful. My go-to tool for this is MALLET. Mallet allows one to strip out all of the xml tags (see MALLET’s help file from the command line for -import-dir), so I can download the xml files as zip from the Inscriptions of Aphrodisias site, and begin exploring for patterns. I optimize the interval too when I train the topic model, to shake out the utility of the resulting ‘topics’. I began by modeling 50 topics.
You can download the MALLET file and results here, to play with and explore for yourself.
When I look at the results (inscriptionkeys.txt), the ‘strongest’ topics all relate to metadata regarding their online publication (the top 3). The next few clearly relate to the researchers who are behind the inscriptions of Aphrodisias website, so not overly useful for me here. The next couple seem to be a mixture of findspot information and publishing history:
topic6 0.34603 unpublished fragment reynolds face museum version lettering inscription born digital joyce unknown marble expedition white centuries nyu inscribed stray 32 0.23776 face upper moulding left side lettering part lower expedition museum white marble nyu broken aphrodisias asia corner inscribed front
topic39 0.15711 south walls east face west block part wall gate expedition findspot city stretch tupman mama lettering depth measurable marble
topic7 0.14493 mama gaudin published reinach mccabe cormack kubitschek squeeze notebook phi expedition records originally aphrodisias reichel publications recorded charlotte representations
topic43 0.13213 mccabe published originally bodard gabriel rouech aphrodisias phi description findspot reported subsequently charlotte unknown preliminary inscription tidied publication funerary
The remaining topics all deal explicitly with the inscriptions themselves, their texts and their findspots (it seems).
topic47 0.06106 son honoured honours people council claudius priest diogenes family tiberius man high cl public gerousia lived virtue life zenon
topic8 0.05807 roman family wife names father aphrodisian case daughter suggests citizenship early century reference possibly menodotos clear named civic late
topic38 0.05137 son zenon adrastos attalos dionysios athenagoras artemidoros apollonios hypsikles aphrodite diogenes daughter early tupman menestheus cf goddess grandson sons
Every file is composed of all these different topics, to differing degrees. I would like to visualize the paths of these discourses through the corpus, so I translate the inscriptioncomp.txt file so that I end up with at at least 9/10s of each document’s composition (in practice, this means cutting and pasting the inscriptioncomp.txt file so that I end up with a single list with source-document, target-topic, and weight). I also filtered out those strongest topics described above (5,6,7,9,16,17,29,39,43).
I imported this list into Gephi, and set about trying to find groupings of topics and inscriptions, based on the shared patterns (and the weighting) of relationships. I coloured it by group (modularity) and resized nodes based on ‘betweeness’. What does betweeness mean here? I think it means the principle ideas (the discourse) that ties this entire collection together. In this case, topic 0:
statue base honours shaft ll moulding set city sbi council feature honoured capital aurelius prosopography top moulded ligatures antonius
followed closely by 1 and 37:
topic1 sarcophagus funerary inscription front lid standard necropolis aurelius forms buried tomb east formula elements aur rim burial end line
topic37 city face village house inscribed recording edition wall block text transliterated unknown large line greek area lettering viii marble
It might be that these most ‘between’ topics are not the ones that are archaeologically interesting. This is of course a 2-mode network (inscriptions-topics) so it might be desireable to consider this data as two 1-mode networks, inscriptions – inscriptions by virtue of shared topics, and topics – topics by virtue of shared inscriptions. When we take topics – topics, running our familiar grouping and betweeness metrics, topic 37 comes out on top, followed by 10 and 33:
topic10 building reynolds blocks block son architrave published theatre decoration papers fasciae dedication end aphrodite people aphrodisias fascia found demos
topic33 ii iii iv cut text left cross fortune monogram mccabe letters end triumphs broken vi acclamation texts drawing vii
When we turn the two mode network into an inscriptions – inscriptions by virtue of shared topics, we end up with a monster of a graph: 1505 nodes (inscriptions), with 241,002 relationships! The most between inscription is iAph050118:
Building inscription of Helladios
Charlotte M. Roueché2007
Creative Commons licence Attribution 2.5 (http://creativecommons.org/licenses/by/2.5/)
All reuse or distribution of this work must contain somewhere a link back to the URL http://insaph.kcl.ac.uk/
Originally published in Roueché (2004).
English French Ancient Greek Transliterated Greek Latin AsiaTurkey
Aphrodisias
Geyre
2004-06-08Gabriel BodardChecked and fixed all image divs and refs 2004-03-16 Gabriel Bodard Completed lemmatisation, checked figure ids, tagged keywords 2003-11-04John LavagninoConverted beta code to Unicode 2003-05-27 Gabriel Bodard tidied and corrected 2003-04-30 Juan Garcés tidied and corrected 2003-06-22CMRtagged, tidied and corrected2003-07-14JLGLemmatised2003-08-20CMRname tags reduced2004-01-16CMRtidied; image refs2003-05-27Gabriel BodardTyped and marked-up Greek Description of MonumentA rectangular white marble block, perhaps from a lintel (0.285 × 0.665 × 0.50) with simple moulding above and below on one face. Chipped to the right, but complete.
Description of TextInscribed on the moulded face, in one line on the surface between the mouldings, which is slightly concave. The text must have continued onto an adjacent block.
Description of Letters
Flowing style, similar to 5.302, 5.119 and 4.120; 0.05-0.06.Date
First half of the fourth century (lettering, prosopography).
Edition
κἀμὲ
Ἑλλάδιος
ὁ ἁγνόςTranslation
Me also Helladios the pureCommentary
For Helladios see also 1.131, 4.120 and discussion at II.35.Locations
Hadrianic Baths: central chamber. Unknown. Findspot (1972)..History of Recording
Excavated by the NYU expedition.
Bibliography Published by Roueché, Aphrodisias in Late Antiquity, no. 18, whence PHI 605.
Text Constituted From Transcription (Roueché).
Photographs Face (1972)
Seems a bit underwhelming, no? But look at what is in this inscription – a personal name, the central chamber of the Baths, links outward to other inscriptions… reading the inscriptions algorithmically doesn’t absolve us from having to jump back in to do the close reading. Instead, we have to bounce back and forth between the micro and the macro. The modularity routine suspects that there are around 52 distinct subgroups in this material. That’s probably where the most interest will lie, for scholars of this material. Are these groups related to context of discovery, or named individuals appearing in mutliple inscriptions or…? Five groups account for 1456 inscriptions. (It’s easier to load the ‘inscriptions-inscriptions-inscriptions-of-aphrodisias [Nodes].csv’ file to examine all of these). What might be causing the ‘big five’ to group together? I will leave it up to the epigraphers to examine them…
Those 47 inscriptions which the modularity routine found so odd that they each were put into their own group are curious indeed. The first of these uniques is Inscription iAph080906:
αὔξι
Θεόπομπος
ὁ μεγαλοπρεπέστατος πολιτευόμενος σὺν θεῷ πατὴρ τῆς πόλεωςTranslation
Up with Theopompos, magnificentissimus, member of the council and, with God’s help, pater civitatis
…which seems to be a good place to draw this note to a close. Up with Theopompos indeed! One wonders if he won the election. The remainder (checked at random) seem to have no translations associated with them. So perhaps what really sets these apart is simply that they haven’t been translated. If so, that’s rather astonishing that that should be visible from a topic-model & graph viz combination.
Topic modeling the things that fell out of pockets
Topic modeling is very popular at the moment in the digital humanities. Ian, Scott and I described them as tools for extracting topics or injecting semantic meaning into vocabularies: “Topic models represent a family of computer programs that extract topics from texts. A topic to the computer is a list of words that occur in statistically meaningful ways. A text can be an email, a blog post, a book chapter, a journal article, a diary entry – that is, any kind of unstructured text” (Graham, Weingart, and Milligan 2012). In that tutorial, ‘unstructured’ means that there is no encoding in the text by which a computer can model any of its semantic meaning.
But there are topic models of ships’ logs, of computer code. So why not archaeological databases?
Archaeological datasets are rich, largely unstructured bodies of text. While there are examples of archaeological datasets that are coded with semantic meaning through xml and Text Encoding Initiative practices, many of these are done after the fact of excavation or collection. Day to day, things can be rather different, and this material can be considered to be ‘largely unstructured’ despite the use of databases, controlled vocabulary, and other means to maintain standardized descriptions of what is excavated, collected, and analyzed. This is because of the human factor. Not all archaeologists are equally skilled. Not all data gets recorded according to the standards. Where some see few differences in a particular clay fabric type, others might see many, and vice versa. Archaeological custom might call a particular vessel type a ‘casserole’, thus suggesting a particular use, only because in the 19th century when that vessel type was first encountered it reminded the archaeologist of what was in his kitchen – there is no necessary correlation between what we as archaeologists call things and what those things were originally used for. Further, once data is recorded (and the site has been destroyed through the excavation process), we tend to analyze these materials in isolation. That is, we write our analyses based on all of the examples of a particular type, rather than considering the interrelationships amongst the data found in the same context or locus. David Mimno in 2009 turned the tools of data analysis on the databases of household materials recovered and recorded room by room at Pompeii. He considered each room as a ‘document’ and the artefacts therein as the ‘tokens’ or ‘words’ within that document, for the purposes of topic modeling. The resulting ‘topics’ of this analysis are what he calls ‘vocabularies’ of object types which when taken together can suggest the mixture of functions particular rooms may have had in Pompeii. He writes, ‘the purpose of this tool is not to show that topic modeling is the best tool for archaeological investigation, but that it is an appropriate tool that can provide a complement to human analysis….mathematically concrete in its biases’. The ‘casseroles’ of Pompeii turn out to have nothing to do with food preparation, in Mimno’s analysis. To date, I believe this is the only example of topic modeling applied to archaeological data.
Directly inspired by that example, I’ve been exploring the use of topic models on another rich archaeological dataset, the Portable Antiquities Scheme database in the UK. The Portable Antiquities Scheme is a project “to encourage the voluntary recording of archaeological objects found by members of the public in England and Wales”. To date, there are over half a million unique records in the Scheme’s database. These are small things, things that fell out of pockets, things that often get found via metal-detecting.
Here’s what I’ve been doing.
1. I downloaded a nightly dump of the PAS data back in April; it came as a csv file. I opened the file, and discovered over a million lines of records. Upon closer examination, I think what happened is something to do with the encoding- there are line breaks, carriage returns, and other non-printing characters (as well as commas being used within fields) that when I open the file I end up with a single record (say a coin hoard) occupying tens of lines, or of fields shifting at the extraneous commas.
2. I cleaned this data up using Notepad++ and the liberal use of regular expressions to put everything back together again. The entire file is something like 385 mb.
3. I imported it into MS Access so that I could begin to filter it. I’ve been playing with paleo – meso – and neolithic records; bronze age records; and Roman records. The Roman material itself occupies somewhere around 100 000 unique records.
4. I exported my queries so that I would have a simpler table with dates, descriptions, and measurements.
5. I filtered this table in Excel so that I could copy and paste out all of the records found within a particular district (which left me with a folder with 275 files, totaling something like 25 mb of text).
6. Meanwhile, I began topic modeling the unfiltered total PAS database (just after #2 above). Each run takes about 3 hours, as I’ve been running diagnostics to explore the patterns. The problem I have here though is what, precisely, am I finding? What does a cluster of records who share a topic actually mean, archaeologically? Do topics sort themselves out by period, by place, by material, by finds officer…?
7. As that’s been going on, I’ve been topic modeling the folders that contain the districts of England and Wales for a given period. Let’s look at the Roman period.
There are 275 files, where a handful have *a lot* of data (> 1000 kb), while the vast majority are fairly small (< 100 kb). Perhaps that replicates patterns of metal detecting - see Bevan on biases in the PAS. The remaining districts seem to have no records in the database. So I’ve got 80% coverage for all of England and Wales. I’ve been iterating over all of this data, so I’ll just describe the most recent, as it seems to be a typical result. Using MALLET 2.0.7, I made a topic model with 50 topics (and optimized the interval, to shake out the useful from the not-so-useful topics). Last night, as I did this, the topic diagnostics package just wouldn’t work for me (you run it from the MALLET directory, but it lives at the MALLET site; perhaps they were working on it). So I’ll probably want to run all these again.
If I sort the topic keys by their prominence (see ‘optimize interval’) the top 14 all seem to describe different kinds of objects – brooches, denarii, nummus, sherds, lead weights, radiate, coin dates, the ‘heads’ sides of coins – which Emperor. Then we get to the next topic, which reads :” record central database recording usual standards fall created scheme aware portable began antiquities rectify working corroded ae worn century”. This meta-note about data quality appears throughout the database, and refers to materials collected before the Scheme got going.
After that, the remaining topics all seem to deal with the epigraphy of coins, and the various inscriptions, figurative devices, their weights & materials. A number of these topics also include allusions to the work of Guest and Wells, whose work on Iron Age Coins is frequently cited in the database.
Let’s look at the individual districts now, and how these topics play over geographic space. Given that these are modern districts, it’d be better – perhaps – to do this over again with the materials sorted into geographic entities which make sense from a Roman perspective. Perhaps do it by major Roman Roads ( sorting the records so that districts through which Wattling Street traverses are gathered into a single document). Often what people do when they want to visualize the patterns of topic interconnections in a corpus is to trim the composition document so that only topics greater than a certain threshold are imported to a package like Gephi.
My suspicion is that that would throw out a lot of useful data. It may be that it’s the very weak connections that matter. A very strong topic-document relationship might just mean that a coin hoard found in the area is blocking the other signals.
In which case, let’s bring the whole composition document into Gephi. Start with this:
| adur | 4 | 0.238806 | 15 | 0.19403 | 22 | 0.179104 | 13 | 0.119403 | 17 | 0.089552 |
and delete out the edge weights. (I’m trying to figure out how to do what follows without deleting those edge weights, but bear with me.)
You end up with something like this:
adur 4 15 22 [...etc...]
Save the file with a new name, as csv.
Open in Notepad++ (or similar) and replace the commas with ;
Go to gephi. Under ‘open graph file’, select your csv file. This is not the same as ‘import spreadsheet’ under the data table tab. You can import a comma separated file where the first item on a line is a node, and each subsequent item is another node to which it is attached. If you tried to open that file under the ‘import spreadsheet’ button, you’d get an error message – in that dialogue, you have to have two columns source and target where each row describes a single relationship. See the difference?
This is why if you left the edge weights in the csv file – let’s call it an adjaceny file – you’d end up with weights becoming nodes, which is a mess. If you want to keep the weights, you have to do the second option.
I’ve tried it both ways. Ultimately, while the first option is much much faster, the second option is the one to go for because the edge weights (the proportion that a topic is present in a document) is extremely important. So I created a single list that included seven pairs of topic-weight combinations. (This doesn’t created a graph where k=7, because not every document had that many topics. But why 7? In truth, after that point, the topics all seemed to be well under 1% of each document’s composition).
With me so far? Great.
Now that I have a two mode network in Gephi, I can begin to analyze the pattern of topics in the documents. Using the multi-mode plugin, I separate this network into two one-mode networks: topics to topics (based on appearing in the same district) and district – district based on having the same topics, in different strengths.
Network visualization doesn’t offer anything useful here (although Wales always is quite distinctly apparent, when you do. It’s because of the coin hoards). Instead, I simply compute useful network metrics. For instance, ‘betweeness’ literally counts the number of times a node is in between all pairs of nodes, given all the possible paths connecting them. In a piece of text such words do the heavy semantic lifting. So identifying topics that are most in between in the topic – topic network should be a useful thing to do. But what does ‘betweeness’ imply for the district – district network? I’m not sure yet. Pivotal areas in the formation of material culture?
What is perhaps more useful is the ‘modularity’. It’s just one of a number of algorithmns one could use to try to find structural sub-groups in a network (nodexl has many more). But perhaps there are interesting geographical patterns if we examined the pattern of links. So I ran modularity, and uploaded the results to openheatmap to visualize them geographically. Network analysis doesn’t need to produce network visualizations, by the way.
See the result for yourself here: http://www.openheatmap.com/embed.html?map=AnteriorsFrijolsHermetists
It colours each district based on the group that it belongs to. If you mouse-over a district, it’ll give you that group’s number – those numbers shouldn’t be confused with anything else. I’d do this in QGIS, but this was quicker for getting a sense of what’s going on.
I asked on Twitter (referencing a slightly earlier version) if these patterns suggested anything to any of the Romano-Britain crowd.
@electricarchaeo would be interesting to overlay civitas boundries/ Creightons IA coin core boundaries
— Phil Mills (@Tileman_and_son) May 29, 2013
//
Modularity for topic-topic also implies some interesting groupings, but these seem to mirror what one would expect by looking at their prominence in the keys.txt file. So that’s where I am now, soon to try out Phil’s suggestion.
As Paul Harvey was wont to say, ‘…and now you know… the REST of the story’. At DH2013 I hope to be able to tell you what all of this may mean.
Graeworks – my tenure and promotion online portfolio
My online tenure & promotion portfolio may be viewed at graeworks.net It is a work in progress, so I would welcome comments and suggestions. I will be applying for t&p this coming autumn.
The department is currently in the midst of setting its own discipline specific language for what counts for tenure, and what counts for promotion. There’s been a lot of hard work on it, and I’m glad to see that there is specific recognition for digital work on its own merits (and not by drawing false equivalencies with print media). I have the option of going up under the earlier non-specific language, but I think I’ll swing for the bleachers here.
Keynote, Some Assembly Required, now on Youtube
‘Some Assembly Required’, my keynote at the Canadian Network for Innovation in Education is now available on youtube.
(If you don’t see me, I’m the second person on the playlist).
http://www.youtube.com/watch?v=K58DOSeQ0N4&feature=share&list=PLL1ugrJCNbcq8v_iIHqh8mCja6YFOuCEe
Topic Modeling an archaeological database: today’s adventures
If you follow me on twitter, and saw a number of bizarre/cryptic tweets today, I was live tweeting my work stream. This is what I did today – think of this as stream of consciousness over the last five hours.
- imported portable antiquities scheme database into access so I could work with it.
- queried it, selecting just those columns I was interested in
- exported back to csv
- cleaned up the data by removing ‘=’ signs (circular reference error in excel), names of liason officers, meta notes from PAS on the quality of the record, and indications that the record was sourced from the work of Guest and Wells (nb, not any citations to them). also celtic coins index note.
- run a simple defaults topic model to get a sense of what words I need to add to a custom stopwords list.
- 552438 rows (id numbers run to 548561, so I must have lost some).
it occurs to me that I should have left the names of the liason officers in, in case they get associated with a particular topic. d’oh.
bin\mallet import-file –input pasnebraska/everything.csv –output paseverything.mallet –keep-sequence –token-regex ‘[\p{L}\p{M}\p{N}]+’ –remove-stopwords
bin\mallet train-topics –input paseverything.mallet –num-topics 50 –optimize-interval 20 –output-state topic-state.gz –output-topic-keys everything_keys.txt –output-doc-topics everything_composition.txt
- then run this? http://article.gmane.org/gmane.comp.ai.mallet.devel/1483/ yes, as per options suggested by @mwidner on twitter on may 14.
- I think these results will be more useful than the previous ones. Although I believe I forgot to optimize-interval. Yes, I did.
so, running this:
bin\mallet run cc.mallet.topics.tui.TopicTrainer –input paseverything.mallet –num-topics 50 –optimize-interval 20 –diagnostics-file everythingdiagnostics.xml –output-topic-keys everythingdiag_keys.txt –output-doc-topics everythingdiag_topics.txt –xml-topic-pharse-report everythingdiag_phrase.txt –xml-topic-report everythingdiag_topicreport.xml –topic-word-weights-file everythingdiag_word_weights.txt –word-topic-counts-file everythingdiag_word_counts.txt –output-state output-state.gz
looking at the results, it looks like the first two columns, first three? were taken to be labels. shite.
- reformat csv so that I have an id, and a text, per row.
- found formula to combine all columns into a single column. but blank rows are buggering things up:
=stoneage!B1&” “&stoneage!C1&stoneage!D1&” “&stoneage!E1&” “&stoneage!F1&” “&stoneage!G1&stoneage!H1&” “&stoneage!I1&” “&stoneage!J1&” “&stoneage!K1&stoneage!L1&” “&stoneage!M1&” “&stoneage!N1&” “&stoneage!O1&stoneage!P1&” “&stoneage!Q1&” “&stoneage!R1&” “&stoneage!S1&stoneage!T1&” “&stoneage!U1&” “&stoneage!V1&” “&stoneage!W1&stoneage!X1&” “&stoneage!Y1&” “&stoneage!Z1&” “&stoneage!AA1&stoneage!AB1&” “&stoneage!AC1
- returned to access database. gone with the april pas database (csv, download). importing selected columns, ignoring column shift. filtering out blank rows (and/or rows where everything’s colunm shifted all over the place)
- exporting by period. leaving liason officers in. too bloody awkward to deal with the entire database at once.
- put all the columns into a single column, so now I have just two: an id number, and a ‘text’.
- imported, with regex, and diagnostics topic model,
wierd errors when running the model.
- reimporting without regex.
- rerunning with diagnostics. looks much better.
topic composition file is crazy talk.
- ok, screw diagnostics. run normal. optimization 20, topics 50, for stoneage (9680 records).
ok, still the same problem with the composition file. What the hang?
- re-running without optimization.
nope, still getting this kind of thing:
#doc name topic proportion …
“0 51″ “FLAKE 10 0.480592529670674 44 0.3208674000538096 32 0.10756601869328378
“15 422″ BLADE NEOLITHIC -4000 -2200 “Black/grey 22 0.8415325393137797 30 0.12728676320083662
So, that says to me that something weird happened in the initial import. Yet topic keys seem to make sense.
Sigh…
Wait, over on Mallet page it says,
… the first token of each line (whitespace delimited, with optional comma) becomes the instance name, the second token becomes the label, and all additional text on the line is interpreted as a sequence of word tokens.
Simple as that? So I just need a bloody extra column in there. For the love of god…
- add column. filled it with document id (again).
- reimporting. no regex.
- running the topic model with diagnostics. 50 topics, optimized interval of 20.
By god, that was it.
Almost. Something still weird with the document names. Ah, found it. A blank in the first few rows.
- reimporting. no regex.
- running the topic model with diagnostics. 50 topics, optimized interval of 20.
SUCCESS!
Now to repeat with other periods (I elided palaeo, meso, and neolithic into a separate csv file). Then to interpret what all this means.
And I think I really need to reformulate my idea of ‘document’ to not be the individual rows, but rather the districts. I could pull all that out by hand, but I really want to figure out how to make the computer do that.
Anyway, some of the ‘topics’ from today’s adventures (what are ‘topics’ anyway? what might it mean, archaeologically, to think of these as ‘discourses’?, are some questions I need to ask):
implement lithic mesolithic neolithic flint flake bc blade dating date waste possibly period atherton rachel worked core flakes circa
grey flint colour brown dark neolithic light flake mottled cortex mid white cream flakes patina pale coloured knapped translucentneolithic age bronze early flint date late scraper bc flake probable tool dating complete retouched knife made part tertiary
scraper end neolithic tool flint retouch semi edge circular thumb distal dorsal face abrupt side cortex nail thumbnail plan
core platform flint mesolithic single flakes removed cortex worked scars platforms tl pebble neolithic striking multi removals small blades
flake flint cortex rface dorsal grey retouch neolithic edge small damage face remaining scars edges brown secondary made white
bulb flake percussion platform striking end ventral rface flint proximal dorsal face neolithic scars small grey mid distal patina
side plan end profile margin left distal retouched dorsal proximal shaped flake flint convex scraper ventral snapped edge plano
mm mea length res width weighs flake flint neolithic ring wide long weighing adams kurt thick thickness west blade
section implement plan neolithic lithic triangular shaped date cross shape object oval rectangular rface sides roughly worked edge knapped
hill graham south flake west cornwall paul penwith flint sw margin cream end grey brown distal fig dorsal translucent
arrowhead neolithic leaf shaped flint tanged worked tip barbed point triangular broken early faces transverse missing oblique invasive tang
blade end distal mesolithic proximal snapped broken retouch flint ends parallel dorsal edges break incomplete missing sides damage long
Topic Modeling an Archaeological Database 2
Some things I have learned in recent days:
- data must be cleaned. Really. It’s probably still too noisy, even when you think it isn’t. Eliminate frequently occuring meta-notes (as it were). All citations to Guest & Wells on Coins in the UK, for instance, really muck things up.
- you can enter a single csv file as an input for MALLET. I knew this; but I had forgotten it, faced with a few hundred thousand rows of material (as I type this, the thought also occurs that I could run MALLET on the entire single file download I got from PAS, all ~500 000 rows. Presumably, locations and periods would sort themselves out into different topics?)
- MALLET considers letter characters to make up words. If you’ve got other stuff in there – numerals, for instance – that are significant, you’ll need to become familiar with – -token regex , which you’d use during that initial file-import. It was suggested to me to try these
– token-regex \s\d+\s
–token-regex ‘[\p{L}\p{M}\p{N}]+’
What else? Oh, that’s about all, for now. Oh, wait: custom stopwords. Instead of –remove-stopwords, you’ll want –extra-stopwords yourlist.txt . And your list has to be formatted so that there is whitespace between the words. I’m not sure if that means ‘white space’ like how you and I would figure it, or if that means ‘white space’ in some kind of crazy hidden code kind of way (like this in regex: \s (see this)). If you open one of the default stopword lists, there doesn’t look like there’s any hit-the-space-bar-kind-of white space that I’d normally assume.
Onwards!
Topic Modeling the Portable Antiquities Scheme
I got my hands on the latest build of the Portable Antiquities Scheme database. I want to topic model the items in this database, to look for patterns in the small material culture of Britain, across time and space.
The data comes in a single CSV, with approximately 500 000 individual rows. The data’s a bit messy, as a result of extra commas slipping in here and there. The names of the Finds Liaison Officers slip into a column meant to record epigraphic info from coins, for instance, from time to time. Not a big deal, over 500 000 records.
The first issue I had was that after opening the CSV file in Excel, Excel would regard all of those epigraphic conventions (the use of =, +, or [ ] and so on) as formulae. This would generate ‘circular reference’ errors. I could sort that out by inserting a ‘ at the beginning of that column. But as you can imagine, sorting through, filtering, or any kind of manipulation of a single table that large would slow things considerably – and frequently crashed this poor ol’ desktop. I tried using Open Refine to clean up the data. I suspect with a bit of time and effort I’d be able to use that product well, but yesterday all I achieved, once I imported my csv file and clicked ‘make project’, was an ‘undefined error’ (after several minutes of chugging). This morning, I turned to Access and was able to import the csv, and begin querying it, cleaning things up a bit, and so on.
So I decided to focus on the Roman records, for the time being. There are some 66 000 unique records, coming from over 80 unique districts of the UK. This leaves me with a table with the chronological range for the object, a description of the object, and some measurements. I have a script that can take each individual row, and turn it into a txt file which I can then import into MALLET. Each individual row can also include the district name.
So I’m wondering now: should I just cut and paste all of the rows for a single district into a single txt file (and thus the routine will not have the place-name in the analyzed text)? Or should I preserve the granularity, and just topic model over every record, preserving the place name? Ie, a collection of 80 txt files where there are no place names, or a collection of 66 000 txt files where every file has the place name – will they swamp the signals?
It’s too early in the morning for this kind of thinking.
Some Assembly Required: teaching through/with/about/by/because of, the Digital Humanities (slides & notes)
I’m giving a keynote address to the Canadian Network for Innovation in Education conference, at Carleton on Thursday (10.30, River Building). I’ve never done a keynote before, so I’ll confess to being a bit nervous. ‘Provoke!’ I’ve been told. ‘Inspire! Challenge!’ Well, here goes….
These are the slides and the more-or-less complete speaker’s notes. I often write things out, and then completely adlib on the day, but this is more or less the flavour I’m going for.
[Title]
I never appreciated how scary those three words were until I had kids. ‘Some assembly required’. That first Christmas was all, slide Tab A into Slot B. Where’s the 5/8ths gripley? Is that an Allen key? Why are there so many screws left over? The toys, with time, get broken, get fixed, get recombined with different play sets, are the main characters and the exotic locales for epic stories. I get a lot of mileage out of the stories my kids tell and act out with these toys.
My job is the DH guy in the history department. DH, as I see it, is a bit like the way my kids play with the imperfectly built things – it’s about making things, about breaking things, about being playful with those things. This talk is about what that kind of perspective might imply for our teaching and learning.
[2]
I don’t know what persuaded my parents that it’d be a good idea to spend $300 in 1983 dollars on a Vic20, but I’m glad they did. You turn on your ipad, it all just happens magically, whoosh! In those days, if you had a computer, you had to figure out how to make it do stuff, the hard way. A bit disappointing, that first ‘Ready’ prompt. Ready to do what? My brothers and I wanted to play games. So, we sat down to learn how to program them. If you had a vic-20, do you remember how exciting it was when that ball first bounced off the corners of your screen? A bit like the apes in the opening scene of ’2001′. At least, in our house.
[3]
‘Wargame’, film with Matthew Broderick. This scared me; but I loved the idea of being able to reach out to someone else, someone far from where I lived in Western Quebec. So we settled for occasional trips to the Commodore store in Ottawa, bootleg copies of Compute! Magazine, and my most treasured book, a ‘how to make adventure games’ manual for kids, that my Aunt purchased for me at the Ontario Science centre.
[4]
Do you remember old-school text adventures? They’re games! They promote reading! Literacy! They are a Good Thing. Let’s play a bit of this game, ‘Action Castle’, to remind us how they worked.
To play an interactive fiction is to foreground how the rules work; it’s easy to see, with IF. But that same interrogation needs to happen whenever we encounter digital media.
[5]
Games like Bioshock – a criticism of Randian philosophy. Here, the interplay between the rules and the illusion of agency are critical to making the argument work.
When you play any kind of game, or interact with any kind of medium, you generally achieve success once you begin to think like the machine. What do games teach us? How to play the game: how to think like a computer. This is a ‘cyborg’ consciousness. The ‘cyb’ in ‘Cyborg’ comes from the greek for ‘governor’ or ‘ship’s captain’. Who is doing the governing? The code. This is why humanities NEEDS to consider the digital. It’s too important to leave to the folks who are already good at thinking like machines. This is the first strand of what ‘digital humanities’ might mean.
[6]
A second strand comes from that same impulse that my brothers and I had – let’s make something! Trying to make something on the computer inevitably leads to deformation. This deformation can be on purpose, like an artist; or it can be accidental, a result of either the user’s skill or the way that the underlying code imagines the world to work.
[7]
‘Historical Friction’ is my attempt to realize a day-dream: what if the history of a place was thick enough to impede movement through it? I knew that I could find a) enough information about virtually everywhere on Wikipedia; that b) I could access this through mobile computing and c) something that often stops me in my tracks is not primarily visual but rather auditory. But I don’t have the coding chops to build something like that from scratch.
What I can do, though, is mash things together, sometimes. But when I do that, I’m beholden to design choices others have made. ‘Historical Friction’ is my first stab at this, welding someone else’s Wikipedia tool to someone else’s voice synthesizer. Let’s take a listen.
…So this second strand of DH is to deform (with its connotations of a kind of performance) different ways of knowing.
[8]
A third strand of DH comes from the reflexive use of technology. My training is in archaeology. As an archaeologist, I became Eastern Canada’s only expert in Roman Brick Stamps. Not a lot of call for that.
But I recognized that I could use this material to extract fossilized social networks, that the information in the stamps was all about connections. Once I had this social network, I began to wonder how I could reanimate it, and so I turned to simulation modeling. After much exploration, I’ve realized that what I resurrect on these social networks is NOT the past, but rather the story I am telling about the past. I simulate historiography. I create a population of zombie Romans (individual computing objects) and I give them rules of behavior that describe some phenomenon in the past that I am interested in. These rules are formulated at the level of the individual. I let the zombies go, and watch how they interact. In this way, I develop a way to interrogate the unintended or emergent consequences of the story I tell about the past: a kind of probabilistic historiography.
So DH allows me to deform my own understandings of the world; it allows me to put the stories I tell to the test.
[9]recap
There’s an awful lot of work that goes under the rubric of ‘digital humanities’. But these three strands are I think the critical ones for understanding what university teaching informed by DH might look like.
[10]
Did I mention my background was in archaeology? There’s a lot that goes under the rubric of ‘experimental’ archaeology that ties in to or is congruent with the digital humanities as well. Fundamentally, you might file it under the caption of ‘making as a way of knowing’.
[11]
Experimental archaeology has been around for decades. So too has DH (and its earlier incarnation as ‘humanities computing’) which goes back to at least the 1940s and Father Busa, who famously persuaded IBM to give him a research lab and computer scientists to help him create his concordance of the work praesans in the writings of Thomas Aquinas.
So despite the current buzz, DH is not just a fad, but rather has (comparatively) deep antecedents. The ‘Humanities’ as an organizing concept in universities has scarcely been around for much longer.
[12]
So let’s consider then what DH implies for university teaching.
[13]salt
But I feel I should warn you. My abilities to forecast the future are entirely suspect. As an undergrad, in 1994, I was asked to go on the ‘world wide web’, this new thing, and create an annotated bibliography concerning as many websites as I could that dealt with the Etruscans. The first site I found (before the days of content filters) was headlined, ‘the Sex Communist Manifesto’. Unimpressed, I wrote a screed that began, “The so-called ‘world wide web’ will never be useful for academics.”
Please do take everything I say then with a grain or two of salt.
[14]
Let me tell you about some of the things I have tried, built on these ideas of recognizing our increasingly cyborg consciousness, deformation of our materials, and of our perspectives. I’m pretty much a one-man band, so I’ve not done a lot with a lot of bells and whistles, but I have tried to foster a kind of playfulness, whether that’s role-playing, game playing, or just screwing around.
[15]epic fails
Some of this has failed horribly; and partly the failure emerged because I didn’t understand that, just like digital media, our institutions have rule sets that students are aware of; sometimes, our ‘best’ students are ‘best’ not because they have a deep understanding of the materials but rather because they have learned to play the games that our rules have created. In the game of being a student, the rules are well understood – especially in history (which is where I currently have my departmental home). Write an essay; follow certain rhetorical devices; write a midterm; write a final. Rinse. Repeat. Woe betide the prof who messes with that formula!
I once taught in a distance ed program, teaching an introduction to Roman culture class. The materials were already developed; I was little more than a glorified scantron machine. I was getting essay after essay that contained clangers along the lines of, ‘Vespasian won the civil war of AD 69, because Vespasian was later the Emperor.’ I played a lot of Civilization IV at the time, so I thought, I bet if I could get students to play out the scenario of AD69, students would understand a lot more of the contingency of the period, that Vespasian’s win was not foreordained. I crafted the scenario, built an alternative essay around it (’play the scenario, contrast the game’s history with ‘real’ history’), found students who had the game. Though many played it, they all opted to just write the original essay prompt. My failure was two-fold. One,‘playing a game for credit’ did not mesh with ‘the game of being a student’; there was no space there. Two, I created a ‘creepy treehouse’, a transgression into the student’s world where I did not belong. Profs do not play games. It’d be like inviting all my students to friend me on Facebook. It was creepy.
I tried again, in a history course last winter. The first assessment exercise – an icebreaker, really – was to play an interactive fiction that recreated some of the social aspects of moving through Roman space. The player had to find her way from Beneventum to Pompeii, without recourse to maps. What panic! What chaos! I lost a third of the class that week. Again, the concern was, ‘how does playing a game fit into the game of being a student’. Learning from the previous fiasco, I thought I’d laid a better foundation this time. Nope. The thing I neglected: there is safety in the herd. No one was willing to play as an individual and submit an individual response – ‘who wants to be a guinea pig?’ might have been the name of THIS game, as far as the students were concerned. I changed course, and we played it as a group, in class. Suddenly, it was safe.
[16]epic wins
But from failure, we learn, and we sometimes have epic wins (failures almost always are more interesting than wins). Imagine if we had a system that short-circuited the game of being a student, to allow students the freedom to fail, to try things out, and to grow! One of the major fails of my Year of the Four Emperors experiment was that it was I who did all the building. It should’ve been the students. When I built my scenario, I was doing it in public on one of the game’s community forums. I’ve since started crafting courses (or at least, trying to) where the students are continually building upwards from zero, where they do it in public, and where all of their writing and crafting is done in the open, in the context of a special group. This changes the game considerably.
[17]
To many of you, this is no doubt a coals-to-newcastle, preaching-to-the-choir kind of moment.
[18]
And again, I hear you say, what would an entire university look like, if all this was our foundation? Well, it’s starting to look a little better than it did when we first asked the question…
[19]dh will save us
…but DH has been pushed an awful lot lately. DH will save us! It’ll make the humanities ‘relevant’: to funding bodies, to government, to parents! Just sprinkle DH fairy dust, and all will be safe, right?
[19]memes & dark side
You’ve probably heard that. It’s happened enough that there’s even memes about it.
Yep. No doubt – a lot of folks are sick of hearing about ‘the digital humanities’. At the most recent MLA, there was a good deal of pushback, including a session called ‘the dark side of DH’. Wendy Chun wrote,
“For today, I want to propose that the dark side of the digital humanities is its bright side, its alleged promise: its alleged promise to save the humanities by making them and their graduates relevant, by giving their graduates technical skills that will allow them to thrive in a difficult and precarious job market. Speaking partly as a former engineer, this promise strikes me as bull: knowing GIS or basic statistics or basic scripting (or even server side scripting) is not going to make English majors competitive with engineers or CS geeks trained here or increasingly abroad […] It allows us to believe that the problem facing our students and our profession is a lack of technical savvy rather than an economic system that undermines the future of our students.”
(That’s not a DH that I recognize, by the way, as I hope you’ll have noticed given my three strands).
Now, I wasn’t at that meeting, but I saw a lot of chatter flutter by that day, as in that same session MOOCs were conflated with the digital humanities; that somehow the embrace of DH enables the proliferation of MOOCs. As Amanda French, who has coordinated an extraordinary number of digital humanities ‘THATCamp’ conferences, has said, ‘I don’t know a single digital humanist who likes MOOcs.”
We’ve heard a lot about MOOCs today, and I’m certainly in no position to critique them as I’ve never offered nor successfully finished one. But as I’ve identified the strands of DH today, there *is* an affinity though with the so-called ‘cMOOC’.
[21]Know Your MOOCs
Before there was coursera, udacity, and glorified talking heads over the internet, there was the cMOOC. The Canadian MOOC. The personal learning environment. Isn’t it interesting that Pearson, a text book publisher, is a heavy investor in the MOOC scene? Frankly, as xMOOCs are currently designed, they seem to me to be a challenge to publishers of textbooks rather than to teaching. We can do better, and I think DH ties well with the idea of personal learning environments. ‘Massive’ is not, in and of itself, a virtue, and we’d do well to remember that.
[22]Rainbow Castle
So, following my three strands, we’d:
[23]
-identify the ways our institutions and our uses of technology force particular ways of thinking
-we’d deform the content we teach
-we’d set up our institutions and our uses of technology to deform the way our students think: including the ways our institutions are set up.
[24]
So let’s turn the university inside out. It’s been about silos for so long (also known as ivory towers). I grew up on a farm: do you know what gets put into a silo, what comes out? It’s silage, chopped up, often a bit fermented, cattle food: pre-processed cud. Let’s not do that anymore.
[25]Walled Gardens, online dating
For all their massiveness, MOOCs and Universities are still walled gardens. And what’s the unit of connection? It’s the course. It’s the container. I used to work with a guy who often said, ‘once we get the contract, we’ll just get monkeys to do the work’. That guy is no longer in business. I used to work for a for-profit university in the States that had a similar approach to hiring online faculty.
MOOCs are not disruptive in that sense. Want to be really disruptive? Let’s turn to a model that massively connects people together who have a shared interest. I hereby banish the use of any metaphor that frames the relationship at a university in terms of clients, or customers. Instead, what if the metaphor used was more in line with a dating service?
In online dating, the site brings together two kinds of people, both looking for the same thing. Typically, the men pay a fee to be on the site; women are wooed to the site by all sorts of free promos etc. No point having a dating site that does not have any available ‘others’ on it. In which case, the university could be in the business of bringing together students [the ‘men’] with faculty [the ‘women’]. If a university had that metaphor in its mind, it would be thinking, ‘what can we do to make our site – the university – an attractive place for faculty to be?’ Imagine that!
Students would not be signing up for classes, but rather, to follow and learn from particular profs. Typically on something like eBay or a dating site, there are reputation systems embedded in the site. You do not buy from the person with the bad rep in eBay; you do not contact the person whose profile has gotten many negative reviews. Since the university knows the grades of the students and has teaching evaluations and other indicators of faculty interests and reputations, it has the ability to put together faculty and students in a dynamic way. “Others who have enjoyed learning about Roman civilization with Dr. Graham have loved learning about Bronze Age Greece with…”. Wouldn’t it be something to allow students to select their areas of interest knowing the reputation of the profs who work in a particular area; and for profs to select their students based on their demonstrated interests and aptitudes? Let faculty and students have ‘tokens’ – this is my first choice, this is my second choice, this is my third choice prof/student to work with for the session. Facilitate the matching of students and faculty. Let the student craft their way through university following individuals, and crafting a ‘masterpiece’ for their final demonstration of making as a way of knowing, for their BA? Hmmm. Kinda sounds like a return to the Guild, as it were.
You might not like that, which is fine; there are probably better ideas out there. We’ve got all this damned information around! Maybe there are earlier models that could work better with our new technologies, maybe there are new models for our new techs. But surely we can do better than merely replicate processes that were designed for the late 19th and early 20th century? Whatever metaphor we use to frame what the university does, it goes a long way to framing the ways learning can happen. That’s what DH and its exploration of a cyborg consciousness should make us at least explore.
[26]domain of one’s own
And once we’ve done that, let’s have some real openness. Let the world see that faculty-student, and student-student, relationship develop. Invite the rest of the world in. Folks like Ethan Watrall at MSU already do that for their on-campus courses putting all course materials and assessment activities on open websites, inviting the wider world to participate and to interact with the students.
Give every student, at the time of registration, a domain of their own, like Mary Washington is starting to do. Pay for it, help the student maintain it, for their time at university. At graduation, the student could archive it, or take over its maintenance. Let the learning community continue after formal assessment ends. The robots that construct our knowledge from the world wide web – Google and the content aggregators – depend on strong signals, on a creative class. If each and every student at your institution (and your alumni!) is using a domain of their own as a repository for their own IP, a personal learning environment, a node in a frequently re-configuring network of learners, your university would generate real gravity on the web, become the well out of which the wider world draws its knowledge. Use the structure and logic of the web to embed the learning life of the university so deeply into the wider world that it cannot be extricated!
[27]
Because right now, that’s not happening. If you study the structure of the web for different kinds of academic knowledge (here, Roman archaeology), there’s a huge disconnect between where the people are, and where the academics are. If we allow that to continue, it becomes increasingly more easy for outsiders to frame ‘academic’ knowledge as a synonym ‘pointless’. With the embedded university, the university inside out, there are no outsiders. If we embed our teaching through the personal learning environments of our students, our research production will become similarly embedded.
[28]
If the university is inside out, and not in splendid isolation, then it is embedded.
Forget massively ‘open’.
Think massively embedded.
Think massively accessible.
(Not the best image I could fine, but hey! that boulder, part of a structure, is embedded in a massively accessible landscape.)
[29]Check mark list
So what’s tuition for, then? Well, it’s an opportunity to have my one-on-one undivided attention; it’s icetime, an opportunity to skate. But we need to have more opportunities for sideways access to that attention too, for people who have benefited from participating in our openness, our embeddedness to demonstrate what they’ve learned. There’s much to recommend in Western Governors’ University’s approach to the evaluation of non-traditional learners.
[30]
The digital humanities, as a perspective, has changed the way I’ve come to teach. I didn’t set out to be a digital humanist; I wanted to be an archaeologist. But the multiple ways in which archaeological knowledge is constructed, its pan-disciplinary need to draw from different wells, pushed me into DH. There are many different strands to DH work; I’ve identified here what I think are three major ones that could become the framework, the weave and the weft, for something truly disruptive.






