Home » archaeology

Category Archives: archaeology

Where Roman Roads and Topic Models Intersect

Previously, I ended up with a map of UK districts, coloured by the five groups that Gephi’s modularity routine suggested were present, in the network of districts to districts based on shared patterns in the underlying topics (the topic model generated from the total dump of the Portable Antiquities Scheme database).

I asked on twitter if the patterns seemed evocative of anything; Phil Mills suggested that they seemed to match perhaps civitas boundaries. He provided me with an image of those boundaries (thanks Phil!) as well as some kmz files. Below are two images, one with civitas capitals (hand-drawn in by me) and Roman roads. Together, they are evocative.  Blocks of colour seem to go very well with civitas boundaries; where blocks of colour overlap those boundaries, they seem to march along well the routes of the roads. And all this from looking at topic models! I think it is getting progessively safer to say that the patterns found in an archaeological database through topic modelling are indeed meaningful on the ground. The factors of government, of identity, of mobility, seem to emerge in the topic model.

UK Districts by Modularity, overlain with hand-drawn civitas boundaries

UK Districts by Modularity, overlain with hand-drawn civitas boundaries

Roman roads overlain on same.

Roman roads overlain on same.

Reading Inscriptions Algorithmically

Inscriptions are complicated beasts. Frequently quite small and incomplete, epigraphers are able to extract an enormous amount of information from inscriptions – especially when they have other inscriptions with which to contrast and compare. Let us look at the inscriptions from Aphrodisias, which are published online following Epidoc conventions. Because of this, we are able to do some data-mining on them with a minimum of pre-processing.

(Joyce Reynolds, Charlotte Roueché, Gabriel Bodard, Inscriptions of Aphrodisias (2007), available <http://insaph.kcl.ac.uk/iaph2007>, ISBN 978-1-897747-19-3.)

The first one looks like this, when the xml tags are stripped away:

Reynolds1982
Creative Commons licence Attribution 2.5 (http://creativecommons.org/licenses/by/2.5/)
All reuse or distribution of this work must contain somewhere a link back to the URL http://insaph.kcl.ac.uk/
Originally published in Reynolds (1982).
English French German Ancient Greek Transliterated Greek Modern Greek Italian Latin Spanish Turkish 2007-07-04cmrDONE2007-04-02Charlotte Tupmanhand tidiedGBhand tidied 2007-03-15Elliott HallBatch converted Word2XML

Description of MonumentUpper right corner of a white marble block (0.36 x 0.24 x 0.34).
Description of TextInscribed on one face.
LettersLate Republican or Augustan; ave 0.02. rho in ll. 1, 3, 6 has a very small stroke slanting rightwards from the junction of the bowl with the vertical.
Date Late Republican or Augustan (lettering, content)
Edition οὗτος ὁ τόπος ἱερὸς ἄσυλος ὡς ἔκριναν ὁ μέγας Καῖσαρ ὁ δικτάτωρ καὶ ὁ υἱὸς αὐτοῦ αὐτοκράτωρ Καῖσαρ καὶ ἡ σύνκλητος καὶ ὁ δῆμος ὁ Ῥωμαίων καθὼς καὶ τὰ φιλάνθρωπα καὶ δελτογραφήματα καὶ ἐπικρίματα περιέχει ἀνέστησεν δὲ τοὺς ὅρους Γάϊος Ἰούλιος Ζωΐλος ὁ ἱερεὺς τῆς Ἀφροδείτης
Apparatus For the supplements, compare the partner inscription 1.38.
Translation [?This area is] the sacred asylum [?as defined by] the great [?Caesar, the] Dictator, and [?his son] Imperator [Caesar and the ] Senate [and People] of Rome, [as is also contained in the] grants of privilege, the public documents [and decrees. C. Iulius Zoilos priest of Aphrodite set up the boundary stones.]
Commentary See , 159-160.
Locations Stray find. Temple/Church temenos. Museum (1977)
Text Constituted From Transcription (Reynolds)
History of Recording Recorded by the NYU expedition in 1963 (63.596)
Bibliography Published by Reynolds, , doc. 35, whence SEG 1982.1097, BE 1983.388, 1984.878, McCabe 379, R. R. R. Smith, (Mainz, 1993) T5.

Photographs

Face (1977)

There’s a lot of meta information that goes along with a single inscription above and beyond its transcription and translation, all of it which is necessary to understand the possible significance. I don’t think there’s a better illustration of what ‘close reading’ might mean in archaeology, than the epigrapher’s art.

What might we spot if we look at a corpus of inscriptions from a macro level? What patterns might exist? Is there something going on related to geography? Researcher? language of the inscription? Publication history? Dating? This is where the algorithmns of topic modeling might be useful. My go-to tool for this is MALLET. Mallet allows one to strip out all of the xml tags (see MALLET’s help file from the command line for -import-dir), so I can download the xml files as zip from the Inscriptions of Aphrodisias site, and begin exploring for patterns. I optimize the interval too when I train the topic model, to shake out the utility of the resulting ‘topics’. I began by modeling 50 topics.

You can download the MALLET file and results here, to play with and explore for yourself.

When I look at the results (inscriptionkeys.txt), the ‘strongest’ topics all relate to metadata regarding their online publication (the top 3). The next few clearly relate to the researchers who are behind the inscriptions of Aphrodisias website, so not overly useful for me here. The next couple seem to be a mixture of findspot information and publishing history:

topic6 0.34603 unpublished fragment reynolds face museum version lettering inscription born digital joyce unknown marble expedition white centuries nyu inscribed stray 32 0.23776 face upper moulding left side lettering part lower expedition museum white marble nyu broken aphrodisias asia corner inscribed front
topic39 0.15711 south walls east face west block part wall gate expedition findspot city stretch tupman mama lettering depth measurable marble
topic7 0.14493 mama gaudin published reinach mccabe cormack kubitschek squeeze notebook phi expedition records originally aphrodisias reichel publications recorded charlotte representations
topic43 0.13213 mccabe published originally bodard gabriel rouech aphrodisias phi description findspot reported subsequently charlotte unknown preliminary inscription tidied publication funerary

The remaining topics all deal explicitly with the inscriptions themselves, their texts and their findspots (it seems).

topic47 0.06106 son honoured honours people council claudius priest diogenes family tiberius man high cl public gerousia lived virtue life zenon
topic8 0.05807 roman family wife names father aphrodisian case daughter suggests citizenship early century reference possibly menodotos clear named civic late
topic38 0.05137 son zenon adrastos attalos dionysios athenagoras artemidoros apollonios hypsikles aphrodite diogenes daughter early tupman menestheus cf goddess grandson sons

Groupings in Inscriptions of Aphrodisias

Groupings in Inscriptions of Aphrodisias

Every file is composed of all these different topics, to differing degrees. I would like to visualize the paths of these discourses through the corpus, so I translate the inscriptioncomp.txt file so that I end up with at at least 9/10s of each document’s composition (in practice, this means cutting and pasting the inscriptioncomp.txt file so that I end up with a single list with source-document, target-topic, and weight). I also filtered out those strongest topics described above (5,6,7,9,16,17,29,39,43).

I imported this list into Gephi, and set about trying to find groupings of topics and inscriptions, based on the shared patterns (and the weighting) of relationships. I coloured it by group (modularity) and resized nodes based on ‘betweeness’. What does betweeness mean here? I think it means the principle ideas (the discourse) that ties this entire collection together. In this case, topic 0:

statue base honours shaft ll moulding set city sbi council feature honoured capital aurelius prosopography top moulded ligatures antonius

followed closely by 1 and 37:

topic1 sarcophagus funerary inscription front lid standard necropolis aurelius forms buried tomb east formula elements aur rim burial end line

topic37 city face village house inscribed recording edition wall block text transliterated unknown large line greek area lettering viii marble

Topics - topics, Inscriptions of Aphrodisias

Topics – topics, Inscriptions of Aphrodisias

It might be that these most ‘between’ topics are not the ones that are archaeologically interesting. This is of course a 2-mode network (inscriptions-topics) so it might be desireable to consider this data as two 1-mode networks, inscriptions – inscriptions by virtue of shared topics, and topics – topics by virtue of shared inscriptions. When we take topics – topics, running our familiar grouping and betweeness metrics, topic 37 comes out on top, followed by 10 and 33:

topic10 building reynolds blocks block son architrave published theatre decoration papers fasciae dedication end aphrodite people aphrodisias fascia found demos

topic33 ii iii iv cut text left cross fortune monogram mccabe letters end triumphs broken vi acclamation texts drawing vii

When we turn the two mode network into an inscriptions – inscriptions by virtue of shared topics, we end up with a monster of a graph: 1505 nodes (inscriptions), with 241,002 relationships! The most between inscription is iAph050118:

Building inscription of Helladios
Charlotte M. Roueché2007
Creative Commons licence Attribution 2.5 (http://creativecommons.org/licenses/by/2.5/)
All reuse or distribution of this work must contain somewhere a link back to the URL http://insaph.kcl.ac.uk/
Originally published in Roueché (2004).
English French Ancient Greek Transliterated Greek Latin AsiaTurkey
Aphrodisias
Geyre
2004-06-08Gabriel BodardChecked and fixed all image divs and refs 2004-03-16 Gabriel Bodard Completed lemmatisation, checked figure ids, tagged keywords 2003-11-04John LavagninoConverted beta code to Unicode 2003-05-27 Gabriel Bodard tidied and corrected 2003-04-30 Juan Garcés tidied and corrected 2003-06-22CMRtagged, tidied and corrected2003-07-14JLGLemmatised2003-08-20CMRname tags reduced2004-01-16CMRtidied; image refs2003-05-27Gabriel BodardTyped and marked-up Greek Description of Monument

A rectangular white marble block, perhaps from a lintel (0.285 × 0.665 × 0.50) with simple moulding above and below on one face. Chipped to the right, but complete.
Description of Text

Inscribed on the moulded face, in one line on the surface between the mouldings, which is slightly concave. The text must have continued onto an adjacent block.

Description of Letters
Flowing style, similar to 5.302, 5.119 and 4.120; 0.05-0.06.

Date
First half of the fourth century (lettering, prosopography).
Edition
κἀμὲ
Ἑλλάδιος
ὁ ἁγνός

Translation
Me also Helladios the pure

Commentary
For Helladios see also 1.131, 4.120 and discussion at II.35.

Locations
Hadrianic Baths: central chamber. Unknown. Findspot (1972)..

History of Recording
Excavated by the NYU expedition.
Bibliography Published by Roueché, Aphrodisias in Late Antiquity, no. 18, whence PHI 605.
Text Constituted From Transcription (Roueché).
Photographs Face (1972)

Seems a bit underwhelming, no? But look at what is in this inscription – a personal name, the central chamber of the Baths, links outward to other inscriptions… reading the inscriptions algorithmically doesn’t absolve us from having to jump back in to do the close reading. Instead, we have to bounce back and forth between the micro and the macro. The modularity routine suspects that there are around 52 distinct subgroups in this material. That’s probably where the most interest will lie, for scholars of this material. Are these groups related to context of discovery, or named individuals appearing in mutliple inscriptions or…? Five groups account for 1456 inscriptions. (It’s easier to load the ‘inscriptions-inscriptions-inscriptions-of-aphrodisias [Nodes].csv’ file to examine all of these). What might be causing the ‘big five’ to group together? I will leave it up to the epigraphers to examine them…

Those 47 inscriptions which the modularity routine found so odd that they each were put into their own group are curious indeed. The first of these uniques is Inscription iAph080906:

αὔξι
Θεόπομπος
ὁ μεγαλοπρεπέστατος πολιτευόμενος σὺν θεῷ πατὴρ τῆς πόλεως

Translation
Up with Theopompos, magnificentissimus, member of the council and, with God’s help, pater civitatis

…which seems to be a good place to draw this note to a close. Up with Theopompos indeed! One wonders if he won the election. The remainder (checked at random) seem to have no translations associated with them. So perhaps what really sets these apart is simply that they haven’t been translated. If so, that’s rather astonishing that that should be visible from a topic-model & graph viz combination.

Topic modeling the things that fell out of pockets

UK Districts by Modularity, overlain with hand-drawn civitas boundaries

Modern Districts by Modularity, overlain with hand-drawn 1st century civitas boundaries

Topic modeling is very popular at the moment in the digital humanities. Ian, Scott and I described them as tools for extracting topics or injecting semantic meaning into vocabularies: “Topic models represent a family of computer programs that extract topics from texts. A topic to the computer is a list of words that occur in statistically meaningful ways. A text can be an email, a blog post, a book chapter, a journal article, a diary entry – that is, any kind of unstructured text” (Graham, Weingart, and Milligan 2012). In that tutorial, ‘unstructured’ means that there is no encoding in the text by which a computer can model any of its semantic meaning.

But there are topic models of ships’ logs, of computer code. So why not archaeological databases?

Archaeological datasets are rich, largely unstructured bodies of text. While there are examples of archaeological datasets that are coded with semantic meaning through xml and Text Encoding Initiative practices, many of these are done after the fact of excavation or collection. Day to day, things can be rather different, and this material can be considered to be  ‘largely unstructured’ despite the use of databases, controlled vocabulary, and other means to maintain standardized descriptions of what is excavated, collected, and analyzed. This is because of the human factor. Not all archaeologists are equally skilled. Not all data gets recorded according to the standards. Where some see few differences in a particular clay fabric type, others might see many, and vice versa. Archaeological custom might call a particular vessel type a ‘casserole’, thus suggesting a particular use, only because in the 19th century when that vessel type was first encountered it reminded the archaeologist of what was in his kitchen – there is no necessary correlation between what we as archaeologists call things and what those things were originally used for. Further, once data is recorded (and the site has been destroyed through the excavation process), we tend to analyze these materials in isolation. That is, we write our analyses based on all of the examples of a particular type, rather than considering the interrelationships amongst the data found in the same context or locus. David Mimno in 2009 turned the tools of data analysis on the databases of household materials recovered and recorded room by room at Pompeii. He considered each room as a ‘document’ and the artefacts therein as the ‘tokens’ or ‘words’ within that document, for the purposes of topic modeling. The resulting ‘topics’ of this analysis are what he calls ‘vocabularies’ of object types which when taken together can suggest the mixture of functions particular rooms may have had in Pompeii. He writes, ‘the purpose of this tool is not to show that topic modeling is the best tool for archaeological investigation, but that it is an appropriate tool that can provide a complement to human analysis….mathematically concrete in its biases’. The ‘casseroles’ of Pompeii turn out to have nothing to do with food preparation, in Mimno’s analysis. To date, I believe this is the only example of topic modeling applied to archaeological data.

Directly inspired by that example, I’ve been exploring the use of topic models on another rich archaeological dataset, the Portable Antiquities Scheme database in the UK. The Portable Antiquities Scheme is a project “to encourage the voluntary recording of archaeological objects found by members of the public in England and Wales”. To date, there are over half a million unique records in the Scheme’s database. These are small things, things that fell out of pockets, things that often get found via metal-detecting.

Here’s what I’ve been doing.

1. I downloaded a nightly dump of the PAS data back in April; it came as a csv file. I opened the file, and discovered over a million lines of records. Upon closer examination, I think what happened is something to do with the encoding- there are line breaks, carriage returns, and other non-printing characters (as well as commas being used within fields) that when I open the file I end up with a single record (say a coin hoard) occupying tens of lines, or of fields shifting at the extraneous commas.

2. I cleaned this data up using Notepad++ and the liberal use of regular expressions to put everything back together again. The entire file is something like 385 mb.

3. I imported it into MS Access so that I could begin to filter it. I’ve been playing with paleo – meso – and neolithic records; bronze age records; and Roman records. The Roman material itself occupies somewhere around 100 000 unique records.

4. I exported my queries so that I would have a simpler table with dates, descriptions, and measurements.

5. I filtered this table in Excel so that I could copy and paste out all of the records found within a particular district (which left me with a folder with 275 files, totaling something like 25 mb of text).

6. Meanwhile, I began topic modeling the unfiltered total PAS database (just after #2 above). Each run takes about 3 hours, as I’ve been running diagnostics to explore the patterns. The problem I have here though is what, precisely, am I finding? What does a cluster of records who share a topic actually mean, archaeologically? Do topics sort themselves out by period, by place, by material, by finds officer…?

7. As that’s been going on, I’ve been topic modeling the folders that contain the districts of England and Wales for a given period. Let’s look at the Roman period.

There are 275 files, where a handful have *a lot* of data (> 1000 kb), while the vast majority are fairly small (< 100 kb). Perhaps that replicates patterns of metal detecting - see Bevan on biases in the PAS.  The remaining districts seem to have no records in the database. So I’ve got 80% coverage for all of England and Wales. I’ve been iterating over all of this data, so I’ll just describe the most recent, as it seems to be a typical result. Using MALLET 2.0.7, I made a topic model with 50 topics (and optimized the interval, to shake out the useful from the not-so-useful topics). Last night, as I did this, the topic diagnostics package just wouldn’t work for me (you run it from the MALLET directory, but it lives at the MALLET site; perhaps they were working on it). So I’ll probably want to run all these again.

If I sort the topic keys by their prominence (see ‘optimize interval’) the top 14 all seem to describe different kinds of objects – brooches, denarii, nummus, sherds, lead weights, radiate, coin dates, the ‘heads’ sides of coins – which Emperor. Then we get to the next topic, which reads :” record central database recording usual standards fall created scheme aware portable began antiquities rectify working corroded ae worn century”.  This meta-note about data quality appears throughout the database, and refers to materials collected before the Scheme got going.

After that, the remaining topics all seem to deal with the epigraphy of coins, and the various inscriptions, figurative devices, their weights & materials. A number of these topics also include allusions to the work of Guest and Wells, whose work on Iron Age Coins is frequently cited in the database.

Let’s look at the individual districts now, and how these topics play over geographic space. Given that these are modern districts, it’d be better – perhaps – to do this over again with the materials sorted into geographic entities which make sense from a Roman perspective. Perhaps do it by major Roman Roads ( sorting the records so that districts through which Wattling Street traverses are gathered into a single document). Often what people do when they want to visualize the patterns of topic interconnections in a corpus is to trim the composition document so that only topics greater than a certain threshold are imported to a package like Gephi.

My suspicion is that that would throw out a lot of useful data. It may be that it’s the very weak connections that matter. A very strong topic-document relationship might just mean that a coin hoard found in the area is blocking the other signals.

In which case, let’s bring the whole composition document into Gephi. Start with this:

adur 4 0.238806 15 0.19403 22 0.179104 13 0.119403 17 0.089552

and delete out the edge weights. (I’m trying to figure out how to do what follows without deleting those edge weights, but bear with me.)

You end up with something like this:

adur 4 15 22  [...etc...]

Save the file with a new name, as csv.

Open in Notepad++ (or similar) and replace the commas with ;

Go to gephi. Under ‘open graph file’, select your csv file. This is not the same as ‘import spreadsheet’ under the data table tab. You can import a comma separated file where the first item on a line is a node, and each subsequent item is another node to which it is attached. If you tried to open that file under the ‘import spreadsheet’ button, you’d get an error message – in that dialogue, you have to have two columns source and target where each row describes a single relationship. See the difference?

This is why if you left the edge weights in the csv file – let’s call it an adjaceny file – you’d end up with weights becoming nodes, which is a mess. If you want to keep the weights, you have to do the second option.

I’ve tried it both ways. Ultimately, while the first option is much much faster, the second option is the one to go for because the edge weights (the proportion that a topic is present in a document) is extremely important. So I created a single list that included seven pairs of topic-weight combinations. (This doesn’t created a graph where k=7, because not every document had that many topics. But why 7? In truth, after that point, the topics all seemed to be well under 1% of each document’s composition).

With me so far? Great.

Now that I have a two mode network in Gephi, I can begin to analyze the pattern of topics in the documents. Using the multi-mode plugin, I separate this network into two one-mode networks: topics to topics (based on appearing in the same district) and district – district based on having the same topics, in different strengths.

Network visualization doesn’t offer anything useful here (although Wales always is quite distinctly apparent, when you do. It’s because of the coin hoards). Instead, I simply compute useful network metrics. For instance, ‘betweeness’ literally counts the number of times a node is in between all pairs of nodes, given all the possible paths connecting them. In a piece of text such words do the heavy semantic lifting. So identifying topics that are most in between in the topic – topic network should be a useful thing to do. But what does ‘betweeness’ imply for the district – district network? I’m not sure yet. Pivotal areas in the formation of material culture?

What is perhaps more useful is the ‘modularity’. It’s just one of a number of algorithmns one could use to try to find structural sub-groups in a network (nodexl has many more). But perhaps there are interesting geographical patterns if we examined the pattern of links. So I ran modularity, and uploaded the results to openheatmap to visualize them geographically.  Network analysis doesn’t need to produce network visualizations, by the way.

See the result for yourself here: http://www.openheatmap.com/embed.html?map=AnteriorsFrijolsHermetists

It colours each district based on the group that it belongs to. If you mouse-over a district, it’ll give you that group’s number – those numbers shouldn’t be confused with anything else. I’d do this in QGIS, but this was quicker for getting a sense of what’s going on.

I asked on Twitter (referencing a slightly earlier version) if these patterns suggested anything to any of the Romano-Britain crowd.

//

Modularity for topic-topic also implies some interesting groupings, but these seem to mirror what one would expect by looking at their prominence in the keys.txt file.  So that’s where I am now, soon to try out Phil’s suggestion.

As Paul Harvey was wont to say, ‘…and now you know… the REST of the story’.  At DH2013 I hope to be able to tell you what all of this may mean.

Topic Modeling an archaeological database: today’s adventures

If you follow me on twitter, and saw a number of bizarre/cryptic tweets today, I was live tweeting my work stream. This is what I did today – think of this as stream of consciousness over the last five hours.

  • imported portable antiquities scheme database into access so I could work with it.
  • queried it, selecting just those columns I was interested in
  • exported back to csv
  • cleaned up the data by removing ‘=’ signs (circular reference error in excel), names of liason officers, meta notes from PAS on the quality of the record, and indications that the record was sourced from the work of Guest and Wells (nb, not any citations to them). also celtic coins index note.
  • run a simple defaults topic model to get a sense of what words I need to add to a custom stopwords list.
  • 552438 rows (id numbers run to 548561, so I must have lost some).

it occurs to me that I should have left the names of the liason officers in, in case they get associated with a particular topic. d’oh.

bin\mallet import-file –input pasnebraska/everything.csv –output paseverything.mallet –keep-sequence –token-regex ‘[\p{L}\p{M}\p{N}]+’ –remove-stopwords

bin\mallet train-topics –input paseverything.mallet –num-topics 50 –optimize-interval 20 –output-state topic-state.gz –output-topic-keys everything_keys.txt –output-doc-topics everything_composition.txt

  • I think these results will be more useful than the previous ones. Although I believe I forgot to optimize-interval. Yes, I did.

so, running this:

bin\mallet run cc.mallet.topics.tui.TopicTrainer –input paseverything.mallet –num-topics 50 –optimize-interval 20 –diagnostics-file everythingdiagnostics.xml –output-topic-keys everythingdiag_keys.txt –output-doc-topics everythingdiag_topics.txt –xml-topic-pharse-report everythingdiag_phrase.txt –xml-topic-report everythingdiag_topicreport.xml –topic-word-weights-file everythingdiag_word_weights.txt –word-topic-counts-file everythingdiag_word_counts.txt –output-state output-state.gz

looking at the results, it looks like the first two columns, first three? were taken to be labels. shite.

  • reformat csv so that I have an id, and a text, per row.
  • found formula to combine all columns into a single column. but blank rows are buggering things up:

=stoneage!B1&” “&stoneage!C1&stoneage!D1&” “&stoneage!E1&” “&stoneage!F1&” “&stoneage!G1&stoneage!H1&” “&stoneage!I1&” “&stoneage!J1&” “&stoneage!K1&stoneage!L1&” “&stoneage!M1&” “&stoneage!N1&” “&stoneage!O1&stoneage!P1&” “&stoneage!Q1&” “&stoneage!R1&” “&stoneage!S1&stoneage!T1&” “&stoneage!U1&” “&stoneage!V1&” “&stoneage!W1&stoneage!X1&” “&stoneage!Y1&” “&stoneage!Z1&” “&stoneage!AA1&stoneage!AB1&” “&stoneage!AC1

  • returned to access database. gone with the april pas database (csv, download). importing selected columns, ignoring column shift. filtering out blank rows (and/or rows where everything’s colunm shifted all over the place)
  • exporting by period. leaving liason officers in. too bloody awkward to deal with the entire database at once.
  • put all the columns into a single column, so now I have just two: an id number, and a ‘text’.
  • imported, with regex, and diagnostics topic model,

wierd errors when running the model.

  • reimporting without regex.
  • rerunning with diagnostics. looks much better.

topic composition file is crazy talk.

  • ok, screw diagnostics. run normal. optimization 20, topics 50, for stoneage (9680 records).

ok, still the same problem with the composition file. What the hang?

  • re-running without optimization.

nope, still getting this kind of thing:

#doc name topic proportion …
“0 51″ “FLAKE 10 0.480592529670674 44 0.3208674000538096 32 0.10756601869328378
“15 422″ BLADE NEOLITHIC -4000 -2200 “Black/grey 22 0.8415325393137797 30 0.12728676320083662

So, that says to me that something weird happened in the initial import. Yet topic keys seem to make sense.

Sigh…

Wait, over on Mallet page it says,

… the first token of each line (whitespace delimited, with optional comma) becomes the instance name, the second token becomes the label, and all additional text on the line is interpreted as a sequence of word tokens.

Simple as that? So I just need a bloody extra column in there. For the love of god…

  • add column. filled it with document id (again).
  • reimporting. no regex.
  • running the topic model with diagnostics. 50 topics, optimized interval of 20.

By god, that was it.

Almost. Something still weird with the document names. Ah, found it. A blank in the first few rows.

  • reimporting. no regex.
  • running the topic model with diagnostics. 50 topics, optimized interval of 20.

SUCCESS!

Now to repeat with other periods (I elided palaeo, meso, and neolithic into a separate csv file). Then to interpret what all this means.

And I think I really need to reformulate my idea of ‘document’ to not be the individual rows, but rather the districts. I could pull all that out by hand, but I really want to figure out how to make the computer do that.

Anyway, some of the ‘topics’ from today’s adventures (what are ‘topics’ anyway? what might it mean, archaeologically, to think of these as ‘discourses’?, are some questions I need to ask):

implement lithic mesolithic neolithic flint flake bc blade dating date waste possibly period atherton rachel worked core flakes circa
grey flint colour brown dark neolithic light flake mottled cortex mid white cream flakes patina pale coloured knapped translucent

neolithic age bronze early flint date late scraper bc flake probable tool dating complete retouched knife made part tertiary

scraper end neolithic tool flint retouch semi edge circular thumb distal dorsal face abrupt side cortex nail thumbnail plan

core platform flint mesolithic single flakes removed cortex worked scars platforms tl pebble neolithic striking multi removals small blades

flake flint cortex rface dorsal grey retouch neolithic edge small damage face remaining scars edges brown secondary made white

bulb flake percussion platform striking end ventral rface flint proximal dorsal face neolithic scars small grey mid distal patina

side plan end profile margin left distal retouched dorsal proximal shaped flake flint convex scraper ventral snapped edge plano

mm mea length res width weighs flake flint neolithic ring wide long weighing adams kurt thick thickness west blade

section implement plan neolithic lithic triangular shaped date cross shape object oval rectangular rface sides roughly worked edge knapped

hill graham south flake west cornwall paul penwith flint sw margin cream end grey brown distal fig dorsal translucent

arrowhead neolithic leaf shaped flint tanged worked tip barbed point triangular broken early faces transverse missing oblique invasive tang

blade end distal mesolithic proximal snapped broken retouch flint ends parallel dorsal edges break incomplete missing sides damage long

Topic Modeling an Archaeological Database 2

Some things I have learned in recent days:

  • data must be cleaned. Really. It’s probably still too noisy, even when you think it isn’t. Eliminate frequently occuring meta-notes (as it were). All citations to Guest & Wells on Coins in the UK, for instance, really muck things up.
  • you can enter a single csv file as an input for MALLET. I knew this; but I had forgotten it, faced with a few hundred thousand rows of material (as I type this, the thought also occurs that I could run MALLET on the entire single file download I got from PAS, all ~500 000 rows. Presumably, locations and periods would sort themselves out into different topics?)
  • MALLET considers letter characters to make up words. If you’ve got other stuff in there – numerals, for instance – that are significant, you’ll need to become familiar with – -token regex , which you’d use during that initial file-import. It was suggested to me to try these

– token-regex \s\d+\s

–token-regex ‘[\p{L}\p{M}\p{N}]+’

What else? Oh, that’s about all, for now. Oh, wait: custom stopwords. Instead of –remove-stopwords, you’ll want –extra-stopwords yourlist.txt . And your list has to be formatted so that there is whitespace between the words. I’m not sure if that means ‘white space’ like how you and I would figure it, or if that means ‘white space’ in some kind of crazy hidden code kind of way (like this in regex: \s (see this)). If you open one of the default stopword lists, there doesn’t look like there’s any hit-the-space-bar-kind-of white space that I’d normally assume.

Onwards!

Topic Modeling the Portable Antiquities Scheme

The Frome Hoard of Roman Coins

I got my hands on the latest build of the Portable Antiquities Scheme database. I want to topic model the items in this database, to look for patterns in the small material culture of Britain, across time and space.

The data comes in a single CSV, with approximately 500 000 individual rows. The data’s a bit messy, as a result of extra commas slipping in here and there. The names of the Finds Liaison Officers slip into a column meant to record epigraphic info from coins, for instance, from time to time. Not a big deal, over 500 000 records.

The first issue I had was that after opening the CSV file in Excel, Excel would regard all of those epigraphic conventions (the use of =, +, or [ ] and so on) as formulae. This would generate ‘circular reference’ errors. I could sort that out by inserting a ‘ at the beginning of that column. But as you can imagine, sorting through, filtering, or any kind of manipulation of a single table that large would slow things considerably – and frequently crashed this poor ol’ desktop. I tried using Open Refine to clean up the data. I suspect with a bit of time and effort I’d be able to use that product well, but yesterday all I achieved, once I imported my csv file and clicked ‘make project’, was an ‘undefined error’ (after several minutes of chugging). This morning, I turned to Access and was able to import the csv, and begin querying it, cleaning things up a bit, and so on.

So I decided to focus on the Roman records, for the time being. There are some 66 000 unique records, coming from over 80 unique districts of the UK. This leaves me with a table with the chronological range for the object, a description of the object, and some measurements. I have a script that can take each individual row, and turn it into a txt file which I can then import into MALLET. Each individual row can also include the district name.

So I’m wondering now: should I just cut and paste all of the rows for a single district into a single txt file (and thus the routine will not have the place-name in the analyzed text)? Or should I preserve the granularity, and just topic model over every record, preserving the place name?  Ie, a collection of 80 txt files where there are no place names, or a collection of 66 000 txt files where every file has the place name – will they swamp the signals?

It’s too early in the morning for this kind of thinking.

 

Data Mining an Archaeological Database

I’m using their materials with permission.

In July, I’m presenting work related to data mining an archaeological database, in this case, the Portable Antiquity Scheme.

I wondered, if I treated each district in the UK as a ‘document’, and the items recovered in its territory as the words, would I see any interesting or useful patterns if I ran some topic models?

To give you a sense of the scale of this data, there are over 160 000 individual records in the material I obtained from PAS. An individual record might include a ‘hoard’, so there are *well* over 160 000 individual objects. When you sort this material into broad chronological materials, you find:

Paleolithic: 305 records
Mesolithic: 2281
Neolithic: 3608
Prehistoric: 426
Bronze Age: 2620
Iron Age:4695
Roman: 63479
Greek/Roman Provincial:25
Byzantine: 25
Early Medieval: 8421
Medieval: 44982
Post Medieval: 27879
Modern: 306
“Unknown”: 1486
Blank cells: 1278

Quite a lot of material. So, after massaging the data, cleaning things up, I began to work with a very small subset of materials – records tagged ‘bronze age’ from 14 districts (104 records). This was merely an exploration, to see if there’s any meat to my intuitive belief that there should be some sort of latent structure. The 14 districts I selected (the first 14 when I sorted ‘Bronze Age’) are:

Ashford
Bromley
Dover
East hampshire
Hart
Medway
New Forest
Sevenoaks
Test Valley
Winchester
Wokingham

I put every record from Wokingham District into a single txt file, then every one from Winchester, until I was done (and I really need to automate that). Then I fed the text files through MALLET, using the JAVA Gui for this initial exploration (using the JAVA Gui’s default settings. In a more robust exploration, I would go direct from the command line, tweaking until I found the best number of topics, etc).

So here’s what I found.

List of Topics
1. alloy palstave mm copper green surface slight cast dark penannular
2. mouth sides loop dims looped corners armorican axeheads core cast
3. blade axehead prominent casting iron hoard intact uneven single narrow
4. age fragment late surfaces alan spear body faces head flanged
5. age socket collar sectioned alloy slightly ridge seams front square
6. record flint grey scraper antiquities dorsal tool angle black visible
7. bronze patina end stop made remains flat decoration found corroded
8. database central rectify working recording standards usual fall aware began
9. bronze copper flashes part side edge large ridges shallow top
10. socketed straight axe rounded complete horizontal moulded rectangular expanding upper

What do those topics mean? To a human, they are all variations on the description of the artefacts. Given that multiple humans described these artefacts in the first place, perhaps (and it depends too on the kind of guidance and rigour that the PAS uses in its data entry) these topics gather some of the blurriness of categorization, a way of bypassing the clumpers and the splitters amongst us. Obviously, some more thought about what these may mean is necessary. But onwards!

Two-mode, district to topics. Size: betweeness, colour: modularity

Two-mode, district to topics. Size: betweeness, colour: modularity

I brought the resultant ‘documents: topics, % contribution’ list into gephi for some visualization. Since it was a small dataset, I did no pruning. Topic 4 does the most lifting in this network. In its ‘module’, you find topics 9, 10, 3, 5 (coloured purple) and districts of Gravesham, Bromley, Dover, Canterbury, Test Valley, and New Forest. But how much weight does this visualization carry? Since it’s two-mode, and these metrics are really only appropriate for a one-mode graph, probably not much. So I collapsed this graph into a one-mode graph of district to district, based on weighted ties by topic.

The resultant graph is probably more useful for archaeology, for it ties areas together based on all of the material culture recorded in the database. At the recent SAA in Honolulu, in the Connected Past session, folks were constructing networks from artefacts using Brainerd Robinson coefficients. The methodology I’m trying ought to be compared with those studies (see for instance Barbara Mill’s et al recent article).  I then ran modularity and betweeness statistics again. Why betweeness? If the ‘topics’ that emerge in this database reflect something within the underlying material culture, then interconnections between sites constructed from topics show some kind of flow (of ideas? culture? economics?), thus ‘between’ sites straddle the most important of those flows – in which case the most ‘between’ districts might be rather more important.

Colour = modularity; size = betweeness. Ties: shared topics

Colour = modularity; size = betweeness. Ties: shared topics

Remarkably (and this could be an artefact of the method, rather than the underlying data), I get next to no variation in betweeness – every district except for East Hamphsire, Ashford, and New Forest has the same score (and these three all have the same score too). Modularity finds two groups. Perhaps it’s an east/west dichotomy? I laid the network out with the nodes at their geographic locations (typically, the district council office). No east-west dichotomy. (Incidentally, you can now export to Google Earth, overlaying your network against pretty satellite pictures).

So… there seems to be something to it. The thing to do now is to do every record, every district, and every period, mapping out changes over time.  In the interests of being able to assess this, though, I should perhaps stick to my knitting and just do the Roman period.

 

Reanimating Networks with Agent Modeling

I’m presenting next week at the Society for American Archaeology Annual Meeting. I’m giving two papers. One argues for parsimonious models when we do agent based modeling.  The other reverses the flow of archaeological network analysis and instead of finding nets in the archaeology, I use agent based models to generate networks that help me understand the archaeology. (The session is ‘Connected Past’.) Here is the draft of my talk, with all the usual caveats that that entails. Parts of it have been drawn from an unpublished piece that discusses this methodology and the results in much greater detail. It will appear…. eventually.

Scott Weingart has been an enormous help in all of this. You should follow his work. 

My interests lie in the social networks surrounding primary resource extraction in the Roman world. The Roman epigraphy of stamped brick easily lends itself to network analysis. One string together, like pearls, individual landowners, estate names, individual brick makers, signa, brick fabrics, and locations. This leads to very complicated, multi-dimensional networks.

When I first started working with this material, I reduced this complexity by looking only at the humans, whom I tied together based on appearing in the same stamp type together. I called these ‘producer’ networks. I then looked at the ties implied by the shared use of fabrics, or the co-location of brick stamp types at various findspots, and called these ‘manufacturing’ networks.

I then sliced these networks up by reigning dynasty, and developed a story to account for their changing shapes over time.

This was in the late 1990s, and in terms of network theorists I had largely only Granovetter, Hanneman & Riddle, and Strogatz & Watts to go on. The story I told was little more than a just-so story, like how the Camel got its Hump.

I had the shape, I had points where I could hang the story, but I couldn’t account for how I got from the shape of the network in the Julio-Claudian period, to that of the Flavian, to that of the Antonines. I’ve done a lot of work on networks since then; now I want to know what generates these networks that we see archaeologically, in the first place.

In this talk today, I want to reverse the direction of my inquiry. We are all agreed that we can find networks in our archaeological materials. The problem, I think, for us, is to explain the network processes that produce these patterns, and then to use our understanding of those processes to narrow down the possible entangled human & thing interactions that could give rise to these possible processes.

We need to be able to understand the possible behaviour-spaces that could produce the networks we see, to tease out the inevitable from the contingent. We need to be able to rigorously explore the emergent or unintended consequences of the stories we tell. The only way I know how to do that systematically, is to encode those stories as computer code, to turn them from normal, archaeological storytelling rhetoric, to computational procedural rhetoric.

So this is what we did.

One story we tell about the Roman world, that might be useful for understanding things like the exploitation of land for building materials, is that its social economy functioned like a ‘bazaar’.

According to Peter Bang, the Roman economic system is best understood as a complex, agrarian tributary empire, of a kind similar to the Ottoman or Mughal (Bang 2006; 2008).  Bang (2006: 72-9) draws attention to the concept of the bazaar. The bazaar was a complete social system that incorporated the small peddler with larger merchants, long distance trade, with a smearing of categories of role and scale. The bazaar emerged from the interplay of instability and fragmentation. The mechanisms developed to cope with these reproduced that same instability and fragmentation. Bang identifies four key mechanisms that did this: small parcels of capital (to combat risk, cf Skydsgaard 1976); little homogenization of products (agricultural output and quality varied year by year, and region by region as Pliny discusses in Naturalis Historia 12 and 18); opportunism; and social networks (80-4). As Bang demonstrates, these characteristics correspond well with the archaeology of the Roman economy and the picture we know from legal and other text.

Bang’s model of the bazaar (2008; 2006), and the role of social networks within that model, can be simulated computationally. What follows is a speculative attempt to do so, and should be couched in all appropriate caveats and warnings. The model simulates the extraction of various natural resources, where social connections may emerge between individuals as a consequence of the interplay of the environment, transaction costs, and the agent’s knowledge of the world. If the networks generated from the computational simulation of our models for the ancient economy correspond to those we see in the ancient evidence , we have a powerful tool for exploring antiquity, for playing with different ideas about how the ancient world worked (cf. Dibble 2006). Computation might be able to bridge our models and our evidence. In particular, I mean, ‘agent based modeling’.

Agent based modelling is an approach to simulation that focuses on the individual. In an agent based model, the agents or individuals are autonomous computing objects. They are their own programmes. They are allowed to interact within an environment (which frequently represents some real-world physical environment). Every agent has the same suite of variables but each agent’s individual combination of variables is unique (if it was a simulation of an ice-hockey game, every agent would have a ‘speed’ variable, and an ‘ability’ variable, and so the nature of every game would be unique). Agents can be aware of each other and the state of the world (or their location within it), depending on the needs of the simulation. It is a tool to simulate how we believe a particular phenomenon worked in the past. When we simulate, we are interrogating our own understandings and beliefs.

The model imagines a ‘world’ (‘gameboard’ would not be an inappropriate term) in which help is necessary to find and consume resources. The agents do not know when or where resources will appear or become exhausted. By accumulating resources, and ‘investing’ in improvements to make extraction easier, agents can accrue prestige. When agents get into ‘trouble’ (they run out of resources) they can examine their local area and become a ‘client’ of someone with more prestige than themselves.  It is an exceedingly simple simulation, and a necessary simplification of Bang’s ‘Bazaar’ model, but one that captures the essence and exhibits subtle complexity in its results. The resulting networks can be imported into social network analysis software like Gephi.

It is always better to start with a simple simulation, even at the expense of fidelity to the phenomenon under consideration, on the grounds that it is easier to understand and interpret outputs. A simple model can always be made more complex when we understand what it is doing and why; a complex model is rather the inverse, its outcomes difficult to isolate and understand.

A criticism of computational simulation is that one only gets out of it what one puts in; that its results are tautological. This is to misunderstand what an agent based simulation does.  In the model developed here, I put no information into the model about the ‘real world’, the archaeological information against which I measure the results. The model is meant to simulate my understanding of key elements of Bang’s formulation of the ‘Imperial Bazaar’. We measure whether or not this formulation is useful by matching its results against archaeological information which was never incorporated into the agents’ rules, procedures, or starting points. I never pre-specify the shape of the social networks that the agents will employ; rather, I allow them to generate their own social networks which I then measure against those known from archaeology. In this way, I start with the dynamic to produce static snapshots.

We sweep the ‘parameter space’ to understand how the simulation behaves; ie, the simulation is set to run multiple times with different variable settings. In this case, there are only two agent variables that we are interested in (having already pre-set the environment to reflect different kinds of resources), ‘transaction costs’ and ‘knowledge of the world’. Because we are ultimately interested in comparing the social networks produced by the model against a known network, the number of agents is set at 235, a number that reflects the networks known from archaeometric and epigraphic analysis of the South Etruria Collection of stamped Roman bricks (Graham 2006a).

What is particularly exciting about this kind of approach, to my mind, is that if you disagree with it, with my assumptions, with my encoded representation of how we as archaeologists believed the ancient world to have worked, you can simply download the code, make your own changes, and see for yourself. If you are presented with the results of a simulation that you cannot open the hood and examine its inner workings for yourself, you have no reason to believe those findings. Thus agent based modeling plays into open access issues as well.

So let us consider then some of the results of this model, this computational petri dish for generating social networks.For my archaeological networks, I looked at clustering coefficient and average path length as indicator metrics, (key elements of Watts’ small world formulation).  We can tentatively identify a small-world then as one with a short average path length and a strong clustering coefficient, compared to a randomly connected network with the same number of actors and connections. Watts suggests that a small-world exists when the path lengths are similar but the clustering coefficient is an order of magnitude greater than in the equivalent random network (Watts 1999: 114).

In Roman economic history, discussions of the degree of market integration within and across the regions of the Empire could usefully be recast as a discussion of small-worlds. If small-worlds could be identified in the archaeology (or emerge as a consequence of a simulation of the economy), then we would have a powerful tool for exploring flows of power, information, and materials. Perhaps Rome’s structural growth – or lack thereof – could be understood in terms of the degree to which the imperial economy resembles a small-world (cf the papers in Manning and Morris 2005)?

The networks generated from the study of brick stamps are of course a proxy indicator at best. Not everyone (presumably) who made brick stamped it. That said, there are some combinations of settings that produce results broadly similar to those observed in stamp networks, in terms of their internal structure and the average path length between any two agents.

One such mimics a world where transaction costs are significant (but not prohibitive), and knowledge of the world is limited . The clustering coefficient and average path length observed for stamped bricks during the second century fall within the range of results for multiple runs with these settings. In the simulation, the rate at which individuals linked together into a network suggests that there was a constant demand for help and support. The world described by the model doesn’t sound quite like the world of the second century, the height of Rome’s power, that we think we know, suggesting something isn’t quite right, in either the model or our understandings. But how much of the world did brickmakers actually know, remembering that ‘knowledge of the world’ in the model is here limited to the location of new resources to exploit?

Agent based modeling also allow us to explore the consequences of things that didn’t happen. There were a number of simulated worlds that did not produce any clustering at all (and very little social network growth). Most of those runs occurred when the resource being simulated was coppiced woodland. This would suggest that the nature of the resource is such that social networks do not need to emerge to any great degree (for the most part, they are all dyadic pairs, as small groups of agents exploit the same patch of land over and over again). The implication is that some kinds of resources do not need to be tied into social networks to any great degree in order for them to be exploited successfully (these were also some of the longest model runs, another indicator of stability).

What are some of the implications of computationally searching for the networks characteristic of the Roman economy-as-bazaar? If, despite its flaws, this model correctly encapsulates something of the way the Roman economy worked, we have an idea of, and the ability to explore, some of the circumstances that promoted economic stability. It depends on the nature of the resource and the interplay with the degree of transaction costs and the agents’ knowledge of the world. In some situations, ‘patronage’ (as instantiated in the model) serves as a system for enabling continual extraction; in other situations, patronage does not seem to be a factor.

However, with that said, none of the model runs produced networks that had the classical signals of a small-world. This is rather interesting. If we have correctly modeled the way patronage works in the Roman world, and patronage is the key to understanding Rome (cf Verboven 2002), we should have expected that small-worlds would naturally emerge. This suggests that something is missing from the model – or our thinking about patronage is incorrect. We can begin to explore the conundrum by examining the argument made in the code of the simulation, especially in the way agents search for patrons. In the model, it is a local search. There is no way of creating those occasionally long-distance ties. We had initially imagined that the differences in the individual agents’ ‘vision’ would allow some agents to have a greater ability to know more about the world and thus choose from a wider range. In practice, those with greater ‘vision’ were able to find the best patches of resources, indeed, the variability in the distribution of resources allowed these individuals to squat on what was locally best. My ‘competition’ and prestige mechanisms seem to have promoted a kind of path dependence. Perhaps we should have instead included something like a ‘salutatio’, a way for the agents to assess patrons’ fitness or change patrons (cf Graham 2009; Garnsey and Woolf 1989: 154; Drummond 1989: 101; Wallace-Hadrill 1989b: 72-3). Even when models fail, their failures still throw useful light. This failure of my model suggests that we should focus on markets and fairs as not just economic mechanisms, but as social mechanisms that allow individuals to make the long distance links. A subsequent iteration of the model will include just this.

This model will come into its own once there is more and better network data drawn from archaeological, epigraphic, historical sources. This will allow the refining of both the set-up of the model and comparanda for the results. The model presented here is a very simple model, with obvious faults and limitations. Nevertheless, it does have the virtue of forcing us to think about how patronage, resource extraction, and social networks intersected in the Roman economy. It produces output that can be directly measured against archaeological data, unlike most models of the Roman economy. When one finds fault with the model (since every model is a simplification), and with the assumptions coded therein, he or she is invited to download the model and to modify it to better reflect his or her understandings. In this way, we develop a laboratory, a petri-dish, to test our beliefs about the Roman economy. We offer this model in that spirit.

[edited April 4th to make it less clumsy, and to fit in the 15 minute time frame]

 

Simulation as Deformation, or, the Role of Agent Based Modeling in Historical Archaeology

I’ll be at the Society for American Archaeology Annual Meeting next week, presenting in a session on  ‘modeling dynamics in coupled social-natural systems’, and in another on network methods for archaeology. For me, these two approaches are hard to tease apart. Below you’ll find my draft for the modeling session.

Over ten years ago, J.P. Marney and Heather Tarbert published a paper in the journal of artificial societies and social simulation called, ‘Why do simulation? Towards a working epistemology for practitioners of the dark arts’.

Today, we’re discussing the potential of modeling for exploring human-environmental interactions in a wide variety of contexts, across an enormous span of time. We’re thinking about society and ecosystems as highly complex systems, where material, energy, and information flows through massively interconnected positive and negative feedback loops.

A dark art, indeed.

Coming to simulation from a background in the humanities – especially Roman archaeology and ancient history – means that my work is oftentimes viewed askance. There are very deep reasons for this, beyond the usual caricatures of ‘social science vs humanities’. There’s a deep history in Western culture surrounding the ways we try to know the future. I generalize horribly, but it seems to me to come down to the difference between the priest and the magician in Greco-Roman society. The priest examines the entrails, watches the flight of birds, performs the rituals correctly, and is rewarded with some glimpse into divine will. The magician, on the other hand, compels the spirits to visit her, through spells and carefully guarded craft, and wrests the certain knowledge of what is to come by dint of her own skill. The priest is ‘fas’, whereas the magician is ‘nefas’, the root of our word ‘nefarious’, meaning contrary to divine law. So too the simulationist.

In the humanities, when we are concerned about the human past, we read the texts closely, we follow our rituals correctly, and we are rewarded with a story about history; in simulation, our skill enables us to raise the dead, putting them through their paces, and we are rewarded with not just one history, but an entire landscape of possible histories.

Indeed, when I talk to humanists about simulation, I sometimes call it ‘practical necromancy’ for this very reason. Classicists don’t generally like what I do, although ancient historians are sometimes ok with it, and archaeologists (non Roman archaeologists) usually just smile and nod and say, ‘yes, so what?’

I have been creating simulations of various aspects of Greco-Roman antiquity for a while now. What I’d like to speak about to you today, is the degree to which these simulations have found traction amongst ancient historians, and what I’ve learned about how to incorporate agent based modeling into the exploration of a historical society like that of the ancient Mediterranean.

The first issue is that there is a sense that it is not at all needed. ‘Agent modeling might be useful for those non-literate societies, but we’ve got more than enough materials to work on here, Shawn’ is the gist of a conversation I once had with a distinguished Romanist. In Marney and Tarbert’s piece, they argue that simulation is perhaps the only way of addressing situations:

  1. Where there are complex emergent global processes and dynamics from simple local behaviour.
  2. Where coordinated global outcomes are generated by the heterogeneous local decision rules. [amongst others]

…which describes Rome pretty nicely. Or human culture more generally.

The next criticism that my distinguished Romanist colleague raised was that my models – any computational model – was simply tautological, that we only get out what we get in. This is such a weary chestnut to deal with, and perhaps folks in this room don’t need reminding of it, for it fundamentally mis-understands a significant characteristic of complex systems – that the dynamics of one level of organization do not lead linearly or necessarily imply the dynamics of another level. Hence if we are interested in culture, we model at the level of an individual. Thus what comes out of the model is the emergent byproduct of countless individual interactions. What comes out is definitely not what went in.

More Roman historians and archaeologists need to be reading the literature of complex systems studies, I think.

A final issue is about what, exactly, we are modeling. Are we really raising the dead, and simulating the past? No, we are not. We are actually creating zombies. Normally, creating zombies never ends well, but as long as they don’t escape from our computers, all should be ok.

I call these autonomous software agents ‘zombies’ for the very good reason that I need to clearly specify what it is I believe about some phenomenon in the past in order for them to perform that behaviour. What I end up simulating then is not the past, but the story I am telling about the past. This lets me escape nearly all of the criticisms that my colleagues in the humanities raise about this dark art of simulation.

If I am simulating in effect a historiography, then the results, the landscape of possible emergent outcomes, are the consequences of that story I am telling about the past. Simulation becomes a way for me to explore the unintended outcomes about my beliefs about the past. I perform the past; I deform it.

The method forces me to become clear about what it is I believe about the past in an utterly transparent way. If I cannot encode those beliefs, then clearly I need to think more deeply. I use Netlogo for my agent modeling for a couple of reasons. One, its near-english syntax makes it easier for me to develop simulations.  It also makes it possible for my colleagues to examine the procedural rhetoric of my simulation as well. A simulation is not complete until somebody else opens the hood and examines your code for your mistakes, your assumptions, and for the rhetorics hidden therein. I often tell my students that unless they can look at the code for themselves, they have no reason to believe the results of a simulation. My students are history students, without any great affinity for computing – but with a bit of help, they can easily flow-chart a Netlogo simulation to get a sense of what is going on.

(This, incidentally, is what excites me about the movement towards data-as-publication. I am beginning to put all of my models on Figshare to allow this kind of examination.)

The end result then is that I have found that I have to keep my models as tightly focused as possible. If my model becomes too ambitious, I typically have had two problems. One, it becomes difficult for me to tell the story of what is going on in my model, to tease apart the critical interactions that are producing the landscape of possibilities that have emerged. Two, there is little engagement with my code by those who could best critique it, as it becomes seemingly too complex.

Let me give you an example. In my PhD research, I became interested in the social networks surrounding landholding in the immediate vicinity of Rome during the first three centuries AD. I did some network analysis of this data (stitched together from the epigraphy of stamped bricks), but I wanted to reanimate these patterns. There are many episodes in Roman history of elite self-extermination, as different factions vying for power eliminate rivals through murder, forced suicide, or exile. How much disruption could these networks endure?  Thus, I became interested in the sources of civil violence in the Roman world.

I created a simulation where a population of agents were interlinked in the patterns suggested from the archaeology. Over this network would flow prestige, gifts, and money as the agents vied for status, drawing on the literature connected with the Roman tradition of the ‘salutatio’, or morning greeting given by a client to his patron(s). No patron has to accept a client who is not suitably prestigious, no one gains prestige without clients, thus shutting individuals out of the networks: the source for civil violence in the Roman world, I argued.

I was able to put these agents in a world where the economy ranged from one where everything was roses, to one where everything was sackclothes and ashes; I imagined that there would be no violence in the rose-world, and lots of violence in the sackclothes-world. And yes, this is duly what I saw, but there were interesting, non-predicted bouts of violence where there should be peace, and peace where there should be violence.

Teasing all of this apart became the subject of a journal article – a very long, tedious, article. Google scholar tells me that this article has had precisely zero impact. And I’m quite certain that no one has engaged with the code.

Another model I created has had quite a different trajectory. In this model, I simulate a very excruciatingly simple mechanic representing the contentious process of ‘Romanization’. In my model, which is based on an even simpler model of disease transmission, an agent is ‘non-romanized’ until they run into an agent who has become ‘romanized’. Poof, the agent now becomes Romanized. Zombies indeed. (And of course, there are models of zombie infection too! Now we’re just getting recursive).

The key element here was that the agents were not wandering around in an amorphous space. Rather, they were constrained to move along the paths suggested by the third century Antonine Itineraries, the lists of towns one would use in order to figure out how to get from point A, to point B. To get to Honolulu from Ottawa, go to Toronto, Winnipeg, Calgary, Vancouver, Seattle, Honolulu.

Thus, I was interested in exploring the consequences of this list-like, networked conception of geographic space. I could measure the amount of model time it took for everyone in the model to become ‘Romanized’ as they moved over the network of Roman Spain, versus Roman Britain, versus Roman Gaul, versus Roman Italy. I graphed these results, and the shape of this diffusionist model implied something about the way ideas of Romanness would penetrate, and how deeply, in these different regions. Thus the model then became a guide for looking at the archaeology in a new way.

This model, according to Google Scholar, has had much much more traction. Even better, the code has been queried, taken apart, and made better, being used in both teaching contexts, and in others’ research.

Smaller, more constrained models, can have bigger impact, I think – at least in the humanities.

This use of agent based simulation fits into a kind of experimental archaeology mindset, of building as a way of knowing – indeed, it also puts it in the developing traditions of the digital humanities. Trevor Owens, a digital archivist with the Library of Congress, recently blogged about the mutual incomprehension of computer scientists and humanists, and it’s worth quoting him in full:

“[...]I don’t think the issue here is different ways of knowing, incompatible paradigms, or anything big and lofty like that. I think the issue at the heart of this back and forth dialog is about two different contexts. This is about what you can do in the generative context of discovery vs. what you get can do in the context of justifying a set of claims.”

What Owens argues is that, in the humanities, computational approaches are best suited for ‘the generative world of discovery’. He continues:  “If you aren’t using the results of a digital tool as evidence then anything goes. More specifically, if you aren’t trying to attribute particular inferential value to a particular process that process is simply producing another artifact which you can then go about considering, exploring, probing and analyzing.  I take this to be one of the key values of the idea of “deformance.” The results of a particular computational or statistical tool don’t need to be treated as facts, but instead can be used as part of an ongoing exploration.”

Because we are not simulating the past, but rather instantiating what we believe to be true about the past in a computer model, my sense is that agent modeling will take off in archaeology when it ceases to be about the context of trying to justify our stories about the past, but rather for generating new stories, new ways of looking at our evidence about the past. And the models need to be small-ish, digestible, and not needing a team of researchers to explore (to build, well, that’s another matter I suppose).

So here are two new models I am working on, to put my money where my mouth is.

I’ve recently been reading Ian Hodder’s ‘Entangled: An archaeology of the Relationships between Humans and Things’. In the book, Hodder develops an argument for looking at things not just as if they had agency, but rather all tangled up in making humans human. He then offers up, by way of a methodological approached to this entangled perspective, a ‘tanglegram’, where all of the dependences and dependencies between humans and things at Catalhoyuk are mapped. He goes on to talk about flows of information or energy through these entanglements.

This to me seems to be a prime candidate for the kind of simulation that I do. If we can tie material culture, place, and humans together in this kind of tanglegram, what are the implications for energy flow? What are the emergent consequences? I begin by turning his figure 9.2 into a network diagram. I use a code snippet from Netlogo to import this same information into Netlogo, transforming the nodes into active agents connected by active links.

Because of the modularity of Netlogo, I don’t necessarily have to begin from scratch to explore the dynamics of this entanglement. Instead, I turned to my old friend, the virus-on-a-network, and gave it the tanglegram to run on.

The question becomes, well, so what? What does this prove? Right now, I’m working on that. On first blush, there seems to be a behaviour space where this tanglegram, this entangled pattern of things and humans, leads to extreme stability over most parameters (ie, ‘life’ continues), and a small window that leads to paralysis (ie, the ‘life’ of the model stops). That perhaps could be the beginning of a conversation where we look at entanglements at other times and places, modeling their dynamics, coming up with a comparative study of what patterns lead to change and transformation.

In my other model, my zombies represent amphorae type. Here, I’m interested in why different kinds of amphorae styles in the Aegean converge – and why some types are always unique. In this model, I have a population of amphorae which are all different. There are humans who flit into this world, and throw away the amphorae that (for whatever reason) are undesireable. Amphorae reproduce, with a certain amount of mutation. Over time, and without centralized direction, there is a convergence of different amphorae types (having different origins, different clays) to share the same outward stylistic characteristics.

This model is still undeveloped; it tells a story of stylistic evolution where it is the amphorae themselves that do the evolving. The stories we often tell, when it comes to pottery styles (and perhaps this is more of a problem in the classical world than elsewhere; I do not know), often seem to me to be a kind of just-so story, of how the camel got its hump. Whether or not you agree with the story I tell in this model, one can at least see how it works, and change the code to tell a better story, using the emergent results to generate a new perspective.

To conclude, then, I think you’ll all agree that we can find archaeological patterns and reanimate some kind of dynamic on those patterns.

But what I’m trying to suggest to you today, is that we need to resist building extremely complex models on top of those archaeological patterns. There’s lots of low-hanging fruit around, when it comes to agent based models. Small models, tightly focussed models, allow us to iterate quickly, to develop quickly, and to use multiple lines of attack on various problems.

If we want buy-in from our other colleagues interested in the human past – whether historians, historical archaeologists, classicists, or ancient historians – then the models have to be immediately digestible, and we have to acknowledge that we use these as generative, as a way of ‘deforming’ our perspectives and our own beliefs about the past, to develop new perspectives and insights.

[edited April 2 to reduce redundancies, fix awkwardness, and to fit it into the 15 minute time slot allotted to me. 300 words removed.]

Hodder’s ‘Tanglegram’ as Network

Hodder's fig 9.2 as network

Hodder’s fig 9.2 as network

I am reading Ian Hodder’s book, ‘Entangled: An Archaeology of the Relationship between Humans and Things’ Hodder writes that the tanglegram cannot be represented as a network, since a network doesn’t consider the nature of the relationships or nodes. This is not in fact the case. Representing these complex relationships as a network is quite possible, and allows the ‘tanglegram’ to actually become a object to query in its own right, rather than a suggestive illustration. I’ve uploaded the network data to Figshare:
http://dx.doi.org/10.6084/m9.figshare.654626

I used NodeXL to enter the data. If there was a bidirectional tie, I made two entries: A -> B, B -> A. If it was only one way, I entered it with the directionality of the original tanglegram. I saved it as a .net file, opened it in gephi, and ran gephi’s statistics.

This was all rather rough and ready; because I was working from a blown-up photocopy of the original figure, and I’m trying to get ready for a trip, there could be errors. One would need Hodder’s original data to do this properly, but I offer it up here to show that it’s possible, and indeed worthwhile: why else would you bother drawing a tanglegram, if not to use it to help your analysis?

In the image below, I resize the nodes to represent betweenness centrality (which elements of the tanglegram are doing the heavy lifting?) and recolour it according to modularity. Modularity finds five groups (nodes listed in descending order of betweenness centrality):

Group 0: house, groundstone, burial, plaster, figurines, pigment, skins, painting, personal artefacts, animal heads, food storage, human heads, special food, human body parts, burials, storage rooms, bins

Group 1: hoard, chipped stone, sheep, mats, dung, wild animals, fields, bone, cereals, wooden object, weeds.

Group 2: food, hearth, fuel, ash, clay balls, oven, traps, wood

Group 3: clay, baskets, extraction pits, wetland, reeds, birds, dryland, marl, ditches, fish, clean water, landscape, field, eggs

Group 4: midden, dogs, colluvium, mortar, pen, mudbrick

Seems quite suggestive! For the files for yourself, please see:

Hodder’s Figure 9.2, Entangled, as network. Shawn Graham. figshare.
http://dx.doi.org/10.6084/m9.figshare.654626

Retrieved 17:47, Mar 19, 2013 (GMT)
Follow

Get every new post delivered to your Inbox.

Join 136 other followers