If I could read your mind – Sonifying John Adams’ Diary

Maybe the question isn’t one of reading someone’s thoughts, but rather, listening to the overall pattern of topics within them. Topic modeling does some rather magical things. It imposes sense (it fits a model) onto a body of text. The topics that the model duly provide us with insight into the semantic patterns latent within the text (but see Ben Schmidts WEM approach which focuses on systems of relationships in the words themselves – more on this anon). There are a variety of ways emerging for visualizing these patterns. I’m guilty of a few myself (principally, I’ve spent a lot of time visualizing the interrelationships of topics as a kind of network graph, eg this). But I’ve never been happy with them because they often leave out the element of time. For a guy who sometimes thinks of himself as an archaeologist or historian, this is a bit problematic.

I’ve been interested in sonification for some time, the idea that we represent data (capta) aurally. I even won an award for one experiment in this vein, repurposing the excellent scripts of the Data Driven DJ, Brian Foo. What I like about sonification is that the time dimension becomes a significant element in how the data is represented, and how the data is experienced (cf. this recent interview on Spark with composer/prof Chris Chafe). I was once the chapel organist at Bishop’s University (I wasn’t much good, but that’s a story for another day) so my interest in sonification is partly in how the colour of music, the different instrumentation and so on can also be used to convey ideas and information (rather than using algorithmically purely generated tones; I’ve never had much formal musical training, so I know there’s a literature and language to describe what I’m thinking that I simply must go learn. Please excuse any awkawrdness).

So – let’s take a body of text, in this case the diaries of John Adams.  I scraped these, one line per diary entry (see this csv we prepped for our book, the Macroscope). I imported into R and topic modeled for 20 topics. The output is a monstrous csv showing the proportion each topic contributes to the entire diary entry (so each row adds to 1). If you use conditional formatting in Excel, and dial the decimal places to 2, you get a pretty good visual of which topics are the major ones in any given entry (and the really minor ones just round to 0.00, so you can ignore them).

It rather looks like an old-timey player piano roll:

Player Piano Anyone?

I then used ‘Musical Algorithms‘ one column at a time to generate a midi file. I’ve got the various settings in a notebook at home; I’ll update this post with them later. I then uploaded each midi file (all twenty) into GarageBand in the order of their complexity – that is, as indicated by file size:

Size of a file indicates the complexity of the source. Isn’t that what Claude Shannon taught us?

The question then becomes: which instruments do I assign to what topics? In this, I tried to select from the instruments I had readily to hand, and to select instruments whose tone/colour seemed to resonate somehow with the keywords for each topic. Which gives me a selection of brass instruments for topics relating to governance (thank you, Sousa marches); guitar for topics connected perhaps with travels around the countryside (too much country music on the radio as a child, perhaps); strings for topics connected with college and studying (my own study music as an undergrad influencing the choice here); and woodwinds for the minor topics and chirp and peek here and there throughout the text (some onomatopoeia I suppose).

Garageband’s own native visualization owes much to the player piano aesthetic, and so provides a rolling visualization to accompany the music. I used quicktime to grab the garageband visuals, and imovie to marry the two together again, since qt doesn’t grab the audio generated within the computer. Then I changed the name of each of the tracks to reflect the keywords for that topic.

Drumroll: I give you the John Adams 20:

A Digital Archaeology of Digital Archaeology: work in progress

Ethan Watrall and I have been playing around with data mining as a way of writing a historiography of digital & computational archaeology. We’d like to invite you to play along.

We’ll probably have something to say on this at the SAA in April. Anyway, we’ve just been chugging along slowly, sharing the odd email, google doc, and so on – and a monstrous huge topic model browser I set up. Yesterday, an exchange on twitter took place that prompted us to share those materials.

This prompted a lot of chatter, including:

and this:

So let’s get this party started, shall we?

~o0o~

While there’s a lot of movement towards sharing data, and open access publications, there’s also this other space of materials that we don’t talk about too much – the things we build from the data that we (sometimes) share that enable us to write those publications we (sometimes) make open access. This intermediate stage never gets shared. Probably with good reason, but I thought given the nature of digital work, perhaps there’s an opportunity here to open not just our research outputs & inputs, but also our process to wider participation.

Hence this post, and all that follows.

~o0o~

Here’s what I did. I scoured JSTOR’s DFR for anglophone journals, from 1935 onwards (full bibliography right here: http://graeworks.net/digitalarchae/20000/#/bib. Then I fitted various topic models to them, using Andrew Goldstone’s dfr-topics which is an R package using MALLET on the bag-of-words that DFR gives you, running the result through Andrew’s dfr-browser (tag line: “Take a MALLET to disciplinary history!”).

The results can be viewed here. Like I said, this is the middle part of an analysis that we’re sharing here. Want to do some historiography with a distant reading approach? We’d love to see what you spot/think/observe in these models (maybe your students would like a go?) In which case, here’s an open pad for folks to share & comment.

Why would you bother? Well, it occurred to me that I’ve never seen anyone try to crowdsource this step of the process. Maybe it’s a foolish idea. But if folks did, and there was merit to this process, maybe some kind of digital publication could result where all contributors would be authors? Maybe a series of essays, all growing from this same body of analysis? Lots of opportunities.

Stranger things have happened, right?

~o0o~

Just to get you going, here are some of the things I’ve noticed, and some of my still-churning thoughts on what all this might mean (I’ve pasted this from another document; remember, work in progress!):

remembering that in topic modeling, a word can be used in different senses in different topics/discourses (thus something of the semantic sense of a word is preserved)

tools used:

-stanford tmt for detailed view on CAA (computer applications in archaeology)

-mimno’s browser based jslda for detailed view of correlations between topics (using CAA & IA) (internet archaeology, only the open access materials before it went fully OA in October 2014)

-Goldstone’s dftropics for R and dfrbrowser to visulaize 21 000 articles as entire topic model

-same again for individual journals: AJA, JFA, AmA, CA, JAMT, WA

——-

stanford tmt of caa 1974 – 2011

Screen Shot 2014-11-09 at 3.43.49 PM

-no stoplist used; discards most prominent and least likely words from the analysis

-its output is formatted in such a way that it becomes easy to visualize the patterns of discourse over time (MALLET, the other major tool for doing topic modeling, requires much more massaging to get the output in such a form. The right tool for the right job).

-30 topics gives good breakdown; topic 26 contains garbage (‘caa proceedings’ etc as topic words)

In 1974, the most prominent topics were:

topic 1 – computer, program, may, storage, then, excavation, recording, all, into, form, using, retrieval, any, user, output, records, package, entry, one, unit

topic 6: but, they, one, time, their, all, some, only, will, there, would, what, very, our, other, any, most, them, even

topic 20: some, will, many, there, field, problems, may, but, archaeologists, excavation, their, they, recording, however, record, new, systems, most, should, need

The beginnings of the CAA are marked by hesitation and prognostication: what *are* computers for, in archaeology? There is a sense that for archaeologists, computation is something that will be useful insofar as it can be helpful for recording information in the field. With time, topic 1 diminishes. By 2000 it is nearly non-existent.  The hesitation expressed by topics 6 and 20 continues though. Archaeologists do not seem comfortable with the future.

Other early topics that thread their way throughout the entire period are topics 5, 2, 27 and 28:

Topic 5: matrix, units, stratigraphie, relationships, harris, unit, between, method, each, attributes, may two diagram, point, other, seriation, one, all, stratigraphy, sequence

Topic 2: area, survey, aerial, north, features, sites, region, located, excavation, river, areas, during, field, its, large, project, south, water, over, fig

Topic 27: sites, monuments, heritage, national, record, management, cultural, records, development, systems, england, database, english, its, survey, new, will, also, planning, protection.

Topic 28: museum, museums, collections, project, national, documentation, all, database, archives, about, archive, objects, sources, documents, university, text, our, also, collection, reports.

These topics suggest the ‘what’ of topic 1: how do we deal with contexts and units? Large surveys? Sites and monuments records and museum collections? Interestingly, topics 27 and 28 can be taken as representing something of the professional archaeological world (as opposed to ‘academic’ archaeology).

Mark Lake, in a recent review of simulation and modeling in archaeology (JAMT 2014) describes various trends in modeling [discuss]. Only topic 9 seems to capture this aspect of computational/digital archaeology:

model, models, social, modeling, simulation, human, their, between, network, approach, movement, networks, past, different, theory, how, one, population, approaches, through

Interestingly, for this topic, there is a thin thread from the earliest years of the CAA to the present (2011), with brief spurst in the late 70s, and late 80s, then a consistent presence throughout the 90s, with a larger burst from 2005-2008. Lake characterizes thus…. [lake]. Of course, Lake also cites various books and monographs which this analysis does not take into account.

If we regard ‘digital archaeology’ as something akin to ‘digital humanities’ (and so distinct from ‘archaeological computation’) how does it, or does it even, appear in this tangled skein? A rough distinction between the two perspectives can be framed using Trevor Owens meditation on what computation is for. Per Owens, we can think of a humanistic use of computing as one that helps us deform our materials, to give us a different perspective on it. Alternatively, one can think of computing as something that helps us justify a conclusion. That is, the results of the computation are used to argue that such-a-thing is most likely in the past, given this model/map/cluster/statistic. In which case, there are certain topics that seem to imply a deformation of perspective (and thus, a ‘digital archaeology’ rather than an archaeological computation):

topic 03: cultural, heritage, semantic, model, knowledge, systems, web, standards, ontology, work, domain, conceptual, different, crm, between, project, based, approach

topic 04: knowledge, expert, process, its, artefacts, set, problem, different, concepts, human, systems, but, they, what, our, scientific, about, how, all, will

topic 07: project, web, digital, university, internet, access, online, service, through, electronic, http, european, technologies, available, public, heritage, will, services, network, other

topic 14: virtual, reality, museum, public, visualization, models, reconstruction, interactive, museums, multimedia, heritage, envrionment, scientific, reconstructions, will, computer, technologies, environments, communication

topic 29: gis, spatial, time, within, space, temporal, landscape, study, into, social, approaches, geographic, applications, approach, features, environmental, based, between, their, past

Topic 3 begins to emerge in 1996 (although its general discourse is present as early as 1988).  Topic 4 emerges with strength in the mid 1980s, though its general thrust (skepticism about how knowledge is created?) runs throughout the period. Topic 7 emerges in 1994 (naturally enough, when the internet/web first hit widespread public consciousness). Should topic 7 be included in this ‘digital archaeology’ group? Perhaps, inasmuch as it also seems to wrestle with public access to information, which would seem not to be about justifying some conclusion about the past but rather opening perspectives upon it. Topic 14 emerges in the early 1990s.

Topic 29, on first blush, would seem to be very quantitative. But the concern with time and temporality suggests that this is a topic that is trying to get to grips with the experience of space. Again, like the others, it emerges in the late 1980s and early 1990s. [perhaps some function of the personal computer revolution..? instead of being something rare and precious -thus rationed and only for ‘serious’ problems requiring distinct answers – computing power can now be played with and used to address less ‘obvious’ questions?]

What of justification? These are the topics that grapple with statistics and quantification:

Topic 10: age, sites, iron, settlement, early, bronze, area, burial, century, one, period, their, prehistoric, settlements, grave, within, first, neolithic, two, different

Topic 11: pottery, shape, fragments, classification, profile, ceramics, vessels, shapes, vessel, sherds, method, two, ceramic, object, work, finds, computer, fragment, matching, one

Topic 13: dating, radiocarbon, sampling, london, dates, some, but, betwen, than, e.g. , statistical, chronological, date, there, different, only, sample, results, one, errors

Topic 15: landscape, project, study, landscapes, studies, cultural, area, gis, human, through, their, its, rock, history, historical, prehistoric, environment, our, different, approach

Topic 17: sutdy, methods, quantitative, technqiues, approach, statistical, using, method, studies, number, artifacts, results, variables, two, most, bones, based, various, analyses, applied

Topic 19: statistical, methods, techniques, variables, tiie, statistics, density, using, cluster, technique, multivariate, method, two, nottingham, example, principal, some, university

Topic 21: model, predicitve, modelling, models, cost, elevation, viewshed, surface, sites, gis, visibility, van, location, landscape, areas, one, terrain, dem, digital

topic 23: image, digital, documentation, images, techniques, laser, scanning, models, using, objects, high, photogrammetry, methods, model, recording, object, surveying, drawings, accuracy, resolution

topic 24: surface, artefact, distribtuion, artefacts, palaeolithic, materials, sites, deposits, within, middle, area, activity, during, phase, soil, processes, lithic, survey, remains, france

Macroscopic patterns

Screen Shot 2014-11-09 at 3.45.25 PMThis detail of the overall flow of topics in the CAA proceedings points to the period 1978 – 1983 as a punctuation point, an inflection point, of new topics within the computers-and-archaeology crowd. The period 1990-2011 contains minor inflections around 1997 and 2008.

1997-1998

1990-2011

In terms of broad trends, pivot points seem to be the late 70s, 1997, 2008. Given that our ‘digital archaeology’ themes emerge in the late 90s, let’s add Internet Archaeology to the mix [why this journal, why this time: because of the 90s inflection point? quicker publication schedule? ability to incorporate novel outputs that could never be replicated in print?]. This time, instead of searching for topics, let’s see what correlates with our digital archaeology topics. For this, David Mimno’s browser based LDA topic model is most useful. We run it for 1000 iterations, and find the following correlation matrix.

[insert discussion here]

http://www.graeworks.net/digitalarchae/mimno/jslda.html?docs=caa_and_intarch.txt&stoplist=en.txt&topics=30

-1000 iterations. Your 1000 iterations will be slightly different than mine, because this is a probablistic approach

– the browser produces csv files for download, as well as a csv formatted for visualizing patterns of correlation as a network in Gephi or other network visualization software.

-stop list is en, fr, de from MALLET + archaeology, sites, data, research

-running this in a browser is not the most efficient way of doing this kind of analysis, but the advantage is that it allows the reader to explore how topics sort themselves out, and its visualization of correlated topics is very effective and useful.

-note word usage. Mimno’s browser calculates the ‘specificity’ of a word to a topic. The closer to 1.0, the closer the word is distributed only within a single topic. Thus, we can take such words as being true ‘keywords’ for particular kinds of discourses. [which will be useful in exploring the 20000 model]. “Computer” has a specificity of 0.61, while “virtual” has a specificity of 0.87, meaning that ‘computer’ is used in a number of topics, while ‘virtual’ is almost exclusively used in a single discourse. Predicitve has a specificty of 1, and statistical of 0.9.

In the jsLDA model, there are three topics that deal with GIS.

topic 19, gis landscape spatial social approach space study human studies approaches

topic 18, database management systems databases gis web software user model tool

topic 16, sites gis landscape model predictive area settlement modelling region land

The first, topic 19, seems to correspond well with our earlier topic that we argued was about using GIS to offer a new perspective on human use/conception of space (ie, a ‘digital’ approach, in our formulation). Topics 18 and 16 are clearly about GIS as a computational tool. In the correlation matrix below, blue equals topics that occur together greater than expected, while red equals less than expected; the size of the dot gives an indication of how much. Thus, if we look for the topics that go hand in hand with topic 19, the strongest are topic 16 (the predictive power of GIS), and topic 10 (social, spain, simulation, networks, models).

Screen Shot 2014-11-09 at 5.28.47 PMThe ‘statistical, methods, techniques, artefact, quantitative, statistics, artefacts’ topic is positively correlated with ‘human, material, palaeolithic’, ‘time, matrix, relationship’, and ‘methods, points, point’ topics. This constellation of topics is clearly a use of computation to answer or address very specific questions.

-in jslda there’s a topic ‘database project digital databases web management systems access model semantic’ – positively correlated with ‘publication project electoric’, ‘text database maps map section user images museum’, ‘excavation recording’, ‘vr model’,  ‘cultural heritage museum’, ‘italy gis’, ‘sites monuments record’ [see keys.csv for exact label]. These seem to be topics that deal with deforming our perspectives while at the same time intersecting with extremely quantitative goals.

So far, we have been reading distantly some 40 years of archaeological work that is explicitly concerned with the kind of archaeology that uses computational and digital approaches. There are punctuation points, ‘virages’, and complicated patterns – there is no easy-to-see disjuncture between what the digital humanists imagine is the object of using computers, and their critics who see computation as positivism by the back door. It does show that archaeology should be regarded as an early mover in what has come to be known as ‘the digital humanities’, with quite early sophisticated and nuanced uses of computing. But how early? And how much has archaeological computing/digital archaeology permeated the discipline? To answer these questions, we turn to a much larger topic model

Zoom Out Some More

Let’s put this into a broader context. 24 journals from JSTOR were selected for both general coverage of archaeology and for regional/topical specialities. The resulting dataset contains 21000 [get exact number] articles, mostly from the past 75 years (a target start date of 1940 was selected for journals whose print run predates the creation of the electronic computer, thus computer = machine and not = woman who computes). 100 topics seemed to capture the range of thematic discourses well. We looked first for topics that seem analogous to the CAA & IA topics (CAA and IA were not included in this analysis because they are not within the JSTOR DFR database; Goldstone’s DFR Browser was used for the visualization of the topics). [better explanation, rationale, to be written, along with implications]. We also observe ‘punctuation points’ in this broader global (anglosphere) representation of archaeology that correspond with the inflection points in the small model, many trends that fit but also other trends that do not fit with standard historigoraphy of archaeology. We then dive into certain journals (AJA, JFA, AmA, JAMT) to tease these trends apart. Just what has been the impact of computational and digital archaeology in the broader field?

Screen Shot 2014-11-09 at 5.29.24 PMThe sillouhette in the second column gives a glimpse into the topic’s prevalence over the ca 75 years of the corpus. The largest topic, topic 10, with its focus on ‘time, made, work, years, great, place, make’ suggests a kind of special pleading, that in the rhetoric of archaeological argument, one always has to explain just why this particular site/problem/context is important. A similar topic was observed in the model fitted to the CAA & IAA [-in 20000 model, there’s the ‘time’ topic time made work years great place make long case fact point important good people times; it’s the largest topic, and accounts for 5.5%. here, there is one called ‘paper time work archaeologists introduction present important problems field approach’. it’s slightly correlated with every other topic. Seems very similar. ]

More interesting are the topics a bit further down the list. Topic 45 (data, analysis, number, table, size, sample) is clearly quantitative in nature, and its sillhouette matches our existing stories about the rise of the New Archaeology in the late 60s and early 70s. Topics 38 and 1 seem to be topics related to describing finds – ‘found, site, stone, small, area’; ‘found, century, area, early, excavations’. Topic 84 suggests the emergence of social theories and power – perhaps an indication of the rise of Marxist archaeologies? Further down the list we see ‘professional’ archaeology and cutlrual resource management, with peaks in the 1960s and early 1980s.

Screen Shot 2014-11-09 at 5.29.56 PM

Topic 27 might indicate perspectives connected with gender archaeology – “social, women, material, gender, men, objects, female, meaning, press, symbolic” – and it accounts for 0.8% of the corpus: about 160 articles.  ‘Female’ appears in four topics, topic 27, topic 65 (‘head, figure, left, figures, back, side, hand, part’ – art history? 1.4% of the corpus) topic 58 (“skeletal, human, remains, age, bone”- osteoarchaeology, 1.1% of the corpus), and topic 82 (“age, population, human, children, fertility” – demographics? 0.8% of the corpus).

[other words that would perhaps key into major trends in archaeological thought? looking at these topics, things seem pretty conservative, whatever the theorists may think, which is surely important to draw out and discuss]

Concerned as we are to unpick the role of computers in archaeology more generally, if we look at the word ‘data’ in the coprus, we find it contributes to 9 different topics (http://graeworks.net/digitalarchae/20000/#/word/data ). It is the most important word in topic 45 (data, analysis, number, table, size, sample, study) and in topic 55 (data, systems, types, information, type, method, units, technique, design). The word ‘computer’ is also part of topic 55. Topic 45 looks like a topic connected with statistical analysis (indeed, ‘statistical’ is a minor word in that topic), while topic 55 seems to be more ‘digital’ in the sense we’ve been discussing here. Topic 45 is present in 3.2% of the corpus, growing in prominence from the early 1950s, falling in the 60s, and resurging in the 70s, and then decreasing to a more or less steady state in the 00s.

Screen Shot 2014-11-09 at 5.30.34 PM

Topic 55 holds some surprises:

Screen Shot 2014-11-09 at 5.31.17 PM

The papers in 1938 come from American Antiquity volume 4 and show an early awareness of not just quantitative methods, but also the reflective way those methods affect what we see [need to read all these to be certain of this]

next steps

– punctuation points – see http://graeworks.net/digitalarchae/20000/#/model/yearly

major – 1940 (but perhaps an artefact of the boundaries of the study)

minor- early 1950s

minor- mid 1960s

major- 1976 (american antiquity does something odd in this year)

major- 1997-8

 

Topics as Word Clouds

Elijah Meeks and Mat Jockers both have used word clouds to visualize topics from topic models. Colour, orientation, relative placement of the words – all of these could be used to convey different dimensions of the data. Below, you’ll find clouds for each of my initial 50 topics generated from the Roman materials in the Portable Antiquities Scheme database (some 100 000 rows, or nearly 1/5 the database, collected together into ‘documents’ where each unitary district authority is the ‘document’ and the text are the descriptions of things found there). The word clouds are generated from the word weights file that MALLET can output. There are 8100 unique tokens when I convert the database into a MALLET file; each one of those is present in each ‘bag of words’ or topic that MALLET generates, but to differing degrees. Thus, word clouds (here generated with Wordle) pull out important information that the word keys document does not. However, given that I optimized the interval whilst generating the topic models, the keys document provides an indication of the strength of the topic in the corpus. I’ve arranged the word clouds scaling them against the size of the strongest topic (topic 22), top-bottom, left-right. I’ll be damned if I can get wordpress to just display each image under the other one. Even stripped my table out, it did!

At any rate, as one churns through the 50 topics, after about the first 11 (depicted below), the topics get progressively more noisy as MALLET attempts to deal with incomplete transcriptions of the epigraphy of the coins, and the frequent notes about the source for the identification of the coins (the work of Guest & Wells). The final topic depicted here, topic 20, directly references a note often left in the database concerning the quality of an individual record; these frequently are in connection with materials that entered the British Museum collection before the Portable Antiquities Scheme got going and hence the information is not up to usual standards.

This exercise then suggests to me that 50 topics is just too much. I’m rerunning everything with 10 topics this time.

Topic 22

Topic 22

Topic 48

Topic 48

Topic 43

Topic 43

Topic 32

Topic 32

Topic 7

Topic 7

Topic 33

Topic 33

Topic 13

Topic 13

Topic 47

Topic 47

Topic 46

Topic 46

Topic 35

Topic 35

Topic 20

Topic 20

Topic modeling the things that fell out of pockets

UK Districts by Modularity, overlain with hand-drawn civitas boundaries

Modern Districts by Modularity, overlain with hand-drawn 1st century civitas boundaries

Topic modeling is very popular at the moment in the digital humanities. Ian, Scott and I described them as tools for extracting topics or injecting semantic meaning into vocabularies: “Topic models represent a family of computer programs that extract topics from texts. A topic to the computer is a list of words that occur in statistically meaningful ways. A text can be an email, a blog post, a book chapter, a journal article, a diary entry – that is, any kind of unstructured text” (Graham, Weingart, and Milligan 2012). In that tutorial, ‘unstructured’ means that there is no encoding in the text by which a computer can model any of its semantic meaning.

But there are topic models of ships’ logs, of computer code. So why not archaeological databases?

Archaeological datasets are rich, largely unstructured bodies of text. While there are examples of archaeological datasets that are coded with semantic meaning through xml and Text Encoding Initiative practices, many of these are done after the fact of excavation or collection. Day to day, things can be rather different, and this material can be considered to be  ‘largely unstructured’ despite the use of databases, controlled vocabulary, and other means to maintain standardized descriptions of what is excavated, collected, and analyzed. This is because of the human factor. Not all archaeologists are equally skilled. Not all data gets recorded according to the standards. Where some see few differences in a particular clay fabric type, others might see many, and vice versa. Archaeological custom might call a particular vessel type a ‘casserole’, thus suggesting a particular use, only because in the 19th century when that vessel type was first encountered it reminded the archaeologist of what was in his kitchen – there is no necessary correlation between what we as archaeologists call things and what those things were originally used for. Further, once data is recorded (and the site has been destroyed through the excavation process), we tend to analyze these materials in isolation. That is, we write our analyses based on all of the examples of a particular type, rather than considering the interrelationships amongst the data found in the same context or locus. David Mimno in 2009 turned the tools of data analysis on the databases of household materials recovered and recorded room by room at Pompeii. He considered each room as a ‘document’ and the artefacts therein as the ‘tokens’ or ‘words’ within that document, for the purposes of topic modeling. The resulting ‘topics’ of this analysis are what he calls ‘vocabularies’ of object types which when taken together can suggest the mixture of functions particular rooms may have had in Pompeii. He writes, ‘the purpose of this tool is not to show that topic modeling is the best tool for archaeological investigation, but that it is an appropriate tool that can provide a complement to human analysis….mathematically concrete in its biases’. The ‘casseroles’ of Pompeii turn out to have nothing to do with food preparation, in Mimno’s analysis. To date, I believe this is the only example of topic modeling applied to archaeological data.

Directly inspired by that example, I’ve been exploring the use of topic models on another rich archaeological dataset, the Portable Antiquities Scheme database in the UK. The Portable Antiquities Scheme is a project “to encourage the voluntary recording of archaeological objects found by members of the public in England and Wales”. To date, there are over half a million unique records in the Scheme’s database. These are small things, things that fell out of pockets, things that often get found via metal-detecting.

Here’s what I’ve been doing.

1. I downloaded a nightly dump of the PAS data back in April; it came as a csv file. I opened the file, and discovered over a million lines of records. Upon closer examination, I think what happened is something to do with the encoding- there are line breaks, carriage returns, and other non-printing characters (as well as commas being used within fields) that when I open the file I end up with a single record (say a coin hoard) occupying tens of lines, or of fields shifting at the extraneous commas.

2. I cleaned this data up using Notepad++ and the liberal use of regular expressions to put everything back together again. The entire file is something like 385 mb.

3. I imported it into MS Access so that I could begin to filter it. I’ve been playing with paleo – meso – and neolithic records; bronze age records; and Roman records. The Roman material itself occupies somewhere around 100 000 unique records.

4. I exported my queries so that I would have a simpler table with dates, descriptions, and measurements.

5. I filtered this table in Excel so that I could copy and paste out all of the records found within a particular district (which left me with a folder with 275 files, totaling something like 25 mb of text).

6. Meanwhile, I began topic modeling the unfiltered total PAS database (just after #2 above). Each run takes about 3 hours, as I’ve been running diagnostics to explore the patterns. The problem I have here though is what, precisely, am I finding? What does a cluster of records who share a topic actually mean, archaeologically? Do topics sort themselves out by period, by place, by material, by finds officer…?

7. As that’s been going on, I’ve been topic modeling the folders that contain the districts of England and Wales for a given period. Let’s look at the Roman period.

There are 275 files, where a handful have *a lot* of data (> 1000 kb), while the vast majority are fairly small (< 100 kb). Perhaps that replicates patterns of metal detecting – see Bevan on biases in the PAS.  The remaining districts seem to have no records in the database. So I’ve got 80% coverage for all of England and Wales. I’ve been iterating over all of this data, so I’ll just describe the most recent, as it seems to be a typical result. Using MALLET 2.0.7, I made a topic model with 50 topics (and optimized the interval, to shake out the useful from the not-so-useful topics). Last night, as I did this, the topic diagnostics package just wouldn’t work for me (you run it from the MALLET directory, but it lives at the MALLET site; perhaps they were working on it). So I’ll probably want to run all these again.

If I sort the topic keys by their prominence (see ‘optimize interval’) the top 14 all seem to describe different kinds of objects – brooches, denarii, nummus, sherds, lead weights, radiate, coin dates, the ‘heads’ sides of coins – which Emperor. Then we get to the next topic, which reads :” record central database recording usual standards fall created scheme aware portable began antiquities rectify working corroded ae worn century”.  This meta-note about data quality appears throughout the database, and refers to materials collected before the Scheme got going.

After that, the remaining topics all seem to deal with the epigraphy of coins, and the various inscriptions, figurative devices, their weights & materials. A number of these topics also include allusions to the work of Guest and Wells, whose work on Iron Age Coins is frequently cited in the database.

Let’s look at the individual districts now, and how these topics play over geographic space. Given that these are modern districts, it’d be better – perhaps – to do this over again with the materials sorted into geographic entities which make sense from a Roman perspective. Perhaps do it by major Roman Roads ( sorting the records so that districts through which Wattling Street traverses are gathered into a single document). Often what people do when they want to visualize the patterns of topic interconnections in a corpus is to trim the composition document so that only topics greater than a certain threshold are imported to a package like Gephi.

My suspicion is that that would throw out a lot of useful data. It may be that it’s the very weak connections that matter. A very strong topic-document relationship might just mean that a coin hoard found in the area is blocking the other signals.

In which case, let’s bring the whole composition document into Gephi. Start with this:

adur 4 0.238806 15 0.19403 22 0.179104 13 0.119403 17 0.089552

and delete out the edge weights. (I’m trying to figure out how to do what follows without deleting those edge weights, but bear with me.)

You end up with something like this:

adur 4 15 22  […etc…]

Save the file with a new name, as csv.

Open in Notepad++ (or similar) and replace the commas with ;

Go to gephi. Under ‘open graph file’, select your csv file. This is not the same as ‘import spreadsheet’ under the data table tab. You can import a comma separated file where the first item on a line is a node, and each subsequent item is another node to which it is attached. If you tried to open that file under the ‘import spreadsheet’ button, you’d get an error message – in that dialogue, you have to have two columns source and target where each row describes a single relationship. See the difference?

This is why if you left the edge weights in the csv file – let’s call it an adjaceny file – you’d end up with weights becoming nodes, which is a mess. If you want to keep the weights, you have to do the second option.

I’ve tried it both ways. Ultimately, while the first option is much much faster, the second option is the one to go for because the edge weights (the proportion that a topic is present in a document) is extremely important. So I created a single list that included seven pairs of topic-weight combinations. (This doesn’t created a graph where k=7, because not every document had that many topics. But why 7? In truth, after that point, the topics all seemed to be well under 1% of each document’s composition).

With me so far? Great.

Now that I have a two mode network in Gephi, I can begin to analyze the pattern of topics in the documents. Using the multi-mode plugin, I separate this network into two one-mode networks: topics to topics (based on appearing in the same district) and district – district based on having the same topics, in different strengths.

Network visualization doesn’t offer anything useful here (although Wales always is quite distinctly apparent, when you do. It’s because of the coin hoards). Instead, I simply compute useful network metrics. For instance, ‘betweeness’ literally counts the number of times a node is in between all pairs of nodes, given all the possible paths connecting them. In a piece of text such words do the heavy semantic lifting. So identifying topics that are most in between in the topic – topic network should be a useful thing to do. But what does ‘betweeness’ imply for the district – district network? I’m not sure yet. Pivotal areas in the formation of material culture?

What is perhaps more useful is the ‘modularity’. It’s just one of a number of algorithmns one could use to try to find structural sub-groups in a network (nodexl has many more). But perhaps there are interesting geographical patterns if we examined the pattern of links. So I ran modularity, and uploaded the results to openheatmap to visualize them geographically.  Network analysis doesn’t need to produce network visualizations, by the way.

See the result for yourself here: http://www.openheatmap.com/embed.html?map=AnteriorsFrijolsHermetists

It colours each district based on the group that it belongs to. If you mouse-over a district, it’ll give you that group’s number – those numbers shouldn’t be confused with anything else. I’d do this in QGIS, but this was quicker for getting a sense of what’s going on.

I asked on Twitter (referencing a slightly earlier version) if these patterns suggested anything to any of the Romano-Britain crowd.

//

Modularity for topic-topic also implies some interesting groupings, but these seem to mirror what one would expect by looking at their prominence in the keys.txt file.  So that’s where I am now, soon to try out Phil’s suggestion.

As Paul Harvey was wont to say, ‘…and now you know… the REST of the story’.  At DH2013 I hope to be able to tell you what all of this may mean.