FutureFunder Campaign Picks Up Steam!

George Garth Graham

I’m continually fascinated by ways digital media can expand who gets to be a historian, who gets to be an archaeologist. Crowdsourcing expands our readership, too.Open peer review projects allow the potential readership for a volume to have a dialogue with the authors while the project unrolls.

My Futurefunder campaign adds a new facet to this. I’m trying to crowdfund direct tax-deductible donations to a fund that would support undergraduate students as they work on various digital history and humanities projects around the department. The Dean of the Faculty of Arts will match funds if we reach the halfway mark ($2500); the fund is currently only about $800 shy of that point!

I needed to do this. I kept finding that I was pulling funds from various nooks and crannies to send students to THATCamps, to help them get to DHSI, to set up laboratories for exploring data mining, to publish and work with me on projects. I found I was spending weeks a year writing research grants that, when boiled down to their essence, were all about finding funds to train students. This, it seems to me, is a very appropriate idea to take directly to the public, rather than the Tri-council agencies. I was very excited to be interviewed by the Globe and Mail about the project (the story appeared this past Saturday), and the fund has really picked up steam. I would be happy to chat with folks who are interested in this campaign (this experiment!). I would be extremely happy to chat with folks about the amazing work the undergraduates around here do, in digital history.

One last push folks, one last push!

Artwork By Steven Hughes for the Globe and Mail

Patterns in Roman Inscriptions

Update August 22 I’ve now analyzed all 1385 inscriptions. I’ve put an interactive browser of the visualized topic model at http://graeworks.net/roman-occupations/.

See how nicely the Latin clusters?

See how nicely the Latin clusters?

I’ve played with topic modeling inscriptions before. I’ve now got a very effective script in R that runs the topic model and produces various kinds of output (I’ll be sharing the script once the relevant bit from our book project goes live). For instance, I’ve grabbed 220 inscriptions from Miko Flohr’s database of inscriptions regarding various occupations in the Roman world(there are many more; like everything else I do, this is a work in progress).

Above is the dendrogram of the resulting topics. Remember, those aren’t phrases, and I’ve made no accounting for case endings. (Now, it’s worth pointing out that I didn’t include any of the meta data for these inscriptions; just the text of the inscription itself, with the diacritical marks removed.) Nevertheless, you get a sense of both the structure and content of the inscriptions, reading from left to right, top to bottom.

We can also look at which inscriptions group together based on the similarity matrix of their topics, and graph the result.

roman-occ-graph

Inscriptions, linked based on similarity of the language of the inscription, via topics. If the image appears wonky, just click through.

So let’s look at these groups in a bit more depth. I can take the graph exported by R and import it into Gephi (or another package) to do some exploratory statistical analysis.

I’ve often put a lot of stock in ‘betweeness centrality’, reckoning that if a document is highly between in a network representation of the patterns of similarity of topics, then that document is representative of the kinds of discourses that run through it. What do we get, then?

We get this (here’s the page in the database):

aurifices Roma CIL 6, 9207 Inscription Occupation
M(arcus) Caedicius Iucundus / aurifex de / sacra via vix(it) a(nnos) XXX // Clodia …

But there are a lot of subgroupings in this graph. Something like ‘closeness’ might indicate more locally important inscriptions. In this case, the two with the highest ‘closeness’ measures are

aurifices Roma CIL 6, 9203 Inscription Occupation
Protogeni / aurfici / vix(it) an(nos) LXXX / et Claudiae / Pyrallidi con(iugi) …

and

aurifices Roma CIL 6, 3950 Inscription Occupation
Lucifer v(ixit) a(nnum) I et d(ies) XLV / Hesper v(ixit) a(nnos) II / Callistus …

If we look for subgroupings based on the patterning of connections, the biggest subgroup has 22 inscriptions:
Dis Manibus Felix publicus Brundisinorum servus aquarius vixit…
Dis Manibus Laetus publicus populi Romani 3 aquarius aquae An{n}ionis…
Dis Manibus sacrum Euporo servo vilico Caesaris aquario fecit Vestoria Olympias…
Nymphis Sanctis sacrum Epictetus aquarius Augusti nostri
Dis Manibus Agathemero Augusti liberto fecerunt Asia coniugi suo bene…
Agatho Aquarius Caesaris sibi et Anniae Myrine et suis ex parte parietis mediani…
Dis Manibus Sacrum Doiae Palladi coniugi dignissimae Caius Octavius…
Dis Manibus Tito Aelio Martiali architecto equitum singularium …
Dis Manibus Aureliae Fortunatae feminae incomparabili et de se bene merenti..
Dis Manibus Auliae Laodices filiae dulcissimae Rusticus Augusti libertus…
Dis Manibus Tychico Imperatoris Domitiani servo architecto Crispinilliano.
Dis Manibus Caio Iulio 3 architecto equitum singularium…
Dis Manibus Marco Claudio Tryphoni Augustali dupliciario negotiatori…
Dis Manibus Bromius argentarius
Faustus 3ae argentari
Dis Manibus sacrum Tiberius Claudius Hymeneus aurarius argentarius…
Dis Manibus Silio Victori filio et Naebiae Amoebae coniugi et Siliae…
Dis Manibus 3C3 argentari Allia coniugi? bene merenti fecit…
Dis Manibus Marco Ulpio Augusti liberto Martiali coactori argentario…
Suavis 3 aurarius
Dis Manibus sacrum Tiberius Claudius Hymeneus aurarius argentarius…
Dis Manibus Tito Aurelio Aniceto Augusti liberto aurifici Aurelia…

What ties these together? Well, ‘dis manibus’ is good, but it’s pretty common. The occupations in this group are all argentarii, architectii, or aquarii. So that’s a bit tighter. Many of these folks are mentioned in conjunction with their spouses.

In the next largest group, we get what must be a family (or familia, extended slave family) grouping:
Caius Flaminius Cai libertus Atticus argentarius Reatinus
Caius Octavius Parthenio Cai Octavi Chresti libertus argentarius
Musaeus argentarius
Caius Caicius Cai libertus Heracla argentarius de foro Esquilino sibi…
Caius Iunius Cai libertus Salvius Caius Iunius Cai libertus Aprodisi…
Caius Vedennius Cai filius Quirina Moderatus Antio militavit in legione…
Aurifex brattarius
Caius Acilius Luci filius Trebonia natus architectus
Caius Postumius Pollio architectus
Caius Camonius Cai libertus Gratus faber anularius
Caius Antistius Isochrysus architectus
Elegans architectus
Caius Cuppienus Cai filius Pollia Terminalis praefectus cohortis…
Cresces architectus
Cresces architectus
Caius Vedennius Cai filius Quirina Moderatus Antio militavit in legione…
Pompeia Memphis fecit sibi et Cnaeo Pompeio Iucundo coniugi suo aurifici…
Caius Papius Cai libertus Salvius Caius Papius Cai libertus Apelles…
Caius Flaminius Cai libertus Atticus argentarius Reatinus

The outliers here are graffitos or must be being picked up by the algorithmn due to the formation of the words; the inclusion of Pompeia in here is interesting, which must be to the overall structure of that inscription. Perhaps a stretch too far to wonder why these would be similar…?

This small experiment demonstrates I think the potential of topic modeling for digging out patterns in archaeological/epigraphic materials. In due time I will do Flohr’s entire database. Here are my files to play with yourself.

Giant component at the centre of these 220 inscriptions.

Giant component at the centre of these 220 inscriptions.

Topic Modeling #dh2013 with Paper Machines

I discovered the pdf with all of the abstracts from #dh2013 on a memory-stick-cum-swag this AM. What can I do with these? I know! I’ll topic model them using Paper Machines for Zotero.

Iteration 1.
1. Drop the pdf into a zotero collection.
2. Create a parent item from it.
3. Add a date (July 2013) to the date field on the parent item.
4. Right click on the collection, extract text for paper machines.
5. Right click on the collection, topic model –> by date.
6. Result: blank screen.

Damn.
Right-click the collection, ‘reset papermachines output’.

Iteration 2.
1. Split the pdfs for the abstracts themselves into separate pages. (pg 9 – 546).
2. Drop the pdfs into a zotero collection.
3. Create parent items for it. (Firefox hangs badly at this stage. And keeps redirecting through scholar.google.com for reasons I don’t know why).
4. Add dates to the date field; grab these by hand from the dh schedule page. God, there’s gotta be an easier way of doing this. Actually, I’ll just skip this for now and hope that the sequential page numbers/multiple documents will suffice.
5. Right click on the collection, extract text for paper machines.
6. Right click on the collection, topic model –> by date.
7. Result: IndexError: index out of range: -1.

Damn.
Right-click the collection, ‘reset papermachines output’.

Iteration 3.
Jump directly to #4, add dates to date field. In the interests of getting something done this morning, I will give them all the same date – a range from July 16 – July 19. If I gave them all their correct dates, you’d get a much more granular view. But I’m adding these by hand. (Though there probably exists some sort of batch edit for Zotero fields? Hang on, I right click on ‘change fields for items’ type ‘date’ for field, put in my range, hey presto! Thanks, Zotero)
5. Right click on the collection, extract text for paper machines.
6. Right click on the collection, topic model –> by date.
7. Result:

Damn.

Chased down the folder where all of these was being stored. Ahha. Each extracted text file is blank. Nice.

Blow this for a lark. Sometimes, folks, the secret is to go away, and come back later.

Update: I tweeted:

And then walked away for a while. Came back, and went to the TEI file. I used Notepad ++ to strip everything else out but the abstracts. I saved it as a csv. Then, in Excel, I used a custom script I found lying about on teh webs to turn each line into its own txt file. Then I copied the directory into Zotero. I gave each txt file its own parent. I mass edited those items so that they all carried the date July 16 – 19 2013. Then I extracted texts (which seems redundant, but you can’t jump ahead).

And then I selected topic modeling by time.

Which at least created a topic model, but it didn’t make the stream graph. The heat map worked, but all it showed was the US, UK, and Germany. And Florida, for reasons unexplained.

So I went back to Gephi for my topic model visualization. I used Ben Marwick’s Mallet-in-R script to do the modeling and to transform the output so I could easily visualize the correlations. Behold, I give you the network of strongly correlated #dh2013 abstracts by virtue of their shared topics:

dhabstracts-strongy-correlations

It’s coloured by modularity and sized by betweeness, which gives us groups of abstracts and the identification of the abstract whose topics/discourse/text do all of the heavy lifting. A brief glance at the titles suggests that these papers are all concerned with issues of data management of text.

I’ll put all of this data up on my space at Figshare.com in a moment It’s up on Figshare, and provide some further reflections. Currently, this machine is hanging up on me frequently, and I want to get this out before it crashes. Here are the topics; you can add labels if you’d like, but the top three seem to be ‘publishing & scholarly communication’; ‘visualization’; ‘teaching’:

Correlated topics at #dh2013

Correlated topics at #dh2013

0.35142 humanities digital social scholarly http research history accessed work community scholarship www access dh journal publication citation communication publishing
0.28061 literary reading analysis visualization text texts digital literature century studies media topic humanities corpus mining modeling press textual paper
0.21684 digital humanities students university teaching research dh participants workshop projects education pedagogy program tools academic arts graduate project resources
0.18993 digital collections research collection content researchers users access library user resources image images libraries archives metadata cultural information tools
0.14539 tei text document documents encoding markup xml texts index london indexing http uk html encoded links search version modern
0.11833 data historical map time gis information temporal maps university spatial geographic locations texts geographical place names mapping date dates
0.11792 crowdsourcing digital project public states united archaeological america archaeology projects poster university virginia web community social civil media users
0.11289 systems model modeling system narrative media theory elements classification type features user markup ic gesture expression representation press character
0.09601 editions edition text scholarly digital women editing collation print textual texts tools http image manuscript electronic editorial versions environment
0.08569 authorship author words texts corpus attribution characters frequency plays fig classification results number novels genre authors analysis character delta
0.08016 semantic annotation web linked open ontology data rdf scholarly http ontologies research annotations information review project metadata knowledge org
0.07777 social network networks analysis graph relationships characters group graphs jazz science family de interaction publication relationship nodes discussion cultural
0.06328 language corpus text txm http german de web lexicon platform corpora tools analysis unicode research annotation encoding languages lexus
0.05286 digital knowledge community fabrication migration book open feminist learning field knitting desktop world practices cultural experience work lab academic
0.04856 text analysis programming voyant tools ca poster interface alberta live rank sinclair http latent environments ualberta touch screen environment
0.04131 words poetry word text poem texts poetic ford english author segments conrad analysis language poems zeta newton mining chapters
0.0364 simulation information time content model vsim environment narrative abm distribution feature light embedded study narratives virtual japan plot resources
0.03538 query search google alloy xml language words typesetting algorithm de detection cf engine speech mql algorithms body searches paris
0.01131 de la el homer movement uncertainty en se clock catalogue del astronomical una movements para los dance las imprecision

Historian’s Macroscope- how we’re organizing things

‘One of the sideshows was wrestling’ from National Library of Scotland on Flickr Commons; found by running this post through http://serendipomatic.org

How do you coordinate something as massive as a book project, between three authors across two countries?

Writing is a bit like sausage making. I write this, thinking of Otto von Bismarck, but Wikipedia tells me:

  • Laws, like sausages, cease to inspire respect in proportion as we know how they are made.
    • As quoted in University Chronicle. University of Michigan (27 March 1869) books.google.de, Daily Cleveland Herald (29 March 1869), McKean Miner (22 April 1869), and “Quote… Misquote” by Fred R. Shapiro in The New York Times (21 July 2008); similar remarks have long been attributed to Otto von Bismarck, but this is the earliest known quote regarding laws and sausages, and according to Shapiro’s research, such remarks only began to be attributed to Bismarck in the 1930s.

I was thinking just about the messiness rather that inspiring respect; but we think there is a lot to gain when we reveal the messiness of writing. Nevertheless, there are some messy first-first-first drafts that really ought not to see the light of day. We want to do a bit of writing ‘behind the curtain’, before we make the bits and pieces visible on our Commentpress site, themacroscope.org.  We are all fans of Scrivener, too, for the way it allows the bits and pieces to be moved around, annotated, rejected, resurrected and so on. Two of us are windows folks, the other a Mac. We initially tried using Scrivener and Github, as a way of managing version control over time and to provide access to the latest version simultaneously. This worked fine, for about three days, until I detached the head.

Who knew that decapitation was possible? Then, we started getting weird line breaks and dropped index cards happening. So we switched tacts and moved our project into a shared dropbox folder. We know that with dropbox we absolutely can’t have more than one of us be in the project at the same time. We started emailing each other to say, ‘hey, I’m in the project….now. It’s 2.05 pm’ but that got very messy. We installed yshout  and set it up to log our chats. Now, we can just check to see who’s in, and leave quick memos about what we were up to.

Once we’ve got a bit of the mess cleaned up, we’ll push bits and pieces to our Commentpress site for comments. Then, we’ll incorporate that feedback back in our Scrivener, and perhaps re-push it out for further thoughts.

One promising avenue that we are not going down, at least for now, is to use Draft.  Draft has many attractive features, such as multiple authors, side-by-side comparisons, and automatic pushing to places such as WordPress. It even does footnotes! I’m cooking up an assignment for one of my classes that will require students to collaboratively write something, using Draft. More on that some other day.

A quick run with Serendip-o-matic

I just ran my announcement of our book through the #owot Serendip-o-matic serendipity engine.

It took the text of my post, and extracted these key words:

book, digital, writing, online, process, project, students, things, us, wanted, going, historian, nervous, one, programming.

I wondered if the selected keywords changed each time, if there was a bit of fuzziness to the extraction routine.  The image results this second time looked different than the first (more digitally than booky the second time, more bookish the first time than digital), but the results from the ‘save’ button were the same:

So, for pass one:

  1. Writing 2.0: Using Google Docs as a Collaborative Writing Tool in the Elementary Classroom: http://thoth.library.utah.edu:1701/primo_library/libweb/action/dlDisplay.do?vid=MWDL&afterPDS=true&docId=digcoll_uvu_19UVUTheses/609. From DPLA.
  2. Effectiveness of an Improvement Writing Program According to Students’ Reflexivity Levels: http://preview.europeana.eu/portal/record/9200102/F5795175AA2BAED57402D982C774072FE21364BF.html?utm_source=api&utm_medium=api&utm_campaign=iiecvYL4T. From Europeana.
  3. Students in the incubation room at the Woodbine Agricultural School, New Jersey: http://www.flickr.com/photos/36988361@N08/4296232936/. From Flickr Commons.
  4. Impossible things [book review]: http://thoth.library.utah.edu:1701/primo_library/libweb/action/dlDisplay.do?vid=MWDL&afterPDS=true&docId=digcoll_byu_12CBPR/201. From DPLA.
  5. Let The Feeling Flow: http://preview.europeana.eu/portal/record/2023601/F8C732E3D49AC67D886564EC78D0E37F02617C72.html?utm_source=api&utm_medium=api&utm_campaign=iiecvYL4T. From Europeana.
  6. Student reading to two little girls. Photographed for 1920 home economics catalog by Troy.: http://www.flickr.com/photos/30515687@N05/3856396957/. From Flickr Commons.

For pass two: – well, lots of different stuff, some overlaps, but a glitch meant that my results didn’t get saved.

Pass three: these words extracted- book, digital, writing, online, process, project, students, things, us, wanted, going, historian, nervous, one, programming. Same words, different order; but there were many different images from passes 1 and 2, while some images stayed the same. The ‘save’ page brought up the list above.  If I was serious about saving, I’d try to push from the results page into Zotero; in any event, after five workdays, this is a hell of a neat piece of work!  For contrast, let me take those keywords that serendipomatic extracted, and run them through google. Three results:

http://electricarchaeology.ca/2013/07/24/themacroscope/

http://electricarchaeology.ca/

http://dohistory.org/on_your_own/toolkit/oralHistory.html

So serendipomatic is the winner, hands down! Putting the keywords* extracted via natural language processing into google really highlights how google works: it exactly points to the post with which we began. And there, ladies and gentlemen, is the reason why Google, for all its power, is not the friend to research that you might have thought. Google is for generating needles; Serendipomatic is for generating haystacks, and it does it well. Well done #owot team!

*putting the whole text generated an error: Error 414 (Request URI too large!) Sorry google, didn’t mean to break you.

Announcing a live-writing project: the Historian’s Macroscope, an approach to big digital history

Robert Hook’s Microscope http://www.history-of-the-microscope.org

I’ve just signed a book contract today with Imperial College Press; it’s winging its way to London as I type. I’m writing the book with the fantastically talented Ian Milligan and Scott Weingart. (Indeed, I sometimes feel the weakest link – goodbye!).

It seems strangely appropriate, given the twitter/blog furor over the AHA’s statement recommendation to graduate students that they embargo their dissertations online, for fear of harming their eventual monograph-from-dissertation chances. We were approached by ICP to write this book largely on the strength of our blog posts, social media presence, and key articles, many of which come from our respective dissertations. The book will be targeted at senior/advanced undergrads for the most part, as a way of unpeeling the tacit knowledge around the practice of digital history. In essence, we can’t all be part of, or initiate, fantastic multi-investigator projects like ChartEx or Old Bailey Online; in which case, what can the individual achieve in the realm of fairly-big data? Our book will show you.

One could reasonably ask, ‘why a book? why not a website? why not just continue adding to things like the Programming Historian?’.  We wanted to write more than tutorials (although we owe an enormous debt to the Programming Historian team whose example and project continues to inspire us). We wanted to make the case for why as much as explore the how, and we wanted reach a broader audience than the digital technosavy. In our teaching, we’ve all experienced the pushback from students who are exposed to digital tools & media all the time; a book-length treatment normalizes these kinds of approaches so that students (and lay-people) can say, ‘oh, right, yes, these are the kinds of things that historians do’ – and then they’ll seek out Programming Historian, Stack Overflow, and myriad other sites to develop their nascent skills.  Another attraction of doing a book is that we recognize that editors add value to the finished product. Indeed, our commissioning editor sent our first attempt at a proposal out to five single-blind reviewers! This project is all the stronger for it, and I wish to thank those reviewers for their generous reviews.

One thing that we insisted upon from the start was that we were going to live-write the book, openly, via a comment-press installation. I submitted a piece to the Writing History in the Digital Age project a few years ago. That project exposed the entire process of writing an edited volume. The number and quality of responses was fantastic, and we knew we wanted to try for that here. We argued in our proposal that this process would make the book stronger, save us from ourselves, and build a potential readership long before the book ever hit store shelves. We were astonished and pleased that ICP thought it was a great idea! They had no hesitation at all – thank you Alice! We’ve had long discussions about the relationship of the online materials to the eventual finished book, and wording to that effect is in the final contract. Does that mean that the final type-set manuscript will appear on the commentpress online? No, but nor will the book’s materials be embargoed.  None of us, including the Press, have tried this scale of things before. No doubt there will be hiccups along the way, but there’s a lot of goodwill built up and I trust that we will be able to work out any issues that may (will) arise.

We’re going to write this book over the course of one academic year. In all truthfulness, I’m a bit nervous about this, but the rationale is that digital tools and approaches can change rapidly. We want to be as up-to-date as possible, but we also have to be aware in our writing not to date ourselves either. That’s where all of you come in. As we put bits and parts up on The Historian’s Macroscope – Big Digital History, please do read and offer comments. Consider this an open invitation. We’d love to hear from undergraduate students. Some of these pieces I’m going to road test on my ‘HIST2809 Historian’s Craft’ students this autumn and winter. Ian, Scott, and I will be reflecting on the writing process itself (and my student’s experiences) on the blog portion of the live-writing website.

I’m excited, but nervous as hell, about doing this. Nervous, because this is a tall order. Excited, because it seems to me that the real transformative power of the digital humanities is not in the technology, but in a mindset that peels back the layers, to reveal the process underneath, that says it’s ok to tinker with the ways things have been done before.

Won’t you join us?

Shawn

The George Garth Graham Undergraduate Digital History Research Fellowship

My grandfather, George Garth Graham, in the 1930s.

At Carleton University, we have a number of essay awards for undergraduate history students. We do not have any awards geared towards writing history in new media, or doing historical research using digital tools, or any of the various permutations that would broadly fall within big-tent digital humanities.

So I decided to create an award, using the University’s micro-giving (crowdfunding) platform, Futurefunder.

I’m establishing this fellowship in tribute to my grandfather and the values he represented. George Garth Graham did not have any formal education after Grade 8. He educated himself through constant reading. One of my fondest memories is going through his stack of Popular Science and Popular Mechanics magazines, and making things with him in his basement workshop. Digital History is often about making, tinkering, and exploring, and this was something that my grandfather exemplified. He had a great love of history, showing my brothers and I around the area, telling us the stories of the land. He was generous with his time and would also quietly help those in need, never asking for nor expecting recognition for his contribution.

I’m calling this a ‘research fellowship’ rather than a scholarship because I want it to encourage future work, rather than reward past work. I intend this fellowship to be available to any History student who has taken the second year, required, HIST2809 Historian’s Craft course (a methods course). The student would have to have a certain GPA (appropriate to their year), and have a potential faculty member and project in mind (and I would help facilitate that). A committee of the department would adjudicate applications.

One of the conditions of the fellowship would be for the student to maintain an active research blog, where she or he would detail their work, their reflections, their explorations and experiments. It would become the locus for managing their digital online identity as a scholar. I imagine that holders of this fellowship would be well set-up to pursue further work in graduate programs in the digital humanities or in the digital media sector. I imagine opportunities for the students to publish with faculty (as did the students who worked on my 2011 ‘HeritageCrowd’ project, writinghistory.trincoll.edu). I know of no other undergraduate fellowship like this, in this field. Students who held such a post would not just be assistants, but potential leaders in the field.

For more details about the Fellowship, and how you can contribute to it, please see the Fellowship’s page on Futurefunder.

Prescot Street as Topic Model, or, reading an excavation distantly

I tried a new tact in my quest to data mine archaeological records. Stuart Eve sent me the csv from the Prescot Street excavations, where each record was a unique context. I fed this into the vanilla java gui for MALLET (so no tuning, just the basic settings, looking for 25 topics) to see what – if anything – might result. The output seems very promising. I deliberately did not look up any information on the excavation until after I’d run this analysis. Can reading site records algorithmically tell us anything useful, that we did not otherwise know?

As I often do, I posted my initial reaction to twitter:

How to visualize this? I’m growing cold towards network visualizations of this kind of data, but in this case a two-mode representation might be appropriate, since the topic modeling algorithm is functioning as a kind of unsupervised clustering routine, pulling words out of the records that seem to go together. Here’s a two-mode network of the results, contexts tied to their constituent topics:

Prescot Street as Topic Model.

Prescot Street as Topic Model.

It seems promising. In that image, I took the excavators’ names out. But upon reflection, I shouldn’t do that:

I asked Gephi to look for modules (communities; groups; based on similarity of ties) within this two mode network. Below are a series of images that focus on the individual modules. Two items jump out immediately – one, particular excavators are associated with particular word choice, patterning of word usages; two, particular kinds of materials clump together quite nicely.

Do particular excavators ‘see’ particular kinds of info that others don’t? Do they ‘specialize’ in certain kinds of info? As a newbie on the Forum Novum project for BSR many years ago, I was never allowed on any of the ‘interesting’ stuff, being consigned to digging through layers of fill to find the depth of the natural soil level. There’s only so many ways to describe dirt. This kind of thing happens often. You want your most experienced excavators to handle the most delicate/intricate/complicated situations, but… I wonder.

Topic modeling this material, whilst including the names of the excavators attached to each context, seems to shed interesting light on the ways we see things archaeologically. In my other experiments with the PAS database, because of extraneous commas creeping in and shifting the fields, I often ended up with an inconsistent inclusion of the finds officers’ names, so I tended to just exclude them completely. That might be an error. I think we need to know whose voice is most tied to the ‘topics’/’discourses’ that make up our record (after all, once it’s excavated, this is all we have left, right?) This experiment here suggests that perhaps one of the more valuable outcomes of topic modeling archaeological material is the re-introduction of subjectivity into our records, the idea that many voices (modern and ancient) make up the ‘record’ – and we should listen to them.

In due course I’ll put the html up somewhere so that the interested reader can jump through the contexts along the topic – context – topic pathways suggested by the topic modeling. We use Harris matrices (a kind of network) to understand the three dimensional relationships amongst contexts (which imply their chronological ordering); what kinds of insights can deforming our reading of an excavation along the network paths suggested by the topic modeling produce?

Below are the visualizations of the modules.

pits and burials

pits and burials

roman pits, fills, structures

roman pits, fills, structures

cellars and latrines

cellars and latrines

graves and cemeteries

graves and cemeteries

roman fill

roman fill

modern ditches

modern ditches



And the topics with their top words:
topicId words..

1 schager elisabet pottery area part remains found bone similar poss fills bones appears burnt located human pieces waste grey activity main animal clear cremations broken cbm fragments truncates domestic skull high underneath mid shells bit edge sort chalk vessels deposits charcoal nw sherds disarticulated lost oyster sterile specific includes thrown

2 pit roman ii po ossuary irregular large latest including probable mixed pictured truncating inside planned sealed appears cut continuation surviving soakaways remained intercutting step pitting results topped width relates infilling partial include moved northwards steven ashley contexts adult perpendicular offset remain aesthetically loaced disturb sprial mentioned compass fed skeletons connections

3 fill floor basement rubble concrete slab fl evidence bedding ce larger glass abutting represent demolition room darker suggesting repair boundaries situe remaining unclear feature continues samian cessy eval packed facade john photo subrectangular reused actual ws lay inclusion noted lie teh constrcution looked crees brick lots archaeology flexed state

4 soakaway late water sump collection su pm brick soak masonry structure horn core back lined bricks lining drainage masonary materials face smell fit red held system courses time functioned sloping putrid cores aid headers lain knocked pipes mottled lies bands buried rotten real lying tirtiary simple earthernware exterior acrivity respective

5 pm pooley ashley late backfill century brick lucas tom cellar made garden line deliberate material walls cistern places sitting leveling thc proximity shallow backfilling based lerza rivets lifting limestone rebuild characteristic general redep suggested potential campion signs putrid map shown phase bits occurance structure element disintegrated ash southwards act crumble

6 truncated linear modern clark william heavily south west east truncation due north shape rectangular foundation cist machine stone cutting running aligned relationship pre composition tiles ne note observed worked sides deeper manhole intrusion define identical machining unknown depression tile mod axis bagged tegula limit channel erosional forming sample loe uneven

7 cut construction structural back slightly ring recut ephemeral completely realised doesn left partly heading heavy fragment contents analogous suggests comprises properties limestone short wells thc intervening association reflecting pictures clarify count sotnes terminus browny vertically bar unarticulated highest repdeposited things redeposit crmated tank approx ifthe lessnes forming explaining inclination plan

8 fill top base finds contained clancy sara clay organic context section date level horncore excavated original sheet shallow sketch nature suggest silting pipe suggests depth sampled trap lined dumped fully put reverse cemetery hearth beam deliberately frequent removal orientation orange paper backfilled horncores lain discussion sealed cultural appeared thick tenon

9 pit roman cut howell paula oval tip big ground difficult exact probable vertical pocket reflect shows pretty phase work means times duffy region alignment man matches nail wasn sequence build silting clinker fl brittle abuts tentar db sewer quadrant disarticulation implying characteristics revised constuction bottomed pressed unimpressive smc extending

10 gary webster filled fill possibly surface metalling gravelly related underlying unclear laid em difficult compact cobbling overlying represent dark modern mix undetermined series metalled place yellow gaps se stratigraphically extend dumps intentionally missing size charnal foundations spilled lack unsure things areas barrell blue metal yard variety respects ploughsoil anphora

11 pm post early fill med large medieval cess lot contemporary light pc inclusions latrine observed single mortar collapse character leather recovered ceramic suggests extent glacial lense hand event green interpretation resting case demo roughly curved household apparently assist inflow setting render cores varying determining belongs tenuously derivation mixture unlike consistant

12 refuse pr greg crees pit rubbish kind determine paula representing previously rounded bs discovered full gradual probing based enclosing struck housing similarities fronted coursings characterisation excavate sharp valley abse meeting people compare chronologically indication hypocaust blurring subjected distinctive amost grain remaining forms patchy interred colours including similar time midden

13 fill lower natural secondary sand greyish shaped statter claire surveyed black yorkstone proper ashley loose return horizontally mmx rest slots tbm patches largest acidic order distinct interface terrace drainage seperated ark rubberley hit spiralled rebuild destruction coming eastwards sharply hold candidate smells distorted field air powdered stains overview vacant dated

14 roman carreton adrian upper pit colour excavation makes funerary alignment southern preservation fireplace cover collapsed extending scattered adhering pinkish comprised ns nw bag smelly find soakaway whoel gs meant belonged disused regard ditches meters quarries huge making corresponds ritual existing cemented dimensions starts dimension marked paired excavtion staining shipton

15 wall fergal donoghue late pm wa sill building georgian st internal tenter butting wooden victorian present buttress support extension long prescot barrel house rear street dividing facing immediately platform front rising moisture prevent slate thinks medium slabs beneath seemingly fl counstruction plot wider lienar knees erosional lies cu trample photographed

16 roman fill matt ceri shipton law nails williams earlier black form find situ obvious fe uncertain complete amounts objects culvert smae skulls notably addition wood stoney domed truncations rectilinear pyres quality moderate working bonding earliest dark gis ark failry compost peeled functional rows ended properly remnants buildings accounted variation

17 cremation burial cr urn disturbed pot plan tile dobosz ukasz votive diffuse recorded dug built sample cm represents bone chest cremated position surrounding box analysis nb lifted coin regular offering vessel concentration occasional deposits suggesting block intact urned sw notable lid samples deep stones western broad higher plate cms relate

18 make levelling mu layer gravel dump material brickearth redeposited ed sandy deposited earth dumped spread dirty ground silty slumped capping clayey charcoal derived quarrying stoney extraction layers thin square sorted period exposed occupation sands soft parts provide lines stuff didn true partly significantly basal white tom mixey cluster test central

19 void external soil deposit hole cultivation posthole ec soils sp fra features lerza site fairly agricultural number brown debris dep evaluation reworked dumping result dates horticultural environmental run unurned plough residue deposition manuring representing upright storage exit family connected cleaning difference squared linked geophoto amorphous gravely concentrations poo defined

20 pit david unspecified ross edge roman brenna lowest shallow expect final basal presume dimensions marcus pebbles angular appeared covering diffuse processing stage stuart lens stored missed thickness const irregularities souther button funcation limits uncear oblong wider poshole suggested works fil metaling patella jaws grounds greater major purposes elisabet derive pegtile

21 cut small ruth rolfe side cuts end pits grid sq eastern circular originally western piece hard mm fact partially edges removed orangey wood half northern thought directly separate nearby degraded marcus initial urns period solid straight slope inwash graves limit rough wide occured occasionally centre good concave leading survives undertermined

22 drain ditch gully feature possibly shallow trench bottom boundary hassett visible southern runs burials sides postmed aspects slot cemetary presence point robber quarrying footing essentially direction formed doesn land homogenous number indicating section constructed thc terraced gulley parallel holes assoc overflow longbone debitage arising pressure fragment mark glazed wash sealing

23 pit quarry pq hassan anies primary silt dark prob middle zone skull filling mausoleum planned machined edges tiled tanked evident reused stain northeast ts corner sit redepoisted doubt terminate pillow overleaf shale fits standard means dateable existant redundant easts dropping quarried usage gc report truncate trampling compositions marcus bag

24 morse chaz deposit mixed roman dumped brown gravels rich function silty forms lenses subsided narrow assume robbed rest past discernable pitcut con sitly barren bucket cesspit shot beneath late unfrogged sister occupancy flure terminates consister retrieved resolved parallel joining ideas give millefiore burrial cd assumption regularity uppermost imbrex deposite

25 grave roman skeleton cut sk moskal tomasz coffin dug inhumation preserved poorly erroded goods body left legs head articulated skeletal nos poor events juvenile severly feet condition fragmentary holding bed ends stain strongly info spaced cu deposited shaped assigned disturbance cleaned chalk disatriculated femur hands soakawy showing overhangs hom cen

In which I topic model the entire PAS database by individual rows

Previously, I was trying to consider the geography of Roman Britain as a corpus of documents – individual geographic (modern) areas – where the records in the Portable Antiquities Scheme database formed the words of the document.

Today, I inverted that process. I treated each individual row in the entire PAS database as an individual document, with the data within that record its words. It took about two hours of processing time, looking for 100 topics. I now have a series of outputs that neither Excel nor Notepad++ can open, as they are too big. I’ll have to break the files up before I can dig too much deeper into them. However, what I can examine seems promising – topics that seem to indicate various regions; topics that indicate particular finds officers; topics that indicate particular kinds of artefacts; topics that indicate the status of the object (whether it was returned to the finder). Here’s a sampling:

Topic Weight Words
94 0.01654 mm thick wide weighs long diameter measures grams length width weight thickness high weighing fragment edge section maximum measuring
3 0.01442 suffolk east metal detector minter faye finder returned alloy plouviez judith copper mid st jane carr coastal edmundsbury geake
22 0.01409 green patina surface dark colour mid brown alloy copper corrosion light worn grey slightly condition corroded object pitted original
45 0.01374 mm weight width thickness length diameter atherton rachel maximum thick derbyshire dimensions wt height fragment midlands max including complete
1 0.01165 yorkshire humber riding metal detector east finder north returned alloy copper holmes simon paynton ceinwen hambleton selby david illegible
56 0.01076 east lincolnshire adam daubney midlands detector metal alloy lindsey copper finder returned kesteven north west elwes marina nottinghamshire rushcliffe
49 0.01028 lines decorated incised side decoration line central edge raised ring centre border grooves dot cross rectangular end upper punched
69 0.01003 ae nummus constantine house gloria exercitvs bust soldiers standards copper standard victory prow ii left illegible constantinopolis helmeted ad
27 0.00985 frame buckle pin bar alloy copper medieval loop edge oval outer missing cast strap double narrowed shaped looped section
92 0.00976 sherd pottery rim fabric sherds ware vessel grey chance find body medieval ceramic roman detecting inclusions colour surface orange

topic3topic1topic94  topic3 topic22 topic45

Topics as Word Clouds

Elijah Meeks and Mat Jockers both have used word clouds to visualize topics from topic models. Colour, orientation, relative placement of the words – all of these could be used to convey different dimensions of the data. Below, you’ll find clouds for each of my initial 50 topics generated from the Roman materials in the Portable Antiquities Scheme database (some 100 000 rows, or nearly 1/5 the database, collected together into ‘documents’ where each unitary district authority is the ‘document’ and the text are the descriptions of things found there). The word clouds are generated from the word weights file that MALLET can output. There are 8100 unique tokens when I convert the database into a MALLET file; each one of those is present in each ‘bag of words’ or topic that MALLET generates, but to differing degrees. Thus, word clouds (here generated with Wordle) pull out important information that the word keys document does not. However, given that I optimized the interval whilst generating the topic models, the keys document provides an indication of the strength of the topic in the corpus. I’ve arranged the word clouds scaling them against the size of the strongest topic (topic 22), top-bottom, left-right. I’ll be damned if I can get wordpress to just display each image under the other one. Even stripped my table out, it did!

At any rate, as one churns through the 50 topics, after about the first 11 (depicted below), the topics get progressively more noisy as MALLET attempts to deal with incomplete transcriptions of the epigraphy of the coins, and the frequent notes about the source for the identification of the coins (the work of Guest & Wells). The final topic depicted here, topic 20, directly references a note often left in the database concerning the quality of an individual record; these frequently are in connection with materials that entered the British Museum collection before the Portable Antiquities Scheme got going and hence the information is not up to usual standards.

This exercise then suggests to me that 50 topics is just too much. I’m rerunning everything with 10 topics this time.

Topic 22

Topic 22

Topic 48

Topic 48

Topic 43

Topic 43

Topic 32

Topic 32

Topic 7

Topic 7

Topic 33

Topic 33

Topic 13

Topic 13

Topic 47

Topic 47

Topic 46

Topic 46

Topic 35

Topic 35

Topic 20

Topic 20