archaeology, data management, data mining, making

ODATE: Open Digital Archaeology Textbook Environment (original proposal)

“Never promise to do the possible. Anyone could do the possible. You should promise to do the impossible, because sometimes the impossible was possible, if you could find the right way, and at least you could often extend the limits of the possible. And if you failed, well, it had been impossible.”
Terry Pratchett, Going Postal

And so we did. And the proposal Neha, Michael, Beth, and I put together was successful. The idea we pitched to ecampus ontario is for an open textbook that would have an integral computational laboratory (DHBox!), for teaching digital archaeology. The work of the DHBox team, and their generous licensing of their code makes this entire project possible: thank you!

We put together a pretty ambitious proposal. Right now, we’re working towards designing the minimal viable version of this. The original funding guidelines didn’t envision any sort of crowd-collaboration, but we think it’d be good to figure out how to make this less us and more all of you. That is, maybe we can provide a kernal that becomes the seed for development along the lines of the Programming Historian.

So, in the interests of transparency, here’s the meat-and-potatoes of the proposal. Comments & queries welcome at bottom, or if I forget to leave that open, on twitter @electricarchaeo.


Project Description

We are excited to propose this project to create an integrated digital laboratory and e-textbook environment, which will be a first for the broader field of archaeology

Digital archaeology as a subfield rests upon the creative use of primarily open-source and/or open-access materials to archive, reuse, visualize, analyze and communicate archaeological data. Digital archaeology encourages innovative and critical use of open access data and the development of digital tools that facilitate linkages and analysis across varied digital sources. 

To that end, the proposed ‘e-textbook’ is an integrated cloud-based digital exploratory laboratory of multiple cloud-computing tools with teaching materials that instructors will be able to use ‘out-of-the-box’ with a single click, or to remix as circumstances dictate.

We are proposing to create in one package both the integrated digital exploratory laboratory and the written texts that engage the student with the laboratory. Institutions may install it on their own servers, or they may use our hosted version. By taking care of the digital infrastructure that supports learning, the e-textbook enables instructors and students to focus on core learning straight away. We employ a student-centred, experiential, and outcome-based pedagogy, where students develop their own personal learning environment (via remixing our tools and materials provided through the laboratory) networked with their peers, their course professors, and the wider digital community.

Project Overview

Digital archaeology as a field rests upon the creative use of primarily open-source and/or open-access materials to archive, reuse, visualize, analyze and communicate archaeological data. This reliance on open-source and open-access is a political stance that emerges in opposition to archaeology’s past complicity in colonial enterprises and scholarship; digital archaeology resists the digital neo-colonialism of Google, Facebook, and similar tech giants that typically promote disciplinary silos and closed data repositories. Specifically, digital archaeology encourages innovative, reflective, and critical use of open access data and the development of digital tools that facilitate linkages and analysis across varied digital sources. 

To that end, the proposed ‘e-textbook’ is an integrated cloud-based digital exploratory laboratory of multiple cloud-computing tools with teaching materials that instructors will be able to use ‘out-of-the-box’ with a single click, or to remix as circumstances dictate. The Open Digital Archaeology Textbook Environment will be the first of its kind to address methods and practice in digital archaeology.

Part of our inspiration comes from the ‘DHBox’ project from CUNY (City University of New York,, a project that is creating a ‘digital humanities laboratory’ in the cloud. While the tools of the digital humanities are congruent with those of digital archaeology, they are typically configured to work with texts rather than material culture in which archaeologists specialise. The second inspiration is the open-access guide ‘The Programming Historian’, which is a series of how-tos and tutorials ( pitched at historians confronting digital sources for the first time. A key challenge scholars face in carrying out novel digital analysis is how to install or configure software; each ‘Programming Historian’ tutorial therefore explains in length and in detail how to configure software. The present e-textbook merges the best of both approaches to create a singular experience for instructors and students: a one-click digital laboratory approach, where installation of materials is not an issue, and with carefully designed tutorials and lessons on theory and practice in digital archaeology.

The word ‘e-textbook’ will be used throughout this proposal to include both the integrated digital exploratory laboratory and the written texts that engage the student with it and the supporting materials. This digital infrastructure includes the source code for exploratory laboratory so that faculty or institutions may install it on their own servers, or they may use our hosted version. This accessibility is a key component because one instructor alone cannot be expected to provide technical support across multiple operating systems on student machines whilst still bringing the data, tools and methodologies together in a productive manner. Moreover, at present, students in archaeology do not necessarily have the appropriate computing resources or skill sets to install and manage the various kinds of server-side software that digital archaeology typically uses. Thus, all materials will be appropriately licensed for maximum re-use. Written material will be provided as source markdown-formatted text files (this allows for the widest interoperability across platforms and operating systems; see sections 9 and 10). By taking care of the digital infrastructure that supports learning, the e-textbook enables instructors and students to focus on core learning straight away.

At our e-textbook’s website, an instructor will click once to ‘spin up’ a digital laboratory accessible within any current web browser, a unique version of the laboratory for that class, at a unique URL. At that address, students will select the appropriate tools for the tasks explored in the written materials. Thus, valuable class time is directed towards learning and experimenting with the material rather than installing or configuring software.

The e-textbook materials will be pitched at an intermediate level; appropriate remixing of the materials with other open-access materials on the web will allow the instructor to increase or decrease the learning level as appropriate. Its exercises and materials will be mapped to a typical one-semester time frame.


Digital archaeology sits at the intersection of the computational analysis of human heritage and material cultural, and rapidly developing ecosystems of new media technologies. Very few universities in Ontario have digital archaeologists as faculty and thus digital archaeology courses are rarely offered as part of their roster. Of the ten universities in Ontario that offer substantial undergraduate and graduate programs in archaeology (see, only three (Western, Ryerson and Carleton) currently offer training in digital methods. Training in digital archaeology is offered on a per project level, most often in the context of Museum Studies, History, or Digital Media programs. Yet growing numbers of students demand these skills, often seeking out international graduate programs in digital archaeology. This e-textbook therefore would be a valuable resource for this growing field, while simultaneously building on Ontario’s leadership in online learning and Open Educational Resources. Moreover, the data and informatics skills that students could learn via this e-textbook, as well as the theoretical and historiographical grounding for those skills, see high and growing demand, which means that this e-textbook could find utility beyond the anthropology, archaeology, and cultural heritage sectors.

Our e-textbook would arrive at an opportune moment to make Ontario a leading centre for digital archaeological education. Recently, the provincial government has made vast public investment in archaeology by creating ‘Sustainable Archaeology’ (, a physical repository of Ontario’s archaeological materials and centre for research. While growing amounts of digitized archaeological materials are being made available online via data publishers such as Open Context (, and repositories such as tDAR (, DINAA ( and ADS (, materials for teaching digital archaeology have not kept pace with the sources now available for study (and print-only materials go out of date extremely quickly). Put simply, once archaeological material is online, we face the question of “so what?” and “now what?” This e-textbook is about data mining the archaeological database, reading distantly thousands of ‘documents’ at once, graphing, mapping, visualizing what we find and working out how best to communicate those findings. It is about writing archaeology in digital media that are primarily visual media. Thus, through the e-textbook, students will learn how to collect and curate open data, how to visualize meaningful patterns within digital archaeological data, and how to analyze them.

Furthermore, this e-textbook has two social goals:

  1. It agitates for students to take control of their own digital identity, and to think critically about digital data, tools and methods. This in turn, can enable them to embody open access principles of research and communication.
  2. It promotes the creation, use and re-use of digital archaeological data in meaningful ways that deepen our understanding of past human societies.

Research materials that are online do not speak for themselves, nor are they necessarily findable or ‘democratized’. To truly make access democratic, we must equip scholars with “digital literacy” — the relevant skills and theoretical perspectives that enable critical thinking. These aims are at the heart of the liberal arts curriculum. We know that digital tools are often repurposed from commercial services and set to work for research ends in the social sciences and liberal arts. We are well aware that digital tools inherently emphasize particular aspects of data, making some more important than others. Therefore, it is essential that students think critically about the digital tools they employ. What are the unintended consequences of working with these tools? There is a relative dearth of expertise in critically assessing digital tools, and in seeing how their biases (often literally encoded in how they work) can impact the practice of archaeology.

To that end, we employ a student-centred, experiential, and outcome-based pedagogy, where students develop their own personal learning environment (via remixing our tools and materials provided through the laboratory) networked with their peers, their course professors, and the wider digital community.

Content Map

E-textbook Structure (instructional materials to support the digital exploratory laboratory)

The individual pieces (files and documents) of this e-textbook will all be made available using the distributed Git versioning control software (via Github). This granularity of control will enable interested individuals to take the project to pieces to reuse or remix those elements that make the most sense for their own practice. Since the writing is in the markdown text format, learners can create EPubs, PDFs, and webpages on-demand as necessary, which facilitates easy reuse, remixing and adaptation of the content. The granularity of control also has the added bonus that our readers/users can make their own suggestions for improvement of our code and writing, which we can then fold into our project easily. In this fashion our e-textbook becomes a living document that grows with its use and readership.

Introduction. Why Digital Archaeology?

Part One: Going Digital

  1. Project management basics
    1. Github & Version control
    2. Failing Productively
    3. Open Notebook Research & Scholarly Communication
  2. Introduction to Digital Libraries, Archives & Repositories
    1. Command Line Methods for Working with APIs
    2. Working with Open Context
    3. Working with Omeka
    4. Working with tDAR
    5. Working with ADS
  3. The Ethics of Big Data in Archaeology

The digital laboratory elements in this part enable the student to explore versioning control, a bash shell for command line interactions, and an Omeka installation.

Part Two: Making Data Useful

  1. Designing Data Collection
  2. Cleaning Data with OpenRefine
  3. Linked Open Data and Data publishing

The digital laboratory elements in this part continue to use the bash shell, as well as OpenRefine.

Part Three: Finding and Communicating the Compelling Story

  1. Statistical Computing with R and Python Notebooks; Reproducible code
  2. D3, Processing, and Data Driven Documents
  3. Storytelling and the Archaeological CMS: Omeka, Kora
  4. Web Mapping with Leaflet
  5. Place-based Interpretation with Locative Augmented Reality
  6. Archaeogaming and Virtual Archaeology
  7. Social media as Public Engagement & Scholarly Communication in Archaeology

The digital laboratory elements in this part include the bash shell, Omeka (with the Neatline mapping installation) and Kora installations, mapwarper, RStudio Server, Jupyter notebooks (python), Meshlab, and Blender.

Part Four: Eliding the Digital and the Physical

  1. 3D Photogrammetry & Structure from Motion
  2. 3D Printing, the Internet of Things and “Maker” Archaeology
  3. Artificial Intelligence in Digital Archaeology (agent models; machine learning for image captioning and other classificatory tasks)

The digital laboratory elements in this part include Wu’s Visual Structure from Motion package, and the TORCH-RNN machine learning package.

Part Five: Digital Archaeology’s Place in the World

  1. Marketing Digital Archaeology
  2. Sustainability & Power in Digital Archaeology

To reiterate, the digital laboratory portion of the e-textbook will contain within it a file manager; a bash shell for command line utilities (useful tools for working with CSV and JSON formatted data); a Jupyter Notebook installation; an RStudio installation; VSFM structure-from-motion; Meshlab; Omeka with Neatline; Jekyll; Mapwarper; Torch for machine learning and image classification. Other packages may be added as the work progresses. The digital laboratory will itself run on a Linux Ubuntu virtual machine. All necessary dependencies and packages will be installed and properly configured. The digital laboratory may be used from our website, or an instructor may choose to install locally. Detailed instructions will be provided for both options.

archaeology, competition, data management, data mining, making, mash up

Open Context & Carleton Prize for Archaeological Visualization

Increasingly, archaeology data are being made available openly on the web. But what do these data show? How can we interrogate them? How can we visualize them? How can we re-use data visualizations?

We’d like to know. This is why we have created the Open Context and Carleton University Prize for Archaeological Visualization and we invite you to build, make, hack, the Open Context data and API for fun and prizes.

Who Can Enter?

Anyone! Wherever you are in the world, we invite you to participate. All entries will be publicly accessible and promoted via a context gallery on the Open Context website.


The prize competition is sponsored by the following:

  • The Alexandria Archive Institute (the nonprofit that runs Open Context)
  • The Digital Archaeology at Carleton University Project, led by Shawn Graham


We have prizes for the following categories of entries:

  • Individual entry: project developed by a single individual
  • Team entry: project developed by a collaborative group (2-3 people)
  • Individual student entry: project developed by a single student
  • Student team entry: project developed by a team of (2-3) students


All prizes are awarded in the form of cash awards or gift vouchers of equivalent value. Depending on the award type, please note currency:

  • Best individual entry: $US200
  • Best team entry (teams of 2 or 3): $US300 (split accordingly)
  • Best student entry: $C200
  • Best student team entry (teams of 2 or 3): $C300 (split accordingly)

We will also note “Honorable Mentions” for each award category.

Entry Requirements

We want this prize competition to raise awareness of open data and reproducible research methods by highlighting some great examples of digital data in practice. To meet these goals, specific project entry requirements include the following:

  • The visualization should be publicly accessible/viewable, live on the open Web
  • The source code should be made available via Github or similar public software repository
  • The project needs to incorporate and/or create open source code, under licensing approved by the Free Software Foundation.
  • The source code must be well-commented and documented
  • The visualization must make use of the Open Context API; other data sources may also be utilized in addition to Open Context
  • A readme file should be provided (as .txt or .md or .rtf), which will include:
    • Instructions for reproducing the visualization from scratch must be included
    • Interesting observations about the data that the visualization makes possible
    • Documentation of your process and methods (that is to say, ‘paradata’ as per theLondon Charter, section 4)

All entries have to meet the minimum requirements described in ‘Entry Requirements’ to be considered.

Entries are submitted by filling a Web form ( that will ask you for your particulars and the URL to your ‘live’ entry and the URL to your code repository. You will also be required to attest that the entry is your own creation.

Important Dates

  • Closing date for entry submissions: December 16, 2016
  • Winners announced: January 16, 2017

Criteria for Judging

  • Potential archaeological insight provided by the visualization
  • Reproducibility
  • Aesthetic impact
  • Rhetorical impact
  • Appropriate recognition for/of data stakeholders (creators and other publics)

Attention will be paid in particular to entries that explore novel ways of visualizing archaeological data, or innovative re-uses of data, or work that takes advantage of the linked nature of Open Context data, or work that enables features robust/reproducible code for visualizations that could be easily/widely applied to other datasets.


The judges for this competition are drawn from across the North America:


data mining, digital history, making

The Robotic Edward Gibbon

Continuing on from yesterday’s post and the suggestion of Steve Leahy:

I have fed The Decline and Fall of the Roman Empire into my neural network. What truths about Gibbon’s writing will be revealed? A first pass: footnotes!


[Footnote 42: Herodian and Marcus, videra de Mestaem.]
[Footnote 221: See the latter of Carthage, and they not in discipline his power in this present country, the attending states of the historians, and the whole senator, which aport the Imperial son and the states of the most army as the command, and their beautiful and singular instances probable to the accustomed and public enemy. The flower of his soldiers was a magnificent spirit of one hundred and even, whose revenge in which his death, for under the emperor was the frontiers of the Hustilians. (See Palegyria, l. xiii. c. 2) in the collection of their taughters. In the emperor Maximian had renewed the mind of the mention of the conduct of the Greeks, and the great human accompanient of the whole hears of the Bratish ineffection. See the ministers of his erections and long tribes who had atthibuted a carable of military grandacts, the fixed of the death, the empire, and who gable to possess the superior successors of the throne, but even the tribunes, were filled as having thrown a proper entire, which they assembled a fliend the two triumphal princes, the accused of the nations of Augustus, in the conduct of the Gallienus.]

[Footnote 35: Adopis, is of these generils of his monarchys of the End and military enjoyments. The consuls of Valerian followed the applausing of the army. [32] Had obtained on the tenting of the little distinction of the conquest of their manners. They were been disciplined by their government which were the guards of particulars and accused of the first names of a luxury of their insolence, and the expedition of the emperor Maximian from the collected sentence of the son who shared the son of the Arminian greatness, supposed to the former the three were cluins, which had only distributed the senate had the honor and expression of the empire. The ancient world, and the first defence of the army alled the military property.]


I waited for quite some time, then ran it again with the temperature dialed close to 0. When you have things arranged that way, you get a sense of the core issues apparent in the writer. In Mr. Gibbon’s case:


The senate was the senate and the senate and the senate and the constitution…The provinces of the Romans were supported by the soldiers who had been subsisted by the emperor the senate, and the provinces of the emperor was confined to the provinces of the senate, and the constitution of the emperor was supported by the soldiers of the Roman empire


No surprises there: Roman history is political history. But if we dial the temperature all the other way, we get a bit more creativity, a bit more … humanity… in the writing:


The Roman Empire was related by the immediate and refulary influence of its expiess in faction: which became the multitude of poperal prophet, which can the most sepurity from every prince, who admitted, in a country of importance and successful sense, or attended on this prince to divide a magnificence, particularly formed from his life and name. When Diocletian was a common tolerant murder of polished the latter, which must have could be found in the remainder, which is safe by Trajan, as the Agrispan and a hundred emperors, the theatres of emperor inferiors in fortifications of ancient pity, which cut the confession of the former disciplined Aurelian into a single and powerful design of attributes, and more sounded and became a sense of stones, they against his superiority of conquest and inconsiderable successors.

[Footnote 23: The Porticular might recover on the acclamations of the Several assent of various numbers and his own historian, to expect with the object of the history of the Paastis, considerated them to the separate respect of the world from the easy on the choice of Herodian. The contemporary church of Decia, and Shohians in the successors of Therspation; [14] the title of Goths and garding of the Romans deperded by their virtues under the conference of his possession.

[Footnote 83: See the city of Maximian into which it any life is impossible against the family of his own provinces. It is by preserved, that Sirman is supplied by the Gibbon sesters, and provided like the throne and his declining antiquity, and described the funishmentang and foundation of climate, by religious and the emperor of their government, formed contest to destroy the wails, expired them, and dreaded the villages of the most noble respectable taxes, both supported their consulship. A serious party the creation of government were alleged to the barbarians of his accounts the prudence of the three country, and in the enterprises of the larger Galicies.


For the Terry Pratchett fans amongst you, all of this is probably an abomination unto Nuggan.


update: at around 8000 and 8500 iterations, we’ve got this: and Now I shall feed it every blog post from this blog…

data mining, digital history

On Punctuation

Posters of various literary works by Nicholas Rougeux – as represented by the punctuation therein- have been doing the rounds lately. They’re lovely; in the absence of words we intuit something of the nature of the work from the pauses, the parenthesis, the short staccato dashes and dots; a kind of telegraphy of meaning.

Adam Calhoun posted some of his own reflections on this kind of work, and helpfully, posted some python code for doing the same. Now, one might want to adjust the resulting output to be more wonderfully evocative as Rougeux’s work does, but for getting started, it’s a great little piece of code.

So I had to try it out. Behold! The sisters Susanna Moodie and Catherine Parr-Traill both published reflections on life in the wilds of Canada in the 19th century, and happily, both ‘Roughing it in the Bush‘ and ‘The Backwoods of Canada‘ are available on the Gutenberg Project. So what does the punctuation reveal about the sisters’ characterization of Canada / literary style?

A detail of the opening of Roughing it in the Bush:

Screen Shot 2016-02-16 at 11.29.36 AM

… you can really see changes in style quite clearly this way – what appears to be bits of dialogue and then lots and lots of exposition.

A detail of the opening of The Backwoods of Canada:

Screen Shot 2016-02-16 at 11.32.06 AM

…certainly a very different style, that much is clear. More variety? More richness? Someday, I must actually go and *read* these things… Today’s post is just really a reminder to myself to come back to all of this.

postscript Sebastian Heath mused on twitter about sonifying this punctuation; I immediately thought that drums would be the best way to do that. So I mapped the ascii values for the punctuation to sound, and I’ve started to play around. Have a listen:



archaeology, data mining, digital history

Reactions to Battlefield Recovery episode 1

Battlefield Recovery, an execrable show that turns the looting of war dead into ‘entertainment’, was shown on Saturday on Channel 5 in the UK. I won’t dignify it by linking to it; instead see this article in the Guardian.

I wondered however what the tweeting public thought about the show – keeping in mind that Channel 5 viewers may or may not be the same kinds of folks who engage with Twitter. I used Ed Summer’s TWARC to collect approximately 3600 tweets (there are likely many more, but the system timed out). The file containing the IDs of all of these tweets is available here. You can use this file in conjuction with TWARC to recover all of the tweets and their associated metadata for yourself (which is approximately 19 mb worth of text). You can explore the language of the tweets for yourself via Voyant-Tools.

So the most retweeted interventions show a pretty strong signal of disapproval. I have not looked into users’ profiles to see whether or not folks identify as archaeologists. Nor have I mapped users’ networks to see how far these messages percolated, and into what kinds of communities. This is entirely possible to do of course, but this post just represents a first pass at the data.

Let’s look at the patterns of language in the corpus of tweets as a whole. I used the LDAVis package for R to create an interactive visualization of topics within the corpus, fitting it to 20 topics as a first stab. You can play with the visualization here. If you haven’t encountered topic modeling yet, it’s a technique to reverse engineer a corpus into the initial ‘topics’ from which the writers wrote (could have written). So, it’s worth pointing out that it’s not ‘truth’ we’re seeing here, but a kind of intellectual thought exercise: if there were 20 topics that capture the variety of discourse expressed in these tweets, what would they look like? The answer is, quite a lot of outrage, dismay, and disappointment that this TV show was aired. Look particular at say topic 8 or topic 3, and ‘disgust’. Topic 1, which accounts for the largest slice of the corpus, clearly shows how the discussants on twitter were unpacking the rebranding of this show from its previous incarnation as ‘Nazi War Diggers’, and the pointed comments at Clearstory Uk, the producers of Battlefield Recovery.

We can also look at patterns in the corpus from the point of view of individual words, imagining the interrelationships of word use as a kind of spatial map (see Ben Schmidt, Word Embeddings). If you give it a word – or a list of words – the approach will return to you words that are close in terms of their use. It’s a complementary approach to topic models. So, I wanted to see what terms were in the same vector as the name of the show & its producers (I’m using R). I give it this:

some_terms = nearest_to(model,model[[c("battlefieldrecovery", "naziwardiggers", "clearstoryuks")]],150)

And I see the interrelationships like so:

…a pretty clear statement about what 3600 tweets felt, in aggregate along this particular vector. Of the tweets I saw personally (I follow a lot of archaeologists), there was an unequivocal agreement that what this show was doing was no better than looting. With word vectors, I can explore the space between pairs of binaries. So let’s assume that ‘archaeologist’ and ‘looter’ are opposite ends of a spectrum. I can plot this using this code:

actor_vector = model[["archaeologists"]] - model[["looters"]]
word_scores = data.frame(word=rownames(model))
word_scores$actor_score = model %>% cosineSimilarity(actor_vector) %>% as.vector

ggplot(word_scores %>% filter(abs(actor_score)>.725)) + geom_bar(aes(y=actor_score,x=reorder(word,actor_score),fill=actor_score<0),stat="identity") + coord_flip()+scale_fill_discrete("words associated with",labels=c("archaeologist","looter")) + labs(title="The words showing the strongest skew along the archaeologist-looter binary")

which gives us:

You can see some individual usernames in there; to be clear, this isn’t equating those individuals with ‘archaeologist’ or ‘looter’, rather, tweets mentioning those individuals tend to be RT’ing them or they themselves are using language or discussing these particular aspects of the show. I’m at a loss to explain ‘muppets’. Perhaps that’s a term of derision.

So, as far as this analysis goes – and one ought really to map how far and into what communities these messages penetrate – I’d say on balance, the twittersphere was outraged at this television ‘show’. As Nick said,



data mining

gnōthi seauton, or, mine your own tweets

Sometimes, one of the best ways to understand a method is to run it on data that you know very well indeed. In which case, the ability to request one’s twitter archive and to feed it into R is quite handy. You make the request, download the csv, then paste the ‘text’ column into its own csv. Clean it up with regex to remove http and special characters etc, then feed it into this script:

This can take a while. When it’s done, go to the output folder, and copy each file into a single github gist (as I’ve done here: Then, swap out the for and you can explore the result or share it: If you hit the ‘view in another window, you get the visualization full screen, eg

If you find something interesting in a topic or term, you can put that in the URL as appropriate and share/cite the relevant visualization directly. ldaVIS is a really nice package.

So – what does all this mean? Well, at first blush, it shows that my tweeting activity is largely pretty consistent, for all of its mass. Topics 1 and 2 are on point for archaeology, history, and digital applications thereof; Topic 2 is filled with #msudai from the summer, where I went on a massive twitter-spree tweeting materials at participants and reporting on the institute to the wider world (indeed, at one point, we were trending in Detroit!). Other topics (6 for instance) evidence an interest in gaming and so on. In a way, it’s not the discrete topics, the clearly delimited ones, that are of interest. It’s the fuzzy stuff. 19,9,15, and 7, all overlap. Topic 15, two of the top three words are ‘fiction’ and ‘moocs’ (top word is the username for a robot of mine that tweets the latest archaeological papers). A robot, a roboticized learning environment, fiction…. that perhaps says something.

Anyway, feel free to explore. Or give this a shot on your own materials (whether authored by you or from somewhere else).


data mining, digital history, making

If I could read your mind – Sonifying John Adams’ Diary

Maybe the question isn’t one of reading someone’s thoughts, but rather, listening to the overall pattern of topics within them. Topic modeling does some rather magical things. It imposes sense (it fits a model) onto a body of text. The topics that the model duly provide us with insight into the semantic patterns latent within the text (but see Ben Schmidts WEM approach which focuses on systems of relationships in the words themselves – more on this anon). There are a variety of ways emerging for visualizing these patterns. I’m guilty of a few myself (principally, I’ve spent a lot of time visualizing the interrelationships of topics as a kind of network graph, eg this). But I’ve never been happy with them because they often leave out the element of time. For a guy who sometimes thinks of himself as an archaeologist or historian, this is a bit problematic.

I’ve been interested in sonification for some time, the idea that we represent data (capta) aurally. I even won an award for one experiment in this vein, repurposing the excellent scripts of the Data Driven DJ, Brian Foo. What I like about sonification is that the time dimension becomes a significant element in how the data is represented, and how the data is experienced (cf. this recent interview on Spark with composer/prof Chris Chafe). I was once the chapel organist at Bishop’s University (I wasn’t much good, but that’s a story for another day) so my interest in sonification is partly in how the colour of music, the different instrumentation and so on can also be used to convey ideas and information (rather than using algorithmically purely generated tones; I’ve never had much formal musical training, so I know there’s a literature and language to describe what I’m thinking that I simply must go learn. Please excuse any awkawrdness).

So – let’s take a body of text, in this case the diaries of John Adams.  I scraped these, one line per diary entry (see this csv we prepped for our book, the Macroscope). I imported into R and topic modeled for 20 topics. The output is a monstrous csv showing the proportion each topic contributes to the entire diary entry (so each row adds to 1). If you use conditional formatting in Excel, and dial the decimal places to 2, you get a pretty good visual of which topics are the major ones in any given entry (and the really minor ones just round to 0.00, so you can ignore them).

It rather looks like an old-timey player piano roll:

Player Piano Anyone?

I then used ‘Musical Algorithms‘ one column at a time to generate a midi file. I’ve got the various settings in a notebook at home; I’ll update this post with them later. I then uploaded each midi file (all twenty) into GarageBand in the order of their complexity – that is, as indicated by file size:

Size of a file indicates the complexity of the source. Isn’t that what Claude Shannon taught us?

The question then becomes: which instruments do I assign to what topics? In this, I tried to select from the instruments I had readily to hand, and to select instruments whose tone/colour seemed to resonate somehow with the keywords for each topic. Which gives me a selection of brass instruments for topics relating to governance (thank you, Sousa marches); guitar for topics connected perhaps with travels around the countryside (too much country music on the radio as a child, perhaps); strings for topics connected with college and studying (my own study music as an undergrad influencing the choice here); and woodwinds for the minor topics and chirp and peek here and there throughout the text (some onomatopoeia I suppose).

Garageband’s own native visualization owes much to the player piano aesthetic, and so provides a rolling visualization to accompany the music. I used quicktime to grab the garageband visuals, and imovie to marry the two together again, since qt doesn’t grab the audio generated within the computer. Then I changed the name of each of the tracks to reflect the keywords for that topic.

Drumroll: I give you the John Adams 20:

data mining

Extracting Places with Python

Ok, a quick note to remind myself – I was interested in learning how to use this: 

Installation was a bit complicated; lots of dependencies. The following pages helped sort me out:

AND ultimately, I had to open up one of the geography/ file and change one line of code (line 31 as it happens), as per

So, first, let’s get all the bits and pieces installed. I downloaded the package as zip, unzipped, then:

$ sudo python install

At each stage, I would run a little python script, In my text editor. I just pasted their default script and saved it as, which I’d then run from the command line. This thing:

import geograpy
url = ''
places = geograpy.get_place_context(url=url)

Every error message moved me one step closer as it would tell me whatever module I was missing.

For starters, it turned out ‘pil’ was needed. But pil isn’t maintained any more. Some googling revealed that Pillow is the answer!

$ sudo pip install pillow

Next thing missing: lxml

$ sudo pip install lxml

Then beautiful soup was missing. So:

$ sudo pip install beautifulsoup

At this point, the error messages got a bit more cryptic:

Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>

So, from the command line I typed `python` and then ``. A little window pops open, and I found the punkt tokenizer package. Hit the download button, close the window, `exit()`, and run my again:

Resource u'taggers/maxent_treebank_pos_tagger/english.pickle'
  not found.  

Solved that one the same way. Then:

u'chunkers/maxent_ne_chunker/english_ace_multiclass.pickle' not


Resource u'corpora/words' not found

Then…. success! My wee script ran. (It’s always rather anticlimatic when something works – often, you only know it worked because you’re presented with the $ again, without comment). Now to get something useful out of it. So, for interest’s sake, I pointed it at a Gutenberg Project copy of the case book of Sherlock Holmes:;

and instructed it to print things out, like so:

import geograpy
url = ''
places = geograpy.get_place_context(url=url)
print places.country_mentions
print places.region_mentions
print places.city_mentions

And the results in my terminal:
[(u’Canada’, 2), (u’Turkey’, 1), (‘Central African Republic’, 1), (‘United Kingdom’, 1), (u’Japan’, 1), (u’France’, 1), (u’United States’, 1), (u’Australia’, 1), (u’Hungary’, 1), (u’South Africa’, 1), (u’Norfolk Island’, 1), (u’Jamaica’, 1), (u’Netherlands’, 1)]
… the ‘canada’ is surely because this was, of course…

[(u’Baron’, 16), (u’England’, 3), (u’Adelbert’, 3), (u’Kingston’, 2), (u’Strand’, 2), (u’Southampton’, 1), (u’Briton’, 1), (u’Bedford’, 1), (u’Baker’, 1), (u’Queen’, 1), (u’Liverpool’, 1), (u’Doyle’, 1), (u’Damery’, 1), (u’Bedfordshire’, 1), (u’Greyminster’, 1), (u’Euston’, 1)]
… a few names creeping in there…

[(u’Watson’, 37), (u’Holmes’, 34), (u’Godfrey’, 23), (u’Ralph’, 10), (u’Baron’, 8), (u’Merville’, 5), (u’London’, 5), (u’Johnson’, 4), (u’England’, 3), (u’Eastern’, 2), (u’Strand’, 2), (u’Pretoria’, 2), (u’Kingston’, 2), (u’Violet’, 2), (u’Turkey’, 1), (u’Middlesex’, 1), (u’Dickens’, 1), (u’Bedford’, 1), (u’God’, 1), (u’Damery’, 1), (u’Wainwright’, 1), (u’Nara’, 1), (u’Bohemia’, 1), (u’Liverpool’, 1), (u’Doyle’, 1), (u’America’, 1), (u’Southampton’, 1), (u’Sultan’, 1), (u’Baker’, 1), (u’Richardson’, 1), (u’Square’, 1), (u’Four’, 1), (u’Lomax’, 1), (u’Emsworth’, 1), (u’Scott’, 1), (u’Valhalla’, 1)]

So yep, a bit noisy, but promising. Incidentally, when I run it on that BBC news story, the results are much more sensible:

[(u’Ukraine’, 23), (‘Russian Federation’, 20), (‘Czech Republic’, 2), (u’Lithuania’, 1), (u’United States’, 1), (u’Belgium’, 1)]

[(u’Luhansk’, 4), (u’Donetsk’, 2)]

[(u’Russia’, 20), (u’Moscow’, 5), (u’Kharkiv’, 5), (u’Donetsk’, 2), (u’Independence’, 2), (u’Media’, 1), (u’Brussels’, 1)]

So obviously the corpora that NLTK is using is geared towards more contemporary situations than the worlds described by Arthur Conan Doyle. That’s interesting, and useful to know. I expect – though I haven’t looked yet – that one could use, say, a trained 19th century corpora with NLTK’s taggers etc, to get more useful results. Hmmm! A project for someone, perhaps in my #digh5000

archaeology, data mining

Grabbing data from Open Context

This morning, on Twitter, there was a conversation about site diaries and the possibilities of topic modeling for extracting insight from them. Open Context has 2618 diaries – here’s one of them. Eric, who runs Open Context, has an excellent API for all that kind of data. Append .json on the end of a file name, and *poof*, lots of data. Here’s the json version of that same diary.  So, I wanted all of those diaries – this URL (click & then note where the .json lives; delete the .json to see the regular html) has ’em all.

I copied and pasted that list of urls into a .txt file, and fed it to wget

wget -i urlstograb.txt -O output.txt

and now my computer is merrily pinging Eric’s, putting all of the info into a single txt file. And sometimes crashing it, too.

(Sorry Eric).

When it’s done, I’ll rename it .json and then use Rio to get it into useable form for R. The data has geographic coordinates too, so with much futzing I expect I could *probably* represent topics over space (maybe by exporting to Gephi & using its geolayout).

Futz: that’s the operative word, here.