Open Notebooks

This post is more a reminder to me that anything you’d like to read, but anyway-

I want to make my research more open, more reproducible, and more accessible. I work from several locations, so I want to have all my stuff easily to hand. I work on a Mac (sometimes) a PC (sometimes) and on Linux (rarely, but it happens; with new goodies from Bill Turkel et al I might work more there!).

I build models in Netlogo. I do text analysis in R. I visualize and analyze with things like Voyant and Overview. I scrape websites. I use Excel quite a lot. I’m starting to write in markdown more often. I want to teach students (my students typically have fairly low levels of digital literacy) how to do all this too. What I don’t do is much web development type stuff, which means that I’m still struggling with concepts and workflow around things like version control. And indeed, getting access to a server where I can just screw around to try things out is difficult (for a variety of reasons). So my server-side skills are weak.

What I think I need, is an open notebook. Caleb McDaniel has an excellent post on what this could look like. He uses Gitit. I looked at the documentation, and was defeated out of the gate. Carl Boettiger uses a combination of github and jekyll and who knows what else. What I really like is Mark Madsen’s example but I’m not aufait enough yet with all the bits and pieces (damn you version control, commits, make, rake, et cetera et cetera!)

I’ve got ipython notebooks working on my PC, which are quite cool (I installed the Anaconda version). I don’t know much python though, so yeah. Stefan Sinclair is working on ‘voyant notebooks’ which uses the same general idea to wrap analysis around Voyant, so I’m looking forward to that. Ipython can be used to call R, which is cool, but it’s still early days for me (here’s a neat example passing data to R’s ggplot2).

So maybe that’s just the wrong tool.  Much of what I want to do, at least as far as R is concerned is covered in this post by Robert Flight on ‘creating an analysis as a package and vignette‘ in R studio. And there’s also this, for making sure things are reproducible – ‘packrat

Some combination of all of this I expect will be the solution that’ll work for me. Soon I want to start doing some more agent based modeling & simulation work, and it’s mission critical that I sort out my data management, notebooks, versioning etc first this time.

God, you should see the mess around here from the last time!

SAA 2015: Macroscopic approaches to archaeological histories: Insights into archaeological practice from digital methods

Ben Marwick and I are organizing a session for the SAA2015 (the 80th edition, this year in San Francisco) on “Macroscopic approaches to archaeological histories: Insights into archaeological practice from digital methods”. It’s a pretty big tent. Below is the session ID and the abstract. If this sounds like something you’d be interested in, why don’t you get in touch?

Session ID 743.

The history of archaeology, like most disciplines, is often presented as a sequence of influential individuals and a discussion of their greatest hits in the literature.  Two problems with this traditional approach are that it sidelines the majority of participants in the archaeological literature who are excluded from these discussions, and it does not capture the conversations outside of the canonical literature.  Recently developed computationally intensive methods as well as creative uses of existing digital tools can address these problems by efficiently enabling quantitative analyses of large volumes of text and other digital objects, and enabling large scale analysis of non-traditional research products such as blogs, images and other media. This session explores these methods, their potentials, and their perils, as we employ so-called ‘big data’ approaches to our own discipline.


Like I said, if that sounds like something you’d be curious to know more about, ping me.

Quickly Extracting Data from PDFs

By ‘data’, I mean the tables. There are lots of archaeological articles out there that you’d love to compile together to do some sort of meta-study. Or perhaps you’ve gotten your hands on pdfs with tables and tables of census data. Wouldn’t it be great if you could just grab that data cleanly? Jonathan Stray has written a great synopsis of the various things you might try and has sketched out a workflow you might use. Having read that, I wanted to try ‘Tabula‘, one of the options that he mentioned. Tabula is open source and runs on all the major platforms. You simply download it an double-click on the icon; it runs within your browser. You load your pdf into it, and then draw bounding boxes around the tables that you want to grab. Tabula will then extract that table cleanly, allowing you to download it as a csv or tab separated file, or paste it directly into something else.

For instance, say you’re interested in the data that Gill and Chippindale compiled on Cycladic Figures. You can grab the pdf from JSTOR:

Material and Intellectual Consequences of Esteem for Cycladic Figures
David W. J. Gill and Christopher Chippindale
American Journal of Archaeology , Vol. 97, No. 4 (Oct., 1993) , pp. 601-659
Article DOI: 10.2307/506716

Download it, and then feed it into Tabula. Let’s look at table 2.

You could just highlight this table in your pdf reader and hit ctrl+c to copy it; when you paste that into your browser, you’d get:
Everything in a single column. For a small table, maybe that’s not such a big deal. But let’s look at what you get with Tabula. You drag the square over that same table; when you release the mouse button you get:
Much, much cleaner & faster! I say ‘faster’, because you can quickly drag the selection box around every table and hit download just the one time. Open the resulting csv file, and you have all of your tables in a useful format:
But wait, there’s more! Since you can copy directly to the clipboard, you can paste directly into a google drive spreadsheet (thus taking advantage of all the visualization options that Google offers) or into something like Raw from Density Design.
Tabula is a nifty little tool that you’ll probably want to keep handy.

Gaze & Eonydis for Archaeological Data

I’m experimenting with Clement Levallois‘ data mining tools ‘Gaze‘ and ‘Eonydis‘. I created a table with some mock archaeological data in it: artefact, findspot, and date range for the artefact. More on dates in a moment. Here’s the fake dataset.

Firstly, Gaze will take a list of nodes (source, target), and create a network where the source nodes are connected to each other by virtue of sharing a common target. Clement explains:

Paul, hamster
… In this case, it is interesting to get a network made of Paul, Gerald, Marie and Donald (sources nodes), showing how similar they are in terms of pets they own. Make sure you do this by choosing “directed networks” in the parameters of Gaze. A related option for directed networks: you can choose a minimum number of times Paul should appear as a source to be included in the computations (useful to filter out unfrequent, irrelevant nodes: because you want only owners with many pets to appear for instance).

The output is in a nodes.dl file and an edges.dl file. In Gephi, go to the import spreadsheet button on the data table, import the nodes file first, then the edges file. Here’s the graph file.

Screenshot, Gaze output into Gephi, from mock archaeo-data

Screenshot, Gaze output into Gephi, from mock archaeo-data

Eonydis on the other hand takes that same list and if it has time-stamps within it (a column with dates), will create a dynamic network over time. My mock dataset above seems to cause Eonydis to crash – is it my negative numbers? How do you encode dates from the Bronze Age in the day/month/year system? Checking the documentation, I see that I didn’t have proper field labels, so I needed to fix that. Trying again, it still crashed. I fiddled with the dates to remove the range (leaving a column to imply ‘earliest known date for this sort of thing’), which gave me this file.

Which still crashed. Now I have to go do some other stuff, so I’ll leave this here and perhaps one of you can pick up where I’ve left off. The example file that comes with Eonydis works fine, so I guess when I return to this I’ll carefully compare the two. Then the task will be to work out how to visualize dynamic networks in Gephi. Clement has a very good tutorial on this.


Ok, so I kept plugging away at it. I found if I put the dates yyyy-mm-dd, as in 1066-01-23 then Eonydis worked a treat. Here’s the mock data and here’s the gexf.

And here’s the dynamic animation!

Post post script:

I took the mock data (archaeo-test4.csv) and concatenated a – in front of the dates, thus -1023-01-01 to represent dates BC. In Eonydis, where it asks for the date format, I tried this:

#yyyy#mm#dd  which accepted the dates, but dropped the negative;

-yyyy#mm#dd, which accepted the dates and also dropped the negative.

Thus, it seems to me that I can still use Eonydis for archaeological data, but I should frame my date column in relative terms rather than absolute, as absolute isn’t really necessary for the network analysis/visualization anyway.

How I Lost the Crowd: A Tale of Sorrow and Hope

Yesterday, my HeritageCrowd project website was annihilated. Gone. Kaput. Destroyed. Joined the choir.

It is a dead parrot.

This is what I think happened, what I now know and need to learn, and what I think the wider digital humanities community needs to think about/teach each other.

HeritageCrowd was (may be again, if I can salvage from the wreckage) a project that tried to encourage the crowdsourcing of local cultural heritage knowledge for a community that does not have particularly good internet access or penetration. It was built on the Ushahidi platform, which allows folks to participate via cell phone text messages. We even had it set up so that a person could leave a voice message and software would automatically transcribe the message and submit it via email. It worked fairly well, and we wrote it up for Writing History in the Digital Age. I was looking forward to working more on it this summer.

Problem #1: Poor record keeping of the process of getting things intalled, and the decisions taken.

Now, originally, we were using the Crowdmap hosted version of Ushahidi, so we wouldn’t have to worry about things like security, updates, servers, that sort of thing. But… I wanted to customize the look, move the blocks around, and make some other cosmetic changes so that Ushahidi’s genesis in crisis-mapping wouldn’t be quite as evident. When you repurpose software meant for one domain to another, it’s the sort of thing you do. So, I set up a new domain, got some server space, downloaded Ushahidi and installed it. The installation tested my server skills. Unlike setting up WordPress or Omeka (which I’ve done several times), Ushahidi requires the concommitant set up of ‘Kohana‘. This was not easy. There are many levels of tacit knowledge in computing and especially in web-based applications that I, as an outsider, have not yet learned. It takes a lot of trial and error, and sometimes, just dumb luck. I kept poor records of this period – I was working to a tight deadline, and I wanted to just get the damned thing working. Today, I have no idea what I actually did to get Kohana and Ushahidi playing nice with one another. I think it actually boiled down to file structure.

(It’s funny to think of myself as an outsider, when it comes to all this digital work. I am after all an official, card-carrying ‘digital humanist’. It’s worth remembering what that label actually means. At least one part of it is ‘humanist’. I spent well over a decade learning how to do that part. I’ve only been at the ‘digital’ part since about 2005… and my experience of ‘digital’, at least initially, is in social networks and simulation – things that don’t actually require me to mount materials on the internet. We forget sometimes that there’s more to the digital humanities than building flashy internet-based digital tools. Archaeologists have been using digital methods in their research since the 1960s; Classicists at least that long – and of course Father Busa).

Problem #2: Computers talk to other computers, and persuade them to do things.

I forget where I read it now (it was probably Stephen Ramsay or Geoffrey Rockwell), but digital humanists need to consider artificial intelligence. We do a humanities not just of other humans, but of humans’ creations that engage in their own goal-directed behaviours. As some one who has built a number of agent based models and simulations, I suppose I shouldn’t have forgotten this. But on the internet, there is a whole netherworld of computers corrupting and enslaving each other, for all sorts of purposes.

HeritageCrowd was destroyed so that one computer could persuade another computer to send spam to gullible humans with erectile dsyfunction.

It seems that Ushahidi was vulnerable to ‘Cross-site Request Forgery‘ and ‘Cross-site Scripting‘ attacks. I think what happened to HeritageCrowd was an instance of persistent XSS:

The persistent (or stored) XSS vulnerability is a more devastating variant of a cross-site scripting flaw: it occurs when the data provided by the attacker is saved by the server, and then permanently displayed on “normal” pages returned to other users in the course of regular browsing, without proper HTML escaping.

When I examine every php file on the site, there are all sorts of injected base64 code. So this is what killed my site. Once my site started flooding spam all over the place, the internet’s immune systems (my host’s own, and others), shut it all down. Now, I could just clean everything out, and reinstall, but the more devastating issue: it appears my sql database is gone. Destroyed. Erased. No longer present. I’ve asked my host to help confirm that, because at this point, I’m way out of my league. Hey all you lone digital humanists: how often does your computing services department help you out in this regard? Find someone at your institution who can handle this kind of thing. We can’t wear every hat. I’ve been a one-man band for so long, I’m a bit like the guy in Shawshank Redemption who asks his boss at the supermarket for permission to go to the bathroom. Old habits are hard to break.

Problem #3: Security Warnings

There are many Ushahidi installations all over the world, and they deal with some pretty sensitive stuff. Security is therefore something Ushahidi takes seriously. I should’ve too. I was not subscribed to the Ushahidi Security Advisories. The hardest pill to swallow is when you know it’s your own damned fault. The warning was there; heed the warnings! Schedule time into every week to keep on top of security. If you’ve got a team, task someone to look after this. I have lots of excuses – it was end of term, things were due, meetings to be held, grades to get in – but it was my responsibility. And I dropped the ball.

Problem #4: Backups

This is the most embarrasing to admit. I did not back things up regularly. I am not ever making that mistake again. Over on Looted Heritage, I have an IFTTT recipe set up that sends every new report to BufferApp, which then tweets it. I’ve also got one that sends every report to Evernote. There are probably more elegant ways to do this. But the worst would be to remind myself to manually download things. That didn’t work the first time. It ain’t gonna work the next.

So what do I do now?

If I can get my database back, I’ll clean everything out and reinstall, and then progress onwards wiser for the experience. If I can’t… well, perhaps that’s the end of HeritageCrowd. It was always an experiment, and as Scott Weingart reminds us,

The best we can do is not as much as we can, but as much as we need. There is a point of diminishing return for data collection; that point at which you can’t measure the coastline fast enough before the tides change it. We as humanists have to become comfortable with incompleteness and imperfection, and trust that in aggregate those data can still tell us something, even if they can’t reveal everything.

The HeritageCrowd project taught me quite a lot about crowdsourcing cultural heritage, about building communities, about the problems, potentials, and perils of data management. Even in its (quite probable) death, I’ve learned some hard lessons. I share them here so that you don’t have to make the same mistakes. Make new ones! Share them! The next time I go to THATCamp, I know what I’ll be proposing. I want a session on the Black Hats, and the dark side of the force. I want to know what the resources are for learning how they work, what I can do to protect myself, and frankly, more about the social and cultural anthropology of their world. Perhaps there is space in the Digital Humanities for that.


When I discovered what had happened, I tweeted about it. Thank you everyone who responded with help and advice. That’s the final lesson I think, about this episode. Don’t be afraid to share your failures, and ask for help. As Bethany wrote some time ago, we’re at that point where we’re building the new ways of knowing for the future, just like the Lunaticks in the 18th century. Embrace your inner Lunatick:

Those 18th-century Lunaticks weren’t about the really big theories and breakthroughs – instead, their heroic work was to codify knowledge, found professional societies and journals, and build all the enabling infrastructure that benefited a succeeding generation of scholars and scientists.


if you agree with me that there’s something remarkable about a generation of trained scholars ready to subsume themselves in collaborative endeavors, to do the grunt work, and to step back from the podium into roles only they can play – that is, to become systems-builders for the humanities — then we might also just pause to appreciate and celebrate, and to use “#alt-ac” as a safe place for people to say, “I’m a Lunatick, too.”

Perhaps my role is to fail gloriously & often, so you don’t have to. I’m ok with that.

Converting 2 mode networks with Multimodal plugin for Gephi

Scott Weingart drew my attention this morning to a new plugin for Gephi by Jaroslav Kuchar that converts multimodal networks to one mode networks.

This plugin allows multimode networks projection. For example: you can project your bipartite (2-mode) graph to monopartite (one-mode) graph. The projection/transformation is based on the matrix multiplication approach and allows different types of transformations. Not only bipartite graphs. The limitation is matrix multiplication – large matrix multiplication takes time and memory.

After some playing around, and some emails & tweets with Scott, we determined that it does not seem to work at the moment for directed graphs. But if you’ve got a bimodal undirected graph, it works very well indeed! It does require some massaging though. I assume you can already download and install the plugin.

1. Make sure your initial csv file with your data has a column called ‘type’. Fill that column with ‘undirected’. The plugin doesn’t work correctly with directed graphs.

2. Then, once your csv file is imported, create a new column on the nodes table, call it ‘node-type’ – here you specify what the thing is. Fill it up accordingly. (cheese, crackers, for instance).

3. I thank Scott for talking me through this step. First, save your network; this next step will irrevocably change your data. Click ‘load attributes’. Under attribute type, select your column you created for step 2. Then, for left matrix, select select Cheese – Crackers; for right matrix, select Crackers – Cheese. Hit ‘run’. This gets you a new Cheese-Cheese network (select the inverse to get a crackers – crackers network).  You can then remove any isolates or dangly bits by ticking ‘remove edges’ or ‘remove nodes’ as appropriate.

4. Save your new 1 mode network. Go back to the beginning to create the other 1 mode network.

Looted Heritage: Infotrapping the Illicit Antiquities Trade

To crowdsource something – whether it is a problem of code, or the need to transcribe historical documents – is generally about fracturing a problem into its component pieces, and allowing an interested public to solve the tiny pieces. In 2011 I and my students embarked on a project to crowdsource sense of place, using Ushahidi to solicit and collect community memories about cultural heritage resources in Pontiac and Renfrew counties in Eastern Canada (The HeritageCrowd Project). As part of that project, I had initially set up a deployment of Ushahidi using their hosted service, called ‘Crowdmap‘. It contains all of the functionality of the vanilla Ushahidi, but since they are doing the hosting, it cannot be customized to any great deal. I mothballed that initial deployment, and installed my own on my own server so I could customize. The entire experience was recounted in a contribution to the forthcoming born-digital volume, ‘Writing History in the Digital Age‘, edited by Jack Dougherty and Kristen Nawrotzki. One of the findings of that small experiment was about the order of operations in a crowdsourced project:

…in one sense our project’s focus was misplaced. Crowdsourcing should not be a first step. The resources are already out there; why not trawl, crawl, spider, and collect what has already been uploaded to the internet? Once the knowledge is collected, then one could call on the crowd to fill in the gaps. This would perhaps be a better use of time, money, and resources.

This January, I started the second term of my year long first year seminar in digital antiquity, and I decided to re-start my mothballed Crowdmap as part of a module on crowdsourcing. But as I prepared, that one paragraph above kept haunting me. Perhaps what was needed was not so much a module on crowdsourcing, but rather, one on information trapping. I realized that Ushahidi and Crowdmap are better thought of as info-traps. Thus, ‘Looted Heritage: Monitoring the Illicit Antiquities Trade‘ was born. The site monitors various social media and regular media feeds for stories and reports about the trade in antiquities, which can then be mapped, giving a visual depiction of the impact of the trade. I intend to use Zotero public library feeds as well, to enable the mapping of academic bibliography on the trade.

Why illicit antiquities? The Illicit Antiquities Research Centre at Cambridge University (which is, sadly, closed) provides some statistics on the nature and scale of the illicit antiquities trade; these statistics date to 1999; the problem has only grown with the wars and disruptions of the past decade:

  • Italy: 120,000 antiquities seized by police in five years;
  • Italy: 100,000+ Apulian tombs devastated;
  • Niger: in southwest Niger between 50 and 90 per cent of sites have been destroyed by looters;
  • Turkey: more than 560 looters arrested in one year with 10,000 objects in their possession;
  • Cyprus: 60,000 objects looted since 1974;
  • China: catalogues of Sotheby’s sales found in the poor countryside: at least 15,000 sites vandalized, 110,000 illicit cultural objects intercepted in four years;
  • Cambodia: 300 armed bandits surround Angkor Conservation compound, using hand grenades to blow apart the monuments; 93 Buddha heads intercepted in June this year, 342 other objects a month later;
  • Syria: the situation is now so bad a new law has been passed which sends looters to jail for 15 years;
  • Belize: 73 per cent of major sites looted;
  • Guatemala: thieves now so aggressive they even looted from the laboratory at Tikal;
  • Peru: 100,000 tombs looted, half the known sites.

-Brodie and Watson,

The idea then is to provide my students with hands-on experience using Crowdmap as a tool, and to foster engagement with the real-world consequences of pot-hunting and tomb-robbing. Crowdmap also allows users to download all of the reports that get created. One may download all reports on Looted Heritage in CSV format at the download page. I will be using this data as part of my teaching about data mining and text analysis, getting the students to run this data through tools like Voyant to spot patterns in the way that the antiquities trade is portrayed in the media.

At one point, I also had feeds from eBay for various kinds of artefacts, Greco-Roman, Egyptian, pre-Columbian, etc, but there is so much volume going through eBay that it completely overwhelmed the other signals. I think dealing with eBay and exploring the scope of the trade there will require different scrapers and quantitative tools, so I’m leaving that problem aside for the time being.

I’m also waiting anxiously for, which was described earlier this week in Profhacker, to allow exports of the information it finds. Right now, you can only consume what it finds within its ecosystem (or through kludgy workarounds). Ideally, I would be able to grab a feed from for one of its traps and bring that directly into Looted Heritage. is more of an active agent than the passive sieve approach that Crowdmap takes, so combining the two ought to be a powerful approach. From Profhacker:

[ is] a web service that combines machine learning algorithms with user-selected topics and filters. (The algorithms used in this project stem from the same research that led to Apple’s Siri.) After creating an account, you create a “trap” by entering in a keyword or short phrase into the Discovery box. Once you save your trap, you personalize it by clicking thumbs up or thumbs down on a number of articles in your trap. The more articles you rate, the closer attuned the trap becomes to the kinds of material you want to read.

Finally, one of the other aspects we learned in the HeritageCrowd project was the importance of outreach. For Looted Heritage, I am using a combination of If This Then That to monitor Looted Heritage’s feed, sending new reports to Buffer App, which then sends to Twitter, @looted_heritage. Aside from some problems with duplicated tweets, I’ve been quite happy with this setup (and thanks to Terry Brock for suggesting this). There’s an interesting possibility of circularity there, with Crowdmap picking up @looted_heritage in its traps, and then sending them out again… but my students should spot that if it occurs.

Ushahidi also provides an IOS app that can be deployed in the field, so that an archaeologist who discovers the work of tombaroli could take geo-tagged photographs and submit them with a click to the Looted Heritage installation, drawing police attention.

So far, the response to this project has been good, with 20 people now following the twitter account since I set it up last week (and which I haven’t promoted very actively). Please feel free to contribute reports to Looted Heritage via its submission page, or by tagging your tweets with #looted #antiquities.

Reading ‘Writing History in the Digital Age’ at a Distance

Topics by authors in Writing History in the Digital Age

I and my students have made some contributions to ‘Writing History in the Digital Age‘, the born-digital volume edited by  Jack Dougherty and Kristen Nawrotzki. Rather than reflect on the writing process, I thought I’d topic model the volume to see what patterns emerged in the contributions.

I use Mallet to do this. I’ve posted earlier about how to get Mallet running. I used Outwit Hub to scrape each individual paragraph from each paper (> 700 paragraphs) into a CSV file (I did not scrape block quotes, so my paragraph numbers are slightly out of sync with those used on the Writing History website). I used the Textme excel macro (google it; it lives in multiple versions and requires a bit of modification to work exactly the way you want it to) to save each paragraph into its own unique text file, which I then load into Mallet.

Phew. Now, the tricky part with Mallet is deciding how many topics you want it to look for. Finding the *right* number of topics requires a bit of iteration – start with say 10. Look at the resulting composition of files to topics. If an inordinate number of files all fall into one topic, you don’t have enough granularity yet.

As an initial read, I went with 15 topics. One topic – which I’ll label ‘working with data’ – had quite a large number of files (composition document) (remember, the individual paragraphs from the papers). Ideally, I would re-run the analysis with a greater number of topics, so that the ‘working with data’ topic would get broken up.

I also graphed the results, so that each author is linked to the topics which compose his or her paper; the thickness of the line indicates multiple paragraphs with that topic. I have also graphed topics by individual paragraphs, but the granularity isn’t ideal making the resulting visual not all that useful. The colours correspond with the ‘modularity’ of the graph, that is, communities of similar patterns of connections. The size of the node represents ‘betweeness’ on all paths between every pair of nodes.

So what does it all mean? At the level of paragraph-by-topic, if we had the correct level of granularity, one might be able to read the entire volume by treating the graph as a guide to hyperlinking from paragraph to paragraph, perhaps – a machine generated map/index of the internal structure of ideas. At the level of individual authors, it perhaps suggests papers to read together and the organizing themes of the volume.

This is of course a quick and dirty visualization and analysis, and my initial impressions. More time and consideration, greater granularity, is to be desired.

Topic, Authors Community
Crowdsourcing 0
Students’ Learning 0
graham 0
grahammassiefeuerherm 0
sikarskie 0
Working with Data 1
Video 1
faltesek 1
Games 1
noonan 1
poe 1
zucconietal 1
castaneda 2
Activism, Protests 2
African Americans and the South 2
Primary Resources, Teaching, and Libraries 2
haber 2
judkins 2
madsen-brooks 2
sklardubin 2
tomasek 2
wolff 2
Blogging and Peer Interactions 3
Monitoring Wikipedia 3
introduction 3
jarret 3
lawrence 3
saxtonetal 3
seligman 3
bauer 4
Keywords and Search 4
cummings 4
Japan and History 4
Writing Process 4
dorn 4
Space and Geography 4
erikson 4
harbisonwaltzer 4
petrzelamanekin 4
roberston 4
tanaka 4
gibbsowens 5
Visualization 5
theibault 5

Getting Started with MALLET and Topic Modeling

UPDATE! September 19th 2012: Scott Weingart, Ian Milligan, and I have written an expanded ‘how to get started with Topic Modeling and MALLET’ for the Programming Historian 2. Please do consult that piece for detailed step-by-step instructions for getting the software installed, getting your data into it, and thinking through what the results might mean.

Original Post that Inspired It All:

I’m very interested in topic modeling at the moment. It has not been easy however to get started – I owe a debt of thanks to Rob Nelson for helping me to get going. In the interests of giving other folks a boost, of paying it forward, I’ll share my recipe. I’m also doing this for the benefit of some of my students. Let’s get cracking!

First, some background reading:

  1. Clay Templeton, “Topic Modeling in the Humanities: An Overview | Maryland Institute for Technology in the Humanities”, n.d.,
  2. Rob Nelson, Mining the Dispatch
  3. Cameron Blevins, “Topic Modeling Martha Ballard’s Diary” Historying, April 1, 2010,
  4. David J Newman and Sharon Block, “Probabilistic topic decomposition of an eighteenth‐century American newspaper,” Journal of the American Society for Information Science and Technology 57, no. 6 (April 1, 2006): 753-767.
  5. David Blei, Andrew Ng, and Michael Jordan, “Latent dirichlet allocation,” The Journal of Machine Learning Research 3 (2003),

Now you’ll need the software. Go to the MALLET project page, and download Mallet. (Mallet was developed by Andrew McCallum at U Massachusetts, Amherst).

Then, you’ll need the Java developer’s kit – nb, not the regular Java that’s on every computer, but the one that lets you program things. Install this.

Unzip Mallet into your C:/ directory . This is important; it can’t be anywhere else. You’ll then have a folder called C:/mallet-2.0.6 or similar.

Next, you’ll need to create an environment variable called MALLET_HOME. You do this by clicking on control panel >> system >> advanced system settings (in Windows 7; for XP, see this article), ‘environment variables’. In the pop-up, click ‘new’ and type MALLET_HOME in the variable name box; type c:/mallet-2.0.6 (ie, the exact location where you unzipped Mallet) in variable value.

To run mallet, click on your start menu >> all programs >> accessories >> command prompt. You’ll get the command prompt window, which will have a cursor at c:\user\user> (or similar). type cd .. (two periods; that ain’t a typo) to go up a level; keep doing this until you’re at the C:\ .  Then type cd:\mallet-2.0.6 and you’re in the Mallet directory. You can now type Mallet commands directly. If you type bin\mallet at this point, you should be presented with a list of Mallet commands – congratulations!

At this point, you’ll want some data. Using the regular windows explorer, I create a folder within mallet where I put all of the data I want to study (let’s call it ‘data’). If I were to study someone’s diary, I’d create a unique text file for each entry, naming the text file with the entry’s date. Then, following the topic modeling instructions on the mallet page, I’d import that folder, and see what happens next. I’ve got some work flow for scraping data from websites and other repositories, but I’ll leave that for another day (or skip ahead to The Programming Historian for one way of going about it.)

Once you’ve imported your documents, Mallet creates a single ‘mallet’ file that you then manipulate to determine topics.

bin\mallet import-dir --input \data\johndoediary --output
johndoediary.mallet \ --keep-sequence --remove-stopwords

(modified from the Mallet topic modeling page)

This sequence of commands tells mallet to import a directory located in the subfolder ‘data’ called ‘johndoediary’ (which contains a sequence of txt files). It then outputs that data into a file we’re calling ‘johndoediary.mallet. Removing stopwords strips out ‘and’ ‘of’ ‘the’ etc.

Then we’re ready to find some topics:

bin\mallet train-topics --input johndoediary.mallet \
  --num-topics 100 --output-state topic-state.gz --output-topic-keys
  johndoediary_keys.txt --output-doc-topics johndoediary_composition.txt

(modified from the Mallet topic modeling page)

Now, there are more complicated things you can do with this – take a look at the documentation on the Mallet page. Is there a ‘natural’ number of topics? I do not know. What I have found is that I have to run the train-topics with varying numbers of topics to see how the composition file breaks down. If I end up with the majority of my original texts all in a very limited number of topics, then I need to increase the number of topics; my settings were too coarse.

More on interpreting the output of Mallet to follow.

Again, I owe an enormous debt of gratitude to Rob Nelson for talking me through the intricacies of getting Mallet to work, and for the record, I think the work he is doing is tremendously important and fascinating!

Google Goggles: Augmented Reality

Google Goggles translating on the flyTime was, if you wanted some augmented reality, you had to upload your own points of interest into something like Wikitude or Layar. However, in its quest for world domination, Google seems to be working on something that will render those services moot: Google Goggles (silly name, profound implications).

As Leonard Low says on the MLearning Blog:

The official Google site for the project (which is still in development) provides a number of ways Goggles can be used to accomplish a “visual search”, including landmarks, books, contact information, artwork, places, logos, and even wine labels (which I anticipate could go much further, to cover product packaging more broadly).

So why is this a significant development for m-learning? Because this innovation will enable learners to “explore” the physical world without assuming any prior knowledge. If you know absolutely nothing about an object, Goggles will provide you with a start. Here’s an example: you’re studying industrial design, and you happen to spot a rather nicely-designed chair. However, there’s no information on the chair about who designed it. How do you find out some information about the chair, which you’d like to note as an influence in your own designs? A textual search is useless, but a visual search would allow you to take a photo of the chair and let Google’s servers offer some suggestions about who might have manufactured, designed, or sold it. Ditto unusual insects, species of tree, graphic designs, sculptures, or whatever you might happen to by interested in learning.

Just watch this space. I think Google Goggles is going to rock m-learning…

Now imagine this in action with an archaeological site, and google connects you with something less than what we as archaeological professionals would like to see.  Say it was some sort of aboriginal site with profound cultural significance – but the site it connects with argues for the opposite. Another argument for archaeologists and historians to ‘create signal’ and to tell Google what’s important.

See the video: