Getting Started with MALLET and Topic Modeling

UPDATE! September 19th 2012: Scott Weingart, Ian Milligan, and I have written an expanded ‘how to get started with Topic Modeling and MALLET’ for the Programming Historian 2. Please do consult that piece for detailed step-by-step instructions for getting the software installed, getting your data into it, and thinking through what the results might mean.

Original Post that Inspired It All:

I’m very interested in topic modeling at the moment. It has not been easy however to get started – I owe a debt of thanks to Rob Nelson for helping me to get going. In the interests of giving other folks a boost, of paying it forward, I’ll share my recipe. I’m also doing this for the benefit of some of my students. Let’s get cracking!

First, some background reading:

  1. Clay Templeton, “Topic Modeling in the Humanities: An Overview | Maryland Institute for Technology in the Humanities”, n.d., http://mith.umd.edu/topic-modeling-in-the-humanities-an-overview/.
  2. Rob Nelson, Mining the Dispatch http://dsl.richmond.edu/dispatch/
  3. Cameron Blevins, “Topic Modeling Martha Ballard’s Diary” Historying, April 1, 2010, http://historying.org/2010/04/01/topic-modeling-martha-ballards-diary/
  4. David J Newman and Sharon Block, “Probabilistic topic decomposition of an eighteenth‐century American newspaper,” Journal of the American Society for Information Science and Technology 57, no. 6 (April 1, 2006): 753-767.
  5. David Blei, Andrew Ng, and Michael Jordan, “Latent dirichlet allocation,” The Journal of Machine Learning Research 3 (2003), http://dl.acm.org/citation.cfm?id=944937.

Now you’ll need the software. Go to the MALLET project page, and download Mallet. (Mallet was developed by Andrew McCallum at U Massachusetts, Amherst).

Then, you’ll need the Java developer’s kit – nb, not the regular Java that’s on every computer, but the one that lets you program things. Install this.

Unzip Mallet into your C:/ directory . This is important; it can’t be anywhere else. You’ll then have a folder called C:/mallet-2.0.6 or similar.

Next, you’ll need to create an environment variable called MALLET_HOME. You do this by clicking on control panel >> system >> advanced system settings (in Windows 7; for XP, see this article), ‘environment variables’. In the pop-up, click ‘new’ and type MALLET_HOME in the variable name box; type c:/mallet-2.0.6 (ie, the exact location where you unzipped Mallet) in variable value.

To run mallet, click on your start menu >> all programs >> accessories >> command prompt. You’ll get the command prompt window, which will have a cursor at c:\user\user> (or similar). type cd .. (two periods; that ain’t a typo) to go up a level; keep doing this until you’re at the C:\ .  Then type cd:\mallet-2.0.6 and you’re in the Mallet directory. You can now type Mallet commands directly. If you type bin\mallet at this point, you should be presented with a list of Mallet commands – congratulations!

At this point, you’ll want some data. Using the regular windows explorer, I create a folder within mallet where I put all of the data I want to study (let’s call it ‘data’). If I were to study someone’s diary, I’d create a unique text file for each entry, naming the text file with the entry’s date. Then, following the topic modeling instructions on the mallet page, I’d import that folder, and see what happens next. I’ve got some work flow for scraping data from websites and other repositories, but I’ll leave that for another day (or skip ahead to The Programming Historian for one way of going about it.)

Once you’ve imported your documents, Mallet creates a single ‘mallet’ file that you then manipulate to determine topics.

bin\mallet import-dir --input \data\johndoediary --output
johndoediary.mallet \ --keep-sequence --remove-stopwords

(modified from the Mallet topic modeling page)

This sequence of commands tells mallet to import a directory located in the subfolder ‘data’ called ‘johndoediary’ (which contains a sequence of txt files). It then outputs that data into a file we’re calling ‘johndoediary.mallet. Removing stopwords strips out ‘and’ ‘of’ ‘the’ etc.

Then we’re ready to find some topics:

bin\mallet train-topics --input johndoediary.mallet \
  --num-topics 100 --output-state topic-state.gz --output-topic-keys
  johndoediary_keys.txt --output-doc-topics johndoediary_composition.txt

(modified from the Mallet topic modeling page)

Now, there are more complicated things you can do with this – take a look at the documentation on the Mallet page. Is there a ‘natural’ number of topics? I do not know. What I have found is that I have to run the train-topics with varying numbers of topics to see how the composition file breaks down. If I end up with the majority of my original texts all in a very limited number of topics, then I need to increase the number of topics; my settings were too coarse.

More on interpreting the output of Mallet to follow.

Again, I owe an enormous debt of gratitude to Rob Nelson for talking me through the intricacies of getting Mallet to work, and for the record, I think the work he is doing is tremendously important and fascinating!

Google Goggles: Augmented Reality

Google Goggles translating on the flyTime was, if you wanted some augmented reality, you had to upload your own points of interest into something like Wikitude or Layar. However, in its quest for world domination, Google seems to be working on something that will render those services moot: Google Goggles (silly name, profound implications).

As Leonard Low says on the MLearning Blog:

The official Google site for the project (which is still in development) provides a number of ways Goggles can be used to accomplish a “visual search”, including landmarks, books, contact information, artwork, places, logos, and even wine labels (which I anticipate could go much further, to cover product packaging more broadly).

So why is this a significant development for m-learning? Because this innovation will enable learners to “explore” the physical world without assuming any prior knowledge. If you know absolutely nothing about an object, Goggles will provide you with a start. Here’s an example: you’re studying industrial design, and you happen to spot a rather nicely-designed chair. However, there’s no information on the chair about who designed it. How do you find out some information about the chair, which you’d like to note as an influence in your own designs? A textual search is useless, but a visual search would allow you to take a photo of the chair and let Google’s servers offer some suggestions about who might have manufactured, designed, or sold it. Ditto unusual insects, species of tree, graphic designs, sculptures, or whatever you might happen to by interested in learning.

Just watch this space. I think Google Goggles is going to rock m-learning…

Now imagine this in action with an archaeological site, and google connects you with something less than what we as archaeological professionals would like to see.  Say it was some sort of aboriginal site with profound cultural significance – but the site it connects with argues for the opposite. Another argument for archaeologists and historians to ‘create signal’ and to tell Google what’s important.

See the video:

It would’ve been nice

It would’ve been nice if the IT folks at U Manitoba had given me some warning that they were about to close my account. I’m no longer going to be teaching for them in the fall, it is true; but a lot of my stuff – not to mention my agent models – are on their servers.

My own fault, I guess – I should’ve cleaned everything off of there when I decided to decline the fall courses, but still, it would’ve been nice to have had some warning.

So, if you’re looking for me @umanitoba.ca, that doesn’t work any more. I’ll be getting some new contact info before too much longer, and hopefully, some space for my simulations, too.

The Streets of London – iphone app

Have you seen the old man
In the closed-down market
Kicking up the paper,
with his worn out shoes?
In his eyes you see no pride
And held loosely at his side
Yesterday’s paper telling yesterday’s news

So how can you tell me you’re lonely,
And say for you that the sun don’t shine?
Let me take you by the hand and lead you through the streets of London
I’ll show you something to make you change your mind

Ralph McTell

The Museum of London – always at the forefront in museology & archaeology – has released an iPhone app that transforms your experience of the streets of London into an augmented reality.  If only:

a) I had an iPhone and

b) I was in London.

I look forward to seeing more of these sorts of things emerge. Imagine – mashing the physical, the digital, the past, and the present all at once.  Landscape archaeology as palimpsest is a fairly standard idea, but these sorts of applications should only enhance the notion more popularly [he said, hopefully…]

Manifesto for Digital Humanities, from THATCamp Paris

It’s THATCamp season, and while I can’t participate, I am following the twitter stream #thatcamp.

From THATCamp Paris, a manifesto for Digital Humanities (I translate from the French below, with a wee bit of a kickstart from Google Translate; I do not guarantee that this is a perfect or most accurate translation):

Manifesto for the Digital Humanities

Context

We practitioners or observers of digital humanities (Digital Humanities) met in Paris at the THATCamp 18 and May 19, 2010.

During these two days, we have discussed, exchanged, reflected together on what are the digital humanities and we have tried to imagine and invent what they might become.

After these two days which are only one step, we propose to research communities and to all those involved in creating, editing, enhancement or preservation, a manifesto for “digital humanities”.

I. Definition

1. The computational turn taken by the society changes and examines the conditions of production and dissemination of knowledge.

2. For us, the digital humanities relate to all Social Sciences, Arts and Letters. The digital humanities are not a clean slate. They rely instead on all the paradigms, skills and knowledge specific to these disciplines, while leveraging the tools and the unique perspectives of the digital field.

3. The digital humanities designate a ‘transdiscipline’, embodying the methods, devices and heuristics related to digital opportunities in the field of humanities and social sciences.

II. Situation

4. We note:

– That there has been increased experimentation in the field of digital humanities and social sciences in the last half-century. What has emerged more recently – digital humanities centers- are, at present, prototypes or specific areas of application of an approach to digital humanities;

– That computational or digital approaches induce a stronger technical constraint and thus an economic one; therefore, that this constraint is an opportunity to change the collective work;

– There are a number of proven methods, known and shared unequally;

– There are multiple communities from special interest practices, tools or interdisciplinary approaches (encoding textual sources, geographic information systems, lexicometry, digitization of cultural heritage, scientific and technical web mapping , data mining, 3D, oral archives, digital arts and literature and hypermedia, etc..), and that these communities are converging to form the field of ‘digital humanities’.

III. Statement

5. We, the practitioners of digital humanities, are building a community of practice that is open, welcoming and freely accessible

6. We are a community without borders. We are a multilingual community and we are multidisciplinary.

7. Our aims are the advancement of knowledge, enhancing the quality of research in our disciplines, and the enrichment of the knowledge not just within but also beyond the academic sphere.

8. We call for the integration of digital culture in the definition of the general culture of the twenty-first century.

IV. Guidelines

9. We call for open access to data and metadata. These must be documented and interoperable, both technically and conceptually.

10. We support the dissemination, movement and free enhancement of methods, code, formats and results of research.

11. We call for the integration of training in digital humanities within the curriculum in Social Studies in Arts and Letters. We also want the creation of specialist diplomas in the digital humanities and the development of dedicated professional training. Finally, we hope that these skills will be taken into account in recruitment and career development.

12. We are committed to building a collective competency based on a common vocabulary, which is from the collective expertise of all working practitioners. This collective expertise is to become a common good. It is a scientific opportunity but also an opportunity for professional development in all sectors.

13. We want to participate in defining and disseminating best practices related to identified disciplinary and interdisciplinary identified. These are needs will be identified as they emerge from debate and consensus amongst the communities concerned. The fundamental openness of digital humanities nevertheless provides a pragmatic approach to protocols and visions, which maintains the right to coexistence of different approaches and competing for the benefit of the enrichment of the thinking and practices.

14. We call for the construction of scalable Cyberinfrastructures responding to real needs. These Cyberinfrastructures be built iteratively, based on the finding of methods and approaches that are proven in the research communities.

(errors of translation are my own)

How much would it cost to digitize all the UK’s archaeological grey literature?

I am a member of the Working Group on Open Archaeology. Recently in the discussion, Anthony Beck linked to  a recent presentation of his called ‘Dig the new breed: how open approaches can empower archaeologists’:

In one of his slides, he mentions Richard Bradley, from my alma mater, the University of Reading, and how Richard used the grey literature from various commercial bodies to write his history of bronze age Britain. He links to this article. As I was reading this, it occurred to me that here is a perfect opportunity for crowdsourcing… perhaps.

What would it cost to digitize all of the UK’s grey literature? Here are the plans for a $20 DIY book scanner which uses a basic point-and-shoot digital camera.  And here is an open source optical character recognition package from the good people at Google.

So only two hurdles remain: getting access to the grey literature, and the man-power to do this (hence the crowdsourcing). It would be interesting perhaps for a phd student to try this out at their local archaeological consultancy, and then perhaps use some data mining techniques (like in this example) to quickly begin to extract useful information.

The technology is there… let’s make it work!

Zotero Maps: Visualizing Archaeology?

You can now map your Zotero Library:

Potential Use Cases:
Map Your Collection By Key Places:
Many records from library catalogs and journal databases come pre-loaded with geographic keywords. Zotero Maps lets you quickly see the relationships between the terms catalogers, authors, and publishers have assigned to the items in your collection. Similarly, as you apply your own geographic tags to items you can then explore those geographic relationships. Whether you’re looking at key locations in studies of avian flu, ethnographic work in the American southwest, or the history of the transatlantic slave trade, the tags associated with your items provide valuable geographic information.

Map Places of Publication:
In many cases places of publication include crucial information about your items. If your working on a project involving the history of the book, how different media outlets cover an issue, or how different journals present distinct scientific points of view, the places in which those items are published can provide valuable insight.

In 2007, I was trying something along these lines using Platial (now deceased). Now – since you can add objects from things like Opencontext.org into your Zotero library, and describe these using tags, you could begin to build a map of not only ‘things’ but also the relevant reports etc, all from your browser, without doing any of the fancy coding stuff…

From my library:

More on Vue

With Eric’s link, I think I might be able to make some headway on getting Opencontext.org materials into VUE… in the meantime, I thought it might behoove me to start on something more straightforward. So I took the feed from Tom Elliott’s Maia Atlantis feed, to see what’d happen… and I think it might be useful for working out the links of our own little corner of the blogosphere. With a does of network analysis (VUE generates connectivity matrices) it should be able to figure out who are the keyplayers, and other implications for information flow. Hmm!

(I also ran the feed for this blog through it, and discovered just how awful my tagging/categories really are. Great big blocks of unconnected posts – the categories are the links – so I should really try to rationalize all that).

A similar idea – well, standard social network analysis – is being done using VUE with regard to the WWI Poets:

Stuart Lee from The First World War Poetry Digital Archive is using VUE to draw out relationships between poets covered in the archive. From his post at the World War One Literature google group:

What I have done, therefore, is take a preliminary stab at showing –
in a mind-map – the relationships between the poets we have
concentrated on in the project (or will be) and show how they might
have known each other, etc. By no means is this complete, but it
begins to show poets who were clearly at the centre of things
(Sassoon, Thomas, Graves, and eventually Owen) and those who were on
the periphery (Leighton, Jones, Brittain).

Check out the map he created:

See the VUE blog, and the original post

VUE + OpenContext.org: quickly visualizing relationships in data

Archaeology is about context, about understanding relationships, about looking at the spaces between datapoints, as much as it is about the points themselves.

I’ve been experimenting with VUE, mostly as a way of organizing my Zotero libraries, and to help in the planning of a digital history course I really would like to teach next year. So far, it’s been great. But this morning, while watching some of the how-to videos, I started thinking about how VUE could be used to represent archaeological data.

Here’s the video:

So: import data from an RSS feed, along with its metadata…. Hmmm. I tried it with the atom feed from OpenContext.org, regarding the Presidio of San Francisco feed.

I’ve goofed it a bit – missed an important step – but I think there’s real potential here… further bulletins as events warrant; I’ve got a couple of training sessions to lead, so I’m off to class.

Theorizing Digital Archaeology

What does ‘digital archaeology’ or ‘digital humanities’ or ‘digital history’ actually mean?

Bill Caraher is giving it a stab; sounds like it’ll be a fascinating talk. I hope there’s a video:

[…]In particular, I am thinking about articulating the notion of digital workflow and its implications in my own archaeological research.By digital workflow, I mean the use of digital technologies across the entire range of archaeological procedures from pre-season planning, data collection in the field, and the dissemination of our results across multiple platforms for diverse audiences.  I like to imagine that our deep dependence on digital data and applications shaped not only how we approached historical and archaeological problems but also how we understood the results of our research and imagined the process of scholarly critique as well as pedagogical .  This is, in part, a response to the view of digital technology as merely a tool that scholars and teachers deploy in the ongoing search for truth rather than an “active” participant in the process of determining what truths are significant, knowable, and even imaginable within a particular academic discourse.

[…]

Software Turns that Cheap Camera into a 3d Scanner

Now: can you think of some archaeological applications? 🙂

See this post in Wired.

It’s called ProFORMA, or Probabilistic Feature-based On-line Rapid Model Acquisition, but it is way cooler than it sounds. The software, written by a team headed by Qui Pan, a student at the Department of Engineering at Cambridge University in England, turns a regular, cheap webcam into a 3D scanner. Normally, scanning in 3D requires purpose-made gear and time. ProFORMA lets you rotate any object in front of the camera and it scans it in real time, building a fully 3D texture mapped model as fast as you can turn an object. Even more impressive is what happens after the scan: The camera continues to track the objsct in space and matches it’s movement instantly with the on-screen model.

I haven’t found a website for this software yet, and I have no idea when/if it is available, but let’s hope it is soon. Should be a boon to those folks who are creating immersive archaeological simulations of real sites & artefacts (Colleen?)

edit: the website address turns up in the last few seconds of the video at 3.16, http://mi.eng.cam.ac.uk/~qp202