Play along at home with #hist3812a

In my video games and history class, I assign each week one or two major pieces that I want everyone to read. Each week, a subset of the class has to attempt a ‘challenge’, which involves reading a bit more, reflecting, and devising a way of making their argument – a procedural rhetoric – via a game engine (in this case, Twine). Later on, they’ll be building in Minecraft. Right now, we have nearly 50 students enrolled.

If you’re interested in following along at home, here are the first few challenges. These are the actual prompts cut-n-pasted out of our LMS. Give ‘em a try if you’d like, upload to philome.la, and let us know! Ours will be at hist3812a.dhcworks.ca

I haven’t done this before, so it’ll be interesting to see what happens next.

Introduction to #hist3812a

Challenge #1

Read:

  1. Fogu, Claudio. ‘Digitalizing Historical Consciousness’, History and Theory 2, 2009.
  2. Tufekci, Zeynep. ‘What Happens to #Ferguson Affects Ferguson: Net Neutrality, Algorithmic Filtering and Ferguson. MediumAugust 14 2014

Craft:

A basic Twine that highlights the ways the two articles are connected.

Share:

Put your Twine build (the *html file) into the ‘public’ folder in your Dropbox account (if you don’t have a public folder, just right-click and select public link – see this help file). Share the link on our course blog:

  1. Create a new post.
  2. Hit the ‘html’ button.
  3. type:
  4. Preview your post to make sure it loads your Twine.

Play:

Explore others’ Twines and be ready to discuss this process and these readings in Tuesday’s class.

A history of games, and of video games

Challenge #2

Read & Watch:

Antecedents (read the intros):

Shannon, C. A Mathematical Theory of Communication  Reprinted with corrections from The Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, July, October, 1948. http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

Turing, Alan Mathison. “On computable numbers, with an application to the Entscheidungsproblem.” J. of Math 58 (1936): 345-363. http://www.cs.virginia.edu/~robins/Turing_Paper_1936.pdf

Cold War (watch this entire lecture): https://www.youtube.com/watch?v=_otw7hWq58A

1980s:

Dillon, Roberto. The golden age of video games : the birth of a multi-billion dollar industry CRC Press, c2011.

Christiansen, Peter ‘Dwarf Norad: A Glimpse of Counterfactual Computing History’ Play the Past August 6 2014 http://www.playthepast.org/?p=4892

Craft:

A Twine that imagines what an ENIAC developed to serve the needs of historians might’ve looked like, ie explore Christiansen’s argument.

Share:

Put your Twine build (the *html file) into the ‘public’ folder in your Dropbox account. Share the link on our course blog by:

  1. Create a new post.
  2. Hit the ‘html’ button.
  3. type:
  4. Preview your post to make sure it loads your Twine.

Play:

Explore others’ Twines and be ready to discuss this process and these readings in Tuesday’s class.

Historical Consciousness and Worldview

Challenge #3

Read:

Kee, Graham, et al. ‘Towards a Theory of Good History Through Gaming’ The Canadian Historical Review
Volume 90, Number 2, June 2009 pp. 303-326.

http://muse.jhu.edu/journals/can/summary/v090/90.2.kee.html

Travis, Roger. ‘Your practomimetic school: Duck Hunt or BioShock?’ Play the Past Oct 21 2011 http://www.playthepast.org/?p=2067

Owens, T. ‘What does Simony say? An interview with Ian Bogost’ Play the Past Dec 13, 2012 http://www.playthepast.org/?p=3394

Travis, Roger. ‘A Modest Proposal for viewing literary texts as rulesets, and for making game studies beneficial for the publick’ Play the Past Feb 9 2012 http://www.playthepast.org/?p=2417

McCall, Jeremiah. “Historical Simulations as Problem Spaces: Some Guidelines for Criticism”. Play the Past http://www.playthepast.org/?p=2594

(Not assigned, but more of Travis’ work: http://livingepic.blogspot.ca/2012/07/rules-of-text-series-at-play-past.html)

Craft:

A Twine that exposes the underlying rhetorics of the game of teaching history.

Share:

Put your Twine build (the *html file) into the ‘public’ folder in your Dropbox account. Share the link on our course blog by:

  1. Create a new post.
  2. Hit the ‘html’ button.
  3. type:
  4. Preview your post to make sure it loads your Twine.

Play:

Explore others’ Twines and be ready to discuss this process and these readings in Tuesday’s class.

Critical Play Week

Challenge # 4

Remember: 

Keep notes on the discussions from the critical play session; move around the class, talk with people about what they’re playing, why they’re making the moves they’re doing, and think about the connections with the major reading.

(nb, I’ve assigned all the students to bring in video games, board games, in both sessions this week that we’ll play. We might decamp to the game lab in the library to make this work. This group will observe the play. I’ve also pointed them to Feminist Frequency as an example of the kind of criticism I want them to emulate).

Craft:

Devise a Twine that captures the dynamic and discussions of this week’s in-class critical play. Remember, for historians, it may be all about time and space.

Share:

Put your Twine build (the *html file) into the ‘public’ folder in your Dropbox account. Share the link on our course blog by:

  1. Create a new post.
  2. Hit the ‘html’ button.
  3. type:
  4. Preview your post to make sure it loads your Twine.

Play:

Explore others’ Twines and be ready to discuss this process and these readings in Tuesday’s class.

Material Culture and the Digital

Challenge #5

Read

Montfort et al, ‘Introduction’, 10 Print http://10print.org/ (download the pdf)

Montfort et al, ‘Mazes,’ 10 Print http://10print.org/ (download the pdf)

Bogost, Ian, Montfort, N. ‘New Media as Material Constraint: An Introduction to Platform Studies.’ 1st International HASTAC Conference, Duke University, Durham NC  http://bogost.com/downloads/Bogost%20Montfort%20HASTAC.pdf

Craft:

Make a Twine game that emulates Space Invaders; then discuss (within the Twine) the interaction between game, platform, and experience. Think also about ‘emulation’…

OR

Play one of these games, reviewing it via Twine, thinking about in a way that reverses the points made my Montfort & Bogost (ie, think about the way the physical is represented in the software).

Share:

Put your Twine build (the *html file) into the ‘public’ folder in your Dropbox account. Share the link on our course blog by:

  1. Create a new post.
  2. Hit the ‘html’ button.
  3. type:
  4. Preview your post to make sure it loads your Twine.

Play:

Explore others’ Twines and be ready to discuss this process and these readings in Tuesday’s class.

 

Web Seer and the Zeitgeist

I’ve been playing all evening with Web Seer, a toy that lets you contrast pairs of Google autocomplete suggestions. As is well known, Google autocomplete suggests completions based on what others have been searching for given that pattern of text you are entering. This is sparking some thoughts on how I might use this to think about things like public archaeology or public history.

As Alan Liu put it,

But for now, enjoy the pairings that I’ve been feeding it….

Screen Shot 2014-08-28 at 8.58.44 PM

In ancient/modern

 

Screen Shot 2014-08-28 at 8.57.17 PM

Greek versus Roman

 

Screen Shot 2014-08-28 at 8.52.25 PM

What School Should I Go To?

 

Screen Shot 2014-08-28 at 8.46.18 PM

Games and Literature

 

Screen Shot 2014-08-28 at 8.42.20 PM

Getting Down to Brass Tacks

 

Screen Shot 2014-08-28 at 8.35.24 PM

Drunkards and Teetotallers, never the twain shall meet

 

Screen Shot 2014-08-28 at 8.09.24 PM

Historians v Archaeologists, a Google Cage Match

 

Screen Shot 2014-08-28 at 7.55.55 PM

The DH Dilemma

 

Screen Shot 2014-08-28 at 7.55.08 PM

Future/Perfect

 

Screen Shot 2014-08-28 at 7.52.58 PM

Two Solitudes Redux

SAA 2015: Macroscopic approaches to archaeological histories: Insights into archaeological practice from digital methods

Ben Marwick and I are organizing a session for the SAA2015 (the 80th edition, this year in San Francisco) on “Macroscopic approaches to archaeological histories: Insights into archaeological practice from digital methods”. It’s a pretty big tent. Below is the session ID and the abstract. If this sounds like something you’d be interested in, why don’t you get in touch?

Session ID 743.

The history of archaeology, like most disciplines, is often presented as a sequence of influential individuals and a discussion of their greatest hits in the literature.  Two problems with this traditional approach are that it sidelines the majority of participants in the archaeological literature who are excluded from these discussions, and it does not capture the conversations outside of the canonical literature.  Recently developed computationally intensive methods as well as creative uses of existing digital tools can address these problems by efficiently enabling quantitative analyses of large volumes of text and other digital objects, and enabling large scale analysis of non-traditional research products such as blogs, images and other media. This session explores these methods, their potentials, and their perils, as we employ so-called ‘big data’ approaches to our own discipline.

—-

Like I said, if that sounds like something you’d be curious to know more about, ping me.

Historical Maps into Minecraft: My Workflow

The folks at the New York Public Library have a workflow and python script for translating historical maps into Minecraft. It’s a three-step (quite big steps) process. First, they generate a DEM (digital elevation model) from the historical map, using QGIS. This is saved as ‘elevation.tiff’. Then, using Inkscape, they trace over the features from the historical map that they want to translate into Minecraft. Different colours equal different kinds of blocks. This is saved as ‘features.tiff’. Then, using a custom python script, the two layers are combined to create a minecraft map, which can either be in ‘creative’ mode or ‘survival’ mode.

There are a number of unspoken steps in that workflow, including a number of dependencies for the python script that have to be installed first. Similarly, QGIS and its plugins also have a steep (sometimes hidden) learning curve. As does Inkscape. And Imagemagick. This isn’t a criticism; it’s just the way this kind of thing works. The problem, from my perspective, is that if I want to use this in the classroom, I have to guide 40 students with widely varying degrees of digital fluency.* I’ve found in the past that many of my students “didn’t study history to have to work with computers” and that the payoff sometimes (to them) doesn’t seem to have (immediate) value. The pros and cons of that kind of work shall be the post for another day.

Right now, my immediate problem is, how can I smooth the gradient of the learning curve? I will do this by providing 3 separate paths for creating the digital elevation model.

Path 1, for when real world geography is not the most important aspect.

It may be that the shape of the world described by the historical map is what is of interest, rather than the current topography of the world. For example, I could imagine a student wanting to explore the historical geography of the Chats Falls before they were flooded by the building of a hydro dam. Current topographic maps and DEMs are not useful. For this path, the student will need to use the process described by the NYPL folks:

Requirements

QGIS 2.2.0 ( http://qgis.org )

  • Activate Contour plugin
  • Activate GRASS plugin if not already activated

A map image to work from

  • We used a geo-rectified TIFF exported from this map but any high rez scan of a map with elevation data and features will suffice.

Process:

Layer > Add Raster Layer > [select rectified tiff]

  • Repeat for each tiff to be analyzed

Layer > New > New Shapefile Layer

  • Type: Point
  • New Attribute: add ‘elevation’ type whole number
  • remove id

Contour (plugin)

  • Vector Layer: choose points layer just created
  • Data field: elevation
  • Number: at least 20 (maybe.. number of distinct elevations + 2)
  • Layer name: default is fine

Export and import contours as vector layer:

  • right click save (e.g. port-washington-contours.shp)
  • May report error like “Only 19 of 20 features written.” Doesn’t seem to matter much

Layer > Add Vector Layer > [add .shp layer just exported]

Edit Current Grass Region (to reduce rendering time)

  • clip to minimal lat longs

Open Grass Tools

  • Modules List: Select “v.in.ogr.qgis”
  • Select recently added contours layer
  • Run, View output, and close

Open Grass Tools

  • Modules List: Select “v.to.rast.attr”
  • Name of input vector map: (layer just generated)
  • Attribute field: elevation
  • Run, View output, and close

Open Grass Tools

  • Modules List: Select “r.surf.contour”
  • Name of existing raster map containing colors: (layer just generated)
  • Run (will take a while), View output, and close

Hide points and contours (and anything else above bw elevation image) Project > Save as Image

You may want to create a cropped version of the result to remove un-analyzed/messy edges

The hidden, tacit bits here involve installing the Countour plugin, and working with GRASS tools (especially the bit about ‘editing the current grass region’, which always is fiddly, I find). Students pursuing this path will need a lot of one-on-one.

Path 2, for when you already have a shapefile from a GIS:

This was cooked up for me by Joel Rivard, one of our GIS & Map specialists in the Library. He writes,

1) In the menu, go to Layer > Add Vector Layer. Find the point shapefile that has the elevation information.
Ensure that you select point in the file type.
2) In the menu, go to Raster > Interpolation. Select “Field 3″ (this corresponds to the z or elevation field) for Interpolation attribute and click on “Add”.
Feel free to keep the rest as default and save the output file as an Image (.asc, bmp, jpg or any other raster – probably best to use .asc since that’s what MicroDEM likes.
We’ll talk about MicroDEM in a moment. I haven’t tested this path yet, myself. But it should work.

Path 3 For when modern topography is fine for your purposes

In this situation, modern topography is just what you need.

1. Grab Shuttle Radar Topography Mission data for the area you are interested in (it downloads as a tiff.)

2. Install MicroDEM and all of its bits and pieces (the installer wants a whole bunch of other supporting bits; just say yes. MicroDEM is PC software, but I’ve run it on a Mac within WineBottler).

3. This video tutorial covers working with MicroDEM and Worldpainter:

https://www.youtube.com/watch?v=Wha2m4_CPoo

But here’s some screenshots – basically, you open up your .tiff or your .asc image file within MicroDEM, crop to the area you are interested in, and then convert the image to grayscale:

MicroDEM: open image, crop image.

MicroDEM: open image, crop image.

Convert to grayscale

Convert to grayscale

Remove legends, marginalia

Remove legends, marginalia

Save your grayscaled image as a .tiff.
Regardless of the path you took (and think about the historical implications of those paths) you now have a gray scale DEM image that you can use to generate your mindcraft world.

Converting your grayscale DEM to a Minecraft World

At this point, the easiest thing to do is to use WorldPainter. It’s free, but you can donate to its developers to help them maintain and update it. Now, the video shown above shows how to load your DEM image into WorldPainter. It parses the black-to-white pixel values and turns them into elevations. You have the option of setting where ‘sea level’ is on your map (so elevations below that point are covered with water). There are many, many options here; play with it! Adam Clarke, who made the video, suggests scaling up your image to 900%, but I’ve found that that makes absolutely monstrous worlds. You’ll have to play around to see what makes most sense for you, but with real-world data of any area larger than a few kilometres on a side, I think 100 to 200% is fine.

Now, the crucial bit for us: you can import an image into WorldPainter to use as an overlay to guide the placement of blocks, terrain, buildings, whatever. So, rather than me simply regurgitating what Adam narrates, go watch the video. Save as a .world file for editing; export to Minecraft when you’re ready (be warned: big maps can take *a very long time* to render. That’s another reason why I don’t scale up the way Adam suggests).

Go play.

To get you started: here are a number of DEMs and WorldPainter world files that I’ve been playing with. Try ‘em out for yourself.

 

* another problem I’ve encountered is that my features colours don’t map onto the index values for blocks in the script. I’ve tried modifying the script to allow for a bit of fuzziness (a kind of, ‘if the pixel value is between x and y, treat as z’). I end up with worlds filled with water. If I run the script on the Fort Washington maps provided by NYPL, it works perfectly. The script is supposed to only be looking at the R of the RGB values when it assigns blocks, but I wonder if there isn’t something else going on. I had it work once, correctly, for me – but I used MS Paint to recolour my image with the exact colours from the Fort Washington map. Tried it again, exact same workflow on a different map, nada. Nyet. Zip. Zilch. Just a whole of tears and heartache.

Assessing my upcoming seminar on the Illicit Antiquities trade, HIST4805b

So I’m putting together the syllabus for my illicit antiquities seminar. This is where I think I’m going with the course, which starts in less than a month (eep!). The first part is an attempt to revitalize my classroom blogging, and to formally tie it into the discussion within the classroom – that is, something done in advance of class in order to make the classroom discussion richer. In the second term, I want to make as much time as possible for students to pursue their own independent research, which I’m framing as an ‘unessay’ following the O’Donnell model.

~oOo~

Daylight: The Journal of #HIST4805b Studying Looted Heritage

Rationale: What we are studying is important, and what we are learning needs to be disseminated as widely as possible. In a world where ‘American Diggers‘ can be a tv show, where National Geographic (for heaven’s sake!) seriously can contemplate putting on a show that desecrates war dead for entertainment there is a need to shed daylight. The fall term major assessment piece does this. You will be writing and curating a Flipboard magazine that ties our readings and discussions into the current news regarding heritage crime.

There are a number of steps to this.

  1. Each week, everyone  logs into heritage.crowdmap.com and puts three new reports on the map.
  2. Each week, a different subset of the class will be the lead editors for our journal.
    1. lead editors each write an editorial that explores the issues raised in the readings, with specific reference to new reports on our crowdmap. Editorials should be 750- 1000 words long.
    2. lead editors curate the Flipboard magazine so that it contains:
      1. the editorials
      2. the crowdmap reports
      3. the readings
  3. This should be completed before Monday’s class where we will discuss those readings. The lead editors will begin the class by discussing their edition of Daylight.*
  4. Each student will be a lead editor three times.

*if you can think of a better name, we’ll use that.

At the end of term you will nominate your two best pieces for grading. I will grade these for how you’ve framed your argument, for your use of evidence, and for your understanding of the issues. I will also take into account your in-class discussion of your edition of Daylight.

At the end of term you will also nominate two of your peers’ best pieces for consideration for bonus, with a single line explaining why.

This is worth 40% of your final grade.

—–

The Unessay Research Project

Unessay‘ noun - as described by Daniel Paul O’Donnell,

“[...] the unessay is an assignment that attempts to undo the damage done by [traditional essay writing at the university level]. It works by throwing out all the rules you have learned about essay writing in the course of your primary, secondary, and post secondary education and asks you to focus instead solely on your intellectual interests and passions. In an unessay you choose your own topicpresent it any way you please, and are evaluated on how compelling and effective you are.”

Which means for us:

The second term is an opportunity for exploration, and for you to use the time that you would normally spend in a classroom listening as time for active planning, researching, and learning the necessary skills, to effectively craft an ‘unessay’ of original research on a topic connected with the illicit antiquities trade. I will put together a schedule for weekly one on one or small group meetings where I can help you develop your project.

For this to work, you will have to come prepared to these meetings. This means keeping a research journal to which I will have access. You may choose to make this publicly accessible as well (and we’ll talk about why and how you might want to do that).  Periodically, we will meet as an entire class to discuss the issues we are having in our research. You will present your research formally to the class and invited visitors at the end of term – your project might not be finished at that point, but your presentation can take this into account. The project is due on the final day of term.

Grading:

Pass/Fail: Research Journal (ie, no complete research journal, no assessment for this project). We will discuss what is involved in a research journal. A Zotero library with notes would also be acceptable.

5% Presentation in class

45% Project

O’Donnel writes,

“If unessays can be about anything and there are no restrictions on format and presentation, how are they graded?

The main criteria is how well it all fits together. That is to say, how compelling and effective your work is.

An unessay is compelling when it shows some combination of the following:

  • it is as interesting as its topic and approach allows
  • it is as complete as its topic and approach allows (it doesn’t leave the audience thinking that important points are being skipped over or ignored)
  • it is truthful (any questions, evidence, conclusions, or arguments you raise are honestly and accurately presented)

In terms of presentation, an unessay is effective when it shows some combination of these attributes:

  • it is readable/watchable/listenable (i.e. the production values are appropriately high and the audience is not distracted by avoidable lapses in presentation)
  • it is appropriate (i.e. it uses a format and medium that suits its topic and approach)
  • it is attractive (i.e. it is presented in a way that leads the audience to trust the author and his or her arguments, examples, and conclusions).”

~oOo~

So that’s what I’m going with. I’m not giving points out for participation, as that never has really worked for me. There will of course be much more going on in the classroom that just what is described here, including technical tutorials on various digital tools that I think are useful, beta-testing some other things, but my thinking is that these will see their expression in the quality of the independent research that takes place in the Winter term.

So Fall term: much reading, much discussion. Winter term: self-direction along trajectories established in the Fall. We shall see.

#hist3812a video games and simulations for historians, batting around some syllabus ideas

I’ve been batting around ideas for my video games class, trying to flesh them out some more. I put together a twine-based exploration of some of my ideas in this regard a few weeks ago; you can play it here. Anyway, what follows below is just me thinking out loud. The course runs for 12 weeks. (O my students, the version of the syllabus you should trust is the one that I am obligated to put on cuLearn).

What does Good History Through Gaming Look Like?

How do we know? Why should we care? What could we do with it, if we had it? Is it playing that matters, or is it building? Can a game foster critical play? What is critical play, anyway? ‘Close reading’ can happen not just of text, but also of code, and of experience. It pulls back the curtain (link to my essay discussing a previous iteration of this course).

Likely Topics

  1. A history of games, and of video games
  2. Historical Consciousness & Worldview
  3. Material culture, and the digital: software exists in the physical world
  4. Simulation & Practical Necromancy: representing the physical world in software
  5. Living History, LARPing, ARGs and AR: History, the Killer App
  6. Museums as gamed/gameful spaces
  7. Gamification and its bastards: or, nothing sucks the fun out of games like education
  8. Rolling your Own: Mods & Indies
  9. The politics of representation

Assessment

Which Might Include Weekly Responses & Critical Play Sessions:

  1. IF responses to readings (written using http://twinery.org)
  2. Play-throughs of others’ IF (other students; indie games in the wild)
  3. Critical play of Minecraft
  4. Critical play of ‘historical’ game of your choice
  5. Critical play of original SimCity (which can be downloaded or played online here). We’ll look at its source code, too, I think. Or we might play a version of Civilization. Haven’t decided yet.
  6. Critical boardgame play
  7. ARIS WW1 Simulation by Alex Crudas & Tyler Sinclair

Yes. I am going to have you play video games, for grades. But you will be looking for procedural rhetorics, worldviews, constraints, and other ways we share authority with algorithms (and who writes these, anyway?) when we consume digital representations of history. Consume? Is that the right verb? Co-create? Receive?

Major Works

  1. Midterm:IF your favourite academic paper that you have written such that a player playing it could argue the other sides you ignored in your linear paper. Construct it in such a way that the player/reader can move through it at will and still engage with a coherent argument. (See for example ‘Buried’ http://taracopplestone.co.uk/buried.html). You will use the Twine platform. http://twinery.org
  2. Summative Project: Minecrafted History
    1. You will design and build an immersive experience in Minecraft that expresses ‘good history through gaming’. There will be checkpoints to meet over the course of the term.Worlds will be built by teams, in groups of 5. Worlds can be picked from three broad themes:THE HISTORY OF THE OTTAWA VALLEY
      THE CANADIANS ON THE WESTERN FRONT
      COLONIZATION AND RESISTANCE IN ROMAN BRITAIN  (…look, I was a Roman archaeologist, once…)
    2. You will need to obtain source maps; you will digitize these and translate them into Minecraft. We will in all likelihood be using Github to manage your projects. The historical challenge will be to frame the game play within the world that you have created such that it expresses good history. You will need to keep track of every decision you make and why, and think through what the historical implications are of those decisions.
    3. The final build will be accompanied by a paradata document that will discuss your build, details all sources used (Harvard Style), references all appropriate literature, and explains how playing your world creates ‘good history’ for the player. This document should reference Fogu, Kee et al, and the papers in Elliot and Kappell at a miminum. More information about ‘paradata’ and examples may be found at http://heritagejam.org/what-are-paradata Due the first session on the last week of term, so that we can all play each others’ worlds. The in-class discussion that will follow in the second session is also a part of this project’s grade. Your work-in-progress may also be presented at Carleton’s GIS Day (3rd Wednesday in November)
    4. (These worlds will be made publicly available at the end of the term, ideally for local high school history classes to use. Many people at the university are interested to see what we come up with, too. No pressure).

So that’s what I’m thinking, with approximately 1 month to go until term starts. We’ve got Minecraft.edu installed in the Gaming Lab in the Discovery Centre in the Library, we’ve got logins and remote access all sorted out, I have most of the readings set … it’s coming together. Speaking of readings, we’ll use this as our bible:

Playing with the Past

and will probably dip into these:

Play the Past

PastPlay

… sensing a theme…

Topic Modeling Greek Consumerism

I’m experimenting. Here’s what I did today.

1. Justin Walsh published the data on which his book, ‘Consumerism in the Ancient World’, rests.

2. I downloaded it, and decided I would topic model it. The table, ‘Greek Vases’, has one row = one vase. Let’s start with that, though I think it might be more useful/illuminating to decide that ‘document’ might mean ‘site’ or ‘context’. But first things first; let’s sort out the workflow.

3. I delete all columns with ‘true’ or ‘false’ values. Struck me as not useful. I concatenated all columns into a single ‘text’ column. Then, per the description on the Mallet package page for R, I added a new column ‘class’ which I left blank. So I have ‘id’, ‘class’, ‘text’. All of Walsh’s information is in the ‘text’ field.

4. I ran this code in R, using R studio:

## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
library(mallet)
## Create a wrapper for the data with three elements, one for each column.
## R does some type inference, and will guess wrong, so give it hints with "colClasses".
## Note that "id" and "text" are special fields -- mallet will look there for input.
## "class" is arbitrary. We will only use that field on the R side.
documents <- read.table("modified-vases2.txt", col.names=c("id", "class", "text"),
                        colClasses=rep("character", 3), sep="\t", quote="")
## Create a mallet instance list object. Right now I have to specify the stoplist
## as a file, I can't pass in a list from R.
## This function has a few hidden options (whether to lowercase, how we
## define a token). See ?mallet.import for details.
mallet.instances <- mallet.import(documents$id, documents$text, "/Users/shawngraham/Desktop/data mining and tools/stoplist.csv",
                                  token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
## Create a topic trainer object.
num.topics <- 20
topic.model <- MalletLDA(num.topics)

## Load our documents. We could also pass in the filename of a
## saved instance list file that we build from the command-line tools.
topic.model$loadDocuments(mallet.instances)

## Get the vocabulary, and some statistics about word frequencies.
## These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)

## Optimize hyperparameters every 20 iterations,
## after 50 burn-in iterations.
topic.model$setAlphaOptimization(20, 50)

## Now train a model. Note that hyperparameter optimization is on, by default.
## We can specify the number of iterations. Here we'll use a large-ish round number.
topic.model$train(200)

## NEW: run through a few iterations where we pick the best topic for each token,
## rather than sampling from the posterior distribution.
topic.model$maximize(10)

## Get the probability of topics in documents and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities,
## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

## What are the top words in topic 7?
## Notice that R indexes from 1, so this will be the topic that mallet called topic 6.
mallet.top.words(topic.model, topic.words[7,])

## Show the first few documents with at least 5
head(documents[ doc.topics[7,] > 0.05 & doc.topics[10,] > 0.05, ])

## End of Mimno's sample script(Not run)

###from my other script; above was mimno's example script
topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)
write.csv(topic.docs, "vases-topics-docs.csv" ) 

## Get a vector containing short names for the topics
topics.labels <- rep("", num.topics)
for (topic in 1:num.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")

# have a look at keywords for each topic
topics.labels
write.csv(topics.labels, "vases-topics-labels.csv") ## "C:\\Mallet-2.0.7\\topics-labels.csv")

### do word clouds of the topics
library(wordcloud)
for(i in 1:num.topics){
  topic.top.words <- mallet.top.words(topic.model,
                                      topic.words[i,], 25)
  print(wordcloud(topic.top.words$words,
                  topic.top.words$weights,
                  c(4,.8), rot.per=0,
                  random.order=F))
}

And this is what I get:
Topic # Label
1 france greek west eating grey
2 spain ampurias neapolis girona arf
3 france rune herault colline nissan-lez-ens
4 spain huelva east greek drinking
5 france aude drinking montlaures cup
6 spain malaga settlement cup drinking
7 france drinking bouches-du-rhone settlement cup
8 france cup stemmed herault bessan
9 france marseille massalia bouches-du-rhone storage
10 spain ullastret settlement girona puig
11 france settlement mailhac drinking switzerland
12 spain badajoz cup stemless castulo
13 spain ampurias settlement girona neapolis
14 france beziers drinking cup pyrenees
15 spain krater bell arf drinking
16 transport amphora france gard massaliote
17 france settlement saint-blaise bouches-du-rhone greek
18 france marseille massalia west bouches-du-rhone
19 spain jaen drinking cemetery castulo
20 spain settlement abg eating alicante

The three letter acronymns are ware types. The original data had location, context, ware, purpose, and dates. Still need to figure out how to get Mallet (either on the command line or in R) to treat numerals as words, but that’s something I can ignore for the moment. So what next? Map this I guess, in physical and/or temporal space, and resolve the problem of what a ‘document’ really is, for archaeological topic modeling. Here, look at the word clouds generated at the end of the script whilst I ruminate. And also a flow diagram. What it shows, I know not. Exploration, eh?justin-walsh-data-flow

Rplot4

Rplot3Rplot2Rplot1

Extracting Text from PDFs; Doing OCR; all within R

I am a huge fan of Ben Marwick. He has so many useful pieces of code for the programming archaeologist or historian!

Edit July 17 1.20 pm: Mea culpa: I originally titled this post, ‘Doing OCR within R’. But, what I’m describing below – that’s not OCR. That’s extracting text from pdfs. It’s very fast and efficient, but it’s not OCR. So, brain fart. But I leave the remainder of the post as it was. For command line OCR (really, actual OCR) on a Mac, see the link to Ben Schmidt’s piece at the bottom. Sorry.

Edit July 17 10 pm: I am now an even bigger fan of Ben’s. He’s updated his script to either a) perform OCR by calling Tesseract from within R or b) grab the text layer from a pdf image. So this post no longer misleads. Thank you Ben!

Object Character Recognition, or OCR, is something that most historians will need to use at some point when working with digital documents. That is, you will often encounter pdf files of texts that you wish to work with in more detail (digitized newspapers, for instance). Often, there is a layer within the pdf image containing the text already: if you can highlight text by clicking and dragging over the image, you can copy and paste the text from the image. But this is often not the case, or worse, you have tens or hundreds or even thousands of documents to examine. There is commercial software that can do this for you, but it can be quite expensive

One way of doing OCR on your own machine with free tools, is to use Ben Marwick’s pdf-2-text-or-csv.r script for the R programming language. Marwick’s script uses R as wrapper for the Xpdf programme from Foolabs. Xpdf is a pdf viewer, much like Adobe Acrobat. Using Xpdf on its own can be quite tricky, so Marwick’s script will feed your pdf files to Xpdf, and have Xpdf perform the text extraction. For OCR, the script acts as a wrapper for Tesseract, which is not an easy piece of software to work with. There’s a final part to Marwick’s script that will pre-process the resulting text files for various kinds of text analysis, but you can ignore that part for now.

  1. Make sure you have R downloaded and installed on your machine (available from http://www.r-project.org/)
  2. Make sure you have Xpdf downloaded and installed (available from ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-win-3.04.zip ). Make a note of where you unzipped it. In particular, you are looking for the location of the file ‘pdftotext.exe’. Also, make sure you know where ‘pdftoppm’ is located too (it’s in that download).
  3. Download and install Tesseract https://code.google.com/p/tesseract-ocr/ 
  4. Download and install Imagemagick http://www.imagemagick.org/
  5. Have a folder with the pdfs you wish to extract text from.
  6. Open R, and paste Marwick’s script into the script editor window.
  7. Make sure you adjust the path for “dest” and the path to “pdftotext.exe” to the correct location
  8. Run the script! But read the script carefully and make sure you run the bits you need. Ben has commented out the code very well, so it should be fairly straightforward.

Obviously, the above is framed for Windows users. For Mac users, the steps are all the same, except that you use the version of Xpdf, Tesseract, and Imagemagick built for IOS, and your paths to the other software are going to be different. And of course you’re using R for Mac, which means the ‘shell’ commands have to be swapped to ‘system’! (As of July 2014, the Xpdf file for Mac that you want is at ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-mac-3.04.tar.gz ) I’m not 100% certain of any other Mac/PC differences in the R script – these should only exist at those points where R is calling on other resources (rather than on R packages). Caveat lector, eh?

The full R script may be found at https://gist.github.com/benmarwick/11333467. So here is the section that does the text extraction from pdf images (ie, you can copy and highlight text in the pdf):

###Note: there's some preprocessing that I (sg) haven't shown here: go see the original gist

################# Wait! ####################################
# Before proceeding, make sure you have a copy of pdf2text
# on your computer! Details: https://en.wikipedia.org/wiki/Pdftotext
# Download: http://www.foolabs.com/xpdf/download.html

# Tell R what folder contains your 1000s of PDFs
dest <- "G:/somehere/with/many/PDFs"

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# now there are a few options...

############### PDF to TXT #################################
# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE) )

# where are the txt files you just made?
dest # in this folder

And here’s the bit that does the OCR

</pre>
                     ##### Wait! #####
# Before proceeding, make sure you have a copy of Tesseract
# on your computer! Details & download:
# https://code.google.com/p/tesseract-ocr/
# and a copy of ImageMagick: http://www.imagemagick.org/
# and a copy of pdftoppm on your computer!
# Download: http://www.foolabs.com/xpdf/download.html
# And then after installing those three, restart to
# ensure R can find them on your path.
# And note that this process can be quite slow...

# PDF filenames can't have spaces in them for these operations
# so let's get rid of the spaces in the filenames

sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})

# get the PDF file names without spaces
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# Now we can do the OCR to the renamed PDF files. Don't worry
# if you get messages like 'Config Error: No display
# font for...' it's nothing to worry about

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), using
  shell(shQuote(paste0("pdftoppm ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("convert *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("tesseract ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
  })

# where are the txt files you just made?
dest # in this folder

Besides showing how to do your own OCR, Marwick’s script shows some of the power of R for doing more than statistics. Mac users might be interested in Ben Schmidt’s tutorial ‘Command-line OCR on a Mac’ from his digital history graduate seminar at Northeastern University, online at http://benschmidt.org/dighist13/?page_id=129.

Government of Canada Edits

In recent days, a number of twitterbots have been set up to monitor changes to Wikipedia emerging from government IP address blocks. Seems to me that here is a window for data mining the mindset of government. Of course, there’s nothing to indicate that anything untoward is being done by the Government itself; I live in Ottawa, and I know what civil servants can get up to on their lunch break.

But let’s look at the recent changes documented by https://twitter.com/gccaedits; I’ve taken screenshots below, but you can just scroll through gccaedits’ feed. Actually, come to think of it, someone should be archiving those tweets, too. It’s only been operational for something like 3 days, but already, we see an interesting grammar/football fanatic; someone with opinions on Amanda Knox; someone setting military history right, and someone fixing the German version of Rene Levesque’s page.

Hmmm. Keep your eyes on this, especially as next year is an election year…

Screen Shot 2014-07-14 at 1.54.49 PMScreen Shot 2014-07-14 at 1.54.24 PMScreen Shot 2014-07-14 at 1.54.03 PMScreen Shot 2014-07-14 at 1.53.35 PMScreen Shot 2014-07-14 at 1.53.14 PMScreen Shot 2014-07-14 at 1.52.55 PMScreen Shot 2014-07-14 at 1.52.23 PMScreen Shot 2014-07-14 at 1.51.52 PMScreen Shot 2014-07-14 at 1.55.06 PM

Setting up your own Data Refinery

Refinery at Oxymoron, by Wyatt Wellman, cc by-sa 2.0. Flickr.

Refinery at Oxymoron, by Wyatt Wellman, cc by-sa 2.0. Flickr.

I’ve been playing with a Mac. I’ve been a windows person for a long time, so bear with me.

I’m setting up a number of platforms locally for data mining. But since what I’m *really* doing is smelting the ore of data scraped using things like Outwit Hub or Import.io (the ‘mining operation’, in this tortured analogy), what I’m setting up is a data refinery. Web based services are awesome, but if you’re dealing with sensitive data (like oral history interviews, for example) you need something local – this will also help with your ethics board review too.  Onwards!

Voyant-Tools

You can now set Voyant-Tools up locally, keeping your data safe and sound. The documentation and downloads are all on this page. This was an incredibly easy setup on Mac. Unzip, double-click voyant-tools.jar, and boom you’ve got Voyant-Tools puttering away in your browser. It’ll be at  http://127.0.0.1:8888. You can also hit the cogwheel icon in the top right to run your corpus through all sorts of other tools that come with Voyant but aren’t there on the main layout. You’ll want ‘export corpus with other tool’. You’ll end up with a url something like http://127.0.0.1:8888/tool/RezoViz/?corpus=1404405786475.8384 . You can then swap the name of any other tool in that URL (to save time). RezoViz by the way uses named entity extraction to construct a network of entities mentioned in the same documents. So if you upload your corpus in small-ish chunks (paragraphs; pages; every 1000 words, whatever) you can see how it all ties together this way. From the cogwheel icon on the RezoViz layout, you can get .net which you can then import into Gephi. How frickin’ cool is that?

Overview Project


Topic modeling is all the rage, and yes, you should have MALLET or the Stanford TMT or R on your machine. But sometimes, it’s nice to just see something rather like a dendrogram of folders with progressively finer levels of self-similarity. Overview does term-frequency inverse document frequency weightings to figure out similarity of documents. The instructions (for all platforms) are here. It’s not quite as painless as Voyant, but it’s pretty darn close. You’ll need to have Postgres – download, install, run it once, then download Overview. You need to have Java 7. (At some point, you’ll probably need to look into running multiple versions of Java, if you continue to add elements to your refinery). Then:

  1. Ctrl-Click or Right-click on Overview for Mac OS X.command and select Open. When the dialog box asks if you are sure you want to open the application, click on the Open button. From then on, you can start Overview by double-clicking on Overview for Mac OS X.command.
  2. Browse to http://localhost:9000 and log in as admin@overviewproject.org with passwordadmin@overviewproject.org.

And you now have Overview running. You can do many many things with Overview – it’ll read pdfs, for instance, which you can then export within a csv file. You can tag folders and export those tags, to do some fun visualizations with the next part of your refinery, RAW.

Tags exported from Overview

Tags visualized. This wasn’t done with Raw (but rather, a commercial piece of software), but you get the idea.

RAW

Flow diagram in Raw, using sample movie data

Flow diagram in Raw, using sample movie data

Raw does wonderful things with CSV formatted data, all in your browser. You can use the webapp version; nothing gets communicated to the server. But, still, it’s nice to keep it close to home. So, you can get Raw source code here. It’s a little trickier to install than the others. First thing: you’ll need Bower. But you can’t install Bower without Node.js and npm. So, go to Node.js and hit install. Then, download Raw. Unzip Raw and go to that folder. To install Bower, type

$ sudo npm install -g bower

Once the dust settles, there’s a bunch of dependencies to install. Remember, you’re in the Raw folder. Type:

$ bower install

When the dust clears again, and assuming you have Python installed on your machine, fire Raw up in a server:

$ python -m SimpleHTTPServer 4000

(If you don’t have python, well, go get python. I’ll wait). Then in your browser go to

localhost:4000

And you can now do some funky visualizations of your data. There are a number of chart types packaged with Raw, but you can also develop your own – here’s the documentation. Michelle Moravec has been doing some lovely work visualizing her historical research using Raw. You should check it out.

 

Your Open Source Data Refinery

With these three pieces of data refinery infrastructure installed on your machine, or in your local digital history computer lab, you’ll have no excuse not to start adding some distant reading perspective to your method. Go. Do it now.