Assessing my upcoming seminar on the Illicit Antiquities trade, HIST4805b

So I’m putting together the syllabus for my illicit antiquities seminar. This is where I think I’m going with the course, which starts in less than a month (eep!). The first part is an attempt to revitalize my classroom blogging, and to formally tie it into the discussion within the classroom – that is, something done in advance of class in order to make the classroom discussion richer. In the second term, I want to make as much time as possible for students to pursue their own independent research, which I’m framing as an ‘unessay’ following the O’Donnell model.

~oOo~

Daylight: The Journal of #HIST4805b Studying Looted Heritage

Rationale: What we are studying is important, and what we are learning needs to be disseminated as widely as possible. In a world where ‘American Diggers‘ can be a tv show, where National Geographic (for heaven’s sake!) seriously can contemplate putting on a show that desecrates war dead for entertainment there is a need to shed daylight. The fall term major assessment piece does this. You will be writing and curating a Flipboard magazine that ties our readings and discussions into the current news regarding heritage crime.

There are a number of steps to this.

  1. Each week, everyone  logs into heritage.crowdmap.com and puts three new reports on the map.
  2. Each week, a different subset of the class will be the lead editors for our journal.
    1. lead editors each write an editorial that explores the issues raised in the readings, with specific reference to new reports on our crowdmap. Editorials should be 750- 1000 words long.
    2. lead editors curate the Flipboard magazine so that it contains:
      1. the editorials
      2. the crowdmap reports
      3. the readings
  3. This should be completed before Monday’s class where we will discuss those readings. The lead editors will begin the class by discussing their edition of Daylight.*
  4. Each student will be a lead editor three times.

*if you can think of a better name, we’ll use that.

At the end of term you will nominate your two best pieces for grading. I will grade these for how you’ve framed your argument, for your use of evidence, and for your understanding of the issues. I will also take into account your in-class discussion of your edition of Daylight.

At the end of term you will also nominate two of your peers’ best pieces for consideration for bonus, with a single line explaining why.

This is worth 40% of your final grade.

—–

The Unessay Research Project

Unessay‘ noun - as described by Daniel Paul O’Donnell,

“[...] the unessay is an assignment that attempts to undo the damage done by [traditional essay writing at the university level]. It works by throwing out all the rules you have learned about essay writing in the course of your primary, secondary, and post secondary education and asks you to focus instead solely on your intellectual interests and passions. In an unessay you choose your own topicpresent it any way you please, and are evaluated on how compelling and effective you are.”

Which means for us:

The second term is an opportunity for exploration, and for you to use the time that you would normally spend in a classroom listening as time for active planning, researching, and learning the necessary skills, to effectively craft an ‘unessay’ of original research on a topic connected with the illicit antiquities trade. I will put together a schedule for weekly one on one or small group meetings where I can help you develop your project.

For this to work, you will have to come prepared to these meetings. This means keeping a research journal to which I will have access. You may choose to make this publicly accessible as well (and we’ll talk about why and how you might want to do that).  Periodically, we will meet as an entire class to discuss the issues we are having in our research. You will present your research formally to the class and invited visitors at the end of term – your project might not be finished at that point, but your presentation can take this into account. The project is due on the final day of term.

Grading:

Pass/Fail: Research Journal (ie, no complete research journal, no assessment for this project). We will discuss what is involved in a research journal. A Zotero library with notes would also be acceptable.

5% Presentation in class

45% Project

O’Donnel writes,

“If unessays can be about anything and there are no restrictions on format and presentation, how are they graded?

The main criteria is how well it all fits together. That is to say, how compelling and effective your work is.

An unessay is compelling when it shows some combination of the following:

  • it is as interesting as its topic and approach allows
  • it is as complete as its topic and approach allows (it doesn’t leave the audience thinking that important points are being skipped over or ignored)
  • it is truthful (any questions, evidence, conclusions, or arguments you raise are honestly and accurately presented)

In terms of presentation, an unessay is effective when it shows some combination of these attributes:

  • it is readable/watchable/listenable (i.e. the production values are appropriately high and the audience is not distracted by avoidable lapses in presentation)
  • it is appropriate (i.e. it uses a format and medium that suits its topic and approach)
  • it is attractive (i.e. it is presented in a way that leads the audience to trust the author and his or her arguments, examples, and conclusions).”

~oOo~

So that’s what I’m going with. I’m not giving points out for participation, as that never has really worked for me. There will of course be much more going on in the classroom that just what is described here, including technical tutorials on various digital tools that I think are useful, beta-testing some other things, but my thinking is that these will see their expression in the quality of the independent research that takes place in the Winter term.

So Fall term: much reading, much discussion. Winter term: self-direction along trajectories established in the Fall. We shall see.

#hist3812a video games and simulations for historians, batting around some syllabus ideas

I’ve been batting around ideas for my video games class, trying to flesh them out some more. I put together a twine-based exploration of some of my ideas in this regard a few weeks ago; you can play it here. Anyway, what follows below is just me thinking out loud. The course runs for 12 weeks. (O my students, the version of the syllabus you should trust is the one that I am obligated to put on cuLearn).

What does Good History Through Gaming Look Like?

How do we know? Why should we care? What could we do with it, if we had it? Is it playing that matters, or is it building? Can a game foster critical play? What is critical play, anyway? ‘Close reading’ can happen not just of text, but also of code, and of experience. It pulls back the curtain (link to my essay discussing a previous iteration of this course).

Likely Topics

  1. A history of games, and of video games
  2. Historical Consciousness & Worldview
  3. Material culture, and the digital: software exists in the physical world
  4. Simulation & Practical Necromancy: representing the physical world in software
  5. Living History, LARPing, ARGs and AR: History, the Killer App
  6. Museums as gamed/gameful spaces
  7. Gamification and its bastards: or, nothing sucks the fun out of games like education
  8. Rolling your Own: Mods & Indies
  9. The politics of representation

Assessment

Which Might Include Weekly Responses & Critical Play Sessions:

  1. IF responses to readings (written using http://twinery.org)
  2. Play-throughs of others’ IF (other students; indie games in the wild)
  3. Critical play of Minecraft
  4. Critical play of ‘historical’ game of your choice
  5. Critical play of original SimCity (which can be downloaded or played online here). We’ll look at its source code, too, I think. Or we might play a version of Civilization. Haven’t decided yet.
  6. Critical boardgame play
  7. ARIS WW1 Simulation by Alex Crudas & Tyler Sinclair

Yes. I am going to have you play video games, for grades. But you will be looking for procedural rhetorics, worldviews, constraints, and other ways we share authority with algorithms (and who writes these, anyway?) when we consume digital representations of history. Consume? Is that the right verb? Co-create? Receive?

Major Works

  1. Midterm:IF your favourite academic paper that you have written such that a player playing it could argue the other sides you ignored in your linear paper. Construct it in such a way that the player/reader can move through it at will and still engage with a coherent argument. (See for example ‘Buried’ http://taracopplestone.co.uk/buried.html). You will use the Twine platform. http://twinery.org
  2. Summative Project: Minecrafted History
    1. You will design and build an immersive experience in Minecraft that expresses ‘good history through gaming’. There will be checkpoints to meet over the course of the term.Worlds will be built by teams, in groups of 5. Worlds can be picked from three broad themes:THE HISTORY OF THE OTTAWA VALLEY
      THE CANADIANS ON THE WESTERN FRONT
      COLONIZATION AND RESISTANCE IN ROMAN BRITAIN  (…look, I was a Roman archaeologist, once…)
    2. You will need to obtain source maps; you will digitize these and translate them into Minecraft. We will in all likelihood be using Github to manage your projects. The historical challenge will be to frame the game play within the world that you have created such that it expresses good history. You will need to keep track of every decision you make and why, and think through what the historical implications are of those decisions.
    3. The final build will be accompanied by a paradata document that will discuss your build, details all sources used (Harvard Style), references all appropriate literature, and explains how playing your world creates ‘good history’ for the player. This document should reference Fogu, Kee et al, and the papers in Elliot and Kappell at a miminum. More information about ‘paradata’ and examples may be found at http://heritagejam.org/what-are-paradata Due the first session on the last week of term, so that we can all play each others’ worlds. The in-class discussion that will follow in the second session is also a part of this project’s grade. Your work-in-progress may also be presented at Carleton’s GIS Day (3rd Wednesday in November)
    4. (These worlds will be made publicly available at the end of the term, ideally for local high school history classes to use. Many people at the university are interested to see what we come up with, too. No pressure).

So that’s what I’m thinking, with approximately 1 month to go until term starts. We’ve got Minecraft.edu installed in the Gaming Lab in the Discovery Centre in the Library, we’ve got logins and remote access all sorted out, I have most of the readings set … it’s coming together. Speaking of readings, we’ll use this as our bible:

Playing with the Past

and will probably dip into these:

Play the Past

PastPlay

… sensing a theme…

Topic Modeling Greek Consumerism

I’m experimenting. Here’s what I did today.

1. Justin Walsh published the data on which his book, ‘Consumerism in the Ancient World’, rests.

2. I downloaded it, and decided I would topic model it. The table, ‘Greek Vases’, has one row = one vase. Let’s start with that, though I think it might be more useful/illuminating to decide that ‘document’ might mean ‘site’ or ‘context’. But first things first; let’s sort out the workflow.

3. I delete all columns with ‘true’ or ‘false’ values. Struck me as not useful. I concatenated all columns into a single ‘text’ column. Then, per the description on the Mallet package page for R, I added a new column ‘class’ which I left blank. So I have ‘id’, ‘class’, ‘text’. All of Walsh’s information is in the ‘text’ field.

4. I ran this code in R, using R studio:

## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
library(mallet)
## Create a wrapper for the data with three elements, one for each column.
## R does some type inference, and will guess wrong, so give it hints with "colClasses".
## Note that "id" and "text" are special fields -- mallet will look there for input.
## "class" is arbitrary. We will only use that field on the R side.
documents <- read.table("modified-vases2.txt", col.names=c("id", "class", "text"),
                        colClasses=rep("character", 3), sep="\t", quote="")
## Create a mallet instance list object. Right now I have to specify the stoplist
## as a file, I can't pass in a list from R.
## This function has a few hidden options (whether to lowercase, how we
## define a token). See ?mallet.import for details.
mallet.instances <- mallet.import(documents$id, documents$text, "/Users/shawngraham/Desktop/data mining and tools/stoplist.csv",
                                  token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
## Create a topic trainer object.
num.topics <- 20
topic.model <- MalletLDA(num.topics)

## Load our documents. We could also pass in the filename of a
## saved instance list file that we build from the command-line tools.
topic.model$loadDocuments(mallet.instances)

## Get the vocabulary, and some statistics about word frequencies.
## These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)

## Optimize hyperparameters every 20 iterations,
## after 50 burn-in iterations.
topic.model$setAlphaOptimization(20, 50)

## Now train a model. Note that hyperparameter optimization is on, by default.
## We can specify the number of iterations. Here we'll use a large-ish round number.
topic.model$train(200)

## NEW: run through a few iterations where we pick the best topic for each token,
## rather than sampling from the posterior distribution.
topic.model$maximize(10)

## Get the probability of topics in documents and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities,
## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

## What are the top words in topic 7?
## Notice that R indexes from 1, so this will be the topic that mallet called topic 6.
mallet.top.words(topic.model, topic.words[7,])

## Show the first few documents with at least 5
head(documents[ doc.topics[7,] > 0.05 & doc.topics[10,] > 0.05, ])

## End of Mimno's sample script(Not run)

###from my other script; above was mimno's example script
topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)
write.csv(topic.docs, "vases-topics-docs.csv" ) 

## Get a vector containing short names for the topics
topics.labels <- rep("", num.topics)
for (topic in 1:num.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")

# have a look at keywords for each topic
topics.labels
write.csv(topics.labels, "vases-topics-labels.csv") ## "C:\\Mallet-2.0.7\\topics-labels.csv")

### do word clouds of the topics
library(wordcloud)
for(i in 1:num.topics){
  topic.top.words <- mallet.top.words(topic.model,
                                      topic.words[i,], 25)
  print(wordcloud(topic.top.words$words,
                  topic.top.words$weights,
                  c(4,.8), rot.per=0,
                  random.order=F))
}

And this is what I get:
Topic # Label
1 france greek west eating grey
2 spain ampurias neapolis girona arf
3 france rune herault colline nissan-lez-ens
4 spain huelva east greek drinking
5 france aude drinking montlaures cup
6 spain malaga settlement cup drinking
7 france drinking bouches-du-rhone settlement cup
8 france cup stemmed herault bessan
9 france marseille massalia bouches-du-rhone storage
10 spain ullastret settlement girona puig
11 france settlement mailhac drinking switzerland
12 spain badajoz cup stemless castulo
13 spain ampurias settlement girona neapolis
14 france beziers drinking cup pyrenees
15 spain krater bell arf drinking
16 transport amphora france gard massaliote
17 france settlement saint-blaise bouches-du-rhone greek
18 france marseille massalia west bouches-du-rhone
19 spain jaen drinking cemetery castulo
20 spain settlement abg eating alicante

The three letter acronymns are ware types. The original data had location, context, ware, purpose, and dates. Still need to figure out how to get Mallet (either on the command line or in R) to treat numerals as words, but that’s something I can ignore for the moment. So what next? Map this I guess, in physical and/or temporal space, and resolve the problem of what a ‘document’ really is, for archaeological topic modeling. Here, look at the word clouds generated at the end of the script whilst I ruminate. And also a flow diagram. What it shows, I know not. Exploration, eh?justin-walsh-data-flow

Rplot4

Rplot3Rplot2Rplot1

Extracting Text from PDFs; Doing OCR; all within R

I am a huge fan of Ben Marwick. He has so many useful pieces of code for the programming archaeologist or historian!

Edit July 17 1.20 pm: Mea culpa: I originally titled this post, ‘Doing OCR within R’. But, what I’m describing below – that’s not OCR. That’s extracting text from pdfs. It’s very fast and efficient, but it’s not OCR. So, brain fart. But I leave the remainder of the post as it was. For command line OCR (really, actual OCR) on a Mac, see the link to Ben Schmidt’s piece at the bottom. Sorry.

Edit July 17 10 pm: I am now an even bigger fan of Ben’s. He’s updated his script to either a) perform OCR by calling Tesseract from within R or b) grab the text layer from a pdf image. So this post no longer misleads. Thank you Ben!

Object Character Recognition, or OCR, is something that most historians will need to use at some point when working with digital documents. That is, you will often encounter pdf files of texts that you wish to work with in more detail (digitized newspapers, for instance). Often, there is a layer within the pdf image containing the text already: if you can highlight text by clicking and dragging over the image, you can copy and paste the text from the image. But this is often not the case, or worse, you have tens or hundreds or even thousands of documents to examine. There is commercial software that can do this for you, but it can be quite expensive

One way of doing OCR on your own machine with free tools, is to use Ben Marwick’s pdf-2-text-or-csv.r script for the R programming language. Marwick’s script uses R as wrapper for the Xpdf programme from Foolabs. Xpdf is a pdf viewer, much like Adobe Acrobat. Using Xpdf on its own can be quite tricky, so Marwick’s script will feed your pdf files to Xpdf, and have Xpdf perform the text extraction. For OCR, the script acts as a wrapper for Tesseract, which is not an easy piece of software to work with. There’s a final part to Marwick’s script that will pre-process the resulting text files for various kinds of text analysis, but you can ignore that part for now.

  1. Make sure you have R downloaded and installed on your machine (available from http://www.r-project.org/)
  2. Make sure you have Xpdf downloaded and installed (available from ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-win-3.04.zip ). Make a note of where you unzipped it. In particular, you are looking for the location of the file ‘pdftotext.exe’. Also, make sure you know where ‘pdftoppm’ is located too (it’s in that download).
  3. Download and install Tesseract https://code.google.com/p/tesseract-ocr/ 
  4. Download and install Imagemagick http://www.imagemagick.org/
  5. Have a folder with the pdfs you wish to extract text from.
  6. Open R, and paste Marwick’s script into the script editor window.
  7. Make sure you adjust the path for “dest” and the path to “pdftotext.exe” to the correct location
  8. Run the script! But read the script carefully and make sure you run the bits you need. Ben has commented out the code very well, so it should be fairly straightforward.

Obviously, the above is framed for Windows users. For Mac users, the steps are all the same, except that you use the version of Xpdf, Tesseract, and Imagemagick built for IOS, and your paths to the other software are going to be different. And of course you’re using R for Mac, which means the ‘shell’ commands have to be swapped to ‘system’! (As of July 2014, the Xpdf file for Mac that you want is at ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-mac-3.04.tar.gz ) I’m not 100% certain of any other Mac/PC differences in the R script – these should only exist at those points where R is calling on other resources (rather than on R packages). Caveat lector, eh?

The full R script may be found at https://gist.github.com/benmarwick/11333467. So here is the section that does the text extraction from pdf images (ie, you can copy and highlight text in the pdf):

###Note: there's some preprocessing that I (sg) haven't shown here: go see the original gist

################# Wait! ####################################
# Before proceeding, make sure you have a copy of pdf2text
# on your computer! Details: https://en.wikipedia.org/wiki/Pdftotext
# Download: http://www.foolabs.com/xpdf/download.html

# Tell R what folder contains your 1000s of PDFs
dest <- "G:/somehere/with/many/PDFs"

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# now there are a few options...

############### PDF to TXT #################################
# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE) )

# where are the txt files you just made?
dest # in this folder

And here’s the bit that does the OCR

</pre>
                     ##### Wait! #####
# Before proceeding, make sure you have a copy of Tesseract
# on your computer! Details & download:
# https://code.google.com/p/tesseract-ocr/
# and a copy of ImageMagick: http://www.imagemagick.org/
# and a copy of pdftoppm on your computer!
# Download: http://www.foolabs.com/xpdf/download.html
# And then after installing those three, restart to
# ensure R can find them on your path.
# And note that this process can be quite slow...

# PDF filenames can't have spaces in them for these operations
# so let's get rid of the spaces in the filenames

sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})

# get the PDF file names without spaces
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# Now we can do the OCR to the renamed PDF files. Don't worry
# if you get messages like 'Config Error: No display
# font for...' it's nothing to worry about

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), using
  shell(shQuote(paste0("pdftoppm ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("convert *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("tesseract ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
  })

# where are the txt files you just made?
dest # in this folder

Besides showing how to do your own OCR, Marwick’s script shows some of the power of R for doing more than statistics. Mac users might be interested in Ben Schmidt’s tutorial ‘Command-line OCR on a Mac’ from his digital history graduate seminar at Northeastern University, online at http://benschmidt.org/dighist13/?page_id=129.

Government of Canada Edits

In recent days, a number of twitterbots have been set up to monitor changes to Wikipedia emerging from government IP address blocks. Seems to me that here is a window for data mining the mindset of government. Of course, there’s nothing to indicate that anything untoward is being done by the Government itself; I live in Ottawa, and I know what civil servants can get up to on their lunch break.

But let’s look at the recent changes documented by https://twitter.com/gccaedits; I’ve taken screenshots below, but you can just scroll through gccaedits’ feed. Actually, come to think of it, someone should be archiving those tweets, too. It’s only been operational for something like 3 days, but already, we see an interesting grammar/football fanatic; someone with opinions on Amanda Knox; someone setting military history right, and someone fixing the German version of Rene Levesque’s page.

Hmmm. Keep your eyes on this, especially as next year is an election year…

Screen Shot 2014-07-14 at 1.54.49 PMScreen Shot 2014-07-14 at 1.54.24 PMScreen Shot 2014-07-14 at 1.54.03 PMScreen Shot 2014-07-14 at 1.53.35 PMScreen Shot 2014-07-14 at 1.53.14 PMScreen Shot 2014-07-14 at 1.52.55 PMScreen Shot 2014-07-14 at 1.52.23 PMScreen Shot 2014-07-14 at 1.51.52 PMScreen Shot 2014-07-14 at 1.55.06 PM

Setting up your own Data Refinery

Refinery at Oxymoron, by Wyatt Wellman, cc by-sa 2.0. Flickr.

Refinery at Oxymoron, by Wyatt Wellman, cc by-sa 2.0. Flickr.

I’ve been playing with a Mac. I’ve been a windows person for a long time, so bear with me.

I’m setting up a number of platforms locally for data mining. But since what I’m *really* doing is smelting the ore of data scraped using things like Outwit Hub or Import.io (the ‘mining operation’, in this tortured analogy), what I’m setting up is a data refinery. Web based services are awesome, but if you’re dealing with sensitive data (like oral history interviews, for example) you need something local – this will also help with your ethics board review too.  Onwards!

Voyant-Tools

You can now set Voyant-Tools up locally, keeping your data safe and sound. The documentation and downloads are all on this page. This was an incredibly easy setup on Mac. Unzip, double-click voyant-tools.jar, and boom you’ve got Voyant-Tools puttering away in your browser. It’ll be at  http://127.0.0.1:8888. You can also hit the cogwheel icon in the top right to run your corpus through all sorts of other tools that come with Voyant but aren’t there on the main layout. You’ll want ‘export corpus with other tool’. You’ll end up with a url something like http://127.0.0.1:8888/tool/RezoViz/?corpus=1404405786475.8384 . You can then swap the name of any other tool in that URL (to save time). RezoViz by the way uses named entity extraction to construct a network of entities mentioned in the same documents. So if you upload your corpus in small-ish chunks (paragraphs; pages; every 1000 words, whatever) you can see how it all ties together this way. From the cogwheel icon on the RezoViz layout, you can get .net which you can then import into Gephi. How frickin’ cool is that?

Overview Project


Topic modeling is all the rage, and yes, you should have MALLET or the Stanford TMT or R on your machine. But sometimes, it’s nice to just see something rather like a dendrogram of folders with progressively finer levels of self-similarity. Overview does term-frequency inverse document frequency weightings to figure out similarity of documents. The instructions (for all platforms) are here. It’s not quite as painless as Voyant, but it’s pretty darn close. You’ll need to have Postgres – download, install, run it once, then download Overview. You need to have Java 7. (At some point, you’ll probably need to look into running multiple versions of Java, if you continue to add elements to your refinery). Then:

  1. Ctrl-Click or Right-click on Overview for Mac OS X.command and select Open. When the dialog box asks if you are sure you want to open the application, click on the Open button. From then on, you can start Overview by double-clicking on Overview for Mac OS X.command.
  2. Browse to http://localhost:9000 and log in as admin@overviewproject.org with passwordadmin@overviewproject.org.

And you now have Overview running. You can do many many things with Overview – it’ll read pdfs, for instance, which you can then export within a csv file. You can tag folders and export those tags, to do some fun visualizations with the next part of your refinery, RAW.

Tags exported from Overview

Tags visualized. This wasn’t done with Raw (but rather, a commercial piece of software), but you get the idea.

RAW

Flow diagram in Raw, using sample movie data

Flow diagram in Raw, using sample movie data

Raw does wonderful things with CSV formatted data, all in your browser. You can use the webapp version; nothing gets communicated to the server. But, still, it’s nice to keep it close to home. So, you can get Raw source code here. It’s a little trickier to install than the others. First thing: you’ll need Bower. But you can’t install Bower without Node.js and npm. So, go to Node.js and hit install. Then, download Raw. Unzip Raw and go to that folder. To install Bower, type

$ sudo npm install -g bower

Once the dust settles, there’s a bunch of dependencies to install. Remember, you’re in the Raw folder. Type:

$ bower install

When the dust clears again, and assuming you have Python installed on your machine, fire Raw up in a server:

$ python -m SimpleHTTPServer 4000

(If you don’t have python, well, go get python. I’ll wait). Then in your browser go to

localhost:4000

And you can now do some funky visualizations of your data. There are a number of chart types packaged with Raw, but you can also develop your own – here’s the documentation. Michelle Moravec has been doing some lovely work visualizing her historical research using Raw. You should check it out.

 

Your Open Source Data Refinery

With these three pieces of data refinery infrastructure installed on your machine, or in your local digital history computer lab, you’ll have no excuse not to start adding some distant reading perspective to your method. Go. Do it now.

Stanford NER, extracting & visualizing patterns

This is just a quick note while I’m thinking about this. I say ‘visualizing’ patterns, but there are of course many ways of doing that. Here, I’m just going quick’n’dirty into a network.

Say you have the diplomatic correspondence of the Republic of Texas, and you suspect that there might be interesting patterns in the places named over time. You can use the Stanford Named Entity Recognition package to extract locations. Then, using some regular expressions, you can transform that output into a network file. BUT – and this is important – it’s a format that carries some baggage of its own. Anyway, first you’ll want the Correspondence. Over at The Macroscope, we’ve already written about how you can extract the patterns of correspondence between individuals using regex patterns. This doesn’t need the Stanford NER because there is an index to that correspondence, and the regex grabs & parses that information for you.

But there is no such index for locations named. So grab that document, and feed it into the NER as Michelle Moravec instructs on her blog here. In the  terminal window, as the classifier classifies Persons, Organizations, and Locations, you’ll spot blank lines between batches of categorized items (edit: there’s a classifier that’ll grab time too; that’d be quite handy to incorporate here – SG). These blanks correspond to the blanks between the letters in the original document. Copy all of the terminal output into a new Notepad++ or Textwrangler document. We’re going to trim away every line that isn’t led by LOCATION:

\n[^LOCATION].+

and replace with nothing. This will delete everything that doesn’t have the location tag in front. Now, let’s mark those blank lines as the start of a new letter. A thread on Stack Overflow suggests this regex to find those blank lines:

^\s*$

where:

^ is the beginning of string anchor
$ is the end of string anchor
\s is the whitespace character class
* is zero-or-more repetition

and we replace with the string new-letter.

Now we want to get all of the locations for a single letter into a single line. Replace ‘LOCATION’ with a comma. This budges everything into a single line, so we need to reintroduce line breaks, by replacing ‘new-letter’ with the new line character:

find: (new-letter)
replace \n(\1)

I could’ve just replaced new-letter with a new-line, but I wanted to make sure that every new line did in fact start with new-letter. Now find and replace new-letter so that it’s removed. You now have a document with the same number of lines as original letters in the original correspondence file. Now to turn it into a network file! Add the following information at the start of the file:

DL
n=721
format = nodelist1
labels embedded:
data:

DL will tell a network analysis program that we are dealing with UCINET’s DL format. N equals the number of nodes. Format=nodelist1 says, ‘this is a format where the first item on the line is connected to all the subsequent items on that line’. As a historian or archaeologist, you can see that there’s a big assumption in that format. Is it justified? That’s something to mull over. Gephi only accepts DL in format=edgelist1, that is, binary pairs. If that describes the relationship in your data, there’s a lot of legwork involved in moving from nodelist1 to edgelist1, and I’m not covering that here. Let’s imagine that, on historical grounds, nodelist1 accurately describes the relationship between locations mentioned in letters, that the first location mentioned is probably the place where the letter is being written from, or the most important place, or….

“labels embedded:” tells a network program that the labels themselves are being used as data points, and “data:” indicates that everything afterwards is the data. But how did we know how many nodes there were? You could tally up by hand; you could copy and paste your data )(back when each LOCATION was listed) into a spreadsheet and use its COUNT function to find uniques; I’m lazy and just bang any old number in there, and then save it with a .dl extension.  Then I open it using a small program called Keyplayer. This isn’t what the program is for, but it will give you an error message that tells you the correct number of nodes! Put that number into your DL file, and try again. If you’ve got it right, Keyplayer won’t do anything – its silence speaks volumes (you can then run an analysis in keyplayer. If your DL file is not formatted correctly, no results!).

You now have a DL file that you can analyze in Pajek or UCINET. If you want to visualize in Gephi, you have to get it into a DL format that Gephi can use (edgelist) or else into .net format. Open your DL file in Pajek, and then save as Pajek format (which is .net). Then open in Gephi. (Alternatively, going back a step, you can open in Keyplayer, and then within Keyplayer, hit the ‘visualize in Pajek’ button, and you’ll automatically get that transformation). (edit: if you’re on a Mac, you have to run Pajek or Ucinet with something like Winebottler. Forgot to mention that).

Ta da!

Locations mentioned in letters of the Republic of Texas

Locations mentioned in letters of the Republic of Texas

 

 

-ing history!

Still playing with videogrep. I downloaded 25 heritage minute commercials (non-Canadians: a series of 1 minute or so clips that teach us Canucks about the morally uplifting things we’ve done in the past, things we’ve invented, bad-things-we-did-but-we’ve-patched-over-now. You get the gist.). I ran them through various pattern matches based on parts-of-speech tagging. It was hard to do anything more than that because the closed captioning (on which this all rests) was simply awful. Anyway, there’s a healthy dose of serendipity in all of this, as even after the search is done, the exact sequence the clips are reassembled in is more or less random.

And with that, I give you the result of my pattern matching for gerunds:

-ing history! A Heritage Minute Auto-Supercut.

Heritage Jam entry: PARKER

I’m sure it isn’t quite what they were expecting, but I submitted something to HeritageJam.

View it here.

PARKER is an interactive experience in procedurally extracting, uncovering, and reversing, the burial of latent semantic core archaeological knowledge. In this era of neoliberal corporatization of cultural heritage knowledge, PARKER represents the way forward for its creation and appreciation. When we must balance funding for healthcare versus that for archaeologists, in this time of reduced availability of funds, how can we not turn to data mining and revisualization of knowledge? After all, what is the insight of the individual when millions of minutes of youtube videos are being created every minute? Further, PARKER extracts the core insights of archaeology and formats them automatically for patenting, so that DRM can be affixed and rightsholder value be fully realized.

PARKER:  for the archaeology we always dreamed of.

———

This visualization is an interactive story that frames the automatic search of youtube, natural-language parsing, and automatic super cut & re-formatting of those search results to highlight the ways code can frame archaeological knowledge. It applies Sam Lavigne’s ‘videogrep’ and ‘automatic patent generator’ to results from a search for ‘archaeological burials’ retrieved from Youtube, selecting the first few results that included closed-captioning. Videogrep uses natural-language pattern matching on those captioning files to select clips from a variety of pieces, restitching them at random. The result is similar to an I-Ching or other ways of divination of meaning. Similarly, the patent generator grabs the transcription so that elements that fit the language of patent applications. As I have argued elsewhere, digital archaeology is not about justification of results, but rather, the deformation of the familiar.

The result is a making-strange, an uncovering, of deeper truths. Code is not neutral, and we would be wise to recognize, to engage with, the theoretical perspectives encoded in our use of digital tools – especially when dealing with the human past.

 

A method and apparatus for observing the rhythmic cadence; or, an algorithmic alternative archaeology

Figure 1

Figure 1. A Wretched Garret Without A Fire (at least, according to Google Images)

A method and apparatus for observing the rhythmic cadence

ABSTRACT

A method and apparatus for observing the rhythmic cadence. The devices comprises a small shop, a wretched garret, a Russian letter, a mercantile house, a third storey

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 illustrates a wretched garret without a fire.

Figure 2 is a block diagram of a fearful storm off the island.

Figure 3 illustrates a mercantile house on my own account.

Figure 4 is a perspective view of the principal events of the Trojan war.

Figure 5 is an isometric view of a poor Jew for 4 francs a week.

Figure 6 is a cross section of a thorough knowledge of the English language.

Figure 7 is a block diagram of the hard trials of my life.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention is the son of a Protestant clergyman. The device is a wretched garret without a fire. The present invention facilitates the study of a language. The invention has my book in my hand. The invention acquires a thorough knowledge of the English language.

According to a beneficial embodiment, the invention such a degree that the study. The present invention shows his incapacity for the business. The device obtains a situation as correspondent and bookkeeper. The device understood a word of the language. The present invention established a mercantile house on my own account. The invention does not venture upon its study. The device devotes the rest of my life. The present invention realizes the dream of my whole life. The present invention publishes a work on the subject.

What is claimed is:

1. A method for observing the rhythmic cadence, comprising:
a wretched garret;
a small shop; and
a Russian letter.

2. The method of claim 1, wherein said wretched garret comprises a mercantile house on my own account.

3. The method of claim 1, wherein said small shop comprises the principal events of the Trojan war.

4. The method of claim 1, wherein said Russian letter comprises a fearful storm off the island.

5. An apparatus for observing the rhythmic cadence, comprising:
a mercantile house;
a small shop;
a third storey; and
a Russian letter.

6. The apparatus of claim 5, wherein said mercantile house comprises a wretched garret without a fire.

7. The apparatus of claim 5, wherein said small shop comprises a fearful storm off the island.

8. The apparatus of claim 5, wherein said third storey comprises a thorough knowledge of the English language.

9. The apparatus of claim 5, wherein said Russian letter comprises a thorough knowledge of the English language.

—————–
Did you recognize Troy and its Remains, by Henry (Heinrich) Schliemann, in that patent abstract? I took his ‘autobiographical notice’ from the opening of his account of the work at Troy, and ran it through Sam Lavigne’s Patent Generator. It’s a bit like the I-Ching. I have it in mind that this toy could be used to distort and reflect on, draw something new from, some of the classic works of archaeology – especially from that buccaneering phase when, well, pretty much anything went. What if, instead of publishing their discoveries, the early archaeologists had patented them instead? We live in such an era now, when new forms of life (or at least, its building blocks) can be patented; when workflows can be patented; when patents can be framed so broad that a word-generator and a lawyer will bring you riches beyond compare… the early archaeologists were after fame and fortune as much as they were about knowledge of the past. This patent of Schliemann’s uses as its source text an opening sketch about the man himself, rather than his discoveries. Doesn’t a sense of him shine through? Doesn’t he seem, well, rather over-inflated? What is the rhythmic cadence, I wonder. If I can sort out the encoding, I’ll try this on some of his discussion of what he found.

(think also the computational power that went into this toy: natural language processing, pattern matching… it’s rather impressive, actually, when you think what can be built by bolting existing bits together).

Here’s Chapter 1 of Schliemanns account of Troy. Please see the ‘detailed description of the preferred embodiments’, below.

——————-
An apparatus and method for according to the firman

ABSTRACT

An apparatus and method for according to the firman. The devices comprises a whole building, a large block

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram of the north-western end of the site.

Figure 2 is an isometric view of the second secretary of his chancellary.

Figure 3 is a perspective view of a large block of this kind.

Figure 4 is a diagrammatical view of the steep side of the hill.

Figure 5 is a schematic drawing of the native soil before the winter.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention sells me the field at any price. The device reach the native soil before the winter. The present invention is the highest mountain in the world.

What is claimed is:

1. An apparatus for according to the firman, comprising:
a whole building; and
a large block.

2. The apparatus of claim 1, wherein said whole building comprises a large block of this kind.

3. The apparatus of claim 1, wherein said large block comprises the native soil before the winter.

4. A method for according to the firman, comprising:
a large block; and
a whole building.

5. The method of claim 4, wherein said large block comprises the north-western end of the site.

6. The method of claim 4, wherein said whole building comprises the second secretary of his chancellary.