Topic Modeling Greek Consumerism

I’m experimenting. Here’s what I did today.

1. Justin Walsh published the data on which his book, ‘Consumerism in the Ancient World’, rests.

2. I downloaded it, and decided I would topic model it. The table, ‘Greek Vases’, has one row = one vase. Let’s start with that, though I think it might be more useful/illuminating to decide that ‘document’ might mean ‘site’ or ‘context’. But first things first; let’s sort out the workflow.

3. I delete all columns with ‘true’ or ‘false’ values. Struck me as not useful. I concatenated all columns into a single ‘text’ column. Then, per the description on the Mallet package page for R, I added a new column ‘class’ which I left blank. So I have ‘id’, ‘class’, ‘text’. All of Walsh’s information is in the ‘text’ field.

4. I ran this code in R, using R studio:

## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
library(mallet)
## Create a wrapper for the data with three elements, one for each column.
## R does some type inference, and will guess wrong, so give it hints with "colClasses".
## Note that "id" and "text" are special fields -- mallet will look there for input.
## "class" is arbitrary. We will only use that field on the R side.
documents <- read.table("modified-vases2.txt", col.names=c("id", "class", "text"),
                        colClasses=rep("character", 3), sep="\t", quote="")
## Create a mallet instance list object. Right now I have to specify the stoplist
## as a file, I can't pass in a list from R.
## This function has a few hidden options (whether to lowercase, how we
## define a token). See ?mallet.import for details.
mallet.instances <- mallet.import(documents$id, documents$text, "/Users/shawngraham/Desktop/data mining and tools/stoplist.csv",
                                  token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")
## Create a topic trainer object.
num.topics <- 20
topic.model <- MalletLDA(num.topics)

## Load our documents. We could also pass in the filename of a
## saved instance list file that we build from the command-line tools.
topic.model$loadDocuments(mallet.instances)

## Get the vocabulary, and some statistics about word frequencies.
## These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)

## Optimize hyperparameters every 20 iterations,
## after 50 burn-in iterations.
topic.model$setAlphaOptimization(20, 50)

## Now train a model. Note that hyperparameter optimization is on, by default.
## We can specify the number of iterations. Here we'll use a large-ish round number.
topic.model$train(200)

## NEW: run through a few iterations where we pick the best topic for each token,
## rather than sampling from the posterior distribution.
topic.model$maximize(10)

## Get the probability of topics in documents and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities,
## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

## What are the top words in topic 7?
## Notice that R indexes from 1, so this will be the topic that mallet called topic 6.
mallet.top.words(topic.model, topic.words[7,])

## Show the first few documents with at least 5
head(documents[ doc.topics[7,] > 0.05 & doc.topics[10,] > 0.05, ])

## End of Mimno's sample script(Not run)

###from my other script; above was mimno's example script
topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)
write.csv(topic.docs, "vases-topics-docs.csv" ) 

## Get a vector containing short names for the topics
topics.labels <- rep("", num.topics)
for (topic in 1:num.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")

# have a look at keywords for each topic
topics.labels
write.csv(topics.labels, "vases-topics-labels.csv") ## "C:\\Mallet-2.0.7\\topics-labels.csv")

### do word clouds of the topics
library(wordcloud)
for(i in 1:num.topics){
  topic.top.words <- mallet.top.words(topic.model,
                                      topic.words[i,], 25)
  print(wordcloud(topic.top.words$words,
                  topic.top.words$weights,
                  c(4,.8), rot.per=0,
                  random.order=F))
}

And this is what I get:
Topic # Label
1 france greek west eating grey
2 spain ampurias neapolis girona arf
3 france rune herault colline nissan-lez-ens
4 spain huelva east greek drinking
5 france aude drinking montlaures cup
6 spain malaga settlement cup drinking
7 france drinking bouches-du-rhone settlement cup
8 france cup stemmed herault bessan
9 france marseille massalia bouches-du-rhone storage
10 spain ullastret settlement girona puig
11 france settlement mailhac drinking switzerland
12 spain badajoz cup stemless castulo
13 spain ampurias settlement girona neapolis
14 france beziers drinking cup pyrenees
15 spain krater bell arf drinking
16 transport amphora france gard massaliote
17 france settlement saint-blaise bouches-du-rhone greek
18 france marseille massalia west bouches-du-rhone
19 spain jaen drinking cemetery castulo
20 spain settlement abg eating alicante

The three letter acronymns are ware types. The original data had location, context, ware, purpose, and dates. Still need to figure out how to get Mallet (either on the command line or in R) to treat numerals as words, but that’s something I can ignore for the moment. So what next? Map this I guess, in physical and/or temporal space, and resolve the problem of what a ‘document’ really is, for archaeological topic modeling. Here, look at the word clouds generated at the end of the script whilst I ruminate. And also a flow diagram. What it shows, I know not. Exploration, eh?justin-walsh-data-flow

Rplot4

Rplot3Rplot2Rplot1

Extracting Text from PDFs; Doing OCR; all within R

I am a huge fan of Ben Marwick. He has so many useful pieces of code for the programming archaeologist or historian!

Edit July 17 1.20 pm: Mea culpa: I originally titled this post, ‘Doing OCR within R’. But, what I’m describing below – that’s not OCR. That’s extracting text from pdfs. It’s very fast and efficient, but it’s not OCR. So, brain fart. But I leave the remainder of the post as it was. For command line OCR (really, actual OCR) on a Mac, see the link to Ben Schmidt’s piece at the bottom. Sorry.

Edit July 17 10 pm: I am now an even bigger fan of Ben’s. He’s updated his script to either a) perform OCR by calling Tesseract from within R or b) grab the text layer from a pdf image. So this post no longer misleads. Thank you Ben!

Object Character Recognition, or OCR, is something that most historians will need to use at some point when working with digital documents. That is, you will often encounter pdf files of texts that you wish to work with in more detail (digitized newspapers, for instance). Often, there is a layer within the pdf image containing the text already: if you can highlight text by clicking and dragging over the image, you can copy and paste the text from the image. But this is often not the case, or worse, you have tens or hundreds or even thousands of documents to examine. There is commercial software that can do this for you, but it can be quite expensive

One way of doing OCR on your own machine with free tools, is to use Ben Marwick’s pdf-2-text-or-csv.r script for the R programming language. Marwick’s script uses R as wrapper for the Xpdf programme from Foolabs. Xpdf is a pdf viewer, much like Adobe Acrobat. Using Xpdf on its own can be quite tricky, so Marwick’s script will feed your pdf files to Xpdf, and have Xpdf perform the text extraction. For OCR, the script acts as a wrapper for Tesseract, which is not an easy piece of software to work with. There’s a final part to Marwick’s script that will pre-process the resulting text files for various kinds of text analysis, but you can ignore that part for now.

  1. Make sure you have R downloaded and installed on your machine (available from http://www.r-project.org/)
  2. Make sure you have Xpdf downloaded and installed (available from ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-win-3.04.zip ). Make a note of where you unzipped it. In particular, you are looking for the location of the file ‘pdftotext.exe’. Also, make sure you know where ‘pdftoppm’ is located too (it’s in that download).
  3. Download and install Tesseract https://code.google.com/p/tesseract-ocr/ 
  4. Download and install Imagemagick http://www.imagemagick.org/
  5. Have a folder with the pdfs you wish to extract text from.
  6. Open R, and paste Marwick’s script into the script editor window.
  7. Make sure you adjust the path for “dest” and the path to “pdftotext.exe” to the correct location
  8. Run the script! But read the script carefully and make sure you run the bits you need. Ben has commented out the code very well, so it should be fairly straightforward.

Obviously, the above is framed for Windows users. For Mac users, the steps are all the same, except that you use the version of Xpdf, Tesseract, and Imagemagick built for IOS, and your paths to the other software are going to be different. And of course you’re using R for Mac, which means the ‘shell’ commands have to be swapped to ‘system’! (As of July 2014, the Xpdf file for Mac that you want is at ftp://ftp.foolabs.com/pub/xpdf/xpdfbin-mac-3.04.tar.gz ) I’m not 100% certain of any other Mac/PC differences in the R script – these should only exist at those points where R is calling on other resources (rather than on R packages). Caveat lector, eh?

The full R script may be found at https://gist.github.com/benmarwick/11333467. So here is the section that does the text extraction from pdf images (ie, you can copy and highlight text in the pdf):

###Note: there's some preprocessing that I (sg) haven't shown here: go see the original gist

################# Wait! ####################################
# Before proceeding, make sure you have a copy of pdf2text
# on your computer! Details: https://en.wikipedia.org/wiki/Pdftotext
# Download: http://www.foolabs.com/xpdf/download.html

# Tell R what folder contains your 1000s of PDFs
dest <- "G:/somehere/with/many/PDFs"

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# now there are a few options...

############### PDF to TXT #################################
# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', paste0('"', i, '"')), wait = FALSE) )

# where are the txt files you just made?
dest # in this folder

And here’s the bit that does the OCR

</pre>
                     ##### Wait! #####
# Before proceeding, make sure you have a copy of Tesseract
# on your computer! Details & download:
# https://code.google.com/p/tesseract-ocr/
# and a copy of ImageMagick: http://www.imagemagick.org/
# and a copy of pdftoppm on your computer!
# Download: http://www.foolabs.com/xpdf/download.html
# And then after installing those three, restart to
# ensure R can find them on your path.
# And note that this process can be quite slow...

# PDF filenames can't have spaces in them for these operations
# so let's get rid of the spaces in the filenames

sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})

# get the PDF file names without spaces
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# Now we can do the OCR to the renamed PDF files. Don't worry
# if you get messages like 'Config Error: No display
# font for...' it's nothing to worry about

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), using
  shell(shQuote(paste0("pdftoppm ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("convert *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("tesseract ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
  })

# where are the txt files you just made?
dest # in this folder

Besides showing how to do your own OCR, Marwick’s script shows some of the power of R for doing more than statistics. Mac users might be interested in Ben Schmidt’s tutorial ‘Command-line OCR on a Mac’ from his digital history graduate seminar at Northeastern University, online at http://benschmidt.org/dighist13/?page_id=129.

Government of Canada Edits

In recent days, a number of twitterbots have been set up to monitor changes to Wikipedia emerging from government IP address blocks. Seems to me that here is a window for data mining the mindset of government. Of course, there’s nothing to indicate that anything untoward is being done by the Government itself; I live in Ottawa, and I know what civil servants can get up to on their lunch break.

But let’s look at the recent changes documented by https://twitter.com/gccaedits; I’ve taken screenshots below, but you can just scroll through gccaedits’ feed. Actually, come to think of it, someone should be archiving those tweets, too. It’s only been operational for something like 3 days, but already, we see an interesting grammar/football fanatic; someone with opinions on Amanda Knox; someone setting military history right, and someone fixing the German version of Rene Levesque’s page.

Hmmm. Keep your eyes on this, especially as next year is an election year…

Screen Shot 2014-07-14 at 1.54.49 PMScreen Shot 2014-07-14 at 1.54.24 PMScreen Shot 2014-07-14 at 1.54.03 PMScreen Shot 2014-07-14 at 1.53.35 PMScreen Shot 2014-07-14 at 1.53.14 PMScreen Shot 2014-07-14 at 1.52.55 PMScreen Shot 2014-07-14 at 1.52.23 PMScreen Shot 2014-07-14 at 1.51.52 PMScreen Shot 2014-07-14 at 1.55.06 PM

Setting up your own Data Refinery

Refinery at Oxymoron, by Wyatt Wellman, cc by-sa 2.0. Flickr.

Refinery at Oxymoron, by Wyatt Wellman, cc by-sa 2.0. Flickr.

I’ve been playing with a Mac. I’ve been a windows person for a long time, so bear with me.

I’m setting up a number of platforms locally for data mining. But since what I’m *really* doing is smelting the ore of data scraped using things like Outwit Hub or Import.io (the ‘mining operation’, in this tortured analogy), what I’m setting up is a data refinery. Web based services are awesome, but if you’re dealing with sensitive data (like oral history interviews, for example) you need something local – this will also help with your ethics board review too.  Onwards!

Voyant-Tools

You can now set Voyant-Tools up locally, keeping your data safe and sound. The documentation and downloads are all on this page. This was an incredibly easy setup on Mac. Unzip, double-click voyant-tools.jar, and boom you’ve got Voyant-Tools puttering away in your browser. It’ll be at  http://127.0.0.1:8888. You can also hit the cogwheel icon in the top right to run your corpus through all sorts of other tools that come with Voyant but aren’t there on the main layout. You’ll want ‘export corpus with other tool’. You’ll end up with a url something like http://127.0.0.1:8888/tool/RezoViz/?corpus=1404405786475.8384 . You can then swap the name of any other tool in that URL (to save time). RezoViz by the way uses named entity extraction to construct a network of entities mentioned in the same documents. So if you upload your corpus in small-ish chunks (paragraphs; pages; every 1000 words, whatever) you can see how it all ties together this way. From the cogwheel icon on the RezoViz layout, you can get .net which you can then import into Gephi. How frickin’ cool is that?

Overview Project


Topic modeling is all the rage, and yes, you should have MALLET or the Stanford TMT or R on your machine. But sometimes, it’s nice to just see something rather like a dendrogram of folders with progressively finer levels of self-similarity. Overview does term-frequency inverse document frequency weightings to figure out similarity of documents. The instructions (for all platforms) are here. It’s not quite as painless as Voyant, but it’s pretty darn close. You’ll need to have Postgres – download, install, run it once, then download Overview. You need to have Java 7. (At some point, you’ll probably need to look into running multiple versions of Java, if you continue to add elements to your refinery). Then:

  1. Ctrl-Click or Right-click on Overview for Mac OS X.command and select Open. When the dialog box asks if you are sure you want to open the application, click on the Open button. From then on, you can start Overview by double-clicking on Overview for Mac OS X.command.
  2. Browse to http://localhost:9000 and log in as admin@overviewproject.org with passwordadmin@overviewproject.org.

And you now have Overview running. You can do many many things with Overview – it’ll read pdfs, for instance, which you can then export within a csv file. You can tag folders and export those tags, to do some fun visualizations with the next part of your refinery, RAW.

Tags exported from Overview

Tags visualized. This wasn’t done with Raw (but rather, a commercial piece of software), but you get the idea.

RAW

Flow diagram in Raw, using sample movie data

Flow diagram in Raw, using sample movie data

Raw does wonderful things with CSV formatted data, all in your browser. You can use the webapp version; nothing gets communicated to the server. But, still, it’s nice to keep it close to home. So, you can get Raw source code here. It’s a little trickier to install than the others. First thing: you’ll need Bower. But you can’t install Bower without Node.js and npm. So, go to Node.js and hit install. Then, download Raw. Unzip Raw and go to that folder. To install Bower, type

$ sudo npm install -g bower

Once the dust settles, there’s a bunch of dependencies to install. Remember, you’re in the Raw folder. Type:

$ bower install

When the dust clears again, and assuming you have Python installed on your machine, fire Raw up in a server:

$ python -m SimpleHTTPServer 4000

(If you don’t have python, well, go get python. I’ll wait). Then in your browser go to

localhost:4000

And you can now do some funky visualizations of your data. There are a number of chart types packaged with Raw, but you can also develop your own – here’s the documentation. Michelle Moravec has been doing some lovely work visualizing her historical research using Raw. You should check it out.

 

Your Open Source Data Refinery

With these three pieces of data refinery infrastructure installed on your machine, or in your local digital history computer lab, you’ll have no excuse not to start adding some distant reading perspective to your method. Go. Do it now.

Stanford NER, extracting & visualizing patterns

This is just a quick note while I’m thinking about this. I say ‘visualizing’ patterns, but there are of course many ways of doing that. Here, I’m just going quick’n’dirty into a network.

Say you have the diplomatic correspondence of the Republic of Texas, and you suspect that there might be interesting patterns in the places named over time. You can use the Stanford Named Entity Recognition package to extract locations. Then, using some regular expressions, you can transform that output into a network file. BUT – and this is important – it’s a format that carries some baggage of its own. Anyway, first you’ll want the Correspondence. Over at The Macroscope, we’ve already written about how you can extract the patterns of correspondence between individuals using regex patterns. This doesn’t need the Stanford NER because there is an index to that correspondence, and the regex grabs & parses that information for you.

But there is no such index for locations named. So grab that document, and feed it into the NER as Michelle Moravec instructs on her blog here. In the  terminal window, as the classifier classifies Persons, Organizations, and Locations, you’ll spot blank lines between batches of categorized items (edit: there’s a classifier that’ll grab time too; that’d be quite handy to incorporate here – SG). These blanks correspond to the blanks between the letters in the original document. Copy all of the terminal output into a new Notepad++ or Textwrangler document. We’re going to trim away every line that isn’t led by LOCATION:

\n[^LOCATION].+

and replace with nothing. This will delete everything that doesn’t have the location tag in front. Now, let’s mark those blank lines as the start of a new letter. A thread on Stack Overflow suggests this regex to find those blank lines:

^\s*$

where:

^ is the beginning of string anchor
$ is the end of string anchor
\s is the whitespace character class
* is zero-or-more repetition

and we replace with the string new-letter.

Now we want to get all of the locations for a single letter into a single line. Replace ‘LOCATION’ with a comma. This budges everything into a single line, so we need to reintroduce line breaks, by replacing ‘new-letter’ with the new line character:

find: (new-letter)
replace \n(\1)

I could’ve just replaced new-letter with a new-line, but I wanted to make sure that every new line did in fact start with new-letter. Now find and replace new-letter so that it’s removed. You now have a document with the same number of lines as original letters in the original correspondence file. Now to turn it into a network file! Add the following information at the start of the file:

DL
n=721
format = nodelist1
labels embedded:
data:

DL will tell a network analysis program that we are dealing with UCINET’s DL format. N equals the number of nodes. Format=nodelist1 says, ‘this is a format where the first item on the line is connected to all the subsequent items on that line’. As a historian or archaeologist, you can see that there’s a big assumption in that format. Is it justified? That’s something to mull over. Gephi only accepts DL in format=edgelist1, that is, binary pairs. If that describes the relationship in your data, there’s a lot of legwork involved in moving from nodelist1 to edgelist1, and I’m not covering that here. Let’s imagine that, on historical grounds, nodelist1 accurately describes the relationship between locations mentioned in letters, that the first location mentioned is probably the place where the letter is being written from, or the most important place, or….

“labels embedded:” tells a network program that the labels themselves are being used as data points, and “data:” indicates that everything afterwards is the data. But how did we know how many nodes there were? You could tally up by hand; you could copy and paste your data )(back when each LOCATION was listed) into a spreadsheet and use its COUNT function to find uniques; I’m lazy and just bang any old number in there, and then save it with a .dl extension.  Then I open it using a small program called Keyplayer. This isn’t what the program is for, but it will give you an error message that tells you the correct number of nodes! Put that number into your DL file, and try again. If you’ve got it right, Keyplayer won’t do anything – its silence speaks volumes (you can then run an analysis in keyplayer. If your DL file is not formatted correctly, no results!).

You now have a DL file that you can analyze in Pajek or UCINET. If you want to visualize in Gephi, you have to get it into a DL format that Gephi can use (edgelist) or else into .net format. Open your DL file in Pajek, and then save as Pajek format (which is .net). Then open in Gephi. (Alternatively, going back a step, you can open in Keyplayer, and then within Keyplayer, hit the ‘visualize in Pajek’ button, and you’ll automatically get that transformation). (edit: if you’re on a Mac, you have to run Pajek or Ucinet with something like Winebottler. Forgot to mention that).

Ta da!

Locations mentioned in letters of the Republic of Texas

Locations mentioned in letters of the Republic of Texas

 

 

-ing history!

Still playing with videogrep. I downloaded 25 heritage minute commercials (non-Canadians: a series of 1 minute or so clips that teach us Canucks about the morally uplifting things we’ve done in the past, things we’ve invented, bad-things-we-did-but-we’ve-patched-over-now. You get the gist.). I ran them through various pattern matches based on parts-of-speech tagging. It was hard to do anything more than that because the closed captioning (on which this all rests) was simply awful. Anyway, there’s a healthy dose of serendipity in all of this, as even after the search is done, the exact sequence the clips are reassembled in is more or less random.

And with that, I give you the result of my pattern matching for gerunds:

-ing history! A Heritage Minute Auto-Supercut.

Heritage Jam entry: PARKER

I’m sure it isn’t quite what they were expecting, but I submitted something to HeritageJam.

View it here.

PARKER is an interactive experience in procedurally extracting, uncovering, and reversing, the burial of latent semantic core archaeological knowledge. In this era of neoliberal corporatization of cultural heritage knowledge, PARKER represents the way forward for its creation and appreciation. When we must balance funding for healthcare versus that for archaeologists, in this time of reduced availability of funds, how can we not turn to data mining and revisualization of knowledge? After all, what is the insight of the individual when millions of minutes of youtube videos are being created every minute? Further, PARKER extracts the core insights of archaeology and formats them automatically for patenting, so that DRM can be affixed and rightsholder value be fully realized.

PARKER:  for the archaeology we always dreamed of.

———

This visualization is an interactive story that frames the automatic search of youtube, natural-language parsing, and automatic super cut & re-formatting of those search results to highlight the ways code can frame archaeological knowledge. It applies Sam Lavigne’s ‘videogrep’ and ‘automatic patent generator’ to results from a search for ‘archaeological burials’ retrieved from Youtube, selecting the first few results that included closed-captioning. Videogrep uses natural-language pattern matching on those captioning files to select clips from a variety of pieces, restitching them at random. The result is similar to an I-Ching or other ways of divination of meaning. Similarly, the patent generator grabs the transcription so that elements that fit the language of patent applications. As I have argued elsewhere, digital archaeology is not about justification of results, but rather, the deformation of the familiar.

The result is a making-strange, an uncovering, of deeper truths. Code is not neutral, and we would be wise to recognize, to engage with, the theoretical perspectives encoded in our use of digital tools – especially when dealing with the human past.

 

A method and apparatus for observing the rhythmic cadence; or, an algorithmic alternative archaeology

Figure 1

Figure 1. A Wretched Garret Without A Fire (at least, according to Google Images)

A method and apparatus for observing the rhythmic cadence

ABSTRACT

A method and apparatus for observing the rhythmic cadence. The devices comprises a small shop, a wretched garret, a Russian letter, a mercantile house, a third storey

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 illustrates a wretched garret without a fire.

Figure 2 is a block diagram of a fearful storm off the island.

Figure 3 illustrates a mercantile house on my own account.

Figure 4 is a perspective view of the principal events of the Trojan war.

Figure 5 is an isometric view of a poor Jew for 4 francs a week.

Figure 6 is a cross section of a thorough knowledge of the English language.

Figure 7 is a block diagram of the hard trials of my life.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention is the son of a Protestant clergyman. The device is a wretched garret without a fire. The present invention facilitates the study of a language. The invention has my book in my hand. The invention acquires a thorough knowledge of the English language.

According to a beneficial embodiment, the invention such a degree that the study. The present invention shows his incapacity for the business. The device obtains a situation as correspondent and bookkeeper. The device understood a word of the language. The present invention established a mercantile house on my own account. The invention does not venture upon its study. The device devotes the rest of my life. The present invention realizes the dream of my whole life. The present invention publishes a work on the subject.

What is claimed is:

1. A method for observing the rhythmic cadence, comprising:
a wretched garret;
a small shop; and
a Russian letter.

2. The method of claim 1, wherein said wretched garret comprises a mercantile house on my own account.

3. The method of claim 1, wherein said small shop comprises the principal events of the Trojan war.

4. The method of claim 1, wherein said Russian letter comprises a fearful storm off the island.

5. An apparatus for observing the rhythmic cadence, comprising:
a mercantile house;
a small shop;
a third storey; and
a Russian letter.

6. The apparatus of claim 5, wherein said mercantile house comprises a wretched garret without a fire.

7. The apparatus of claim 5, wherein said small shop comprises a fearful storm off the island.

8. The apparatus of claim 5, wherein said third storey comprises a thorough knowledge of the English language.

9. The apparatus of claim 5, wherein said Russian letter comprises a thorough knowledge of the English language.

—————–
Did you recognize Troy and its Remains, by Henry (Heinrich) Schliemann, in that patent abstract? I took his ‘autobiographical notice’ from the opening of his account of the work at Troy, and ran it through Sam Lavigne’s Patent Generator. It’s a bit like the I-Ching. I have it in mind that this toy could be used to distort and reflect on, draw something new from, some of the classic works of archaeology – especially from that buccaneering phase when, well, pretty much anything went. What if, instead of publishing their discoveries, the early archaeologists had patented them instead? We live in such an era now, when new forms of life (or at least, its building blocks) can be patented; when workflows can be patented; when patents can be framed so broad that a word-generator and a lawyer will bring you riches beyond compare… the early archaeologists were after fame and fortune as much as they were about knowledge of the past. This patent of Schliemann’s uses as its source text an opening sketch about the man himself, rather than his discoveries. Doesn’t a sense of him shine through? Doesn’t he seem, well, rather over-inflated? What is the rhythmic cadence, I wonder. If I can sort out the encoding, I’ll try this on some of his discussion of what he found.

(think also the computational power that went into this toy: natural language processing, pattern matching… it’s rather impressive, actually, when you think what can be built by bolting existing bits together).

Here’s Chapter 1 of Schliemanns account of Troy. Please see the ‘detailed description of the preferred embodiments’, below.

——————-
An apparatus and method for according to the firman

ABSTRACT

An apparatus and method for according to the firman. The devices comprises a whole building, a large block

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram of the north-western end of the site.

Figure 2 is an isometric view of the second secretary of his chancellary.

Figure 3 is a perspective view of a large block of this kind.

Figure 4 is a diagrammatical view of the steep side of the hill.

Figure 5 is a schematic drawing of the native soil before the winter.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention sells me the field at any price. The device reach the native soil before the winter. The present invention is the highest mountain in the world.

What is claimed is:

1. An apparatus for according to the firman, comprising:
a whole building; and
a large block.

2. The apparatus of claim 1, wherein said whole building comprises a large block of this kind.

3. The apparatus of claim 1, wherein said large block comprises the native soil before the winter.

4. A method for according to the firman, comprising:
a large block; and
a whole building.

5. The method of claim 4, wherein said large block comprises the north-western end of the site.

6. The method of claim 4, wherein said whole building comprises the second secretary of his chancellary.

Using Storymaps.js

The Fonseca Bust, a storymap

The Fonseca Bust, a storymap

A discussion on Twitter the other day – asking about the best way to represent ‘flowmaps’ (see DHQ&A) – led me to encounter a new toy from KnightLabs: Storymap.js. Knightlabs also provides quite a nice, fairly intuitive editor for making the storymaps. In essence, it provides a way, and a viewer, for tying various kinds of media and text to points along a map. Sounds fairly simple, right? Something that you could achieve with ‘my maps’ in Google? Well, sometimes, it’s not what you do but the way that you do it. Storymaps also allows you to upload your own large-dimension image so that you can bring the viewer around it, pointing out the detail. In the sample (so-called) ‘gigapixel’ storymap, you are brought around The Garden of Earthly Delights.

This struck me as a useful tool for my upcoming classes – both in terms of creating something that I could embed in our LMS and course website for later viewing, but also as something that the students themselves could use to support their own presentations. I also imagine using it in place of essays or blog post reflections. To that end, I whipped up two sample storymaps. One reports on an academic journal article, the other provides a synopsis of a portion of a book’s argument.

Here’s a storymap about the Fonseca Bust.

Here’s a storymap about looting Cambodian statues.

In the former, I’ve uploaded an image to a public google drive folder. It’s been turned into tiles, so as to load into the map engine that is used to jump around the story. Storymap’s own documentation suggests using Photoshop’s zoomify plugin. But if you don’t have zoomify? Go to sourceforge and get this: http://sourceforge.net/projects/zoomifyimage/ . It requires that you have Python and the Python Image Library installed (PIL). Unzip zoomifyimage, and put your image that you want to use for your story in the same folder. Open your image in any image processing program, and find out how many pixels wide by high it is. Write this down. Close the program. Then, open a command prompt in the folder where you unzipped zoomify (shift+right click, ‘open command prompt here’, in Windows). At the prompt, type


ZoomifyFileProcessor.py <your_image_file>

If all goes well, nothing much seems to happen – except that you have a new folder with the name of your image, an xml file called ImageProperties.xml and one or more TileGroupN folders with your sliced and diced images. Move this entire folder (with its xml and subfolders) into your google drive. Make sure that it’s publicly viewable on the web, and take note of the hosting url. Copy and paste it somewhere handy.

see the Storymap.js documentation on this:

“If you don’t have a webserver, you can use Google Drive orDropbox. You need the base url for your exported image tiles when you start making your gigapixel StoryMap. (show me how)).”

In the Storymap.js editor, when you click on ‘make a new storymap’, you select ‘gigapixel’, and give it the url to your folder.  Enter the pixel dimensions of the complete image, and you’re good to go.

Your image could be a high-resolution google earth image; it could be a detail of a painting or a sculpture; it could be a historical map or photograph. There are also detailed instructions on running a storymap off your own server here.

 

Using Goldstone’s Topic Modeling R package with JSTOR’s Data for Research

Andrew Goldstone and Ted Underwood have an article on ‘the quiet transformation of literary studies’ (preprint), where they topic model a literary history journal and discuss the implications of that model for their discipline. Andrew has a blog post discussing their process and the coding that went into that process.

I’m currently playing with their code, and thought I’d share some of the things you’ll need to know if you want to try it out for yourself – get it on github. I’m assuming you’re using a Windows machine.

1. Get the right version of R. You need 3.0.3 for this. Use either the 32 bit or 64 bit version of R ( both download in a single installer; when you install it, you can choose the 32 bit or the 64 bit version, depending on your machine. Choose wisely).

2. Make sure you’re using the right version of Java. If you are on a 64 bit machine, have 64 bit java; 32 bit: 32 bit java.

3. Dependencies. You need to have the rJava package, and the Mallet wrapper, installed in R. You’ll also need ‘devtools’. In the basic R gui, you can do this by clicking on packages >> install packages. Select rJava. Do the same again for Mallet. Do the same again for ‘devtools’. Now you can install Goldstone’s dfrtopics by typing, at the R command prompt

library(devtools)
install_github("dfrtopics","agoldst")

Now. Assuming that you’ve downloaded and unzipped a dataset from JSTOR (go to dfr.jstor.org to get one), here’s what you’re going to need to do. You’ll need to increase the available memory in Java for rJava to play with. You do this before you tell R to use the rJava library. I find it best to just close R, then reload it. Then, type the following, one at a time:

options(java.parameters="-Xmx2048m")
library(rJava)
library(mallet)
library(dfrtopics)

The first item in that list increases the memory heap size. If all goes well, there’ll be a little message telling you that your heap size is 2048 mb and you should really increase it to 2gb. As these are the same thing, then no worries. Now to topic model your stuff!

m <- model_documents(citations_files="[path to your]\\citations.CSV",
dirs="[path to your]\\wordcounts\\",
stoplist_file="[path to your]\\stoplist.txt",
n_topics=60)

Change n_topics to whatever you want. In the path to your files, remember to use double \\.

Now to export your materials.

output_model(m, "data")

This will create a ‘data’ folder with all of your outputs. But where? In your working directory! If you don’t know where this is, wait until the smoke clears (the prompt returns) and type

getwd()

You can use setwd() to set that to whatever you want:

setwd("c:\\path-to-your-preferred-work-directory")

You can also export all of this to work with Goldstone’s topic model browser, but that’ll be a post for another day. Open up your data folder and explore your results.