Stanford NER, extracting & visualizing patterns

This is just a quick note while I’m thinking about this. I say ‘visualizing’ patterns, but there are of course many ways of doing that. Here, I’m just going quick’n’dirty into a network.

Say you have the diplomatic correspondence of the Republic of Texas, and you suspect that there might be interesting patterns in the places named over time. You can use the Stanford Named Entity Recognition package to extract locations. Then, using some regular expressions, you can transform that output into a network file. BUT – and this is important – it’s a format that carries some baggage of its own. Anyway, first you’ll want the Correspondence. Over at The Macroscope, we’ve already written about how you can extract the patterns of correspondence between individuals using regex patterns. This doesn’t need the Stanford NER because there is an index to that correspondence, and the regex grabs & parses that information for you.

But there is no such index for locations named. So grab that document, and feed it into the NER as Michelle Moravec instructs on her blog here. In the  terminal window, as the classifier classifies Persons, Organizations, and Locations, you’ll spot blank lines between batches of categorized items (edit: there’s a classifier that’ll grab time too; that’d be quite handy to incorporate here – SG). These blanks correspond to the blanks between the letters in the original document. Copy all of the terminal output into a new Notepad++ or Textwrangler document. We’re going to trim away every line that isn’t led by LOCATION:

\n[^LOCATION].+

and replace with nothing. This will delete everything that doesn’t have the location tag in front. Now, let’s mark those blank lines as the start of a new letter. A thread on Stack Overflow suggests this regex to find those blank lines:

^\s*$

where:

^ is the beginning of string anchor
$ is the end of string anchor
\s is the whitespace character class
* is zero-or-more repetition

and we replace with the string new-letter.

Now we want to get all of the locations for a single letter into a single line. Replace ‘LOCATION’ with a comma. This budges everything into a single line, so we need to reintroduce line breaks, by replacing ‘new-letter’ with the new line character:

find: (new-letter)
replace \n(\1)

I could’ve just replaced new-letter with a new-line, but I wanted to make sure that every new line did in fact start with new-letter. Now find and replace new-letter so that it’s removed. You now have a document with the same number of lines as original letters in the original correspondence file. Now to turn it into a network file! Add the following information at the start of the file:

DL
n=721
format = nodelist1
labels embedded:
data:

DL will tell a network analysis program that we are dealing with UCINET’s DL format. N equals the number of nodes. Format=nodelist1 says, ‘this is a format where the first item on the line is connected to all the subsequent items on that line’. As a historian or archaeologist, you can see that there’s a big assumption in that format. Is it justified? That’s something to mull over. Gephi only accepts DL in format=edgelist1, that is, binary pairs. If that describes the relationship in your data, there’s a lot of legwork involved in moving from nodelist1 to edgelist1, and I’m not covering that here. Let’s imagine that, on historical grounds, nodelist1 accurately describes the relationship between locations mentioned in letters, that the first location mentioned is probably the place where the letter is being written from, or the most important place, or….

“labels embedded:” tells a network program that the labels themselves are being used as data points, and “data:” indicates that everything afterwards is the data. But how did we know how many nodes there were? You could tally up by hand; you could copy and paste your data )(back when each LOCATION was listed) into a spreadsheet and use its COUNT function to find uniques; I’m lazy and just bang any old number in there, and then save it with a .dl extension.  Then I open it using a small program called Keyplayer. This isn’t what the program is for, but it will give you an error message that tells you the correct number of nodes! Put that number into your DL file, and try again. If you’ve got it right, Keyplayer won’t do anything – its silence speaks volumes (you can then run an analysis in keyplayer. If your DL file is not formatted correctly, no results!).

You now have a DL file that you can analyze in Pajek or UCINET. If you want to visualize in Gephi, you have to get it into a DL format that Gephi can use (edgelist) or else into .net format. Open your DL file in Pajek, and then save as Pajek format (which is .net). Then open in Gephi. (Alternatively, going back a step, you can open in Keyplayer, and then within Keyplayer, hit the ‘visualize in Pajek’ button, and you’ll automatically get that transformation). (edit: if you’re on a Mac, you have to run Pajek or Ucinet with something like Winebottler. Forgot to mention that).

Ta da!

Locations mentioned in letters of the Republic of Texas

Locations mentioned in letters of the Republic of Texas

 

 

-ing history!

Still playing with videogrep. I downloaded 25 heritage minute commercials (non-Canadians: a series of 1 minute or so clips that teach us Canucks about the morally uplifting things we’ve done in the past, things we’ve invented, bad-things-we-did-but-we’ve-patched-over-now. You get the gist.). I ran them through various pattern matches based on parts-of-speech tagging. It was hard to do anything more than that because the closed captioning (on which this all rests) was simply awful. Anyway, there’s a healthy dose of serendipity in all of this, as even after the search is done, the exact sequence the clips are reassembled in is more or less random.

And with that, I give you the result of my pattern matching for gerunds:

-ing history! A Heritage Minute Auto-Supercut.

Heritage Jam entry: PARKER

I’m sure it isn’t quite what they were expecting, but I submitted something to HeritageJam.

View it here.

PARKER is an interactive experience in procedurally extracting, uncovering, and reversing, the burial of latent semantic core archaeological knowledge. In this era of neoliberal corporatization of cultural heritage knowledge, PARKER represents the way forward for its creation and appreciation. When we must balance funding for healthcare versus that for archaeologists, in this time of reduced availability of funds, how can we not turn to data mining and revisualization of knowledge? After all, what is the insight of the individual when millions of minutes of youtube videos are being created every minute? Further, PARKER extracts the core insights of archaeology and formats them automatically for patenting, so that DRM can be affixed and rightsholder value be fully realized.

PARKER:  for the archaeology we always dreamed of.

———

This visualization is an interactive story that frames the automatic search of youtube, natural-language parsing, and automatic super cut & re-formatting of those search results to highlight the ways code can frame archaeological knowledge. It applies Sam Lavigne’s ‘videogrep’ and ‘automatic patent generator’ to results from a search for ‘archaeological burials’ retrieved from Youtube, selecting the first few results that included closed-captioning. Videogrep uses natural-language pattern matching on those captioning files to select clips from a variety of pieces, restitching them at random. The result is similar to an I-Ching or other ways of divination of meaning. Similarly, the patent generator grabs the transcription so that elements that fit the language of patent applications. As I have argued elsewhere, digital archaeology is not about justification of results, but rather, the deformation of the familiar.

The result is a making-strange, an uncovering, of deeper truths. Code is not neutral, and we would be wise to recognize, to engage with, the theoretical perspectives encoded in our use of digital tools – especially when dealing with the human past.

 

A method and apparatus for observing the rhythmic cadence; or, an algorithmic alternative archaeology

Figure 1

Figure 1. A Wretched Garret Without A Fire (at least, according to Google Images)

A method and apparatus for observing the rhythmic cadence

ABSTRACT

A method and apparatus for observing the rhythmic cadence. The devices comprises a small shop, a wretched garret, a Russian letter, a mercantile house, a third storey

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 illustrates a wretched garret without a fire.

Figure 2 is a block diagram of a fearful storm off the island.

Figure 3 illustrates a mercantile house on my own account.

Figure 4 is a perspective view of the principal events of the Trojan war.

Figure 5 is an isometric view of a poor Jew for 4 francs a week.

Figure 6 is a cross section of a thorough knowledge of the English language.

Figure 7 is a block diagram of the hard trials of my life.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention is the son of a Protestant clergyman. The device is a wretched garret without a fire. The present invention facilitates the study of a language. The invention has my book in my hand. The invention acquires a thorough knowledge of the English language.

According to a beneficial embodiment, the invention such a degree that the study. The present invention shows his incapacity for the business. The device obtains a situation as correspondent and bookkeeper. The device understood a word of the language. The present invention established a mercantile house on my own account. The invention does not venture upon its study. The device devotes the rest of my life. The present invention realizes the dream of my whole life. The present invention publishes a work on the subject.

What is claimed is:

1. A method for observing the rhythmic cadence, comprising:
a wretched garret;
a small shop; and
a Russian letter.

2. The method of claim 1, wherein said wretched garret comprises a mercantile house on my own account.

3. The method of claim 1, wherein said small shop comprises the principal events of the Trojan war.

4. The method of claim 1, wherein said Russian letter comprises a fearful storm off the island.

5. An apparatus for observing the rhythmic cadence, comprising:
a mercantile house;
a small shop;
a third storey; and
a Russian letter.

6. The apparatus of claim 5, wherein said mercantile house comprises a wretched garret without a fire.

7. The apparatus of claim 5, wherein said small shop comprises a fearful storm off the island.

8. The apparatus of claim 5, wherein said third storey comprises a thorough knowledge of the English language.

9. The apparatus of claim 5, wherein said Russian letter comprises a thorough knowledge of the English language.

—————–
Did you recognize Troy and its Remains, by Henry (Heinrich) Schliemann, in that patent abstract? I took his ‘autobiographical notice’ from the opening of his account of the work at Troy, and ran it through Sam Lavigne’s Patent Generator. It’s a bit like the I-Ching. I have it in mind that this toy could be used to distort and reflect on, draw something new from, some of the classic works of archaeology – especially from that buccaneering phase when, well, pretty much anything went. What if, instead of publishing their discoveries, the early archaeologists had patented them instead? We live in such an era now, when new forms of life (or at least, its building blocks) can be patented; when workflows can be patented; when patents can be framed so broad that a word-generator and a lawyer will bring you riches beyond compare… the early archaeologists were after fame and fortune as much as they were about knowledge of the past. This patent of Schliemann’s uses as its source text an opening sketch about the man himself, rather than his discoveries. Doesn’t a sense of him shine through? Doesn’t he seem, well, rather over-inflated? What is the rhythmic cadence, I wonder. If I can sort out the encoding, I’ll try this on some of his discussion of what he found.

(think also the computational power that went into this toy: natural language processing, pattern matching… it’s rather impressive, actually, when you think what can be built by bolting existing bits together).

Here’s Chapter 1 of Schliemanns account of Troy. Please see the ‘detailed description of the preferred embodiments’, below.

——————-
An apparatus and method for according to the firman

ABSTRACT

An apparatus and method for according to the firman. The devices comprises a whole building, a large block

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram of the north-western end of the site.

Figure 2 is an isometric view of the second secretary of his chancellary.

Figure 3 is a perspective view of a large block of this kind.

Figure 4 is a diagrammatical view of the steep side of the hill.

Figure 5 is a schematic drawing of the native soil before the winter.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The present invention sells me the field at any price. The device reach the native soil before the winter. The present invention is the highest mountain in the world.

What is claimed is:

1. An apparatus for according to the firman, comprising:
a whole building; and
a large block.

2. The apparatus of claim 1, wherein said whole building comprises a large block of this kind.

3. The apparatus of claim 1, wherein said large block comprises the native soil before the winter.

4. A method for according to the firman, comprising:
a large block; and
a whole building.

5. The method of claim 4, wherein said large block comprises the north-western end of the site.

6. The method of claim 4, wherein said whole building comprises the second secretary of his chancellary.

Using Storymaps.js

The Fonseca Bust, a storymap

The Fonseca Bust, a storymap

A discussion on Twitter the other day – asking about the best way to represent ‘flowmaps’ (see DHQ&A) – led me to encounter a new toy from KnightLabs: Storymap.js. Knightlabs also provides quite a nice, fairly intuitive editor for making the storymaps. In essence, it provides a way, and a viewer, for tying various kinds of media and text to points along a map. Sounds fairly simple, right? Something that you could achieve with ‘my maps’ in Google? Well, sometimes, it’s not what you do but the way that you do it. Storymaps also allows you to upload your own large-dimension image so that you can bring the viewer around it, pointing out the detail. In the sample (so-called) ‘gigapixel’ storymap, you are brought around The Garden of Earthly Delights.

This struck me as a useful tool for my upcoming classes – both in terms of creating something that I could embed in our LMS and course website for later viewing, but also as something that the students themselves could use to support their own presentations. I also imagine using it in place of essays or blog post reflections. To that end, I whipped up two sample storymaps. One reports on an academic journal article, the other provides a synopsis of a portion of a book’s argument.

Here’s a storymap about the Fonseca Bust.

Here’s a storymap about looting Cambodian statues.

In the former, I’ve uploaded an image to a public google drive folder. It’s been turned into tiles, so as to load into the map engine that is used to jump around the story. Storymap’s own documentation suggests using Photoshop’s zoomify plugin. But if you don’t have zoomify? Go to sourceforge and get this: http://sourceforge.net/projects/zoomifyimage/ . It requires that you have Python and the Python Image Library installed (PIL). Unzip zoomifyimage, and put your image that you want to use for your story in the same folder. Open your image in any image processing program, and find out how many pixels wide by high it is. Write this down. Close the program. Then, open a command prompt in the folder where you unzipped zoomify (shift+right click, ‘open command prompt here’, in Windows). At the prompt, type


ZoomifyFileProcessor.py <your_image_file>

If all goes well, nothing much seems to happen – except that you have a new folder with the name of your image, an xml file called ImageProperties.xml and one or more TileGroupN folders with your sliced and diced images. Move this entire folder (with its xml and subfolders) into your google drive. Make sure that it’s publicly viewable on the web, and take note of the hosting url. Copy and paste it somewhere handy.

see the Storymap.js documentation on this:

“If you don’t have a webserver, you can use Google Drive orDropbox. You need the base url for your exported image tiles when you start making your gigapixel StoryMap. (show me how)).”

In the Storymap.js editor, when you click on ‘make a new storymap’, you select ‘gigapixel’, and give it the url to your folder.  Enter the pixel dimensions of the complete image, and you’re good to go.

Your image could be a high-resolution google earth image; it could be a detail of a painting or a sculpture; it could be a historical map or photograph. There are also detailed instructions on running a storymap off your own server here.

 

Using Goldstone’s Topic Modeling R package with JSTOR’s Data for Research

Andrew Goldstone and Ted Underwood have an article on ‘the quiet transformation of literary studies’ (preprint), where they topic model a literary history journal and discuss the implications of that model for their discipline. Andrew has a blog post discussing their process and the coding that went into that process.

I’m currently playing with their code, and thought I’d share some of the things you’ll need to know if you want to try it out for yourself – get it on github. I’m assuming you’re using a Windows machine.

1. Get the right version of R. You need 3.0.3 for this. Use either the 32 bit or 64 bit version of R ( both download in a single installer; when you install it, you can choose the 32 bit or the 64 bit version, depending on your machine. Choose wisely).

2. Make sure you’re using the right version of Java. If you are on a 64 bit machine, have 64 bit java; 32 bit: 32 bit java.

3. Dependencies. You need to have the rJava package, and the Mallet wrapper, installed in R. You’ll also need ‘devtools’. In the basic R gui, you can do this by clicking on packages >> install packages. Select rJava. Do the same again for Mallet. Do the same again for ‘devtools’. Now you can install Goldstone’s dfrtopics by typing, at the R command prompt

library(devtools)
install_github("dfrtopics","agoldst")

Now. Assuming that you’ve downloaded and unzipped a dataset from JSTOR (go to dfr.jstor.org to get one), here’s what you’re going to need to do. You’ll need to increase the available memory in Java for rJava to play with. You do this before you tell R to use the rJava library. I find it best to just close R, then reload it. Then, type the following, one at a time:

options(java.parameters="-Xmx2048m")
library(rJava)
library(mallet)
library(dfrtopics)

The first item in that list increases the memory heap size. If all goes well, there’ll be a little message telling you that your heap size is 2048 mb and you should really increase it to 2gb. As these are the same thing, then no worries. Now to topic model your stuff!

m <- model_documents(citations_files="[path to your]\\citations.CSV",
dirs="[path to your]\\wordcounts\\",
stoplist_file="[path to your]\\stoplist.txt",
n_topics=60)

Change n_topics to whatever you want. In the path to your files, remember to use double \\.

Now to export your materials.

output_model(m, "data")

This will create a ‘data’ folder with all of your outputs. But where? In your working directory! If you don’t know where this is, wait until the smoke clears (the prompt returns) and type

getwd()

You can use setwd() to set that to whatever you want:

setwd("c:\\path-to-your-preferred-work-directory")

You can also export all of this to work with Goldstone’s topic model browser, but that’ll be a post for another day. Open up your data folder and explore your results.

 

Still playing with historical maps into minecraft

I managed to get my map of the zone between the Hogs’ back falls and Dow’s Lake (nee Swamp) into Minecraft. I completely screwed up the elevations though, so it’s a pretty ….interesting… landscape. I’ve trying again with a map of Lowertown, coupled with elevation data from a modern map. This clearly isn’t ideal, as the topography of the area has changed a lot with 150 years of urbanism. But it’s the best I have handy. Anyway, it’s nearly been working for me.

Nearly.

So I provide to you the elevation and features for your own enjoyment, see if you can make ‘em run with the generate_map.py script. If you get ‘key errors’, try editing the features file in Paint, make sure the blocks of colour are not fuzzy on the edges.

https://dl.dropboxusercontent.com/u/37716296/byward-market/market-maps.zip