I’ve come across an interesting tool called ‘Overview‘. It’s meant for journalists, but I see no reason why it can’t serve historical/archaeological ends as well. It does recursive adaptive k-means clustering rather than topic modeling, as I’d initially assumed (more on process here). You can upload texts as pdfs or within a table. One of the columns in your table could be a ‘tags’ column, whereby – for example – you indicate the year in which the entry was made (if you’re working with a diary). Then, Overview sorts your documents or entries into nested folders of similiarity. You can then see how your tags – decades – play out across similar documents. In the screenshot below, I’ve fed the text of ca 600 historical plaques into Overview:
Overview divides the historical plaques, at the broadest level, of similarity into the following groups:
‘church, school, building, toronto, canada, street, first, house, canadian, college (545 plaques),
‘road, john_graves, humber, graves_simcoe, lake, river, trail, plant’ (41 plaques)
‘community’ with ‘italian, north_york, lansing, store, shepard, dempsey, sheppard_avenue’, 13 documents
‘: years’ with ‘years_ago, glacier, ice, temperance, transported, found, clay, excavation’, 11 documents.
That’s interesting information to know. In terms of getting the info back out, you can export a spreadsheet with tags attached. Within Overview, you might want to tag all documents together that sort into similar groupings, which you could then visualize with some other program. You can also search documents, and tag them manually. I wondered how plaques concerned with ‘children’, ‘women’, ‘agriculture’, ‘industry’, etc might play out, so I started using Overview’s automatic tagger (search for a word or phrase, apply that word or phrase as a tag to everything that is found). One could then visually explore the way various tags correspond with particular folders of similar documents (as in this example). That first broad group of ‘church school building canada toronto first york house street canadian’ just is too darned big, and so my tagging is hidden (see the image)- but it does give you a sense that the historical plaques in Toronto really are concerned with the first church, school, building, house, etc in Toronto (formerly, York). Architectural history trumps all. It would be interesting to know if these plaques are older than the other ones: has the interest in spaces/places of history shifted over time from buildings to people? Hmmm. I’d better check my topic models, and do some close reading.
Anyway, leaving that aside for now, I exported my tagged texts, and did a quick and dirty network visualization of tags connected to other tags by virtue of shared plaques. I only did this for 200 of the plaques, because, frankly, it’s Friday evening and I’d like to go home.
Here’s what I saw [pdf version]:
So a cluster with ‘elderly’, ‘industry’, ‘doctor’, ‘medical’, ‘woman’…. I don’t think this visualization that I did was particularly useful.
Probably, it would be better to generate tags that collect everything together in the groups that the tree visualization in Overview generates, export that, and visualize as some kind of dendrogram. It would be good if the groupings could be exported without having to do that though.
Cool, and love the interface with DocumentCloud. I’ll have to test it with a 600 page legal doc in Italian…