Text Analysis of the Grand Jury Documents

a topic in the grand jury documents, #ferguson

a topic in the grand jury documents, #ferguson

I watched Twitter and the CBC while the prosecutor was reading his statement. I watched the live feeds from Ferguson, and other cities around the US. Back in August, when this all first began, I was glued to my computer, several feeds going at once.

A spectator.

Yesterday, Mitch Fraas put the grand jury documents (transcripts of the statements, the proceedings) into Voyant Tools:

These ultimately came from here: http://apps.stlpublicradio.org/ferguson-project/evidence.html

So today, I began, in a small way, to try to make sense of it all, the only way that I can. Text analysis.

Here’s the Voyant Tools corpus

Not having read the full corpus closely (this is, of course, a *distant* tool), it certainly looks as if the focus was on working out what Brown was doing, rather than Wilson…

I started topic modeling, using R & MALLET.

and I put everything up on github

but then I felt that I could improve the analysis; I created one concatenated file, then broke it into 1000 line chunks. The latest inputs, outputs, and scripts, are all on my github page.

The most haunting…

And all 100 topics…

None of this counts as analysis. But – by putting it altogether, my hope is that more people will grab the text files, grab the R script, explore the Voyant corpus, and really put this all under the microscope. I was tremendously effected by Bethany’s latest blog post, ‘All at once‘, which discusses her own reaction to recent news in both Ferguson and UVa, and elsewhere. It was this bit at the end that really resonated:

[…]we need analytical and interpretive platforms, too, that help us embrace our own subjective positioning in the systems in which we labor–which means, inevitably, to embrace our own complicity and culpability in them. And we need these, at the same time, to help us see beyond: to see patterns and trends, to read close and distantly all at once, to know how to act and what to do next. We need platforms that help us understand the workings of the cogs, of which we are one.

So here’s my small contribution. Maybe this can be a platform for someone to do a deeper analysis, to get started with text analysis, to read distantly and closely, to see beyond, and to understand what happened during the Grand Jury.

A Digital Archaeology of Digital Archaeology: work in progress

Ethan Watrall and I have been playing around with data mining as a way of writing a historiography of digital & computational archaeology. We’d like to invite you to play along.

We’ll probably have something to say on this at the SAA in April. Anyway, we’ve just been chugging along slowly, sharing the odd email, google doc, and so on – and a monstrous huge topic model browser I set up. Yesterday, an exchange on twitter took place that prompted us to share those materials.

This prompted a lot of chatter, including:

and this:

So let’s get this party started, shall we?

~o0o~

While there’s a lot of movement towards sharing data, and open access publications, there’s also this other space of materials that we don’t talk about too much – the things we build from the data that we (sometimes) share that enable us to write those publications we (sometimes) make open access. This intermediate stage never gets shared. Probably with good reason, but I thought given the nature of digital work, perhaps there’s an opportunity here to open not just our research outputs & inputs, but also our process to wider participation.

Hence this post, and all that follows.

~o0o~

Here’s what I did. I scoured JSTOR’s DFR for anglophone journals, from 1935 onwards (full bibliography right here: http://graeworks.net/digitalarchae/20000/#/bib. Then I fitted various topic models to them, using Andrew Goldstone’s dfr-topics which is an R package using MALLET on the bag-of-words that DFR gives you, running the result through Andrew’s dfr-browser (tag line: “Take a MALLET to disciplinary history!”).

The results can be viewed here. Like I said, this is the middle part of an analysis that we’re sharing here. Want to do some historiography with a distant reading approach? We’d love to see what you spot/think/observe in these models (maybe your students would like a go?) In which case, here’s an open pad for folks to share & comment.

Why would you bother? Well, it occurred to me that I’ve never seen anyone try to crowdsource this step of the process. Maybe it’s a foolish idea. But if folks did, and there was merit to this process, maybe some kind of digital publication could result where all contributors would be authors? Maybe a series of essays, all growing from this same body of analysis? Lots of opportunities.

Stranger things have happened, right?

~o0o~

Just to get you going, here are some of the things I’ve noticed, and some of my still-churning thoughts on what all this might mean (I’ve pasted this from another document; remember, work in progress!):

remembering that in topic modeling, a word can be used in different senses in different topics/discourses (thus something of the semantic sense of a word is preserved)

tools used:

-stanford tmt for detailed view on CAA (computer applications in archaeology)

-mimno’s browser based jslda for detailed view of correlations between topics (using CAA & IA) (internet archaeology, only the open access materials before it went fully OA in October 2014)

-Goldstone’s dftropics for R and dfrbrowser to visulaize 21 000 articles as entire topic model

-same again for individual journals: AJA, JFA, AmA, CA, JAMT, WA

——-

stanford tmt of caa 1974 – 2011

Screen Shot 2014-11-09 at 3.43.49 PM

-no stoplist used; discards most prominent and least likely words from the analysis

-its output is formatted in such a way that it becomes easy to visualize the patterns of discourse over time (MALLET, the other major tool for doing topic modeling, requires much more massaging to get the output in such a form. The right tool for the right job).

-30 topics gives good breakdown; topic 26 contains garbage (‘caa proceedings’ etc as topic words)

In 1974, the most prominent topics were:

topic 1 – computer, program, may, storage, then, excavation, recording, all, into, form, using, retrieval, any, user, output, records, package, entry, one, unit

topic 6: but, they, one, time, their, all, some, only, will, there, would, what, very, our, other, any, most, them, even

topic 20: some, will, many, there, field, problems, may, but, archaeologists, excavation, their, they, recording, however, record, new, systems, most, should, need

The beginnings of the CAA are marked by hesitation and prognostication: what *are* computers for, in archaeology? There is a sense that for archaeologists, computation is something that will be useful insofar as it can be helpful for recording information in the field. With time, topic 1 diminishes. By 2000 it is nearly non-existent.  The hesitation expressed by topics 6 and 20 continues though. Archaeologists do not seem comfortable with the future.

Other early topics that thread their way throughout the entire period are topics 5, 2, 27 and 28:

Topic 5: matrix, units, stratigraphie, relationships, harris, unit, between, method, each, attributes, may two diagram, point, other, seriation, one, all, stratigraphy, sequence

Topic 2: area, survey, aerial, north, features, sites, region, located, excavation, river, areas, during, field, its, large, project, south, water, over, fig

Topic 27: sites, monuments, heritage, national, record, management, cultural, records, development, systems, england, database, english, its, survey, new, will, also, planning, protection.

Topic 28: museum, museums, collections, project, national, documentation, all, database, archives, about, archive, objects, sources, documents, university, text, our, also, collection, reports.

These topics suggest the ‘what’ of topic 1: how do we deal with contexts and units? Large surveys? Sites and monuments records and museum collections? Interestingly, topics 27 and 28 can be taken as representing something of the professional archaeological world (as opposed to ‘academic’ archaeology).

Mark Lake, in a recent review of simulation and modeling in archaeology (JAMT 2014) describes various trends in modeling [discuss]. Only topic 9 seems to capture this aspect of computational/digital archaeology:

model, models, social, modeling, simulation, human, their, between, network, approach, movement, networks, past, different, theory, how, one, population, approaches, through

Interestingly, for this topic, there is a thin thread from the earliest years of the CAA to the present (2011), with brief spurst in the late 70s, and late 80s, then a consistent presence throughout the 90s, with a larger burst from 2005-2008. Lake characterizes thus…. [lake]. Of course, Lake also cites various books and monographs which this analysis does not take into account.

If we regard ‘digital archaeology’ as something akin to ‘digital humanities’ (and so distinct from ‘archaeological computation’) how does it, or does it even, appear in this tangled skein? A rough distinction between the two perspectives can be framed using Trevor Owens meditation on what computation is for. Per Owens, we can think of a humanistic use of computing as one that helps us deform our materials, to give us a different perspective on it. Alternatively, one can think of computing as something that helps us justify a conclusion. That is, the results of the computation are used to argue that such-a-thing is most likely in the past, given this model/map/cluster/statistic. In which case, there are certain topics that seem to imply a deformation of perspective (and thus, a ‘digital archaeology’ rather than an archaeological computation):

topic 03: cultural, heritage, semantic, model, knowledge, systems, web, standards, ontology, work, domain, conceptual, different, crm, between, project, based, approach

topic 04: knowledge, expert, process, its, artefacts, set, problem, different, concepts, human, systems, but, they, what, our, scientific, about, how, all, will

topic 07: project, web, digital, university, internet, access, online, service, through, electronic, http, european, technologies, available, public, heritage, will, services, network, other

topic 14: virtual, reality, museum, public, visualization, models, reconstruction, interactive, museums, multimedia, heritage, envrionment, scientific, reconstructions, will, computer, technologies, environments, communication

topic 29: gis, spatial, time, within, space, temporal, landscape, study, into, social, approaches, geographic, applications, approach, features, environmental, based, between, their, past

Topic 3 begins to emerge in 1996 (although its general discourse is present as early as 1988).  Topic 4 emerges with strength in the mid 1980s, though its general thrust (skepticism about how knowledge is created?) runs throughout the period. Topic 7 emerges in 1994 (naturally enough, when the internet/web first hit widespread public consciousness). Should topic 7 be included in this ‘digital archaeology’ group? Perhaps, inasmuch as it also seems to wrestle with public access to information, which would seem not to be about justifying some conclusion about the past but rather opening perspectives upon it. Topic 14 emerges in the early 1990s.

Topic 29, on first blush, would seem to be very quantitative. But the concern with time and temporality suggests that this is a topic that is trying to get to grips with the experience of space. Again, like the others, it emerges in the late 1980s and early 1990s. [perhaps some function of the personal computer revolution..? instead of being something rare and precious -thus rationed and only for ‘serious’ problems requiring distinct answers – computing power can now be played with and used to address less ‘obvious’ questions?]

What of justification? These are the topics that grapple with statistics and quantification:

Topic 10: age, sites, iron, settlement, early, bronze, area, burial, century, one, period, their, prehistoric, settlements, grave, within, first, neolithic, two, different

Topic 11: pottery, shape, fragments, classification, profile, ceramics, vessels, shapes, vessel, sherds, method, two, ceramic, object, work, finds, computer, fragment, matching, one

Topic 13: dating, radiocarbon, sampling, london, dates, some, but, betwen, than, e.g. , statistical, chronological, date, there, different, only, sample, results, one, errors

Topic 15: landscape, project, study, landscapes, studies, cultural, area, gis, human, through, their, its, rock, history, historical, prehistoric, environment, our, different, approach

Topic 17: sutdy, methods, quantitative, technqiues, approach, statistical, using, method, studies, number, artifacts, results, variables, two, most, bones, based, various, analyses, applied

Topic 19: statistical, methods, techniques, variables, tiie, statistics, density, using, cluster, technique, multivariate, method, two, nottingham, example, principal, some, university

Topic 21: model, predicitve, modelling, models, cost, elevation, viewshed, surface, sites, gis, visibility, van, location, landscape, areas, one, terrain, dem, digital

topic 23: image, digital, documentation, images, techniques, laser, scanning, models, using, objects, high, photogrammetry, methods, model, recording, object, surveying, drawings, accuracy, resolution

topic 24: surface, artefact, distribtuion, artefacts, palaeolithic, materials, sites, deposits, within, middle, area, activity, during, phase, soil, processes, lithic, survey, remains, france

Macroscopic patterns

Screen Shot 2014-11-09 at 3.45.25 PMThis detail of the overall flow of topics in the CAA proceedings points to the period 1978 – 1983 as a punctuation point, an inflection point, of new topics within the computers-and-archaeology crowd. The period 1990-2011 contains minor inflections around 1997 and 2008.

1997-1998

1990-2011

In terms of broad trends, pivot points seem to be the late 70s, 1997, 2008. Given that our ‘digital archaeology’ themes emerge in the late 90s, let’s add Internet Archaeology to the mix [why this journal, why this time: because of the 90s inflection point? quicker publication schedule? ability to incorporate novel outputs that could never be replicated in print?]. This time, instead of searching for topics, let’s see what correlates with our digital archaeology topics. For this, David Mimno’s browser based LDA topic model is most useful. We run it for 1000 iterations, and find the following correlation matrix.

[insert discussion here]

http://www.graeworks.net/digitalarchae/mimno/jslda.html?docs=caa_and_intarch.txt&stoplist=en.txt&topics=30

-1000 iterations. Your 1000 iterations will be slightly different than mine, because this is a probablistic approach

– the browser produces csv files for download, as well as a csv formatted for visualizing patterns of correlation as a network in Gephi or other network visualization software.

-stop list is en, fr, de from MALLET + archaeology, sites, data, research

-running this in a browser is not the most efficient way of doing this kind of analysis, but the advantage is that it allows the reader to explore how topics sort themselves out, and its visualization of correlated topics is very effective and useful.

-note word usage. Mimno’s browser calculates the ‘specificity’ of a word to a topic. The closer to 1.0, the closer the word is distributed only within a single topic. Thus, we can take such words as being true ‘keywords’ for particular kinds of discourses. [which will be useful in exploring the 20000 model]. “Computer” has a specificity of 0.61, while “virtual” has a specificity of 0.87, meaning that ‘computer’ is used in a number of topics, while ‘virtual’ is almost exclusively used in a single discourse. Predicitve has a specificty of 1, and statistical of 0.9.

In the jsLDA model, there are three topics that deal with GIS.

topic 19, gis landscape spatial social approach space study human studies approaches

topic 18, database management systems databases gis web software user model tool

topic 16, sites gis landscape model predictive area settlement modelling region land

The first, topic 19, seems to correspond well with our earlier topic that we argued was about using GIS to offer a new perspective on human use/conception of space (ie, a ‘digital’ approach, in our formulation). Topics 18 and 16 are clearly about GIS as a computational tool. In the correlation matrix below, blue equals topics that occur together greater than expected, while red equals less than expected; the size of the dot gives an indication of how much. Thus, if we look for the topics that go hand in hand with topic 19, the strongest are topic 16 (the predictive power of GIS), and topic 10 (social, spain, simulation, networks, models).

Screen Shot 2014-11-09 at 5.28.47 PMThe ‘statistical, methods, techniques, artefact, quantitative, statistics, artefacts’ topic is positively correlated with ‘human, material, palaeolithic’, ‘time, matrix, relationship’, and ‘methods, points, point’ topics. This constellation of topics is clearly a use of computation to answer or address very specific questions.

-in jslda there’s a topic ‘database project digital databases web management systems access model semantic’ – positively correlated with ‘publication project electoric’, ‘text database maps map section user images museum’, ‘excavation recording’, ‘vr model’,  ‘cultural heritage museum’, ‘italy gis’, ‘sites monuments record’ [see keys.csv for exact label]. These seem to be topics that deal with deforming our perspectives while at the same time intersecting with extremely quantitative goals.

So far, we have been reading distantly some 40 years of archaeological work that is explicitly concerned with the kind of archaeology that uses computational and digital approaches. There are punctuation points, ‘virages’, and complicated patterns – there is no easy-to-see disjuncture between what the digital humanists imagine is the object of using computers, and their critics who see computation as positivism by the back door. It does show that archaeology should be regarded as an early mover in what has come to be known as ‘the digital humanities’, with quite early sophisticated and nuanced uses of computing. But how early? And how much has archaeological computing/digital archaeology permeated the discipline? To answer these questions, we turn to a much larger topic model

Zoom Out Some More

Let’s put this into a broader context. 24 journals from JSTOR were selected for both general coverage of archaeology and for regional/topical specialities. The resulting dataset contains 21000 [get exact number] articles, mostly from the past 75 years (a target start date of 1940 was selected for journals whose print run predates the creation of the electronic computer, thus computer = machine and not = woman who computes). 100 topics seemed to capture the range of thematic discourses well. We looked first for topics that seem analogous to the CAA & IA topics (CAA and IA were not included in this analysis because they are not within the JSTOR DFR database; Goldstone’s DFR Browser was used for the visualization of the topics). [better explanation, rationale, to be written, along with implications]. We also observe ‘punctuation points’ in this broader global (anglosphere) representation of archaeology that correspond with the inflection points in the small model, many trends that fit but also other trends that do not fit with standard historigoraphy of archaeology. We then dive into certain journals (AJA, JFA, AmA, JAMT) to tease these trends apart. Just what has been the impact of computational and digital archaeology in the broader field?

Screen Shot 2014-11-09 at 5.29.24 PMThe sillouhette in the second column gives a glimpse into the topic’s prevalence over the ca 75 years of the corpus. The largest topic, topic 10, with its focus on ‘time, made, work, years, great, place, make’ suggests a kind of special pleading, that in the rhetoric of archaeological argument, one always has to explain just why this particular site/problem/context is important. A similar topic was observed in the model fitted to the CAA & IAA [-in 20000 model, there’s the ‘time’ topic time made work years great place make long case fact point important good people times; it’s the largest topic, and accounts for 5.5%. here, there is one called ‘paper time work archaeologists introduction present important problems field approach’. it’s slightly correlated with every other topic. Seems very similar. ]

More interesting are the topics a bit further down the list. Topic 45 (data, analysis, number, table, size, sample) is clearly quantitative in nature, and its sillhouette matches our existing stories about the rise of the New Archaeology in the late 60s and early 70s. Topics 38 and 1 seem to be topics related to describing finds – ‘found, site, stone, small, area’; ‘found, century, area, early, excavations’. Topic 84 suggests the emergence of social theories and power – perhaps an indication of the rise of Marxist archaeologies? Further down the list we see ‘professional’ archaeology and cutlrual resource management, with peaks in the 1960s and early 1980s.

Screen Shot 2014-11-09 at 5.29.56 PM

Topic 27 might indicate perspectives connected with gender archaeology – “social, women, material, gender, men, objects, female, meaning, press, symbolic” – and it accounts for 0.8% of the corpus: about 160 articles.  ‘Female’ appears in four topics, topic 27, topic 65 (‘head, figure, left, figures, back, side, hand, part’ – art history? 1.4% of the corpus) topic 58 (“skeletal, human, remains, age, bone”- osteoarchaeology, 1.1% of the corpus), and topic 82 (“age, population, human, children, fertility” – demographics? 0.8% of the corpus).

[other words that would perhaps key into major trends in archaeological thought? looking at these topics, things seem pretty conservative, whatever the theorists may think, which is surely important to draw out and discuss]

Concerned as we are to unpick the role of computers in archaeology more generally, if we look at the word ‘data’ in the coprus, we find it contributes to 9 different topics (http://graeworks.net/digitalarchae/20000/#/word/data ). It is the most important word in topic 45 (data, analysis, number, table, size, sample, study) and in topic 55 (data, systems, types, information, type, method, units, technique, design). The word ‘computer’ is also part of topic 55. Topic 45 looks like a topic connected with statistical analysis (indeed, ‘statistical’ is a minor word in that topic), while topic 55 seems to be more ‘digital’ in the sense we’ve been discussing here. Topic 45 is present in 3.2% of the corpus, growing in prominence from the early 1950s, falling in the 60s, and resurging in the 70s, and then decreasing to a more or less steady state in the 00s.

Screen Shot 2014-11-09 at 5.30.34 PM

Topic 55 holds some surprises:

Screen Shot 2014-11-09 at 5.31.17 PM

The papers in 1938 come from American Antiquity volume 4 and show an early awareness of not just quantitative methods, but also the reflective way those methods affect what we see [need to read all these to be certain of this]

next steps

– punctuation points – see http://graeworks.net/digitalarchae/20000/#/model/yearly

major – 1940 (but perhaps an artefact of the boundaries of the study)

minor- early 1950s

minor- mid 1960s

major- 1976 (american antiquity does something odd in this year)

major- 1997-8

 

Setting up your own Data Refinery

Refinery at Oxymoron, by Wyatt Wellman, cc by-sa 2.0. Flickr.

Refinery at Oxymoron, by Wyatt Wellman, cc by-sa 2.0. Flickr.

I’ve been playing with a Mac. I’ve been a windows person for a long time, so bear with me.

I’m setting up a number of platforms locally for data mining. But since what I’m *really* doing is smelting the ore of data scraped using things like Outwit Hub or Import.io (the ‘mining operation’, in this tortured analogy), what I’m setting up is a data refinery. Web based services are awesome, but if you’re dealing with sensitive data (like oral history interviews, for example) you need something local – this will also help with your ethics board review too.  Onwards!

Voyant-Tools

You can now set Voyant-Tools up locally, keeping your data safe and sound. The documentation and downloads are all on this page. This was an incredibly easy setup on Mac. Unzip, double-click voyant-tools.jar, and boom you’ve got Voyant-Tools puttering away in your browser. It’ll be at  http://127.0.0.1:8888. You can also hit the cogwheel icon in the top right to run your corpus through all sorts of other tools that come with Voyant but aren’t there on the main layout. You’ll want ‘export corpus with other tool’. You’ll end up with a url something like http://127.0.0.1:8888/tool/RezoViz/?corpus=1404405786475.8384 . You can then swap the name of any other tool in that URL (to save time). RezoViz by the way uses named entity extraction to construct a network of entities mentioned in the same documents. So if you upload your corpus in small-ish chunks (paragraphs; pages; every 1000 words, whatever) you can see how it all ties together this way. From the cogwheel icon on the RezoViz layout, you can get .net which you can then import into Gephi. How frickin’ cool is that?

Overview Project


Topic modeling is all the rage, and yes, you should have MALLET or the Stanford TMT or R on your machine. But sometimes, it’s nice to just see something rather like a dendrogram of folders with progressively finer levels of self-similarity. Overview does term-frequency inverse document frequency weightings to figure out similarity of documents. The instructions (for all platforms) are here. It’s not quite as painless as Voyant, but it’s pretty darn close. You’ll need to have Postgres – download, install, run it once, then download Overview. You need to have Java 7. (At some point, you’ll probably need to look into running multiple versions of Java, if you continue to add elements to your refinery). Then:

  1. Ctrl-Click or Right-click on Overview for Mac OS X.command and select Open. When the dialog box asks if you are sure you want to open the application, click on the Open button. From then on, you can start Overview by double-clicking on Overview for Mac OS X.command.
  2. Browse to http://localhost:9000 and log in as admin@overviewproject.org with passwordadmin@overviewproject.org.

And you now have Overview running. You can do many many things with Overview – it’ll read pdfs, for instance, which you can then export within a csv file. You can tag folders and export those tags, to do some fun visualizations with the next part of your refinery, RAW.

Tags exported from Overview

Tags visualized. This wasn’t done with Raw (but rather, a commercial piece of software), but you get the idea.

RAW

Flow diagram in Raw, using sample movie data

Flow diagram in Raw, using sample movie data

Raw does wonderful things with CSV formatted data, all in your browser. You can use the webapp version; nothing gets communicated to the server. But, still, it’s nice to keep it close to home. So, you can get Raw source code here. It’s a little trickier to install than the others. First thing: you’ll need Bower. But you can’t install Bower without Node.js and npm. So, go to Node.js and hit install. Then, download Raw. Unzip Raw and go to that folder. To install Bower, type

$ sudo npm install -g bower

Once the dust settles, there’s a bunch of dependencies to install. Remember, you’re in the Raw folder. Type:

$ bower install

When the dust clears again, and assuming you have Python installed on your machine, fire Raw up in a server:

$ python -m SimpleHTTPServer 4000

(If you don’t have python, well, go get python. I’ll wait). Then in your browser go to

localhost:4000

And you can now do some funky visualizations of your data. There are a number of chart types packaged with Raw, but you can also develop your own – here’s the documentation. Michelle Moravec has been doing some lovely work visualizing her historical research using Raw. You should check it out.

 

Your Open Source Data Refinery

With these three pieces of data refinery infrastructure installed on your machine, or in your local digital history computer lab, you’ll have no excuse not to start adding some distant reading perspective to your method. Go. Do it now.

Visualizing texts using Overview

I’ve come across an interesting tool called ‘Overview‘. It’s meant for journalists, but I see no reason why it can’t serve historical/archaeological ends as well. It does recursive adaptive k-means clustering rather than topic modeling, as I’d initially assumed (more on process here). You can upload texts as pdfs or within a table. One of the columns in your table could be a ‘tags’ column, whereby – for example – you indicate the year in which the entry was made (if you’re working with a diary). Then, Overview sorts your documents or entries into nested folders of similiarity. You can then see how your tags – decades – play out across similar documents. In the screenshot below, I’ve fed the text of ca 600  historical plaques into Overview:

Overview divides the historical plaques, at the broadest level, of similarity into the following groups:

‘church, school, building, toronto, canada, street, first, house, canadian, college (545 plaques),

‘road, john_graves, humber, graves_simcoe, lake, river, trail, plant’ (41 plaques)

‘community’ with ‘italian, north_york,  lansing, store, shepard, dempsey, sheppard_avenue’, 13 documents

‘: years’ with ‘years_ago, glacier, ice, temperance, transported, found, clay, excavation’, 11 documents.

That’s interesting information to know. In terms of getting the info back out, you can export a spreadsheet with tags attached. Within Overview, you might want to tag all documents together that sort into similar groupings, which you could then visualize with some other program. You can also search documents, and tag them manually. I wondered how plaques concerned with ‘children’, ‘women’, ‘agriculture’, ‘industry’, etc might play out, so I started using Overview’s automatic tagger (search for a word or phrase, apply that word or phrase as a tag to everything that is found). One could then visually explore the way various tags correspond with particular folders of similar documents (as in this example). That first broad group of ‘church school building canada toronto first york house street canadian’ just is too darned big, and so my tagging is hidden (see the image)- but it does give you a sense that the historical plaques in Toronto really are concerned with the first church, school, building, house, etc in Toronto (formerly, York). Architectural history trumps all. It would be interesting to know if these plaques are older than the other ones: has the interest in spaces/places of history shifted over time from buildings to people? Hmmm. I’d better check my topic models, and do some close reading.

Anyway, leaving that aside for now, I exported my tagged texts, and did a quick and dirty network visualization of tags connected to other tags by virtue of shared plaques. I only did this for 200 of the plaques, because, frankly, it’s Friday evening and I’d like to go home.

Here’s what I saw [pdf version]:

visualizing-tags-via-overview

So a cluster with ‘elderly’, ‘industry’, ‘doctor’, ‘medical’, ‘woman’…. I don’t think this visualization that I did was particularly useful.

Probably, it would be better to generate tags that collect everything together in the groups that the tree visualization in Overview generates, export that, and visualize as some kind of dendrogram. It would be good if the groupings could be exported without having to do that though.

Introducing Voyant in a History Tutorial

This week my HIST2809 students are encountering digital history, as part of their ‘Historian’s Craft’ class (an introduction to various tools & methods). As part of the upcoming assignment, I’m having them run some history websites through Voyant, as a way of sussing out how these websites craft a particular historical consciousness. Each week, there’s a two-hour lecture and one hour of tutorial where the students lead discussions given the lecture & assigned readings. For this week, I want the students to explore different flavours of Digital History – here are the readings:

“Possible discussion questions: How is digital history different? In ten years, will there still be something called ‘digital history’ or will we all history be digital? Is there space for writing history through games or simulations? How should historians cope with that? What kind of logical fallacies would such approaches be open to?”

To help the TAs bring the students up to speed with using Voyant, I’ve suggested to them that they might find it fun/interesting/useful/annoying to run one of those papers through Voyant. Here’s a link to the ‘Interchange’ article, loaded into Voyant:

http://voyant-tools.org/?corpus=1363622350848.367&stopList=stop.en.taporware.txt

The TAs could put that up on the screen, click on various words in the word cloud, to see how the word is used over the course of a single article (though in this case, there are several academics speaking, so the patterns are in part author-related). Click on ‘scholarship’ in the word cloud, and you get a graph of its usage on the right – the highest point is clickable (‘segment six’). Click on that, and the relevant bit of text appears in the middle, as Bill Turkel talks about the extent to which historical scholarship should be free. On the bottom left, if you click on ‘words in the entire corpus’, you can select ‘access’ and ‘scholarship’, which will put both of them on the graph

( http://voyant-tools.org/tool/TypeFrequenciesChart/?corpus=1363622350848.367&docIdType=d1363579550728.b646f3e3-65d1-2347-c580-5e5c0985e6d0%3Ascholarship&docIdType=d1363579550728.b646f3e3-65d1-2347-c580-5e5c0985e6d0%3Aaccess&stopList=stop.en.taporware.txt&mode=document&limit=2 )

and you’ll see that the two words move in perfect tandem, so the discussion in here is all about digital tools opening access to scholarship – except in segment 8. The question would then become, why?

….so by doing this exercise, the students should get a sense of how looking at macroscopic patterns involves jumping back to the close reading we’re normally familiar with, then back out again, in an iterative process, generating new questions all along the way. An hour is a short period of time, really, but I think this would be a valuable exercise.

(I have of course made screen capture videos walking the students through the various knobs and dials of Voyant. This is a required course here at Carleton. 95 students are enrolled. 35 come to every lecture. Approximately 50 come to the tutorials. Roughly half the class never comes…. in protest that it’s a requirement? apathy? thinking they know how to write an essay so what could I possibly teach them? That’s a question for another day, but I’m fairly certain that the next assignment, as it requires careful use of Voyant, is going to be a helluva surprise for that fraction.”

On Public Access to Digital Data: Mining Public Comment

Yesterday, Bethany suggested that the public comments on the American OSTP request for info regarding public access to digital data would be a good target for some data mining:

So, I downloaded the pdf and turned it into plain text. I did not do any cleanup; what follows is a brief look at large-scale patterns, with all the caveats and cautions that that implies. I loaded the raw txt into Voyant Tools, where one can do some initial frequency counts and so on – available here. (NB – the corpus reader tool does not seem to work from this link; but all other Voyant tools do. You may also upload the txt file yourself into Voyant, which may solve the corpus reader problem – the txt is available in the zip file below).

Then, as is often my wont, I topic modeled it by individual line (for 25 topics). Below are the raw topics w/o interpretation. I also mapped the topics to their documents using Gephi. As there were >30 000 lines, I pruned to show just the lines where an individual topic accounted for more than 2/3rds of its composition (and joining it to its minor topics). I ran the modularity routine to determine ‘communities’ within those comments; gephi suggests 15 communities. The communities centered on 15, 9, and 23 seem to be most prominent. Here are all my data files (zipped download, ca. 46mb; includes gephi files).

What does this all mean? I’ll leave it to the reader to decide that for herself. Larger screenshot.  (scroll to bottom for update).

List of Topics

1. data digital types repository shared sets collected established serve place major variety small reasonable physionet low prevent establishing archived acquired continuing certified helpful aspects trial presented interested strategic base releasing spent put conventions decades directorate attention campus proteomics capturing confidential leveraging subsequent choices articulated efficiently initiated learned deposit methods meaningful
2. long community term project stewardship effort share practice nature al researcher projects driven ensuring opportunities supported individual exchange genetics end short collection includes defined find maintaining brazma responsibility distributed building addressed hosting hard communication ode advantage piwowar evaluation maintenance treated emerge evolution enables descriptive reducing widespread networks planning months met
3. publication require archiving ensure journals users field guidelines database high publishing quality significant reports primary supporting sharing widely ecg privacy life basis integrity challenge identify biomedical leads concerns underlying annotations electronic open mandates dryad enabling progress min clinical instance hours standardized parties released efficient permanent confidentiality expected individual remain beat
4. data open important identifiers grants future analysis persistent benefit unique free full step multiple requirement source point exist citations requiring repositories availability note accessibility list critical today reference researcher subjects encouraged lost increased experts raw larger move restrictions gathering ease image aera transparent considerations generate problems hand consortia noted outcome
5. economic comment centers creation gov easier growth allowing elements consideration washington id decisions icpsr fr education number statistical offer team de locations dc informed performance assessment st big usa controls inclusion practical skills digitaldata colleagues statistics run forward pp street home expert partnerships laboratory tt protections frameworks asked accessed fit
6. resources scholarly infrastructure including datasets develop potential models innovation key approach organization sustainable business number program online area reporting increase open greater machine global recognized adopted integration basic exclusive contributions represent great submission standardization infrastructures sufficient capacity understanding internet past importance ddi continued lifecycle learning early communications traditional expect website
7. policies agency costs benefits developing system differences respect burden general proposed relative stakeholders recognize article databases focus force problem interest fund task critical recommended difficult report longer scientist flexible cases students participate position blue rapidly initiatives financial november written requiring possibility range recognizing demonstrate framework longterm line networked books blog
8. data repositories time government technical relevant question private created products period raw investigators trusted collections record determine challenges continue change resources industry medical adopt evidence gis direct bodies intended sector ready track usability measures partners fully stored purposes responses personal structure companies functions extensive consortium host solution integrated countries manage
9. standards digital publications http reuse org interoperability www needed linking enable datacite repurposing format iso purposing orcid emerging migration uk openly define cultural initiatives worldwide inform inter verified eu promotes index beneficial approval site html pdf components likelihood seal insight circumstances creativecommons computing mechanism operations strongly ansi significantly previous permits
10. researchers grant published data nsf include articles journal funds datasets related part dataset fields programs cases code means final result materials papers applications lack similar education act highly cited species included dollars rules document supplemental paper submitted protected receive generally advance investments findings date subsequent active evolve administrative assume america
11. data citation organizations discipline set stewardship archive verification alliance ndsa licensing members generated minimized ongoing center single resource purpose complex embargo criteria location focused individuals grantees usable analyses multi present norms maintain committed signals independent detailed barriers selection patents description protocols transparency associations protein distribution engaged managers mandating actively usage
12. access public digital preservation information rfi minimum providing encouraging broadly experiment miame microarray page discoverability taxpayer tool comprehensive unclassified cyberinfrastructure utilize represented verifiable commission piece equilibrium measurements decentralized iwgdd capable entrepreneurs fear depositing sustain paleoanthropology recording enhancing permitted michael authority distinct confusion constitutes studies personnel scope strengths mechanism intuitive expression
13. work copyright required deposit level content current commons institutional collaboration years good creative licenses cc form society collaborative activities license publisher broad considered works subject mandate dois experience areas terms institution incentive success protection law participation start identify consistent diverse fees patent procedures recent knowledge boundaries action kind facts core
14. data management sharing plans requirements plan implementation contribute part proposals meet states proposal united include ethical complete awareness professionals interdisciplinary implement capture priorities kinds reviewers nasa endeavor explicit book dataone balance techniques criteria submissions mandatory physiotoolkit ad mining copyrighted contact sage statements healthy direction cycle increasingly relating goods contributing specification
15. scientific research funded federally resulting dissemination american discovery productivity enterprise valuable metrics rewards taxpayer reason diversity discussions existence assigning successfully reduction isn allocate secret attempts overcome evaluation visualizations billion organizing trained operational occurs documentary provided plays quickly connection measurement assessing conjunction recorded ore animals broadening archivist modes game hosts argue
16. funding provide costs mechanisms address issues preserving disciplinary questions real improved provided expertise comments requires study ways search methods recommendations specifically result social extent issue increasing establishment ieee minimal collaborations manner tracking participants answer cooperative copies considerable cover posed greatest path stage budget sensitive comply fostering citations selected exploitation basic
17. existing standard create tools economy archives web grow software formats model large sciences world build markets jobs industries order preserve linked wide proprietary activity pay lead biology directly scale type promoting people computational easily permit definition limited achieve production improving sites machines interoperable view text follow kitware controlled concept conduct
18. data make scientists developed review sharing produced peer assure original reviewed mechanisms legal easy incentives publish understand simple principles projects year makes literature produce regard viewed citing reward prior scholarship store facilitate due responsible expensive explore domains details sense assess ecosystem collections times display perform topics pass latency handle genetic
19. services accessible based publicly making university domain library additional curation innovative professional systems stimulate societies network addition author january state local retention adoption apply technologies infrastructure physical offer conditions material steps congress added lab fact physics baseline phd ecological banding california award market specialized allowed options fully press environmental bird
20. data metadata international cost common storage impact doi identifier essential creating facilitate deposited goal freely ensure object proper service central levels link ffsr producers semantic guidance acquisition file broader documents reasons culture links nations european agreements providers identifying fits cross astronomy committees issue partnership citable desired necessarily starting security job
21. information science policy national technology ostp request response foundation context nih opportunity human office social health council institute committee service member engineering studies genome medicine respond concern strong input institutes writing acra pub contract publically care enhance stakeholder behalf greatly dedicated administration design educational commitment similar detail dependent quantitative barriers
22. research communities institutions stakeholders libraries researchers universities clear establish knowledge academic investment government user association recognition range training ideas consensus managed vital survey return manage develop biological submit higher archival found advances back preserved growing board legitimate publication periods matter investigator limited mandated expressed vision collecting trust actual energy aabb
23. federal agencies compliance disciplines encourage improve effective promote approaches verify account coordination inherent measure differences control staff recommend faculty creation give files coordinate register regulations processing monitor describe collaborate accommodate operate discoveries redundant option layer supporting closely recommends dublin familiar university claim ensures worthwhile changing sponsored ample deep joint huge
24. specific publishers intellectual property working steps interests rights protect group groups authors funders commercial case report release profit publishing involved supports simply interest ability environment niso librarians issues initiative ip core managing discussion explicitly mission director primarily responsibilities expectations money transfer amounts applicable outcomes points interagency don stakeholder foster openness
25. support standards development practices results made process efforts attribution successful credit examples role secondary processes reported characteristics applied play sources goals documentation cite pm producing maintain maximize values computing draft download post provenance meeting biodiversity facilitates proposals accreditation helping educators small spread reader bermuda keeping funder intensive certification domains replication

UPDATE: This is why the DH & Twitter community is so awesome. I mentioned to Bethany that one mode networks (topics joined directly to other topics on basis of shared composition of a line) would provide a ‘truer’ picture than my two mode networks, and Scott Weingart duly did the heavy lifting:


…and the resulting visualization shows that things boil down into two communities, with 24,10, and 12 being most prominent. So which is right, one mode or two mode? While two modes make more apparent common-sense, in terms of analyseses and metrics, you want to go with one modes. Thickness of the line depicts a stronger relationship between the two topics.

UPDATE 2: Scot Weingart’s Gephi visualization of the same materials, with topic top words swapped for the numbers.