Topic Modeling #dh2013 with Paper Machines

I discovered the pdf with all of the abstracts from #dh2013 on a memory-stick-cum-swag this AM. What can I do with these? I know! I’ll topic model them using Paper Machines for Zotero.

Iteration 1.
1. Drop the pdf into a zotero collection.
2. Create a parent item from it.
3. Add a date (July 2013) to the date field on the parent item.
4. Right click on the collection, extract text for paper machines.
5. Right click on the collection, topic model –> by date.
6. Result: blank screen.

Right-click the collection, ‘reset papermachines output’.

Iteration 2.
1. Split the pdfs for the abstracts themselves into separate pages. (pg 9 – 546).
2. Drop the pdfs into a zotero collection.
3. Create parent items for it. (Firefox hangs badly at this stage. And keeps redirecting through for reasons I don’t know why).
4. Add dates to the date field; grab these by hand from the dh schedule page. God, there’s gotta be an easier way of doing this. Actually, I’ll just skip this for now and hope that the sequential page numbers/multiple documents will suffice.
5. Right click on the collection, extract text for paper machines.
6. Right click on the collection, topic model –> by date.
7. Result: IndexError: index out of range: -1.

Right-click the collection, ‘reset papermachines output’.

Iteration 3.
Jump directly to #4, add dates to date field. In the interests of getting something done this morning, I will give them all the same date – a range from July 16 – July 19. If I gave them all their correct dates, you’d get a much more granular view. But I’m adding these by hand. (Though there probably exists some sort of batch edit for Zotero fields? Hang on, I right click on ‘change fields for items’ type ‘date’ for field, put in my range, hey presto! Thanks, Zotero)
5. Right click on the collection, extract text for paper machines.
6. Right click on the collection, topic model –> by date.
7. Result:


Chased down the folder where all of these was being stored. Ahha. Each extracted text file is blank. Nice.

Blow this for a lark. Sometimes, folks, the secret is to go away, and come back later.

Update: I tweeted:

And then walked away for a while. Came back, and went to the TEI file. I used Notepad ++ to strip everything else out but the abstracts. I saved it as a csv. Then, in Excel, I used a custom script I found lying about on teh webs to turn each line into its own txt file. Then I copied the directory into Zotero. I gave each txt file its own parent. I mass edited those items so that they all carried the date July 16 – 19 2013. Then I extracted texts (which seems redundant, but you can’t jump ahead).

And then I selected topic modeling by time.

Which at least created a topic model, but it didn’t make the stream graph. The heat map worked, but all it showed was the US, UK, and Germany. And Florida, for reasons unexplained.

So I went back to Gephi for my topic model visualization. I used Ben Marwick’s Mallet-in-R script to do the modeling and to transform the output so I could easily visualize the correlations. Behold, I give you the network of strongly correlated #dh2013 abstracts by virtue of their shared topics:


It’s coloured by modularity and sized by betweeness, which gives us groups of abstracts and the identification of the abstract whose topics/discourse/text do all of the heavy lifting. A brief glance at the titles suggests that these papers are all concerned with issues of data management of text.

I’ll put all of this data up on my space at in a moment It’s up on Figshare, and provide some further reflections. Currently, this machine is hanging up on me frequently, and I want to get this out before it crashes. Here are the topics; you can add labels if you’d like, but the top three seem to be ‘publishing & scholarly communication’; ‘visualization’; ‘teaching’:

Correlated topics at #dh2013
Correlated topics at #dh2013

0.35142 humanities digital social scholarly http research history accessed work community scholarship www access dh journal publication citation communication publishing
0.28061 literary reading analysis visualization text texts digital literature century studies media topic humanities corpus mining modeling press textual paper
0.21684 digital humanities students university teaching research dh participants workshop projects education pedagogy program tools academic arts graduate project resources
0.18993 digital collections research collection content researchers users access library user resources image images libraries archives metadata cultural information tools
0.14539 tei text document documents encoding markup xml texts index london indexing http uk html encoded links search version modern
0.11833 data historical map time gis information temporal maps university spatial geographic locations texts geographical place names mapping date dates
0.11792 crowdsourcing digital project public states united archaeological america archaeology projects poster university virginia web community social civil media users
0.11289 systems model modeling system narrative media theory elements classification type features user markup ic gesture expression representation press character
0.09601 editions edition text scholarly digital women editing collation print textual texts tools http image manuscript electronic editorial versions environment
0.08569 authorship author words texts corpus attribution characters frequency plays fig classification results number novels genre authors analysis character delta
0.08016 semantic annotation web linked open ontology data rdf scholarly http ontologies research annotations information review project metadata knowledge org
0.07777 social network networks analysis graph relationships characters group graphs jazz science family de interaction publication relationship nodes discussion cultural
0.06328 language corpus text txm http german de web lexicon platform corpora tools analysis unicode research annotation encoding languages lexus
0.05286 digital knowledge community fabrication migration book open feminist learning field knitting desktop world practices cultural experience work lab academic
0.04856 text analysis programming voyant tools ca poster interface alberta live rank sinclair http latent environments ualberta touch screen environment
0.04131 words poetry word text poem texts poetic ford english author segments conrad analysis language poems zeta newton mining chapters
0.0364 simulation information time content model vsim environment narrative abm distribution feature light embedded study narratives virtual japan plot resources
0.03538 query search google alloy xml language words typesetting algorithm de detection cf engine speech mql algorithms body searches paris
0.01131 de la el homer movement uncertainty en se clock catalogue del astronomical una movements para los dance las imprecision

2 thoughts on “Topic Modeling #dh2013 with Paper Machines

Comments are closed.