I and my students have made some contributions to ‘Writing History in the Digital Age‘, the born-digital volume edited by Jack Dougherty and Kristen Nawrotzki. Rather than reflect on the writing process, I thought I’d topic model the volume to see what patterns emerged in the contributions.
I use Mallet to do this. I’ve posted earlier about how to get Mallet running. I used Outwit Hub to scrape each individual paragraph from each paper (> 700 paragraphs) into a CSV file (I did not scrape block quotes, so my paragraph numbers are slightly out of sync with those used on the Writing History website). I used the Textme excel macro (google it; it lives in multiple versions and requires a bit of modification to work exactly the way you want it to) to save each paragraph into its own unique text file, which I then load into Mallet.
Phew. Now, the tricky part with Mallet is deciding how many topics you want it to look for. Finding the *right* number of topics requires a bit of iteration – start with say 10. Look at the resulting composition of files to topics. If an inordinate number of files all fall into one topic, you don’t have enough granularity yet.
As an initial read, I went with 15 topics. One topic – which I’ll label ‘working with data’ – had quite a large number of files (composition document) (remember, the individual paragraphs from the papers). Ideally, I would re-run the analysis with a greater number of topics, so that the ‘working with data’ topic would get broken up.
I also graphed the results, so that each author is linked to the topics which compose his or her paper; the thickness of the line indicates multiple paragraphs with that topic. I have also graphed topics by individual paragraphs, but the granularity isn’t ideal making the resulting visual not all that useful. The colours correspond with the ‘modularity’ of the graph, that is, communities of similar patterns of connections. The size of the node represents ‘betweeness’ on all paths between every pair of nodes.
So what does it all mean? At the level of paragraph-by-topic, if we had the correct level of granularity, one might be able to read the entire volume by treating the graph as a guide to hyperlinking from paragraph to paragraph, perhaps – a machine generated map/index of the internal structure of ideas. At the level of individual authors, it perhaps suggests papers to read together and the organizing themes of the volume.
This is of course a quick and dirty visualization and analysis, and my initial impressions. More time and consideration, greater granularity, is to be desired.
|Working with Data||1|
|African Americans and the South||2|
|Primary Resources, Teaching, and Libraries||2|
|Blogging and Peer Interactions||3|
|Keywords and Search||4|
|Japan and History||4|
|Space and Geography||4|