I’ve been having an interesting conversation with Ben Marwick, in the comments thread of my initial ‘Getting Started with Topic Modeling’ post. Ben pointed me to an interesting GUI for Mallet, which may be downloaded here. I’ve been trying it out this morning, and I like what I’m seeing. Topic modeling is becoming more and more popular amongst the Digital Humanities crowd. An interesting automated approach to generating networks of topics and ideas from texts is reported by Scott Weingart, using the writings of Newton.
While I have nothing near so polished available, the GUI for Mallet used with Gephi can do nearly the same thing. My body of data comes from Writing History in the Digital Age. An earlier experiment with the same data is recounted here. I re-ran the data using the GUI approach, and have to say, this is a much easier and accessible approach. Run the program; select the folder with your txt documents in it; select the target number of topics; select the appropriate language stopwords list if necessary; hit ‘train topics’. What is very neat about this program is how it presents its output in both html and csv.
So in the spirit of crowdsourcing, I’ve put the output files online, and haven’t tried to decide yet what the topics might mean. Instead, why don’t you view the files for yourself, and let’s identify the topics using the comments of this post?
I then took the CSV files, and got them ready for import into Gephi. Decide which two columns you’d like to represent as being connected, and prune away the extraneous data. I took the ‘topicsindocs.csv’ file, and pruned it so that each paragraph of each author is paired with its major topic. I stripped away the info about the paragraph itself, so that the resulting visualization is just authors to the topics they write about. In the screenshot below, you can see the open gephi file with my own ‘Wikiblitz’ article highlighted, and its connections.
What’s also interesting is when I ran the ‘modularity’ routine – identifying communities based on patterns of self-similarity of ties – only four communities emerged (albeit with a very low modularity measurement, 0.235, which suggests that these communities are all that strong). A natural grouping of the papers, perhaps? (by the way, here’s the pdf/svg file).