I’ve been having an interesting conversation with Ben Marwick, in the comments thread of my initial ‘Getting Started with Topic Modeling’ post. Ben pointed me to an interesting GUI for Mallet, which may be downloaded here. I’ve been trying it out this morning, and I like what I’m seeing. Topic modeling is becoming more and more popular amongst the Digital Humanities crowd. An interesting automated approach to generating networks of topics and ideas from texts is reported by Scott Weingart, using the writings of Newton.
While I have nothing near so polished available, the GUI for Mallet used with Gephi can do nearly the same thing. My body of data comes from Writing History in the Digital Age. An earlier experiment with the same data is recounted here. I re-ran the data using the GUI approach, and have to say, this is a much easier and accessible approach. Run the program; select the folder with your txt documents in it; select the target number of topics; select the appropriate language stopwords list if necessary; hit ‘train topics’. What is very neat about this program is how it presents its output in both html and csv.
So in the spirit of crowdsourcing, I’ve put the output files online, and haven’t tried to decide yet what the topics might mean. Instead, why don’t you view the files for yourself, and let’s identify the topics using the comments of this post?
I then took the CSV files, and got them ready for import into Gephi. Decide which two columns you’d like to represent as being connected, and prune away the extraneous data. I took the ‘topicsindocs.csv’ file, and pruned it so that each paragraph of each author is paired with its major topic. I stripped away the info about the paragraph itself, so that the resulting visualization is just authors to the topics they write about. In the screenshot below, you can see the open gephi file with my own ‘Wikiblitz’ article highlighted, and its connections.
What’s also interesting is when I ran the ‘modularity’ routine – identifying communities based on patterns of self-similarity of ties – only four communities emerged (albeit with a very low modularity measurement, 0.235, which suggests that these communities are all that strong). A natural grouping of the papers, perhaps? (by the way, here’s the pdf/svg file).
16 thoughts on “Topic Modeling With the JAVA GUI + Gephi”
Postscriptum: You can set up your browser so that Gephi can listen in – select the generate http graph dynamically from a new project within Gephi, and then as you click through (or scrape) the links from the output files generated by the Java GUI, you end up with a graph of the connections rather automatically.
I have the output topic model html file from mallet, but I am unable to graph it dynamically as you suggested using the HTTP Graph plugin. I configure my proxy settings to a manual proxy configuration (browser:Firefox,127.0.0.1:8088), but it works only when I browse, not when I open the html file and follow the links. Am I missing something? Thanks!
Hi Sabrina – I missed your comment (there’s been a barrage of spam lately). If you’re still working with this, send me an email and we can try to troubleshoot.
Thanks for the hat-tip! Thanks also for sharing the files, they were helpful for me to see how to get Gephi working. I look forward to seeing more of what you and your students do with this.
I tried it on my texts written in Russian and got some uncoherent “topics” like:
word xy le relspk app image nc ms ih yh
Is it a problem with encoding or with language?
I think it might be your input files. I’ve gotten results like that myself when I inadvertently used .doc files instead of clean .txt files. You can run the input using clean .csv too, where each cell in the first column contains the individual parcel of text (paragraphs, email, chapters, what have you).
I exported them into UTF-8 *.txt files and in the output I received only the english words =(
Hi, sorry my english.
I’m from Brazil. And I will do my job of completion, on social networks and will use the gephi to present the results. But I’m not finding ways to create the database. I’ve tried several tutorials, but no guides me to create it. Can you help me?
Thanks for your marvelous posting! I truly enjoyed reading it, you migbht bee
a great author. I will make certain to bookmark your blog and will come back
later iin life. I want to encourage that you continue your great job,
have a nice weekend!
This blog was… how do you say it? Relevant!!
Finally I have found something which helped me.
If some one needs expert view on the topic of blogging and site-building then i suggest him/her to go to
see this weblog, Keep up the pleasant job.
I know this web site provides quality depending articles
and additional stuff, is there any other web page which gives these
kinds of information in quality?
Comments are closed.