Home » digital history » Topic Modeling With the JAVA GUI + Gephi

Topic Modeling With the JAVA GUI + Gephi

I’ve been having an interesting conversation with Ben Marwick, in the comments thread of my initial ‘Getting Started with Topic Modeling’ post. Ben pointed me to an interesting GUI for Mallet, which may be downloaded here. I’ve been trying it out this morning, and I like what I’m seeing. Topic modeling is becoming more and more popular amongst the Digital Humanities crowd. An interesting automated approach to generating networks of topics and ideas from texts is reported by Scott Weingart, using the writings of Newton.

While I have nothing near so polished available, the GUI for Mallet used with Gephi can do nearly the same thing. My body of data comes from Writing History in the Digital Age. An earlier experiment with the same data is recounted here. I re-ran the data using the GUI approach, and have to say, this is a much easier and accessible approach. Run the program; select the folder with your txt documents in it; select the target number of topics; select the appropriate language stopwords list if necessary; hit ‘train topics’. What is very neat about this program is how it presents its output in both html and csv.

So in the spirit of crowdsourcing, I’ve put the output files online, and haven’t tried to decide yet what the topics might mean. Instead, why don’t you view the files for yourself, and let’s identify the topics using the comments of this post?

I then took the CSV files, and got them ready for import into Gephi. Decide which two columns you’d like to represent as being connected, and prune away the extraneous data. I took the ‘topicsindocs.csv’ file, and pruned it so that each paragraph of each author is paired with its major topic. I stripped away the info about the paragraph itself, so that the resulting visualization is just authors to the topics they write about. In the screenshot below, you can see the open gephi file with my own ‘Wikiblitz’ article highlighted, and its connections.

What’s also interesting is when I ran the ‘modularity’ routine – identifying communities based on patterns of self-similarity of ties – only four communities emerged (albeit with a very low modularity measurement, 0.235, which suggests that these communities are all that strong). A natural grouping of the papers, perhaps? (by the way, here’s the pdf/svg file).

About these ads

10 Comments

  1. Shawn says:

    Postscriptum: You can set up your browser so that Gephi can listen in – select the generate http graph dynamically from a new project within Gephi, and then as you click through (or scrape) the links from the output files generated by the Java GUI, you end up with a graph of the connections rather automatically.

  2. Ben says:

    Thanks for the hat-tip! Thanks also for sharing the files, they were helpful for me to see how to get Gephi working. I look forward to seeing more of what you and your students do with this.

  3. Alexander Semenov says:

    Hi!
    I tried it on my texts written in Russian and got some uncoherent “topics” like:
    word xy le relspk app image nc ms ih yh
    Is it a problem with encoding or with language?

    • Shawn says:

      Hi Alexander,

      I think it might be your input files. I’ve gotten results like that myself when I inadvertently used .doc files instead of clean .txt files. You can run the input using clean .csv too, where each cell in the first column contains the individual parcel of text (paragraphs, email, chapters, what have you).

  4. [...] topic modeled your blog posts (for directions, methods, see this post) to see what the structure was. In a nutshell, topic modeling looks for patterns of word use. [...]

  5. [...] cut off which edges to use, especially with humanistic and inferred data. That’s what Shawn Graham showed us how to do when combining topic models with networks. The network was one of authors and topics; which authors wrote about which topics? The data itself [...]

  6. [...] I can point the Mallet Java GUI at the original csv file and topic model all of the posts (400 iterations, with 40 topic words and [...]

  7. [...] of MALLET is that the output can be a bit opaque without putting it into another environment. Shawn Graham has a great series on using the Gephi GUI to process it (if you want to use MALLET yourself, his how-to guide is an amazing resource; we have a [...]

  8. Bruna Tres says:

    Hi, sorry my english.
    I’m from Brazil. And I will do my job of completion, on social networks and will use the gephi to present the results. But I’m not finding ways to create the database. I’ve tried several tutorials, but no guides me to create it. Can you help me?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 109 other followers

%d bloggers like this: