Elijah Meeks and Mat Jockers both have used word clouds to visualize topics from topic models. Colour, orientation, relative placement of the words – all of these could be used to convey different dimensions of the data. Below, you’ll find clouds for each of my initial 50 topics generated from the Roman materials in the Portable Antiquities Scheme database (some 100 000 rows, or nearly 1/5 the database, collected together into ‘documents’ where each unitary district authority is the ‘document’ and the text are the descriptions of things found there). The word clouds are generated from the word weights file that MALLET can output. There are 8100 unique tokens when I convert the database into a MALLET file; each one of those is present in each ‘bag of words’ or topic that MALLET generates, but to differing degrees. Thus, word clouds (here generated with Wordle) pull out important information that the word keys document does not. However, given that I optimized the interval whilst generating the topic models, the keys document provides an indication of the strength of the topic in the corpus. I’ve arranged the word clouds scaling them against the size of the strongest topic (topic 22), top-bottom, left-right. I’ll be damned if I can get wordpress to just display each image under the other one. Even stripped my table out, it did!
At any rate, as one churns through the 50 topics, after about the first 11 (depicted below), the topics get progressively more noisy as MALLET attempts to deal with incomplete transcriptions of the epigraphy of the coins, and the frequent notes about the source for the identification of the coins (the work of Guest & Wells). The final topic depicted here, topic 20, directly references a note often left in the database concerning the quality of an individual record; these frequently are in connection with materials that entered the British Museum collection before the Portable Antiquities Scheme got going and hence the information is not up to usual standards.
This exercise then suggests to me that 50 topics is just too much. I’m rerunning everything with 10 topics this time.