Maybe the question isn’t one of reading someone’s thoughts, but rather, listening to the overall pattern of topics within them. Topic modeling does some rather magical things. It imposes sense (it fits a model) onto a body of text. The topics that the model duly provide us with insight into the semantic patterns latent within the text (but see Ben Schmidts WEM approach which focuses on systems of relationships in the words themselves – more on this anon). There are a variety of ways emerging for visualizing these patterns. I’m guilty of a few myself (principally, I’ve spent a lot of time visualizing the interrelationships of topics as a kind of network graph, eg this). But I’ve never been happy with them because they often leave out the element of time. For a guy who sometimes thinks of himself as an archaeologist or historian, this is a bit problematic.
I’ve been interested in sonification for some time, the idea that we represent data (capta) aurally. I even won an award for one experiment in this vein, repurposing the excellent scripts of the Data Driven DJ, Brian Foo. What I like about sonification is that the time dimension becomes a significant element in how the data is represented, and how the data is experienced (cf. this recent interview on Spark with composer/prof Chris Chafe). I was once the chapel organist at Bishop’s University (I wasn’t much good, but that’s a story for another day) so my interest in sonification is partly in how the colour of music, the different instrumentation and so on can also be used to convey ideas and information (rather than using algorithmically purely generated tones; I’ve never had much formal musical training, so I know there’s a literature and language to describe what I’m thinking that I simply must go learn. Please excuse any awkawrdness).
So – let’s take a body of text, in this case the diaries of John Adams. I scraped these, one line per diary entry (see this csv we prepped for our book, the Macroscope). I imported into R and topic modeled for 20 topics. The output is a monstrous csv showing the proportion each topic contributes to the entire diary entry (so each row adds to 1). If you use conditional formatting in Excel, and dial the decimal places to 2, you get a pretty good visual of which topics are the major ones in any given entry (and the really minor ones just round to 0.00, so you can ignore them).
It rather looks like an old-timey player piano roll:
Player Piano Anyone?
I then used ‘Musical Algorithms‘ one column at a time to generate a midi file. I’ve got the various settings in a notebook at home; I’ll update this post with them later. I then uploaded each midi file (all twenty) into GarageBand in the order of their complexity – that is, as indicated by file size:
Size of a file indicates the complexity of the source. Isn’t that what Claude Shannon taught us?
The question then becomes: which instruments do I assign to what topics? In this, I tried to select from the instruments I had readily to hand, and to select instruments whose tone/colour seemed to resonate somehow with the keywords for each topic. Which gives me a selection of brass instruments for topics relating to governance (thank you, Sousa marches); guitar for topics connected perhaps with travels around the countryside (too much country music on the radio as a child, perhaps); strings for topics connected with college and studying (my own study music as an undergrad influencing the choice here); and woodwinds for the minor topics and chirp and peek here and there throughout the text (some onomatopoeia I suppose).
Garageband’s own native visualization owes much to the player piano aesthetic, and so provides a rolling visualization to accompany the music. I used quicktime to grab the garageband visuals, and imovie to marry the two together again, since qt doesn’t grab the audio generated within the computer. Then I changed the name of each of the tracks to reflect the keywords for that topic.
Drumroll: I give you the John Adams 20: