I’m experimenting. Here’s what I did today.
1. Justin Walsh published the data on which his book, ‘Consumerism in the Ancient World’, rests.
2. I downloaded it, and decided I would topic model it. The table, ‘Greek Vases’, has one row = one vase. Let’s start with that, though I think it might be more useful/illuminating to decide that ‘document’ might mean ‘site’ or ‘context’. But first things first; let’s sort out the workflow.
3. I delete all columns with ‘true’ or ‘false’ values. Struck me as not useful. I concatenated all columns into a single ‘text’ column. Then, per the description on the Mallet package page for R, I added a new column ‘class’ which I left blank. So I have ‘id’, ‘class’, ‘text’. All of Walsh’s information is in the ‘text’ field.
4. I ran this code in R, using R studio:
## from http://cran.r-project.org/web/packages/mallet/mallet.pdf library(mallet) ## Create a wrapper for the data with three elements, one for each column. ## R does some type inference, and will guess wrong, so give it hints with "colClasses". ## Note that "id" and "text" are special fields -- mallet will look there for input. ## "class" is arbitrary. We will only use that field on the R side. documents <- read.table("modified-vases2.txt", col.names=c("id", "class", "text"), colClasses=rep("character", 3), sep="\t", quote="") ## Create a mallet instance list object. Right now I have to specify the stoplist ## as a file, I can't pass in a list from R. ## This function has a few hidden options (whether to lowercase, how we ## define a token). See ?mallet.import for details. mallet.instances <- mallet.import(documents$id, documents$text, "/Users/shawngraham/Desktop/data mining and tools/stoplist.csv", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}") ## Create a topic trainer object. num.topics <- 20 topic.model <- MalletLDA(num.topics) ## Load our documents. We could also pass in the filename of a ## saved instance list file that we build from the command-line tools. topic.model$loadDocuments(mallet.instances) ## Get the vocabulary, and some statistics about word frequencies. ## These may be useful in further curating the stopword list. vocabulary <- topic.model$getVocabulary() word.freqs <- mallet.word.freqs(topic.model) ## Optimize hyperparameters every 20 iterations, ## after 50 burn-in iterations. topic.model$setAlphaOptimization(20, 50) ## Now train a model. Note that hyperparameter optimization is on, by default. ## We can specify the number of iterations. Here we'll use a large-ish round number. topic.model$train(200) ## NEW: run through a few iterations where we pick the best topic for each token, ## rather than sampling from the posterior distribution. topic.model$maximize(10) ## Get the probability of topics in documents and the probability of words in topics. ## By default, these functions return raw word counts. Here we want probabilities, ## so we normalize, and add "smoothing" so that nothing has exactly 0 probability. doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T) topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T) ## What are the top words in topic 7? ## Notice that R indexes from 1, so this will be the topic that mallet called topic 6. mallet.top.words(topic.model, topic.words[7,]) ## Show the first few documents with at least 5 head(documents[ doc.topics[7,] > 0.05 & doc.topics[10,] > 0.05, ]) ## End of Mimno's sample script(Not run) ###from my other script; above was mimno's example script topic.docs <- t(doc.topics) topic.docs <- topic.docs / rowSums(topic.docs) write.csv(topic.docs, "vases-topics-docs.csv" ) ## Get a vector containing short names for the topics topics.labels <- rep("", num.topics) for (topic in 1:num.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ") # have a look at keywords for each topic topics.labels write.csv(topics.labels, "vases-topics-labels.csv") ## "C:\\Mallet-2.0.7\\topics-labels.csv") ### do word clouds of the topics library(wordcloud) for(i in 1:num.topics){ topic.top.words <- mallet.top.words(topic.model, topic.words[i,], 25) print(wordcloud(topic.top.words$words, topic.top.words$weights, c(4,.8), rot.per=0, random.order=F)) }
And this is what I get:
Topic # Label
1 france greek west eating grey
2 spain ampurias neapolis girona arf
3 france rune herault colline nissan-lez-ens
4 spain huelva east greek drinking
5 france aude drinking montlaures cup
6 spain malaga settlement cup drinking
7 france drinking bouches-du-rhone settlement cup
8 france cup stemmed herault bessan
9 france marseille massalia bouches-du-rhone storage
10 spain ullastret settlement girona puig
11 france settlement mailhac drinking switzerland
12 spain badajoz cup stemless castulo
13 spain ampurias settlement girona neapolis
14 france beziers drinking cup pyrenees
15 spain krater bell arf drinking
16 transport amphora france gard massaliote
17 france settlement saint-blaise bouches-du-rhone greek
18 france marseille massalia west bouches-du-rhone
19 spain jaen drinking cemetery castulo
20 spain settlement abg eating alicante
The three letter acronymns are ware types. The original data had location, context, ware, purpose, and dates. Still need to figure out how to get Mallet (either on the command line or in R) to treat numerals as words, but that’s something I can ignore for the moment. So what next? Map this I guess, in physical and/or temporal space, and resolve the problem of what a ‘document’ really is, for archaeological topic modeling. Here, look at the word clouds generated at the end of the script whilst I ruminate. And also a flow diagram. What it shows, I know not. Exploration, eh?
Isn’t R brilliant? Nice word clouds man – I may use this resource myself.
R is fun. I need to learn how to do better viz though. Word clouds get no respect, but I like these for giving a sense of relative importance of words within a topic.