This morning, on Twitter, there was a conversation about site diaries and the possibilities of topic modeling for extracting insight from them. Open Context has 2618 diaries – here’s one of them. Eric, who runs Open Context, has an excellent API for all that kind of data. Append .json on the end of a file name, and *poof*, lots of data. Here’s the json version of that same diary. So, I wanted all of those diaries – this URL (click & then note where the .json lives; delete the .json to see the regular html) has ’em all.
I copied and pasted that list of urls into a .txt file, and fed it to wget
wget -i urlstograb.txt -O output.txt
and now my computer is merrily pinging Eric’s, putting all of the info into a single txt file. And sometimes crashing it, too.
When it’s done, I’ll rename it .json and then use Rio to get it into useable form for R. The data has geographic coordinates too, so with much futzing I expect I could *probably* represent topics over space (maybe by exporting to Gephi & using its geolayout).
Futz: that’s the operative word, here.