Yesterday, Bethany suggested that the public comments on the American OSTP request for info regarding public access to digital data would be a good target for some data mining:
DHers, let's see some text-mining & viz! All responses to OSTP RFIs on public access to fed-funded research now online: is.gd/jimJ6v—
Bethany Nowviskie (@nowviskie) February 01, 2012
So, I downloaded the pdf and turned it into plain text. I did not do any cleanup; what follows is a brief look at large-scale patterns, with all the caveats and cautions that that implies. I loaded the raw txt into Voyant Tools, where one can do some initial frequency counts and so on – available here. (NB – the corpus reader tool does not seem to work from this link; but all other Voyant tools do. You may also upload the txt file yourself into Voyant, which may solve the corpus reader problem – the txt is available in the zip file below).
Then, as is often my wont, I topic modeled it by individual line (for 25 topics). Below are the raw topics w/o interpretation. I also mapped the topics to their documents using Gephi. As there were >30 000 lines, I pruned to show just the lines where an individual topic accounted for more than 2/3rds of its composition (and joining it to its minor topics). I ran the modularity routine to determine ‘communities’ within those comments; gephi suggests 15 communities. The communities centered on 15, 9, and 23 seem to be most prominent. Here are all my data files (zipped download, ca. 46mb; includes gephi files).
What does this all mean? I’ll leave it to the reader to decide that for herself. Larger screenshot. (scroll to bottom for update).
List of Topics
UPDATE: This is why the DH & Twitter community is so awesome. I mentioned to Bethany that one mode networks (topics joined directly to other topics on basis of shared composition of a line) would provide a ‘truer’ picture than my two mode networks, and Scott Weingart duly did the heavy lifting:
Scott Weingart (@scott_bot) February 01, 2012
…and the resulting visualization shows that things boil down into two communities, with 24,10, and 12 being most prominent. So which is right, one mode or two mode? While two modes make more apparent common-sense, in terms of analyseses and metrics, you want to go with one modes. Thickness of the line depicts a stronger relationship between the two topics.