Yesterday, Bethany suggested that the public comments on the American OSTP request for info regarding public access to digital data would be a good target for some data mining:https://twitter.com/nowviskie/status/164509026225364993
So, I downloaded the pdf and turned it into plain text. I did not do any cleanup; what follows is a brief look at large-scale patterns, with all the caveats and cautions that that implies. I loaded the raw txt into Voyant Tools, where one can do some initial frequency counts and so on – available here. (NB – the corpus reader tool does not seem to work from this link; but all other Voyant tools do. You may also upload the txt file yourself into Voyant, which may solve the corpus reader problem – the txt is available in the zip file below).
Then, as is often my wont, I topic modeled it by individual line (for 25 topics). Below are the raw topics w/o interpretation. I also mapped the topics to their documents using Gephi. As there were >30 000 lines, I pruned to show just the lines where an individual topic accounted for more than 2/3rds of its composition (and joining it to its minor topics). I ran the modularity routine to determine ‘communities’ within those comments; gephi suggests 15 communities. The communities centered on 15, 9, and 23 seem to be most prominent. Here are all my data files (zipped download, ca. 46mb; includes gephi files).
What does this all mean? I’ll leave it to the reader to decide that for herself. Larger screenshot. (scroll to bottom for update).
List of Topics
UPDATE: This is why the DH & Twitter community is so awesome. I mentioned to Bethany that one mode networks (topics joined directly to other topics on basis of shared composition of a line) would provide a ‘truer’ picture than my two mode networks, and Scott Weingart duly did the heavy lifting:
…and the resulting visualization shows that things boil down into two communities, with 24,10, and 12 being most prominent. So which is right, one mode or two mode? While two modes make more apparent common-sense, in terms of analyseses and metrics, you want to go with one modes. Thickness of the line depicts a stronger relationship between the two topics.
UPDATE 2: Scot Weingart’s Gephi visualization of the same materials, with topic top words swapped for the numbers.
3 thoughts on “On Public Access to Digital Data: Mining Public Comment”
PS – Just noticed that one of my output files from my previous post on playing with Play the Past found its way into the zip. Enjoy that too! But object lesson: develop good naming conventions and archiving strategies, and stick to ’em!
What an awesome thing to do.
Nature Spirits Mer-people, air and water beings who are concerned with spring rains and storms. Why not ask your partner to help stimulate your mental libido by talking to you during sex.
Comments are closed.