On Public Access to Digital Data: Mining Public Comment

Yesterday, Bethany suggested that the public comments on the American OSTP request for info regarding public access to digital data would be a good target for some data mining:

So, I downloaded the pdf and turned it into plain text. I did not do any cleanup; what follows is a brief look at large-scale patterns, with all the caveats and cautions that that implies. I loaded the raw txt into Voyant Tools, where one can do some initial frequency counts and so on – available here. (NB – the corpus reader tool does not seem to work from this link; but all other Voyant tools do. You may also upload the txt file yourself into Voyant, which may solve the corpus reader problem – the txt is available in the zip file below).

Then, as is often my wont, I topic modeled it by individual line (for 25 topics). Below are the raw topics w/o interpretation. I also mapped the topics to their documents using Gephi. As there were >30 000 lines, I pruned to show just the lines where an individual topic accounted for more than 2/3rds of its composition (and joining it to its minor topics). I ran the modularity routine to determine ‘communities’ within those comments; gephi suggests 15 communities. The communities centered on 15, 9, and 23 seem to be most prominent. Here are all my data files (zipped download, ca. 46mb; includes gephi files).

What does this all mean? I’ll leave it to the reader to decide that for herself. Larger screenshot.  (scroll to bottom for update).

List of Topics

1. data digital types repository shared sets collected established serve place major variety small reasonable physionet low prevent establishing archived acquired continuing certified helpful aspects trial presented interested strategic base releasing spent put conventions decades directorate attention campus proteomics capturing confidential leveraging subsequent choices articulated efficiently initiated learned deposit methods meaningful
2. long community term project stewardship effort share practice nature al researcher projects driven ensuring opportunities supported individual exchange genetics end short collection includes defined find maintaining brazma responsibility distributed building addressed hosting hard communication ode advantage piwowar evaluation maintenance treated emerge evolution enables descriptive reducing widespread networks planning months met
3. publication require archiving ensure journals users field guidelines database high publishing quality significant reports primary supporting sharing widely ecg privacy life basis integrity challenge identify biomedical leads concerns underlying annotations electronic open mandates dryad enabling progress min clinical instance hours standardized parties released efficient permanent confidentiality expected individual remain beat
4. data open important identifiers grants future analysis persistent benefit unique free full step multiple requirement source point exist citations requiring repositories availability note accessibility list critical today reference researcher subjects encouraged lost increased experts raw larger move restrictions gathering ease image aera transparent considerations generate problems hand consortia noted outcome
5. economic comment centers creation gov easier growth allowing elements consideration washington id decisions icpsr fr education number statistical offer team de locations dc informed performance assessment st big usa controls inclusion practical skills digitaldata colleagues statistics run forward pp street home expert partnerships laboratory tt protections frameworks asked accessed fit
6. resources scholarly infrastructure including datasets develop potential models innovation key approach organization sustainable business number program online area reporting increase open greater machine global recognized adopted integration basic exclusive contributions represent great submission standardization infrastructures sufficient capacity understanding internet past importance ddi continued lifecycle learning early communications traditional expect website
7. policies agency costs benefits developing system differences respect burden general proposed relative stakeholders recognize article databases focus force problem interest fund task critical recommended difficult report longer scientist flexible cases students participate position blue rapidly initiatives financial november written requiring possibility range recognizing demonstrate framework longterm line networked books blog
8. data repositories time government technical relevant question private created products period raw investigators trusted collections record determine challenges continue change resources industry medical adopt evidence gis direct bodies intended sector ready track usability measures partners fully stored purposes responses personal structure companies functions extensive consortium host solution integrated countries manage
9. standards digital publications http reuse org interoperability www needed linking enable datacite repurposing format iso purposing orcid emerging migration uk openly define cultural initiatives worldwide inform inter verified eu promotes index beneficial approval site html pdf components likelihood seal insight circumstances creativecommons computing mechanism operations strongly ansi significantly previous permits
10. researchers grant published data nsf include articles journal funds datasets related part dataset fields programs cases code means final result materials papers applications lack similar education act highly cited species included dollars rules document supplemental paper submitted protected receive generally advance investments findings date subsequent active evolve administrative assume america
11. data citation organizations discipline set stewardship archive verification alliance ndsa licensing members generated minimized ongoing center single resource purpose complex embargo criteria location focused individuals grantees usable analyses multi present norms maintain committed signals independent detailed barriers selection patents description protocols transparency associations protein distribution engaged managers mandating actively usage
12. access public digital preservation information rfi minimum providing encouraging broadly experiment miame microarray page discoverability taxpayer tool comprehensive unclassified cyberinfrastructure utilize represented verifiable commission piece equilibrium measurements decentralized iwgdd capable entrepreneurs fear depositing sustain paleoanthropology recording enhancing permitted michael authority distinct confusion constitutes studies personnel scope strengths mechanism intuitive expression
13. work copyright required deposit level content current commons institutional collaboration years good creative licenses cc form society collaborative activities license publisher broad considered works subject mandate dois experience areas terms institution incentive success protection law participation start identify consistent diverse fees patent procedures recent knowledge boundaries action kind facts core
14. data management sharing plans requirements plan implementation contribute part proposals meet states proposal united include ethical complete awareness professionals interdisciplinary implement capture priorities kinds reviewers nasa endeavor explicit book dataone balance techniques criteria submissions mandatory physiotoolkit ad mining copyrighted contact sage statements healthy direction cycle increasingly relating goods contributing specification
15. scientific research funded federally resulting dissemination american discovery productivity enterprise valuable metrics rewards taxpayer reason diversity discussions existence assigning successfully reduction isn allocate secret attempts overcome evaluation visualizations billion organizing trained operational occurs documentary provided plays quickly connection measurement assessing conjunction recorded ore animals broadening archivist modes game hosts argue
16. funding provide costs mechanisms address issues preserving disciplinary questions real improved provided expertise comments requires study ways search methods recommendations specifically result social extent issue increasing establishment ieee minimal collaborations manner tracking participants answer cooperative copies considerable cover posed greatest path stage budget sensitive comply fostering citations selected exploitation basic
17. existing standard create tools economy archives web grow software formats model large sciences world build markets jobs industries order preserve linked wide proprietary activity pay lead biology directly scale type promoting people computational easily permit definition limited achieve production improving sites machines interoperable view text follow kitware controlled concept conduct
18. data make scientists developed review sharing produced peer assure original reviewed mechanisms legal easy incentives publish understand simple principles projects year makes literature produce regard viewed citing reward prior scholarship store facilitate due responsible expensive explore domains details sense assess ecosystem collections times display perform topics pass latency handle genetic
19. services accessible based publicly making university domain library additional curation innovative professional systems stimulate societies network addition author january state local retention adoption apply technologies infrastructure physical offer conditions material steps congress added lab fact physics baseline phd ecological banding california award market specialized allowed options fully press environmental bird
20. data metadata international cost common storage impact doi identifier essential creating facilitate deposited goal freely ensure object proper service central levels link ffsr producers semantic guidance acquisition file broader documents reasons culture links nations european agreements providers identifying fits cross astronomy committees issue partnership citable desired necessarily starting security job
21. information science policy national technology ostp request response foundation context nih opportunity human office social health council institute committee service member engineering studies genome medicine respond concern strong input institutes writing acra pub contract publically care enhance stakeholder behalf greatly dedicated administration design educational commitment similar detail dependent quantitative barriers
22. research communities institutions stakeholders libraries researchers universities clear establish knowledge academic investment government user association recognition range training ideas consensus managed vital survey return manage develop biological submit higher archival found advances back preserved growing board legitimate publication periods matter investigator limited mandated expressed vision collecting trust actual energy aabb
23. federal agencies compliance disciplines encourage improve effective promote approaches verify account coordination inherent measure differences control staff recommend faculty creation give files coordinate register regulations processing monitor describe collaborate accommodate operate discoveries redundant option layer supporting closely recommends dublin familiar university claim ensures worthwhile changing sponsored ample deep joint huge
24. specific publishers intellectual property working steps interests rights protect group groups authors funders commercial case report release profit publishing involved supports simply interest ability environment niso librarians issues initiative ip core managing discussion explicitly mission director primarily responsibilities expectations money transfer amounts applicable outcomes points interagency don stakeholder foster openness
25. support standards development practices results made process efforts attribution successful credit examples role secondary processes reported characteristics applied play sources goals documentation cite pm producing maintain maximize values computing draft download post provenance meeting biodiversity facilitates proposals accreditation helping educators small spread reader bermuda keeping funder intensive certification domains replication

UPDATE: This is why the DH & Twitter community is so awesome. I mentioned to Bethany that one mode networks (topics joined directly to other topics on basis of shared composition of a line) would provide a ‘truer’ picture than my two mode networks, and Scott Weingart duly did the heavy lifting:


…and the resulting visualization shows that things boil down into two communities, with 24,10, and 12 being most prominent. So which is right, one mode or two mode? While two modes make more apparent common-sense, in terms of analyseses and metrics, you want to go with one modes. Thickness of the line depicts a stronger relationship between the two topics.

UPDATE 2: Scot Weingart’s Gephi visualization of the same materials, with topic top words swapped for the numbers.

About these ads

3 thoughts on “On Public Access to Digital Data: Mining Public Comment

  1. PS – Just noticed that one of my output files from my previous post on playing with Play the Past found its way into the zip. Enjoy that too! But object lesson: develop good naming conventions and archiving strategies, and stick to ‘em!

  2. Nature Spirits Mer-people, air and water beings who are concerned with spring rains and storms. Why not ask your partner to help stimulate your mental libido by talking to you during sex.

Comments are closed.