Open Notebook Research in Digital Archaeology

We’re currently writing the very first draft of our integrated DHBox virtual-machine-and-textbook for digital archaeology. It’s aimed at the same crowd as a regular intro-to-archaeology text, that is, first or second year students with little digital grounding. It won’t cover everyone’s wishlist for digital archaeology, but it will with care be a solid foundation for going further.

In the instructions below, we are imagining the student using the command line within DHBox. If you’re following along here, go over to DHBox, click on start hour long demo, then command line. The login and password are both ‘demonstration’. 

As always, please use hypothesis (annotated copy link) to annotate or leave comments. Thanks!

1.4 Open Notebook Research & Scholarly Communication

Digital archaeology necessarily generates a lot of files. Many of those files are data; many more are manipulations of that data, or the data in various stages of cleaning and analysis. Without any sort of version control or revision history (as detailed in the previous section), these files quickly replicate to the point where a project can be in serious danger of failing. Which file contains the ‘correct’ data? The correct analysis? Even worse, imagine coming back to a project after a few months’ absence. Worse still, after a major operating system update of the kind foisted on Windows users from Windows 7 to Windows 10. The bad news continues: magnetic storage can fail; online cloud services can be hacked; a key person on the project can die.

Even if the data makes it to publication, there is the problem of the data not being available to others for re-interrogation or re-analysis. Requests for data from the authors of journal articles are routinely ignored, whether by accident or design. Researchers may sit on data for years. We have all of us had the experience of working on a collection of material, and then writing to the author of the original article, requesting an explanation for some aspect of the data schema used, only to find out that the author has either died, kept no notes, left the field entirely, or simply doesn’t remember.

There is no excuse for this any longer. Open notebook science is a gathering movement across a number of fields to make the entire research process transparent by sharing materials online as they are generated. These include everything from the data files themselves, to the code used to manipulated it, to notes and observations in various archives. Variations on this ‘strong’ position include data-publishing of the materials after the main paper has been published (see for instance OpenContext or the Journal of Open Archadological Data). Researchers such as Ben Marwick and Mark E. Madsen are leading the field in archaeology, while scholars such as Caleb McDaniel are pushing the boundaries in history. The combination of simple text files (whether written text or tabular data such as .csv files) with static website generators (ie, html rather than dynamically generated database websites like WordPress) enables the live publishing of in-progress work. Carl Boettiger is often cited as one of the godfathers of this movement. He makes an important distinction:

This [notebook, not blog] is the active, permanent record of my scientific research, standing in place of the traditional paper bound lab notebook. The notebook is primarily a tool for me to do science, not communicate it. I write my entries with the hope that they are intelligible to my future self; and maybe my collaborators and experts in my field. Only the occasional entry will be written for a more general audience. […] In these pages you will find not only thoughts and ideas, but references to the literature I read, the codes or manuscripts I write, derivations I scribble and graphs I create and mistakes I make. (Boettiger)

Major funding bodies are starting to require a similar transparency in the research that they support. Recently, the Social Sciences and Humanities Research Council of Canada published guidance on data management plans:

All research data collected with the use of SSHRC funds must be preserved and made available for use by others within a reasonable period of time. SSHRC considers “a reasonable period” to be within two years of the completion of the research project for which the data was collected.

Annecdotally, we have also heard of work being denied funding because the data management plan, and/or the plan for knowledge mobilization, made only the briefest of nods towards these issues: ‘we shall have a blog and will save the data onto a usb stick’ does not cut it any more. A recent volume of case-studies in ‘reproducible research’ includes a contribution from Ben Marwick that details not only the benefits of such an approach, but also the ‘pain points’. Key amongst them was that not everyone participating in the project was on board using scripted code to perform the analysis (preferring instead to use the point-and-click of Excel), the duplication of effort that emerged as a result, and the complexities that arose from what he calls the ‘dual universes’ of Microsoft tools versus the open source tools. (MARWICK REF). On the other hand, the advantages outweighed the pain. For Marwick’s team, because their results and analysis can be re-queried and re-interrogated, they have an unusually high degree of confidence in what they’ve produced. Their data, and their results have a complete history of revisions that can be examined by reviewers. Their code can be re-used and re-purposed, thus making their subsequent research more efficient. Marwick goes on to create an entire ‘compendium’ of code, notes, data, and software dependencies that can be duplicated by other researchers. Indeed, we will be re-visiting their compendium in Section XXXXXXXXX.

Ultimately, McDaniels says it best about keeping open notebooks of research in progress when he writes,

The truth is that we often don’t realize the value of what we have until someone else sees it. By inviting others to see our work in progress, we also open new avenues of interpretation, uncover new linkages between things we would otherwise have persisted in seeing as unconnected, and create new opportunities for collaboration with fellow travelers. These things might still happen through the sharing of our notebooks after publication, but imagine how our publications might be enriched and improved if we lifted our gems to the sunlight before we decided which ones to set and which ones to discard? What new flashes in the pan might we find if we sifted through our sources in the company of others?

A parallel development is the growing practice of placing materials online as pre-prints or even as drafts, for sharing and for soliciting comments. Graham for instance uses a blog as a place to share longer-form discursive writing in progress; with his collaborators Ian Milligan and Scott Weingart, he even wrote a book ‘live’ on the web, warts and all (which you may still view at The Macroscope). Sharing the draft in progress allowed them to identify errors and ommissions as they wrote, and for their individual chapters and sections to be incorporated into class syllabi right away. In their particular case, they came to an arrangment with their publisher to permit the draft to remain online even after the formal publication of the ‘finished’ book – which was fortunate, as they ended up writing another chapter immediately after publication! In this, they were building on the work of scholars such as Kathleen Fitzpatrick, whose Planned Obsolescence was one of the first to use the Media Commons ‘comment press’ website to support the writing. Commentpress is a plugin for the widely used WordPress blogging system, which allows comments to be made at the level of individual paragraphs. This textbook you are currently reading uses another solution, the hypothes.is plugin that fosters communal reading and annotation of electronic texts. This points to another happy by-product of sharing one’s work this way – the ability to generate communities of interest around one’s research. The Kitz et al. volume is written with the Gitbook platform, which is a graphical interface for writing using Git at its core with markdown text files to manage the collaboration. The commit history for the book then also is a record of how the book evolved, and who did what to it when. In a way, it functions a bit like ‘track changes’ in Word, with the significant difference that the evolution of the book can be rewound and taken down different branches when desired.

In an ideal world, we would recommend that everyone should push for such radical transparency in their research and teaching. But what is safe for a group of (mostly) white, tenured, men is not safe for everyone online. In which case, what we recommend is for individuals to assess what is safest for them to do, while still making use of the affordances of Git, remote repositories, and simple text files. Bitbucket at the time of writing offers free private repositories (so you can push your changes to a remote repository without fear of others looking or cloning your materials); ReclaimHosting supports academic webhosting and allows one to set up the private ‘dropbox’ like file-sharing service Owncloud.

In this exercises below, we will explore how to make a simple open notebook via a combination of markdown files and a repository on Github. Ultimately, we endorse the model developed by Ben Marwick, of creating an entire ‘research compendium’ that can be installed on another researcher’s machine, but a good place to start are with the historian Lincoln Mullen’s simple notebook templates. This will introduce to you another tool in the digital archaeologist’s toolkit, the open source R programming language and the R Studio ‘IDE’ (’integrated development environment).

Far more complicated notebooks are possible, inasmuch as they combine more features and ways of compiling your research. Scholars such as Mark Madsen use a combination of Github pages and the Jekyll blog generator (for more on using Jekyll to create static websites, see Amanda Visconti’s Programming Historian tutorial.) A simple Github repository and WordPress blog can be used in tandem, where the blog serves for the narrative part of a notebook, the part that tries to make sense of the notes contained in the repository. This aspect of open notebook science is critically important in that it serves to signal your bona fides as a serious scholarly person. Research made available online is findable; given the way web search works, if something cannot be found easily, it might as well not exist.

Ultimately, tou will need to work out what combination of tools works best for you. Some of our students have had success using Scrivener as a way of keeping notes, where Scrivener writes to a repository folder or some other folder synced across the web (like Dropbox, for instance). In this workflow, you have one Scrivener file per project. Scrivener uses the visual conceit of actual 3 x 5 notecards. Within Scrivener, one would make one card per note, and keep them in a ‘research’ folder. Then, when it becomes time to write up the project, those notecards can be moved into the draft and rearranged as necessary so that the writing flows naturally from them.

1.4.1 How to Ask Questions

  • stuff here on how to ask a question on sites like stackoverflow etc.
  • also this https://speakerdeck.com/jennybc/reprex-help-me-help-you, although perhaps move it to 3.1 literate programming. perhaps use it to create actual examples that can be copied over to R though. In which case, talk about it in both places.
  • the idea being ways in which your open notebook becomes an invitation to others to help you, and also, a way of making sure you find the answer you’re looking for when the inevitable troubles emerge

1.4.2 discussion

Questions for discussion:

  1. Search the archaeological literature (via jstor or Google Scholar) for examples of open notebook science ‘in the wild’. Are you finding anything, and if so, where? Do there seem to be impediments from the journals regarding this practice?
  2. What excites you about the possibilities of open notebook archaeology? What are the advantages?
  3. What frightens you? What are the disadvantages?
  4. Search online for the ‘replicability crisis in science’. Is there any such thing in archaeology?
  5. Study Marwick’s paper REF and compare it to its supporting Github repository. What new questions could be asked of this data?
  6. In what ways are terms like ‘open access’, ‘open source’, and ‘open science’ synonyms for a similar approach, and in what ways are they different?

1.4.3 Take-aways

Keeping an open notebook (or if necessary, a closed notebook) is a habit that must be cultivated. As a target to aim for, try to have

  • each experiment|project in its own folder
  • each experiment|project with regular pattern of subfolders data and figures and text and bib etc
  • the experiments|projects under version control.
  • a plan for data publishing. One option is to submit the repository to zenodo or similar to obtain digital object identifiers (DOIs) for the repository
  • a plan to write as you go, on a fail log or blog or what-have-you. Obtain a DOI for this, too.

We haven’t mentioned DOIs in this section, but when your notebook and your narrative about your research has a DOI, it becomes easier for your colleagues to cite your work – even this work in progress!

1.4.4 Further Reading

Baker, James. ‘Preserving Your Research Data’, The Programming Historianhttp://programminghistorian.org/lessons/preserving-your-research-data

1.4.5 On Privilege and Open Notebooks

While we argue for open notebooks, there may be circumstances where this is not desireable or safe to do. Readers may also want to explore an Evernote alternative, Laverna which stores your notes in your web-browser’s cache hence making them private, but also allows sync to services such as Dropbox (versioning and backup are still absolutely critical). If you work primarily on a Mac computer, nvAlt by Brett Terpstra is an excellent note-taking application that can sync remotely. Another possibility is Classeur a web abb that integrates with various blogging platforms, allows for syncing and collaboration, the choice of what to make public and what to keep private, and includes the ability to sort notes into various notebooks. It does not save locally, so be warned that your information is on their servers. There is an API (application programming interface) that allows you to download your materials (for more on APIs, see [Introduction to Digital Libraries, Archives & Repositories]).

A final word on the privilege involved in keeping an open notebook is warranted. To make one’s research available openly on the web, to discuss openly the things that worked, the things that haven’t, the experiments tried and the dead ends explored, is at the current moment something that depends on the perceived race, class, and gender of the person doing it. What passes without comment when I (Shawn Graham, a white, tenured, professor) do something could attract unwarranted, unwanted, and unfair attention if a woman of colour undergraduate tried. This is not to say this always happens; but disgracefully it happens far too often. It is important and necessary to fight back against the so-called ‘internet culture’ in these things, but it is not worth risking one’s safety. To those who benefit from privilege, it is incumbent upon them to make things safe for others, to recognize that open science, open humanities, represents a net boon to our field. In which case, it is up to them to normalize such practices, to make it safe to try things out. We discuss more in the following section on what [Failing Productively] means, why it matters, and why it is integral not only to digital archaeology, but the culture of academic research, teaching, and outreach more generally.

1.4.6 exercises

In this series of exercises, we are going to take you through the process of setting up an open research notebook, where you control all of the code and all of the data. A good rule-of-thumb in terms of keeping a notebook is ‘one notecard per thought`, here adapted as ’one file per thought, one folder per project’.

Let us set up a template open-notebook based on Lincoln Mullen’s Simple RmD Notebook. If you go to https://lmullen.github.io/rmd-notebook/ you’ll see a ‘live’ version of this template on the web. It is being served to us from a special branch in Mullen’s Github account, called gh-pages. When you have a gh-branches branch in very nearly any repo associated with your own Github account, Github.com will treat that repository not as a repository but as an actual website. This allows us to update or experiment with changes on other branches, and show a ‘polished’ version to the world via the gh-pages branch. That branch will have a special address, in the form your-account-name.github.io/your-repo. Whenever you see github.io in a URL, you know that the source for that website will be found at github.com/account-name/repo. Do you see the difference? (For more on gh-pages, see the Github documentation).

  1. Begin sketching out on paper an idea for a digital archaeological project, perhaps the one you imagined at the end of our section on Project Management Basics. Imagine its file structure. In the top level folder are going to go all of your notecards. Sub-folders are going to hold diagrams that you create in the course of your reserch; source-data that you leave untouched; data that you’ve cleaned or manipulated; any helper code that you might create; and so on. You will use this structure to help organize your open notebook once we’ve installed it.
  2. Make a fork of our copy of Mullen’s notebook (we’ve added hypothes.is to it). You can find our copy at https://github.com/o-date/rmd-notebook.
  3. Clone your copy to the ODATE environment at the command line. (Review Github & Version Control if necessary first).
  4. Type ls to make sure the rmd-notebook directory is present; then cd rmd-notebook.
  5. Check which branch you are on, and make sure it is the gh-pages branch. (Hint: check the status)
  6. Now you’re ready to start adding notes! Remember, .Rmd files are just markdown files into which you can insert working R code. We’re not ready to do that yet (but we will encounter it in due course), but for now, you can think of these files as simple notecards where one card = one idea. Note the existing .rmd files in this folder. Their filenames all begin with a number. Yours should be numbered as well. You can create a new card by typing nano filename.rmd where filename is whatever you want it to be. Your notecard can include images by using the markdown syntax, ![image tile](path/to/image/filename.jpg); those images can be on the web or in an image folder. (A good place to practice markdown is at Dillinger.io).

Your note must contain some descriptive metadata at the top. This is good practice no matter what kind of note-taking system you use. In our case here, we use the yaml approach. This is what a minimum example looks like:

---
title: "First page of the notebook"
author: "Lincoln Mullen"
date: "December 3, 2015"
---

Title, author, and date. These will get passed to the <meta> tags in the eventual HTML we are going to generate. This information makes your site easier to find and to archive and to associate with you as a scholar.

  1. Now we make the public-facing website for your notebook. Mullen has bundled a series of commands into a make file, which acts as a short-cut for us and also ensures that the same sequence of operations is carried out everytime. (You can read more about makefiles here.) At the command prompt type $ make.
SG <- note pandoc and rmarkdown have to be bundled into the dhbox before hand. Otherwise, the following commands have to be run:
$ wget https://github.com/jgm/pandoc/releases/download/1.19.2.1/pandoc-1.19.2.1-1-amd64.deb
to get pandoc and then $ sudo dpkg -i pandoc-1.19.2.1-1-amd64.deb` to unzip and install it. Rmarkdown has to be installed from the R Server code pane. The first sample rmd file in Lincoln's example has a leaflet webapp in it; I modified the R to install leaflet first, but I dunno. This'll have to be tested. Probably easier at this point to just remove it.
  1. The make file is pushing all of your .Rmd files through a program called pandoc, and adding various styling options to make an entire website.

As an aside, Pandoc is an extremely useful piece of software for converting files from one format to another. A series of examples can be found at http://www.pandoc.org/demos.html. At the command line, type $ pandoc and then your source file, then indicate the kind of output you want using the -o flag, eg: $ pandoc MANUAL.txt -o example1.html. One of the outputs Pandoc can generate is a Word document – which means, your source text is kept in a very simple, future-proof format, and Word can be used just for typography.

Commit your changes and push them to your remote repository. Visit the live version of your repository – ie, the one at github.io not github.com to see your live open research notebook!
<

  1. Write a new note for your notebook recording your work, the problems you encountered, and the solutions you found. Save, make, commit, and push your note to the web.

Another approach involves writing markdown files, putting them online, and then using a kind of ‘helper’ file to manage their display as html. In this particular case, we are going to use something called mdwiki. Mdwiki involves a single html file which, when put in the same folder as a series of markdown files, acts as a kind of wrapper to turn the markdown files into pages of a wiki-style website. There is a lot of customization possible, but for now we’re going to make a basic notebook out of a mdwiki template.

  1. Fork the minimal mdwiki template to your Github account; md wiki template is linked here
  2. At this point, any markdown file you create and save into the mdwiki-seed\ll_CC\ folder will become a webpage, although the .md extension should still be used in the URL . If you study the folder structure, you’ll see that there are pre-made folders for pages, for pdfs, for images, and so on (if you clone the repo, you can then add or remove these folders as you like using the file manager). Remembering to frame any internal links as relative links. That is to say, if you saved a markdown file in ll_CC/pages/observation1.md but wanted to link to ll_CC/pages/observation2.md, it is enough to just add [Click here](observation2.md). Because the mdwiki-seed you forked was already on a gh-pages branch, your notebook will be visible at YOURUSERNAME.github.io/mdwiki-seed/. But note: the page will reload and you’ll see #! or ‘hashbang’ inserted at the end of the URL. This is expected behaviour
  3. Let’s customize this a bit. Via Github, click on the ll_CC directory. One of the files that will be listed is config.json. If you click on that file, you’ll see:
{
  "additionalFooterText": "All content and images &copy; by Your Name Goes Here&nbsp;",
  "anchorCharacter": "#",
  "lineBreaks": "gfm",
  "title": "Your wiki name",
  "useSideMenu": true
}

Change the title so that it says something like Your-name Open Research Notebook. You can do this by clicking on the pencil icon at the top right of the file viewer (if you don’t see a pencil icon, you might not be logged into github). Scroll to the bottom and click on the ‘commit changes’ button when you’re done.

  1. Let’s add notes to this notebook. You can do this in two ways. In the first, you clone your mdwiki-seed via the command line, and use the text editor to create new pages in the appropriate folder (in this case, ll_CC\pages), then git commit, git add ., and git push to get your changes live online. You can create a kind of table of contents by directing the ls command into a new file, like so:

$ ls > index.md

and then editing that file to turn the filenames into markdown links like so: [display text for link](filename.md).

Alternatively, a more elegant approach to use in conjunction with mdwiki is to use Prose.io and keep your notebook live on the web. Prose.io is an editor for files hosted in Github. You log into Prose.io with your github credentials, and select the repository you wish to edit, in this case, mdwiki-seed. Then, click on the ‘new file’ button. This will give you a markdown text editor, and allow you to commit changes to your notebook! Warning do not make changes to index.html when using mdwiki. If you want a particular markdown file to appear as the default page in a folder, call it index.md instead. You could then periodically update your cloned copy on your own machine for back up purposes.

Either way, add some notes to the notebook, and make them available online.

Advertisements