R is for Archaeology: A report on the 2017 Society of American Archaeology meeting, by B Marwick

R is for Archaeology: A report on the 2017 Society of American Archaeology meeting, by B Marwick

This guest post is by Ben Marwick of The University of Washington in Seattle. He reports on R workshop at the recent SAA in Vancouver.

The Society of American Archaeology (SAA) is one of the largest professional organisations for archaeologists in the world, and just concluded its annual meeting in Vancouver, BC at the end of March. The R language has been a part of this meeting for more than a decade, with occasional citations of R Core in the posters, and more recently, the distinctive ggplot2 graphics appearing infrequently on posters and slides. However, among the few archaeologists that have heard of R, it has a reputation for being difficult to learn and use, idiosyncratic, and only suitable for highly specialized analyses. Generally, archaeology students are raised on Excel and SPSS. This year, a few of us thought it was time to administer some first aid to R’s reputation among archaeologists and generally broaden awareness of this wonderful tool. We developed a plan for this year’s SAA meeting to show our colleagues that R is not too hard to learn, it is useful for almost anything that involves numbers, and it has lots of fun and cool people that use it to get their research done quicker and easier.

Our plan had three main elements. The first element was the debut of two new SAA Interest Groups. The Open Science Interest Group (OSIG) was directly inspired by Andrew MacDonald’s work founding the ESA Open Science section, with the OSIG being approved by the SAA Board this year. It aims to promote the use of preprints (e.g. SocArXiv), open data (e.g. tDAR, Open Context), and open methods (e.g. R and GitHub). The OSIG recently released a manifesto describing these aims in more detail. At this SAA meeting we also saw the first appearance of the Quantitative Archaeology Interest Group, which has a strong focus on supporting the use R for archaeological research. The appearance of these two groups shows the rest of the archaeological community that there is now a substantial group of R users among academic and professional archaeologists, and they are keen to get organised so they can more effectively help others who are learning R. Some of us in these interest groups were also participants in fora and discussants in sessions throughout the conference, and so had opportunities to tell our colleagues, for example, that it would be ideal if R scripts were available for for certain interesting new analytical methods, or that R code should be submitted when manuscripts are submitted for publication.

The second element of our plan was a normal conference session titled ‘Archaeological Science Using R’. This was a two hour session of nine presentations by academic and professional archaeologists that were live code demonstrations of innovative uses of R to solve archaeological research problems. We collected R markdown files and data files from the presenters before the conference, and tested them extensively to ensure they’d work perfectly during the presentations. We also made a few editorial changes to speed things up a bit, for example using readr::read_csv instead of read.csv. We were told in advance by the conference organisers that we couldn’t count on good internet access, so we also had to ensure that the code demos worked offline. On the day, the live-coding presentations went very well, with no-one crashing and burning, and some presenters even doing some off-script code improvisation to answer questions from the audience. At the start of the session we announced the release of our online book containing the full text of all contributions, including code, data and narrative text, which is online at https://benmarwick.github.io/How-To-Do-Archaeological-Science-Using-R/ We could only do this thanks to the bookdown package, which allowed us to quickly combine the R markdown files into a single, easily readable website. I think this might be a new record for the time from an SAA conference session to a public release of an edited volume. The online book also uses Matthew Salganik’s Open Review Toolkit to collect feedback while we’re preparing this for publication as an edited volume by Springer (go ahead and leave us some feedback!). There was a lot of enthusiastic chatter later in the conference about a weird new kind of session where people were demoing R code instead of showing slides. We took this as an indicator of success, and received several requests for it to be a recurring event in future meetings.

The third element of our plan was a three hour training workshop during the conference to introduce archaeologists to R for data analysis and visualization. Using pedagogical techniques from Software Carpentry (i.e. sticky notes, live coding and lots of exercises), Matt Harris and I got people using RStudio (and discovering the miracle of tab-complete) and modern R packages such as readxl, dplyr, tidyr, ggplot2. At the end of three hours we found that our room wasn’t booked for anything, so the students requested a further hour of Q&A, which lead to demonstrations of knitr, plotly, mapview, sf, some more advanced ggplot2, and a little git. Despite being located in the Vancouver Hilton, this was another low-bandwidth situation (which we were warned about in advance), so we loaded all the packages to the student’s computers from USB sticks. In this case we downloaded package binaries for both Windows and OSX, put them on the USB sticks before the workshop, and had the students run a little bit of R code that used install.packages() to install the binaries to the .libpaths() location (for Windows) or untar’d the binaries to that location (for OSX). That worked perfectly, and seemed to be a very quick and lightweight method to get packages and their dependencies to all our students without using the internet. Getting the students started by running this bit of code was also a nice way to orient them to the RStudio layout, since they were seeing that for the first time.

This workshop was a first for the SAA, and was a huge success. Much of this is due to our sponsors who helped us pay for the venue hire (which was surprisingly expensive!). We got some major support from Microsoft Data Science User Group (which we learned about from a post by Joseph Rickert and Open Context, as well as cool stickers and swag for the students from RStudio, rOpenSci, and the Centre for Open Science. We used the stickers like tiny certificates of accomplishment, for example when our students produced their first plot, we handed out the ggplot2 stickers as a little reward.

Given the positive reception of our workshop, forum and interest groups, our feeling is that archaeologists are generally receptive to new tools for working with data, perhaps more so now than in the past (i.e. pre-tidyverse). Younger researchers seem especially motivated to learn R because they may have heard of it, but not had a chance to learn it because their degree program doesn’t offer it. If you are a researcher in a field where R (or any programming language) is only rarely used by your colleagues, now might be a good time to organise a rehabilitation of R’s reputation in your field. Our strategy of interest groups, code demos in a conference session, and a short training workshop during the meeting is one that we would recommend, and we imagine will transfer easily to many other disciplines. We’re happy to share more details with anyone who wants to try!

Advertisements

ODATE: Open Digital Archaeology Textbook Environment (original proposal)

ODATE: Open Digital Archaeology Textbook Environment (original proposal)

“Never promise to do the possible. Anyone could do the possible. You should promise to do the impossible, because sometimes the impossible was possible, if you could find the right way, and at least you could often extend the limits of the possible. And if you failed, well, it had been impossible.”
Terry Pratchett, Going Postal

And so we did. And the proposal Neha, Michael, Beth, and I put together was successful. The idea we pitched to ecampus ontario is for an open textbook that would have an integral computational laboratory (DHBox!), for teaching digital archaeology. The work of the DHBox team, and their generous licensing of their code makes this entire project possible: thank you!

We put together a pretty ambitious proposal. Right now, we’re working towards designing the minimal viable version of this. The original funding guidelines didn’t envision any sort of crowd-collaboration, but we think it’d be good to figure out how to make this less us and more all of you. That is, maybe we can provide a kernal that becomes the seed for development along the lines of the Programming Historian.

So, in the interests of transparency, here’s the meat-and-potatoes of the proposal. Comments & queries welcome at bottom, or if I forget to leave that open, on twitter @electricarchaeo.

~o0o~

Project Description

We are excited to propose this project to create an integrated digital laboratory and e-textbook environment, which will be a first for the broader field of archaeology

Digital archaeology as a subfield rests upon the creative use of primarily open-source and/or open-access materials to archive, reuse, visualize, analyze and communicate archaeological data. Digital archaeology encourages innovative and critical use of open access data and the development of digital tools that facilitate linkages and analysis across varied digital sources. 

To that end, the proposed ‘e-textbook’ is an integrated cloud-based digital exploratory laboratory of multiple cloud-computing tools with teaching materials that instructors will be able to use ‘out-of-the-box’ with a single click, or to remix as circumstances dictate.

We are proposing to create in one package both the integrated digital exploratory laboratory and the written texts that engage the student with the laboratory. Institutions may install it on their own servers, or they may use our hosted version. By taking care of the digital infrastructure that supports learning, the e-textbook enables instructors and students to focus on core learning straight away. We employ a student-centred, experiential, and outcome-based pedagogy, where students develop their own personal learning environment (via remixing our tools and materials provided through the laboratory) networked with their peers, their course professors, and the wider digital community.

Project Overview

Digital archaeology as a field rests upon the creative use of primarily open-source and/or open-access materials to archive, reuse, visualize, analyze and communicate archaeological data. This reliance on open-source and open-access is a political stance that emerges in opposition to archaeology’s past complicity in colonial enterprises and scholarship; digital archaeology resists the digital neo-colonialism of Google, Facebook, and similar tech giants that typically promote disciplinary silos and closed data repositories. Specifically, digital archaeology encourages innovative, reflective, and critical use of open access data and the development of digital tools that facilitate linkages and analysis across varied digital sources. 

To that end, the proposed ‘e-textbook’ is an integrated cloud-based digital exploratory laboratory of multiple cloud-computing tools with teaching materials that instructors will be able to use ‘out-of-the-box’ with a single click, or to remix as circumstances dictate. The Open Digital Archaeology Textbook Environment will be the first of its kind to address methods and practice in digital archaeology.

Part of our inspiration comes from the ‘DHBox’ project from CUNY (City University of New York, http://dhbox.org), a project that is creating a ‘digital humanities laboratory’ in the cloud. While the tools of the digital humanities are congruent with those of digital archaeology, they are typically configured to work with texts rather than material culture in which archaeologists specialise. The second inspiration is the open-access guide ‘The Programming Historian’, which is a series of how-tos and tutorials (http://programminghistorian.org) pitched at historians confronting digital sources for the first time. A key challenge scholars face in carrying out novel digital analysis is how to install or configure software; each ‘Programming Historian’ tutorial therefore explains in length and in detail how to configure software. The present e-textbook merges the best of both approaches to create a singular experience for instructors and students: a one-click digital laboratory approach, where installation of materials is not an issue, and with carefully designed tutorials and lessons on theory and practice in digital archaeology.

The word ‘e-textbook’ will be used throughout this proposal to include both the integrated digital exploratory laboratory and the written texts that engage the student with it and the supporting materials. This digital infrastructure includes the source code for exploratory laboratory so that faculty or institutions may install it on their own servers, or they may use our hosted version. This accessibility is a key component because one instructor alone cannot be expected to provide technical support across multiple operating systems on student machines whilst still bringing the data, tools and methodologies together in a productive manner. Moreover, at present, students in archaeology do not necessarily have the appropriate computing resources or skill sets to install and manage the various kinds of server-side software that digital archaeology typically uses. Thus, all materials will be appropriately licensed for maximum re-use. Written material will be provided as source markdown-formatted text files (this allows for the widest interoperability across platforms and operating systems; see sections 9 and 10). By taking care of the digital infrastructure that supports learning, the e-textbook enables instructors and students to focus on core learning straight away.

At our e-textbook’s website, an instructor will click once to ‘spin up’ a digital laboratory accessible within any current web browser, a unique version of the laboratory for that class, at a unique URL. At that address, students will select the appropriate tools for the tasks explored in the written materials. Thus, valuable class time is directed towards learning and experimenting with the material rather than installing or configuring software.

The e-textbook materials will be pitched at an intermediate level; appropriate remixing of the materials with other open-access materials on the web will allow the instructor to increase or decrease the learning level as appropriate. Its exercises and materials will be mapped to a typical one-semester time frame.

Rationale

Digital archaeology sits at the intersection of the computational analysis of human heritage and material cultural, and rapidly developing ecosystems of new media technologies. Very few universities in Ontario have digital archaeologists as faculty and thus digital archaeology courses are rarely offered as part of their roster. Of the ten universities in Ontario that offer substantial undergraduate and graduate programs in archaeology (see http://www.ontarioarchaeology.on.ca/archaeology-programs), only three (Western, Ryerson and Carleton) currently offer training in digital methods. Training in digital archaeology is offered on a per project level, most often in the context of Museum Studies, History, or Digital Media programs. Yet growing numbers of students demand these skills, often seeking out international graduate programs in digital archaeology. This e-textbook therefore would be a valuable resource for this growing field, while simultaneously building on Ontario’s leadership in online learning and Open Educational Resources. Moreover, the data and informatics skills that students could learn via this e-textbook, as well as the theoretical and historiographical grounding for those skills, see high and growing demand, which means that this e-textbook could find utility beyond the anthropology, archaeology, and cultural heritage sectors.

Our e-textbook would arrive at an opportune moment to make Ontario a leading centre for digital archaeological education. Recently, the provincial government has made vast public investment in archaeology by creating ‘Sustainable Archaeology’ (http://sustainablearchaeology.org/), a physical repository of Ontario’s archaeological materials and centre for research. While growing amounts of digitized archaeological materials are being made available online via data publishers such as Open Context (http://opencontext.org), and repositories such as tDAR (https://www.tdar.org), DINAA (http://ux.opencontext.org/archaeology-site-data/dinaa-overview/) and ADS (http://archaeologydataservice.ac.uk), materials for teaching digital archaeology have not kept pace with the sources now available for study (and print-only materials go out of date extremely quickly). Put simply, once archaeological material is online, we face the question of “so what?” and “now what?” This e-textbook is about data mining the archaeological database, reading distantly thousands of ‘documents’ at once, graphing, mapping, visualizing what we find and working out how best to communicate those findings. It is about writing archaeology in digital media that are primarily visual media. Thus, through the e-textbook, students will learn how to collect and curate open data, how to visualize meaningful patterns within digital archaeological data, and how to analyze them.

Furthermore, this e-textbook has two social goals:

  1. It agitates for students to take control of their own digital identity, and to think critically about digital data, tools and methods. This in turn, can enable them to embody open access principles of research and communication.
  2. It promotes the creation, use and re-use of digital archaeological data in meaningful ways that deepen our understanding of past human societies.

Research materials that are online do not speak for themselves, nor are they necessarily findable or ‘democratized’. To truly make access democratic, we must equip scholars with “digital literacy” — the relevant skills and theoretical perspectives that enable critical thinking. These aims are at the heart of the liberal arts curriculum. We know that digital tools are often repurposed from commercial services and set to work for research ends in the social sciences and liberal arts. We are well aware that digital tools inherently emphasize particular aspects of data, making some more important than others. Therefore, it is essential that students think critically about the digital tools they employ. What are the unintended consequences of working with these tools? There is a relative dearth of expertise in critically assessing digital tools, and in seeing how their biases (often literally encoded in how they work) can impact the practice of archaeology.

To that end, we employ a student-centred, experiential, and outcome-based pedagogy, where students develop their own personal learning environment (via remixing our tools and materials provided through the laboratory) networked with their peers, their course professors, and the wider digital community.

Content Map

E-textbook Structure (instructional materials to support the digital exploratory laboratory)

The individual pieces (files and documents) of this e-textbook will all be made available using the distributed Git versioning control software (via Github). This granularity of control will enable interested individuals to take the project to pieces to reuse or remix those elements that make the most sense for their own practice. Since the writing is in the markdown text format, learners can create EPubs, PDFs, and webpages on-demand as necessary, which facilitates easy reuse, remixing and adaptation of the content. The granularity of control also has the added bonus that our readers/users can make their own suggestions for improvement of our code and writing, which we can then fold into our project easily. In this fashion our e-textbook becomes a living document that grows with its use and readership.

Introduction. Why Digital Archaeology?

Part One: Going Digital

  1. Project management basics
    1. Github & Version control
    2. Failing Productively
    3. Open Notebook Research & Scholarly Communication
  2. Introduction to Digital Libraries, Archives & Repositories
    1. Command Line Methods for Working with APIs
    2. Working with Open Context
    3. Working with Omeka
    4. Working with tDAR
    5. Working with ADS
  3. The Ethics of Big Data in Archaeology

The digital laboratory elements in this part enable the student to explore versioning control, a bash shell for command line interactions, and an Omeka installation.

Part Two: Making Data Useful

  1. Designing Data Collection
  2. Cleaning Data with OpenRefine
  3. Linked Open Data and Data publishing

The digital laboratory elements in this part continue to use the bash shell, as well as OpenRefine.

Part Three: Finding and Communicating the Compelling Story

  1. Statistical Computing with R and Python Notebooks; Reproducible code
  2. D3, Processing, and Data Driven Documents
  3. Storytelling and the Archaeological CMS: Omeka, Kora
  4. Web Mapping with Leaflet
  5. Place-based Interpretation with Locative Augmented Reality
  6. Archaeogaming and Virtual Archaeology
  7. Social media as Public Engagement & Scholarly Communication in Archaeology

The digital laboratory elements in this part include the bash shell, Omeka (with the Neatline mapping installation) and Kora installations, mapwarper, RStudio Server, Jupyter notebooks (python), Meshlab, and Blender.

Part Four: Eliding the Digital and the Physical

  1. 3D Photogrammetry & Structure from Motion
  2. 3D Printing, the Internet of Things and “Maker” Archaeology
  3. Artificial Intelligence in Digital Archaeology (agent models; machine learning for image captioning and other classificatory tasks)

The digital laboratory elements in this part include Wu’s Visual Structure from Motion package, and the TORCH-RNN machine learning package.

Part Five: Digital Archaeology’s Place in the World

  1. Marketing Digital Archaeology
  2. Sustainability & Power in Digital Archaeology

To reiterate, the digital laboratory portion of the e-textbook will contain within it a file manager; a bash shell for command line utilities (useful tools for working with CSV and JSON formatted data); a Jupyter Notebook installation; an RStudio installation; VSFM structure-from-motion; Meshlab; Omeka with Neatline; Jekyll; Mapwarper; Torch for machine learning and image classification. Other packages may be added as the work progresses. The digital laboratory will itself run on a Linux Ubuntu virtual machine. All necessary dependencies and packages will be installed and properly configured. The digital laboratory may be used from our website, or an instructor may choose to install locally. Detailed instructions will be provided for both options.

Getting Data out of Open Context & Doing Useful Things With It: Coda

Previously, on tips to get stuff out of Open Context…

In part 1, I showed you how to generate a list of URLs that you could then feed into `wget` to download information.

In part 2, I showed you how to use `jq` and `jqplay` – via the amazing Matthew Lincoln, from whom I’ve learned whatever small things I know about the subject – to examine the data and to filter it for exactly what you want.

Today – combining wget & jq

Today, we use wget to pipe the material through jq to get the csv of your dreams. Assuming you’ve got a list of urls (generated with our script from part 1), you point your firehose of downloaded data directly into jq. The crucial thing is to flag wget with `-qO-` to tell it that the output will be *piped* to another program. In which case, you would type at the terminal prompt or command line:

wget -qO- -i urls2.txt | jq -r '.features [ ] | .properties | [.label, .href, ."context label", ."early bce/ce", ."late bce/ce", ."item category", .snippet] | @csv' > out.csv

Which in Human says, ” hey wget, grab all of the data at the urls in the list at urls2.txt and pipe that information into jq. JQ, you’re going to filter for raw output the information within properities (which is within features), in particular these fields. Split the information fields up via commas, and write everything to a new file called out.csv.”

…Extremely cool, eh? (Word to the wise: read Ian’s tutorial on wget to learn how to form your wget requests politely so that you don’t overwhelm the servers. Wait a moment between requests – look at how the wget was formed in the open context part 1 post).

Getting Data out of Open Context & Doing Useful Things With It: Part 2

If you recall, at the end of part 1 I said ‘oh, by the way, Open Context lets you download data as csv anyway’. You might have gotten frustrated with me there – Why are we bothering with the json then? The reason is that the full data is exposed via json, and who knows, there might be things in there that you find you need, or that catch your interest, or need to be explored further. (Note also, Open Context has unique URI’s – identifiers- for every piece of data they have; these unique URIs are captured in the json, which can also be useful for you).

Json is not easy to work with. Fortunately, Matthew Lincoln has written an excellent tutorial on json and jq over at The Programming Historian which you should go read now. Read the ‘what is json?’ part, at the very least. In essence, json is a text file where keys are paired with values. JQ is a piece of software that enables us to reach into a json file, grab the data we want, and create either new json or csv. If you intend to visualize and explore data using some sort of spreadsheet program, then you’ll need to extract the data you want into a csv that your spreadsheet can digest. If you wanted to try something like d3 or some other dynamic library for generating web-based visualizations (eg p5js), you’ll need json.

jqplay

JQ lets us do some fun filtering and parsing, but we won’t download and install it yet. Instead, we’ll load some sample data into a web-toy called jqplay. This will let us try different ideas out and see the results immediately. In the this file  called sample.json I have the query results from Open Context – Github recognizes that it is json and that it has geographic data within it, and turns it automatically into a map! To see the raw json, click on the < > button. Copy that data into the json box at jqplay.org.

JQPlay will colour-code the json. Everything in red is a key, everything in black is a value. Keys can be nested, as represented by the indentation. Scroll down through the json – do you see any interesting key:value pairs? Matthew Lincoln’s tutorial at the programming historian is one of the most cogent explanations of how this works, and I do recommend you read that piece. Suffice to say, for now, that if you see an interesting key:value pair that you’d like to extract, you need to figure out just how deeply nested it is. For instance, there is a properties key that seems to have interesting information within it about dates, wares, contexts and so on. Perhaps we’d like to build a query using JQ that extracts that information into a csv. It’s within the features key pair, so try entering the following in the filter box:

.features [ ] | .properties

You should get something like this:

{
  "id": "#geo-disc-tile-12023202222130313322",
  "href": "https://opencontext.org/search/?disc-geotile=12023202222130313322&prop=oc-gen-cat-object&rows=5&q=Poggio",
  "label": "Discovery region (1)",
  "feature-type": "discovery region (facet)",
  "count": 12,
  "early bce/ce": -700,
  "late bce/ce": -535
}
{
  "id": "#geo-disc-tile-12023202222130313323",
  "href": "https://opencontext.org/search/?disc-geotile=12023202222130313323&prop=oc-gen-cat-object&rows=5&q=Poggio",
  "label": "Discovery region (2)",
  "feature-type": "discovery region (facet)",
  "count": 25,
  "early bce/ce": -700,
  "late bce/ce": -535
}

For the exact syntax of why that works, see Lincoln’s tutorial. I’m going to just jump to the conclusion now. Let’s say we wanted to grab some of those keys within properties, and turn into a csv. We tell it to look inside features and find properties; then we tell it to make a new array with just those keys within properties we want; and then we tell it to pipe that information into comma-separated values. Try the following on the sample data:

.features [ ] | .properties | [.label, .href, ."context label", ."early bce/ce", ."late bce/ce", ."item category", .snippet] | @csv

…and make sure to tick the ‘raw output’ box at the top right. Ta da! You’ve culled the information of interest from a json file, into a csv. There’s a lot more you can do with jq, but this will get you started.

get jq and run the query from the terminal or command line

Install on OS – instructions from Lincoln

Install on PC – instructions from Lincoln

Got JQ installed? Good. Open your terminal or command prompt in the directory where you’ve got your json file with the data you extracted in part 1. Here we go:

jq -r '.features [ ] | .properties | [.label, .href, ."context label", ."early bce/ce", ."late bce/ce", ."item category", .snippet] | @csv' data.json > data.csv

So, we invoke jq, we tell it we want the raw output (-r), we give it the filter to apply, we give it the file to apply it to, and we tell it what to name the output.

one last thing

Take a look at how Lincoln pipes the output of a wget command into jq at the end of the section on ‘invoking jq’. Do you see how we might accelerate this entire process the next time you want data out of Open Context?

Now what?

Well, how about you take that csv data and see what stories you can tell with it? A good place to start is with wtfcsv, or Raw or Plot.ly or heaven help me, Excel or Numbers. Then, enter our contest maybe?

At the very least, you’ve now learned some powerful skills for working with the tsunami of open data now flooding the web. Happy wading!

Getting Data out of Open Context & Doing Useful Things With It: Part 1

a walkthrough for extracting and manipulating data from opencontext.org

Search for something interesting. I put ‘poggio’ in the search box, and then clicked on the various options to get the architectural fragments. Look at the URL:
https://opencontext.org/subjects-search/?prop=oc-gen-cat-object&q=Poggio#15/43.1526/11.4090/19/any/Google-Satellite
See all that stuff after the word ‘Poggio’? That’s to generate the map view. We don’t need it.

We’re going to ask for the search results w/o all of the website extras, no maps, no shiny interface. To do that, we take advantage of the API. With open context, if you have a search with a ‘?’ in the URL, you can put .json in front of the question mark, and delete all of the stuff from the # sign on, like so:

https://opencontext.org/subjects-search/.json?prop=oc-gen-cat-object&q=Poggio

Put that in the address bar. Boom! lots of stuff! But only one page’s worth, which isn’t lots of data. To get a lot more data, we have to add another parameter, the number of rows: ?rows=100&. Slot that in just before the p in prop= and see what happens.

Now, that isn’t all of the records though. Remove the .json and see what happens when you click on the arrows to page through the NEXT 100 rows. You get a URL like this:

https://opencontext.org/subjects-search/?rows=100&prop=oc-gen-cat-object&start=100&q=Poggio#15/43.1526/11.4090/19/any/Google-Satellite

So – to recap, the URL is searching for 100 rows at a time, in the general object category, starting from row 100, and grabbing materials from Poggio. We now know enough about how open context’s api works to grab material.

Couple of ways one could grab it:

  1. You could copy n’ paste -> but that will only get you one page’s worth of data (and if you tried to put, say, 10791 into the ‘rows’ parameter, you’ll just get a time-out error). You’d have to go back to the search page, hit the ‘next’ button, reinsert the .json etc over and over again.
  2. automatically. We’ll use a program called wget to do this. (To install wget on your machine, see the programming historian Wget will interact with the Open Context site to retrieve the data. We feed wget a file that contains all of the urls that we wish to grab, and it saves all of the data into a single file. So, open a new text file and paste our search URL in there like so:
https://opencontext.org/subjects-search/.json?rows=100&prop=oc-gen-cat-object---oc-gen-cat-arch-element&q=Poggio
https://opencontext.org/subjects-search/.json?rows=100&prop=oc-gen-cat-object---oc-gen-cat-arch-element&start=100&q=Poggio
https://opencontext.org/subjects-search/.json?rows=100&prop=oc-gen-cat-object---oc-gen-cat-arch-element&start=200&q=Poggio

…and so on until we’ve covered the full 4000 objects. Tedious? You bet. So we’ll get the computer to generate those URLS for us. Open a new text file, and copy the following in:

#URL-Generator.py

urls = '';
f=open('urls.txt','w')
for x in range(1, 4000, 100):
    urls = 'https://opencontext.org/subjects-search/.json?rows=100&prop=oc-gen-cat-object---oc-gen-cat-arch-element&start=%d&q=Poggio/\n' % (x)
    f.write(urls)
f.close

and save it as url-generator.py. This program is in the python language. If you’re on a Mac, it’s already installed. If you’re on a Windows machine, you’ll have to download and install it. To run the program, open your terminal (mac) or command prompt (windows) and make sure you’re in the same folder where you saved the program. Then type at the prompt:

python url-generator.py

This little program defines an empty container called ‘urls’; it then creates a new file called ‘urls.txt’; then we tell it to write the address of our search into the urls container. See the %d in there? The program writes a number between 1 and 4000; each time it does that, it counts by 100 so that the next time it goes through the loop, it adds a new address with the correct starting point! Then it saves that container of URLs into the file urls.txt. Go ahead, open it up, and you’ll see.

Now we’ll feed it to wget like so. At the prompt in your terminal or command line, type:

wget -i urls.txt -r --no-parent -nd –w 2 --limit-rate=10k

You’ll end up with a lot of files that have no file extension in your folder, eg,

.json?rows=100&prop=oc-gen-cat-object---oc-gen-cat-arch-element&start=61&q=Poggio%2F

Select all of these and rename them in your finder (instructions) or windows explorer (instructions), such that they have a sensible file name, and that the extension is .json. We are now going to concatenate these files into a single, properly formatted, .json file. (Note that it is possible for wget to push all of the downloaded information into a single json file, but it won’t be a properly formatted json file – it’ll just be a bunch of lumps of difference json hanging out together, which we don’t want).

We are going to use a piece of software written for NodeJS to concatenate our json files (this enables us to develop in javascript; it’s useful for lots of other things too). Go to the NodeJS download page and download and install for your machine. (Windows users, make sure you select the npm package manager as well during the install procedure). Once it’s installed, open a terminal or command prompt and type

npm install -g json-concat (mac users, you might need sudo npm install -g json-concat).

This installs the json-concat tool. We’ll now join our files together:

# As simple as this. Output file should be last
$ json-concat file1.json file2.json file3.json file4.json ouput.json

… for however many json files you have.

Congratulations

You now have downloaded data from Open Context as json, and you’ve compiled that data into a single json file. This ability for data to be called and retrieved programmaticaly also enables things like the Open Context package for the R statistical software environment. If you’re feeling adventurous, take a look at that.

In Part Two I’ll walk you through using JQ to masage the json into a csv file that can be explored in common spreadsheet software. (For a detailed lesson on JQ, see the programming historian, which also explains why json in the first place). Of course, lots of the more interesting data viz packages can deal with json itself, but more on that later.

And of course, if you’re looking for some quick and dirty data export, Open Context has recently implemented a ‘cloud download’ button that will export a simplified version of the data direct to csv on your desktop. Look for a little cloud icon with a down arrow at the bottom of your search results page. Now, you might wonder why I didn’t mention that at the outset, but look at it this way: now you know how to get the complete data, and with this knowledge, you could even begin building far more complicated visualizations or websites. It was good for you, right? Right? Right.

PS Eric adds: “Also, you can request different types of results from Open Context (see: https://opencontext.org/about/services#tab_query-meta). For instance, if you only want GeoJSON for the results of a search, add “response=geo-record” to the request. That will return just a list of geospatial features, without the metadata about the search, and without the facets. If you want a really simple list of URIs from a search, then add “response=uri”. Finally, if you want a simple list of search results with some descriptive attributes, add “response=uri-meta” to the search result.”

Open Context & Carleton Prize for Archaeological Visualization

Open Context & Carleton Prize for Archaeological Visualization

Increasingly, archaeology data are being made available openly on the web. But what do these data show? How can we interrogate them? How can we visualize them? How can we re-use data visualizations?

We’d like to know. This is why we have created the Open Context and Carleton University Prize for Archaeological Visualization and we invite you to build, make, hack, the Open Context data and API for fun and prizes.

Who Can Enter?

Anyone! Wherever you are in the world, we invite you to participate. All entries will be publicly accessible and promoted via a context gallery on the Open Context website.

Sponsors

The prize competition is sponsored by the following:

  • The Alexandria Archive Institute (the nonprofit that runs Open Context)
  • The Digital Archaeology at Carleton University Project, led by Shawn Graham

Categories

We have prizes for the following categories of entries:

  • Individual entry: project developed by a single individual
  • Team entry: project developed by a collaborative group (2-3 people)
  • Individual student entry: project developed by a single student
  • Student team entry: project developed by a team of (2-3) students

Prizes

All prizes are awarded in the form of cash awards or gift vouchers of equivalent value. Depending on the award type, please note currency:

  • Best individual entry: $US200
  • Best team entry (teams of 2 or 3): $US300 (split accordingly)
  • Best student entry: $C200
  • Best student team entry (teams of 2 or 3): $C300 (split accordingly)

We will also note “Honorable Mentions” for each award category.

Entry Requirements

We want this prize competition to raise awareness of open data and reproducible research methods by highlighting some great examples of digital data in practice. To meet these goals, specific project entry requirements include the following:

  • The visualization should be publicly accessible/viewable, live on the open Web
  • The source code should be made available via Github or similar public software repository
  • The project needs to incorporate and/or create open source code, under licensing approved by the Free Software Foundation.
  • The source code must be well-commented and documented
  • The visualization must make use of the Open Context API; other data sources may also be utilized in addition to Open Context
  • A readme file should be provided (as .txt or .md or .rtf), which will include:
    • Instructions for reproducing the visualization from scratch must be included
    • Interesting observations about the data that the visualization makes possible
    • Documentation of your process and methods (that is to say, ‘paradata’ as per theLondon Charter, section 4)

All entries have to meet the minimum requirements described in ‘Entry Requirements’ to be considered.

Entries are submitted by filling a Web form (http://goo.gl/forms/stmnS73qCznv1n4v1) that will ask you for your particulars and the URL to your ‘live’ entry and the URL to your code repository. You will also be required to attest that the entry is your own creation.

Important Dates

  • Closing date for entry submissions: December 16, 2016
  • Winners announced: January 16, 2017

Criteria for Judging

  • Potential archaeological insight provided by the visualization
  • Reproducibility
  • Aesthetic impact
  • Rhetorical impact
  • Appropriate recognition for/of data stakeholders (creators and other publics)

Attention will be paid in particular to entries that explore novel ways of visualizing archaeological data, or innovative re-uses of data, or work that takes advantage of the linked nature of Open Context data, or work that enables features robust/reproducible code for visualizations that could be easily/widely applied to other datasets.

Judges

The judges for this competition are drawn from across the North America:

Resources

Open By Default

Open By Default

Recently, there’s been a spat of openness happening in this here government town. This post is just me collecting my thoughts.

First thing I saw: a piece in the local paper about the Canadian Science and Technology Museum’s policy on being ‘open by default’. The actual news release was back in April.

This is exciting stuff; I’ve had many opportunities to work the folks from CSTM and they are consistently in the lead around here in terms of how they’re thinking about the ways their collections (archival, artefactual and textual) could intersect with the open web.

This morning, I was going over the Government of Canada’s ‘Draft Plan on Open Government’ and annotating it. (I’m using wordpress.com, so can’t use Kris Shaffer’s awesome new plugin that would pull these annotations into this post.)

There’s a lot of positive measures in this plan. Clearly, there’s been a lot of careful thought and consideration and I applaud this. There are a few things that I am concerned about though (and you can click on the link above, ‘annotating it’ to see my annotations). Broadly, it’s about the way access != openness. It’s not enough to simply put materials online, even if they have all sorts of linked open data goodness. There are two issues here.

  1. accessing data is something that is not equitably available to all. Big data dumps require fast connections, or good internet plans, or good connectivity. In Canada, if you’re in a major urban area, you’re in luck. If you live in a more rural area, or a poorer area, or an area that is broadly speaking under-educated, you will not have any of these. Where I’m from, there’s a single telephone cable that connects everything (although in recent years a cell phone tower was built. But have you looked at the farce that is Canadian mobile data?)
  2. accessing data so that it becomes useful depends on training. Even I struggle often to make use of things like linked open data to good effect. Open Context for instance (an open archaeological data publishing platform) provides example ‘api recipes‘ to show people what’s possible and how to actually accomplish something.

So my initial thought is this: without training and education (or funds to encourage same), open data becomes a public resource that only the private few can exploit successfully. Which makes things like the programming historian and the emergence of digital humanities programs at the undergraduate and graduate level all the more important, if the digital divide (and the riches being on the right side of it brings) is to be narrowed. If ‘open by default’ is to be for the common good.

 

Regex to grab your citations (provided you’re sensible*)

Regex to grab your citations (provided you’re sensible*)

* and by sensible, I mean, not mucking about with footnotes. Can’t abide footnotes. But I digress.

Earlier today, @archaeo_girl asked,

And Sasha Cuerda came up with this: http://regexr.com/3dl04. This is awesome.

Note, however, that the pattern Sasha shares will not work if, in the middle of a parenthesis, you have a citation like this:

(Doe, 2016; Smith 2010, 2012; Graham, 2008)

See the problem? It’s that pesky , between 2010 and 2012 for Smith. My regex-fu is not strong, so one shortcut might be (assuming you’re working on a *copy* of your text in a text editor) to find all commas and replace them with semi-colons. Then Sasha’s pattern will work. After all, you’re just after the citations.

Be sure to click on the ‘replace’ button in Sasha’s regexr to see how you could extract the citations. The replace pattern puts a # at the start of a line with a citation, and makes sure that only the citation is on that line. You could then search for all lines NOT starting with a # and delete them. Hey presto, all your citations in a handy list!  (Speaking of lists, I missed the ‘list’ tool at the bottom, which has the relevant regex pattern to replace the text directly with a list. Cool beans!)

Other handy regexes:

If the text is like this:

Shawn Graham (2008) writes in Electric Archaeology…

then

[A-Z]\w+ \(\d{4}\)

will find `Graham (2008)`

If the text is like this:

According to Graham (2008:45), “Smurfs are the problem…”

then

[A-Z]\w+ +\(\d{4}:\d+\)

will find `Graham (2008:45)`.