Getting Data out of Open Context & Doing Useful Things With It: Part 2

If you recall, at the end of part 1 I said ‘oh, by the way, Open Context lets you download data as csv anyway’. You might have gotten frustrated with me there – Why are we bothering with the json then? The reason is that the full data is exposed via json, and who knows, there might be things in there that you find you need, or that catch your interest, or need to be explored further. (Note also, Open Context has unique URI’s – identifiers- for every piece of data they have; these unique URIs are captured in the json, which can also be useful for you).

Json is not easy to work with. Fortunately, Matthew Lincoln has written an excellent tutorial on json and jq over at The Programming Historian which you should go read now. Read the ‘what is json?’ part, at the very least. In essence, json is a text file where keys are paired with values. JQ is a piece of software that enables us to reach into a json file, grab the data we want, and create either new json or csv. If you intend to visualize and explore data using some sort of spreadsheet program, then you’ll need to extract the data you want into a csv that your spreadsheet can digest. If you wanted to try something like d3 or some other dynamic library for generating web-based visualizations (eg p5js), you’ll need json.

jqplay

JQ lets us do some fun filtering and parsing, but we won’t download and install it yet. Instead, we’ll load some sample data into a web-toy called jqplay. This will let us try different ideas out and see the results immediately. In the this file  called sample.json I have the query results from Open Context – Github recognizes that it is json and that it has geographic data within it, and turns it automatically into a map! To see the raw json, click on the < > button. Copy that data into the json box at jqplay.org.

JQPlay will colour-code the json. Everything in red is a key, everything in black is a value. Keys can be nested, as represented by the indentation. Scroll down through the json – do you see any interesting key:value pairs? Matthew Lincoln’s tutorial at the programming historian is one of the most cogent explanations of how this works, and I do recommend you read that piece. Suffice to say, for now, that if you see an interesting key:value pair that you’d like to extract, you need to figure out just how deeply nested it is. For instance, there is a properties key that seems to have interesting information within it about dates, wares, contexts and so on. Perhaps we’d like to build a query using JQ that extracts that information into a csv. It’s within the features key pair, so try entering the following in the filter box:

.features [ ] | .properties

You should get something like this:

{
  "id": "#geo-disc-tile-12023202222130313322",
  "href": "https://opencontext.org/search/?disc-geotile=12023202222130313322&prop=oc-gen-cat-object&rows=5&q=Poggio",
  "label": "Discovery region (1)",
  "feature-type": "discovery region (facet)",
  "count": 12,
  "early bce/ce": -700,
  "late bce/ce": -535
}
{
  "id": "#geo-disc-tile-12023202222130313323",
  "href": "https://opencontext.org/search/?disc-geotile=12023202222130313323&prop=oc-gen-cat-object&rows=5&q=Poggio",
  "label": "Discovery region (2)",
  "feature-type": "discovery region (facet)",
  "count": 25,
  "early bce/ce": -700,
  "late bce/ce": -535
}

For the exact syntax of why that works, see Lincoln’s tutorial. I’m going to just jump to the conclusion now. Let’s say we wanted to grab some of those keys within properties, and turn into a csv. We tell it to look inside features and find properties; then we tell it to make a new array with just those keys within properties we want; and then we tell it to pipe that information into comma-separated values. Try the following on the sample data:

.features [ ] | .properties | [.label, .href, ."context label", ."early bce/ce", ."late bce/ce", ."item category", .snippet] | @csv

…and make sure to tick the ‘raw output’ box at the top right. Ta da! You’ve culled the information of interest from a json file, into a csv. There’s a lot more you can do with jq, but this will get you started.

get jq and run the query from the terminal or command line

Install on OS – instructions from Lincoln

Install on PC – instructions from Lincoln

Got JQ installed? Good. Open your terminal or command prompt in the directory where you’ve got your json file with the data you extracted in part 1. Here we go:

jq -r '.features [ ] | .properties | [.label, .href, ."context label", ."early bce/ce", ."late bce/ce", ."item category", .snippet] | @csv' data.json > data.csv

So, we invoke jq, we tell it we want the raw output (-r), we give it the filter to apply, we give it the file to apply it to, and we tell it what to name the output.

one last thing

Take a look at how Lincoln pipes the output of a wget command into jq at the end of the section on ‘invoking jq’. Do you see how we might accelerate this entire process the next time you want data out of Open Context?

Now what?

Well, how about you take that csv data and see what stories you can tell with it? A good place to start is with wtfcsv, or Raw or Plot.ly or heaven help me, Excel or Numbers. Then, enter our contest maybe?

At the very least, you’ve now learned some powerful skills for working with the tsunami of open data now flooding the web. Happy wading!

Advertisements

3 thoughts on “Getting Data out of Open Context & Doing Useful Things With It: Part 2

Comments are closed.