obsidian zotero integration plugin

I’m making a coursepack for my fall class in Obsidian. It’ll have useful features, tools, and templates for doing the work I want them to do.

I want students to use Zotero to keep track of their readings and annotations. I’ve been playing with https://github.com/mgmeyers/obsidian-zotero-integration and have now got it where I want it, more or less. Now that I’ve got it set up, students just have to download my vault and turn off safe mode, and everything’ll be configured. For reference, this is what I did:
– add it via the community plugins
– create a new note for your template.
– use this template for your notes: https://forum.obsidian.md/t/zotero-desktop-connector-import-templates/36310/2?u=drgraham
– the ‘css snippet’ is a text file with .css extension in the .obsidian/snippets folder; you can tell obsidian to read the snippet from the ‘appearance’ settings
– in the template itself, the hex values for colour did not agree with the colours being used by my version of zotero (maybe somewhere in zotero you can set those?)
– Here’re mine:
“`
{%- macro colorValueToName(color) -%}
{%- switch color -%}
{%- case “#a28ae5” -%}
Relevant / important
{%- case “#ff6666” -%}
Disagree
{%- case “#ffd400” -%}
Questions / confusion
{%- case “#5fb236” -%}
Agree
{%- case “#2ea8e5” -%}
Definitions / concepts
{%- default -%}
Interesting but not relevant
{%- endswitch -%}
{%- endmacro -%}
“`
In the settings for the zotero integration, I turned on the insert citekeys from the @ symbol, and I renamed note import to be ‘extract your notes from an item’. I have it create new notes in a folder called `zotero-notes`. I added the @ symbol in front of the filename template – which uses bibtex cite key- so that my extracted annotation notes all follow the pattern @graham2006 etc. Useful for eventual pandoc integration.

Use the colours in Zotero to indicate _why_ you’re highlighting something.

Yellow – Questions/confusion
Red – Disagree
Green – Agree
Blue – Definitions / concepts
Purple – Important

cmd + p to open the command pallette
look for ‘**zotero – extract your notes from an item’**.
wait a few moments for the zotero selector to appear.
search for and select the item that you were annotating within zotero
a new note will appear in the zotero-notes folder
you can then refactor (use the ‘note composer’ command) the one big note into several small ones. Alternatively, in a brand new note you can link to the note and use the ^ symbol to link to a particular annotation.

If you have zotero on an ipad, and you have some kind of shared folder accessible between your computer and the ipad, there seems to be a glitch. Anything you annotate on your ipad in a shared folder needs to be moved to your main ‘my library’ on your computer, and then deleted from the shared folder. Otherwise you’ll get an error when you go to extract the annotations.

steampipe.io + hypothes.is -> obsidian.md

Steampipe.io lets you run sql against a variety of services, including hypothes.is. This strikes me as a nice way to perhaps develop a workflow from hypothesis to obsidian (there are any number of ways one could do this).

These are my notes for getting it up and running.
At the terminal, with brew installed (running on mac)


brew tap turbot/tap
brew install steampipe
steampipe -v
steampipe plugin install hypothesis


Then add my developer key for hypothesis to the config file for steampipe, which lives at

~/.steampipe/config/hypothesis.spc

There is a dashboard for hypothesis at https://github.com/turbot/steampipe-samples/tree/main/all/hypothesis . Copy those files to text editor, save with .sp extension. Then at the terminal, fire ’em up with

steampipe dashboard`

Queries can be run at the query prompt, via

steampipe query

. Or you can save your queries as a file, and pipe the results to output. So, I make an empty file and put this query into it:


select
uri,
tags,
exact,
text
from
hypothesis_search
where
query = 'tag=media&tag=review';


I save it as `query1`, no file extension. Then, at the terminal prompt,


$ steampipe query query1 > output.csv


The resulting output file actually uses the `|` character as a separator, so once I remember to specify that when opening in Excel, a lovely file of annotations. I have to run now, but I can see opening this file in Obsidian.md and then refactoring it so that I end up with one note per annotation.

references
https://hub.steampipe.io/plugins/turbot/hypothesis/tables/hypothesis_search
https://hub.steampipe.io/plugins/turbot/hypothesis

 

 

…god I hate the gutenberg editor here on wordpress.com. It sucks…

Memo to self: The Full Circuit of Humanities Computing

In 2013, Bethany Nowviskie wrote in ‘Resistance in the Materials‘:

“What new, interpretive research avenues will open up for you, in places of interesting friction and resistance, when you gain access to the fresh, full circuit of humanities computing—that is, the loop from the physical to the digital to the material text and artifact again?”

I love this idea; have loved it for years. It was the organizing principle of a course I did a few years ago. I’m working on a proposal that takes this as the central conceit. And I’m making this note here, because that most petulant of man-child, that monstrous Ego, has bought the location where I might normally post such things. Perhaps I’ll be spending more time here…

A System of Relationships, or Getting My Stuff Into Neo4j

Photo by Pixabay on Pexels.com

A context is just the name we give to describe a system of relationships at a particular point in time; a point that conventionally corresponds with a singular event in the life of the ‘site’. But there’s nothing real about a context. It’s just a convention that hides the fact that we are describing the edges, not the nodes, of material culture. What would an archaeology that focused exclusively on ‘assemblages’ in the sense of D&G – agencements – that had their own agency, coming together?

When we think of things this way, it’s clear that a conventional ‘relational’ database is the wrong way of ordering things. A relational database, ironically, does not attached any meaning to the connections in themselves; it’s just a way of joining tables. A graph database on the other hand assigns properties to both the nodes and the edges, and allows us to query through both the properties and the relationships.

Graph databases emerge out of networks & graph theory. Neo4j is one of the most prominent in the field at the moment. But it has, to my mind, a glaring problem: it’s a right pain in the ass to get information into it. You can write queries that describe the nodes and the edges, and you can import csv files, but they have to be arranged just right. Much as in social network analysis, one effective way to import things is to have a table of the nodes, where the columns contain the properties, and another with the edges, and their properties.

But data entry is a nasty business. If you just start filling in cells in a spreadsheet, you quickly run into user interface issues, data validation issues, sheer pain-in-the-arsedness issues. The other requirement was that there were a series of images on which we wanted to annotate the location of objects. The forms would then capture the metadata around how these objects interrelated with each other: the photo captured one moment in time, one event, one context. I wanted therefore to design a form that would make that pain a little less awful.

One can design forms in Excel. Or in Access (though I didn’t have a copy of Access). Or Filemaker. Or Google App script. But… part of me is bloody minded. If I was going to screw things up, I wanted it to be because I screwed things up, not because some part of Excel decided that the data I was entering was really a date, or some other ‘helpful’ automagical correction.

There are a variety of ways of doing this DIY business. I could’ve gone with html and webforms. Remember when writing html was straightforward? Now it’s all Django or Svelte or typescript or whatever. Not being much of a python person (or front-end person too, for that matter), but recognizing that it could run on whatever system, I thought I’d see what I could do in that regard. Which is how I came to Tkinter. Similarly, I could’ve used pyqt4, 5, or 6, or even gui tools (eg, like this) for designing python forms. But I had found someone’s tutorial that did pretty much what I wanted to do. I might not be able to write from scratch, but i can generally follow the logic and adapt/adopt things as I find ’em.

I wrote four forms. One for context metadata, one for artifact metadata, one for photo metadata, and one for notes. I built in validation and dropdowns that pulled from one form to the next, so that I’d minimize my ability to screw up consistency of names and descriptions. I got the forms to import image annotations from LabelImg, and to export the whole boiling to four csv tables. (Chantal, my student, pointed out that if I’d started in pyqt5, I could probably have built the forms directly into LabelImg in the first place. Maybe next time.)

Now the problem was getting all this into Neo4J. Two nodes were obvious: photo, and artifact.

(Artifact)-[APPEARS_IN]->(Photo).

But was the third node really ‘context’? I think the answer is ‘no’. The third node is the physical location; but the context describes an event that happens there. So:

(Photo)-[DEPICTS]->(Square)

and the various contextual information is a property sometimes of DEPICTS, sometimes of APPEARS_IN, sometimes of Square, Photo, and Artifact.

My forms didn’t export a nice table of relationships; instead I had to do some merging using Pandas to parse the various fields/cells into the places I want. Code examples follow below. Then it became a matter of writing print statements that would iterate through the rows of data and write the values in the various cells mixed together with the correct Cypher text and syntax. In this, I was following an example from this ; there are similar things around the web.

I bundled these up into little scripts, and wrote one last gui to help with the workflow, a launcher with buttons for all the various components. I can enter data (saving it to sqlite3, which could be pushed online with Datasette, which’ll wrap it in an API so others can play with it), I can export it to flat csv tables, and I can push a button to get the Cypher statements to add to the graph database.

In Neo4j, I can now start querying the material as if I was looking for shortest paths through it, attenuated by physical x,y,z. Community detection. Cliques. And so on. If an archaeological site is a series of relationships, then I want to use a method that is built around understanding the structure (including absences!) of relationships. Tune in later for when I start querying stuff; that’ll be the thing: was all of this worth it?.

~~~

Python that creates a Cypher create statement from a CSV:

import sys
import csv

import pandas as pd

filename="../data/context_data.csv" 
  
# load the data with pd.read_csv
record = pd.read_csv(filename)

#print(record.Square.unique())
record = record.drop('CONTEXT_NUMBER', 1) # contexts will show up in the relationships

#a square might show up a couple of times 
#because of the other csv files - many artifacts from the same location - 
#so I just want ONE node to represent the square

record.drop_duplicates(subset=['Square'], inplace=True)
row_number = record.index

original_stdout = sys.stdout

with open('./square_nodes.cql', 'w') as f:
	sys.stdout = f # Change the standard output to the file we created.
	for i, row in record.iterrows():
		print("\n"+ "CREATE (s"+ str(row_number[i]+1) +":SQUARE {square:'"+record.loc[i][0] +"',module:'"+record.loc[i][1]+"',context_type:'"+record.loc[i][2]+"',description:'"+record.loc[i][6]+"'})")
	sys.stdout = original_stdout

And some Python that creates relationships; remember, there was no original relationships table; these are all derived implicitly or explicitly from the act of recording, and thinking of contexts as relationships, not things. Incidentally, I wrote these in a jupyter notebook first, so I could tinker slowly and make sure that I was grabbing the right things, that everything looked ok.

import pandas as pd
import csv
import sys

a = pd.read_csv("../data/photo_data.csv")
b = pd.read_csv("../data/artifact_data.csv")
b = b.dropna(axis=1)

c = pd.read_csv("../data/context_data.csv")
d = pd.read_csv("../data/photo_data.csv")
d = d.dropna(axis=1)

merged2 = c.merge(d, on='CONTEXT_NUMBER')
merged = a.merge(b, on='Photo_filename')


original_stdout = sys.stdout

with open('./relationships.cql', 'w') as f:

	for i, row in merged.iterrows():
		sys.stdout = f
		print("\n"+ "MATCH (a {artifactName:'"+ str(merged.loc[i][16]) +"'}), (b {photoNumber:"+ str(merged.loc[i][0])+"}) MERGE (a)-[:APPEARS_IN{timestamp:'"+ merged.loc[i][3]+"', xmin:"+ str(merged.loc[i][10])+", ymin:"+ str(merged.loc[i][11])+", xmax:"+ str(merged.loc[i][12])+", ymax:"+ str(merged.loc[i][13])+"}]->(b);")
		sys.stdout = original_stdout


with open('./relationships.cql', 'a') as f:

	for i, row in merged2.iterrows():
		sys.stdout = f
		print("\n"+ "MATCH (a {photoNumber:"+ str(merged2.loc[i][8]) +"}), (b {square:'"+ str(merged2.loc[i][1])+"'}) MERGE (a)-[:TAKEN_FROM{Square:'"+ merged2.loc[i][1]+"', Module:'"+ str(merged2.loc[i][2])+"', CONTEXT_NUMBER:'"+ str(merged2.loc[i][0])+"'}]->(b);")
		sys.stdout = original_stdout


Now I’d like to make my code more generalizable. But for the time being… it works, it’s alive!

Trying Something New

I never really got the hang of having a ‘holiday’, and dammit, futzing about on the computer still is fun, when there’s no deadline or pressing compulsion to do so… so the idea is:

  • i want a graph database (looks like neo4j is the most accessible thing?), because the relationships in this data seem pretty important
  • but the people entering the data can’t be expected to write cypher queries to do that

But we can import CSV tables describing objects and properties and relations and the properties of those relations. OK. So maybe just fill out a bunch of spreadsheets and export to csv.

But filling out cells in a spreadsheet can be pretty mind numbing, and we also know that excel does weird shit to data sometimes, and well, couldn’t we just avoid it?

So I’ve been designing data entry forms using jupyter notebooks, widgets, and voila. I’ve also been trying out tkinter for making forms directly in python too. So far, tkinter is winning. And at the end of that process, I end up with an sqlite database which I could push online using say datasette and have an api all of a sudden for the data, too. So that’s a win.

But back to the csv. I get a csv out of my tkinter data entry forms. It’s still not really in the format for entry into neo4j through its load.csv stuff. This probably is more a reflection on me than on neo4j, of course. Came across this – https://aspen-lang.org/ which looks like it’ll do what I want. Remember, the data is being added and worked with by folks with even less experience in this domain than me. Aspen looks pretty cool. Describe your data, it translates it into cypher queries, pushes into neo4js.

Problem is, I’m getting all sorts of weird errors that lead me to suspect that the version of ruby I’m running is the wrong version. I have no experience with ruby, other than a couple of jekyll experiments. The Aspen documentation says 2.6 or higher, but I’m running 3, and this isn’t working. It turns out you can run many versions of ruby at once (see stackoverflow) so I’ll give that a try, install 2.6 and see what happens….

… not much; but a new error message ‘undefined method `merge’ for false:FalseClass (NoMethodError)’ so that’s progress I guess.

**update**

memo to self. In the aspen.rb line, if I comment out the merge command, all works:

def self.compile_text(text, environment = {})
    assert_text(text)

    if text.include?(SEPARATOR)
      env, _sep, code = text.partition(SEPARATOR)
      compile_code(code, )#YAML.load(env).merge(environment))
    else
      code = text
      compile_code(code, environment)
    end
  end
  ```

Now, if I follow the wiki to do the csv import, modify the bin/convert script as directed and have empty ‘main.aspen’ files in ‘grammars’ and ‘discourse’, plus remember to capitalize nodes, node properties, then… hot damn, it writes a cypher query!

Five Years Of Epoiesen; Future Funding the Next Five!

Holy moly, it’s been 5 years of Epoiesen. I’d like to think we’ve had a bit of impact, a small moving of the needle concerning expanding the range of what is possible to do! Neville Morley, of the University of Exeter remarked a short while ago, on Twitter, about Epoiesen:

“Five years of being the most downright interesting and thought-provoking publication in archaeology/ancient history.’

The mission of Epoiesen has been referenced in journals like the Canadian Journal of Archaeology, the European Journal of Archaeology, and Advances in Archaeological Practice, and elsewhere; and individual pieces are being cited, used in teaching, and enjoyed by readers from all walks of life. Our authors range from tenured professors, to graduate and undergraduate students, to members of the public – probably the widest variety you’ll see! Of course, as a matter of principle we don’t use tracking cookies on Epoiesen, so I can’t give you ‘hits’ or shares or that sort of thing, but on Google Scholar you can see some of the pieces are gaining traction – https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=epoiesen.library.carleton.ca&btnG= .

Moving forward, I have set up a campaign on Carleton University’s micro-fundraising site, ‘FutureFunder’, to build up a bit of a reserve so that I can provide some paid training opportunities for students to help with Epoiesen; everything from copy-editing to site architecture to promotion. I would be grateful if readers of Electric Archaeology or Epoiesen could circulate this call around their own networks, link below. Starting at midnight on Nov 30 (ie, 12.01 am Nov 30), funds raised will be matched dollar-for-dollar by the University on ‘Giving Tuesday’, which is a great initiative.

link here because I can’t stop the new wordpress editor from turning it into an embed.

Epoiesen is free to read, and free to publish in; we do not charge nor will we ever charge, article processing fees. Pieces are published under licenses chosen by the authors. The formats are only limited by my own technical skills – but if I can start hiring people, even that limitation will fall away! Paper ‘Annual’ versions are available courtesy our friends at The Digital Press at the University of North Dakota. Volumes 1 – 4 are available (free pdf, $9 physical; any funds from sales go back into making other works from the Press open access); Volume 5 will be in production shortly.

Professor Graham’s Museum of Eldritch Antiquities

The more I study computational creativity the more I’ve come to the conclusion that whether something is perceived as creative or not says a lot about us as humans and less about our algorithms. Or to introduce another idiom: creativity is in the eye of the beholder
 

~

At night, the statues would come alive

...in Professor Graham’s Museum of Eldritch Antiquities. “Come – come!” he gestures. You enter a room; the darkness in the air shimmers…

As you walk along, you become aware of a faint clicking noise. “Now, in this room – please don’t get too close to the glass, ha ha, it’s only one pane thick – we have some lovely Greek kraters…”
 
His voice drops to a whisper. “best to just keep on walking past the next gallery. Don’t make eye contact; whatever you do.’
 
 
It turns a baleful…eye? towards you. The cold certain knowledge drops into your brain. There is no escape. A noise from the other gallery distracts it momentarily, and your feet decide to flee. “Wait!” he yells, “You can only exit through the gift shoppe! It is most curious!”
 
 
Bewildered, you run down one evershifting corridor into another, until you crash into a dead-end wall…. and somehow, push through to the outside world. Panting, you stare up at the monstrous edifice…
 
 
At least you got a postcard from Professor Graham’s Museum of Eldritch Antiquities. Looking closely, you see it was painted by someone W someone Bl
 
 
You squint at the words at the bottom, trying to frame your lips around the unfamiliar syllabus, momentarily forgetting they’re likely cursèd. It is the last mistake you make. /fin.
 
~
 
generative art. yeah, the computer pumps out the pixels, but figuring out how to _drive_ the computer to somewhere interesting, somewhere you want to go: there’s the art.
 
 
The top image in each pair was the starting place for a VQGAN+CLIP generative experiment; the bottom image is a still from the resulting movie. You can see the way the images emerge starting at this post. I’m exploring the spaces that the machine(s) ‘know’. Below, a simple prompt, ‘an archaeologist in the field’, drawing on the wikiart image model.
 
No conclusions in this post; experiments ongoing.

A Stack of Text Files + Obsidian Interface = Personal Knowledge Management System

(See my previous post about Obsidian here. This post is written for my students, except for the bit at the end)

In recent years, I’ve found that a lot of my research materials are all online. Everything I read, everything I study. I use Zotero to handle bibliography and to push pdfs to an iPad with zotfile; then I annotate as I read on the device, and eventually, retrieve the annotations back into Zotero.

I read on a browser on my work computer too; I use hypothes.is with a private group where I’m the only member to annotate websites and pdfs in the browser when I can’t be bothered to send them to the ipad.

I use a notebook where I scribble out ideas and page numbers.

All of this material is scattered around my devices. I’ve long admired the work of Caleb McDaniel, and his open notebook history. His research lives online, in a kind of wiki. Links between ideas and observations can be made that way and eventually, ideas can emerge out of the notes themselves by virtue of the way they’re connected. This is the idea behind what is called the Zettelkasten Method (‘slip box’, where each idea, each observation is kept on a discrete slip of paper numbered in such a way that connections can be formed):

A Zettelkasten is a personal tool for thinking and writing. It has hypertextual features to make a web of thought possible. The difference to other systems is that you create a web of thoughts instead of notes of arbitrary size and form, and emphasize connection, not a collection.

On a computer, such a thing can be created out of a folder of simple text files. Everything I observe, every idea that I have: one idea, one file. Then I use the free ‘second brain’ software, Obsidian to sit ‘on top’ of that stack of files to enable easy linking and discoverability. Obsidian helps you identify connections, backlinks and outgoing links, between your notes. It visualizes them as a graph too, which can be helpful. Obsidian works as a personal knowledge base; seeing your notes develop and begining to interlink – and being able to search those patterns – supercharges your note taking abilities and the value of your reading. With time, you’ll start to see connections and patterns in your thoughts. You’re building up a personal knowledge base that becomes more valuable the more you use it. It can be extended with all sorts of plugins, written in javascript (since Obsidian is an electron app). Some of the plugins I use are Dataview (which lets you create new notes that query other notes’ metadata – keeping track of all notes relevant to a particular project, for instance) and Journey, which finds the shortest path between two notes (and you can then copy that path into a new note that embeds the contents of the notes along the journey: boom, instant article outline!)

I haven’t been playing with Obsidian for too long, but I do have a folder with over 400 separate markdown notes in it, the legacy of previous experiments with zettel-based systems. But so far, Obsidian seems like a winner. When I get an idea, I search my vault for relevant notes, and then it is easy to embed them into a new document (type [[ and start writing; any existing notes with that tile are found; close with ]]. Then put a ! in front of it: the note is embedded. Add a ^ after the note title, and you can embed specific blocks or paragraphs from the source note). I write around the existing notes, and then export to word for final polishing, citations (copy and paste; or install the pandoc plugin). This accelerates my writing considerably, and helps me pull my research together into coherent form. (Of course, since I’ve developed these notes across several different systems in my perennial search for the notetaking system to rule them all, my metadata usage isn’t consistent, which hampers things a bit).

Obsidian calls a folder with markdown files in it a ‘vault’. This vault can live in your icloud folder, your gdrive folder, your dropbox folder. You can initialize it with Git and push copies to a private git repo. Since the files are locally store, the whole thing is very fast, and you’re not locked in to using Obsidian if you decide something else would be better. I’ve started making a number of what I am calling ‘starter’ vaults for my students, to use in my upcoming courses. Since my courses this year are mostly methods-based, I want the students to use Obsidian as a kind of empty lab notebook. My starter vaults are configured with a few basic templates, some basic important course information, and a script for retrieving their Hypothesis annotations:

And for the ambitious student, a vault configured with every plugin and template I could find that I figured would be helpful for a student at https://github.com/shawngraham/obsidian-student-starter-vault

These include:

  • ocr templater extracts text from images and inserts into a new note
  • extract notes from hypothesis
  • citations from zotero bib
  • with mdnotes installed in zotero, you can export your zotero notes to this vault
  • with refactor commands, you can break big notes into atomic notes
  • with the ‘related’ templater, you can use text analysis to find notes that are similar, to enable linking
  • open the backlink pane, and use backlinks and outgoing links to find potential connections with other notes (compares the text of the note to the title of notes, finding matching strings for potential linking)
  • use the journey plugin to find paths through the vault, chains of ideas; these chains can be inserted into new notes
  • use transclusion (embedding) to create long overview notes that pull your ideas/observations from atomic notes into something that can become the core for a paper or article
  • queries to see how long you’ve got left for a writing project, and to see which resources you’ve used for what project
  • kanban boards for project management

Finally, I am also using an Obsidian vault configured with Dataview to try to improve my project management. I have a few research projects with several moving parts, plus responsibilities coordinating a graduate programme and a minor programme. This vault looks like this:

On the left, the ‘preview’ (non markdown) version of the index note that links out to several project notes. At the bottom is a table that pulls together notes on each of my graduate students with things that I need to keep track of. In the right hand pane, another note is open, showing the markdown code block that invokes the dataview plugin. The plugin searches through the metadata of each note about my individual students, pulling out those students who are formally working on my various projects.

Anyway, so far, so good. Give obsidian.md a play! Download it, and open the demo vault to get a sense of whether or not this approach to note-keeping works for you.

Hands On with ‘Hands-On Data Visualization’

I was pleased to receive a physical copy of Jack Dougherty and Ilya Ilyankou’s Hands On Data Visualization: Interactive Storytelling from Spreadsheets to Code not long ago. The complete online open access version is available behind this link. 

I’ve worked with Jack before, contributing essays to some of the volumes he’s edited on writing digital history or otherwise crafting for the web with students.

The Hands On Data Visualization (henceforth, HODV) book continues Jack’s work making digital history methods accessible to the widest variety of people. That’s one of the key strengths of this book; it addresses those students who are interested in finding and visualizing patterns in the past but who do not, as yet, have the experience or confidence to ‘open the hood’ and get deep into the coding aspects of digital history. I love and frequently use, refer to, and assign, tutorials from The Programming Historian; but there is so much material there, I find my students often get overwhelmed and find it hard to get started. Of course, that says more about my teaching and pedagogical scaffolding than perhaps I am comfortable with sharing. HODV I think will serve as an on-ramp for these students because it builds on the things that they already know, familiar point-and-click GUIs and so on, but much more important is the way it scaffolds why and how a student, from any discipline, might want to get into data visualization. (And of course, once you can visualize some data, you end up wanting to build more complex explorations, or ask deeper questions.)

Let’s talk about the scaffolding then. The book opens with a series of ‘foundational skills’, most important amongst them being ‘sketching out the data story‘. I love this; starting with pen and paper, the authors guide the student through an exercise moving from problem, to question, to eventual visualization; this exercise bookends the entire volume; the final chapter emphasizes that:

The goal of data visualization is not simply to make pictures about numbers, but also to craft a truthful narrative that convinces readers how and why your interpretation matters….tell your audience what you found that’s interesting in the data, show them the visual evidence to support your argument, and remind us why it matters. In three words: tell—show—why. Whatever you do, avoid the bad habit of showing lots of pictures and leaving it up to the audience to guess what it all means. Because we rely on you, the storyteller, to guide us on a journey through the data and what aspects deserve our attention. Describe the forest, not every tree, but point out a few special trees as examples to help us understand how different parts of the forest stand out.

The focus throughout is on truthfulness and transparency and why it matters. We move from part one, the foundational skills (from mockups, to finding, organizing and data wrangling the data) to building a wide variety of visualizations, charts, maps, and tables and getting these things online at no cost in part two. Part three explores some slightly more complicated visualizations relying on templates that sometimes involve a wee bit of gentle coding, but are laid out and illustrated clearly. This section is the one I’ve directed my own students to the most, as many of them are public history students interested in map making, and this section is one of the best resources on the web I think at the moment for building custom map visualizations (and geocoding, &tc.) I think students navigating this material will be reassured and able to adapt when these various platforms/templates etc change slightly (as they always do), given how carefully the various steps are documented and how they interrelate; this enables the student to see how to adapt to the new circumstances I would think. In my own writing-of-tutorials, I rely too much on writing out the steps without providing enough illustrated materials even though my gang like the reassurance of a screen that matches what the person in charge says should happen (my reasoning about not providing too much illustrative materials is that I’m also trying to teach my students how to identify gaps in their knowledge versus gaps in communicating, and how to roll with things – you can judge for yourself how well that works, see eg https://craftingdh.netlify.app . But I digress).

The final section deals with truthfulness, with sections on ‘how to lie with charts’ and ‘how to lie with maps’, a tongue in cheek set of headings dedicated to helping students recognize and reduce the biases that using these various tools can introduce (whether intentionally or accidentally). The final chapter involves storyboarding, to get that truthful narrative out there on the web, tying us back to chapter one and trying to solve the initial problem we identified. I really appreciate the storyboarding materials; that’s something I want to try more of with my gang.

I’ve spent a lot of years trying to build up the digital skills of my history students, writing many tutorials, spending many hours one-on-one talking students through their projects, goals, and what they need to learn to achieve them. HODV fills an important gap between the dedicated tutorials for academics who know what it is they are after and have a fair degree of digital literacy, and folks who are just starting out, who might be overwhelmed by the wide variety of materials/tutorials/walk-throughs they can find online. HODV helps draw the student into the cultures of data visualization, equipping them with the lingo and the baseline knowledge that will empower them to push their visualizations and analyses further. Make sure also to check out the appendix on ‘common problems’, which gives a series of strategies to deal with the kinds of bugs we encounter most often.

My teaching schedule for the next little while is set, but I could image using HODV as a core text for a class on ‘visualizing history’ at say the second year level. Then, I would rejig my third year ‘crafting digital history’ course to explicitly build on the skills HODV teaches, focussing more on more complex coding challenges (machine vision for historians, or NLP, topic models, text analysis). Then, my fourth year seminar on digital humanities in museums would function as a kind of capstone (the course works with undigitized collections, eventually publishing on the web with APIs, or doing reproducible research on already exposed collections).

Anyway, check it out for yourself at https://handsondataviz.org/ (the website is built with bookdown and R Studio; that’s something I’d like to teach my students too! Happily, there’s an appendix that sketches the process out, so a good place to start.) The physical book can be had at all the usual places. I don’t know what kinds of plans Jack and Ilya have for updating the book, but I expect the online version will probably be kept fresh, and will become a frequent stop for many of us.

 

Law & The Buying or Selling of Human Remains

Tanya Marsh compiled the relevant US laws surrounding human remains in ‘The Law of Human Remains’, 2016. I tried to gain a distant read of this body of law by topic modeling, term frequency – inverse distribution frequency, and pairwise comparison of the cosine distance between the laws. This is only possible due to the care and consistency, and regularity with which Marsh compiled the various laws. I also added in relevant Canadian laws to my text corpus.

For the topic model, I took two approaches. The input documents are individual text files summarising each state’s laws. Then I created a 23 topic topic model based first on the unigrams (individual words) and then bigrams (pairs of words).

For the unigram topic model, these are the topics and their probabilities:

1 body person burial permit dead remains death funeral disinterment director 0.356792543
2 act body person offence burial permit death liable fine cemetery 0.113741585
3 person burial cemetery monument tomb structure guilty remains class removes 0.102140763
4 person body licence specimen purpose crime possession offence deceased anatomical 0.047599170
5 remains funerary object native violation objects profit individual indian title 0.042126856
6 code corpse offense commits tex ilcs treats conduct supervision offensive 0.038992423
7 corpse intentionally site reburial medical admin sexual report coroner examiner 0.032800544
8 disturb destroy ground unmarked regs skeletal memorial knowingly material kin 0.032624217
9 dollars imprisonment thousand fine punished hundred exceeding fined duly conviction 0.030167542
10 death art burial ashes burials civil commune lands container dissection 0.029045414
11 remains disposition person act funeral heritage object operator deceased cremated 0.026407496
12 vehicle rev procession import export sites unmarked skeletal site historic 0.024186852
13 communicable metal disease casket sealed encased stillbirth coffin embalmed lined 0.021603532
14 cemetery corporation thereof monument purpose remove provided notice owner county 0.019534437
15 stat ann entombed intentionally excavates thereof stolen interred surviving directed 0.016960718
16 products criminal importation elements provisional cadavers tissues article organs including 0.016726574
17 services funeral business minister public paragraph prescribed information director person 0.016712443
18 town ch gen clerk agent board rsa city designated registered 0.012288657
19 church cons lot society trustees belonging company weeks deaths association 0.011458630
20 category violates punishable ii offences procedure vehicle field provincial shipment 0.008089604

When we look at the laws this way, you can see that the vast majority of law is related to regulating the funeral industry (topic 1), the regulation and care of cemeteries (topic 2, topic 3), and then various offenses against corpses (including specific mention of Indigenous remains; topics 4, 4, 5, 7; there is a surprising number of statues against necrophilia). Some topics deal with interfering with corpses for tissue and implantation (topic 16). Some topics deal with forms of memorialization (topic 10, topic 11).

If we take pairs of words as our ‘tokens’ for calculating the topic model, these are the topics and their probabilities:

  1. funeral director dead body transit permit burial transit final disposition local registrar licensed funeral death occurred common carrier burial ground 0.16873108
  2. burial permit statistics act vital statistics death occurred health officer anatomy act death occurs religious service common carrier act offence 0.06060749
  3. burial permit summary conviction coroners act statistics act vital statistics archaeological object act offence palaeontological object public health chief coroner 0.05312818
  4. funerary object native indian monument gravestone mutilate deface fence railing tomb monument authorized agent duly authorized willfully destroy destroy mutilate 0.05259857
  5. funerary objects burial site registration district responsible person family sensibilities disposition transit interment site sepulcher grave ordinary family person knowingly 0.05135363
  6. cons stat person commits health safety code ann penal code safety code legal authority cemetery company historic burial authority knowingly 0.05095100
  7. thousand dollars burial remains surviving spouse burial furniture county jail means including original interment pretended lien skeletal burial disturb vandalize 0.05092606
  8. deceased person relevant material anatomy licence religious worship united kingdom ii relevant mortem examination post mortem summary conviction anatomical examination 0.04980786
  9. knowingly sells tomb grave marked burial degree felony sells purchases sexual penetration dead fetus subsequent violation aids incites removing abandoning 0.04823685
  10. tomb monument monument gravestone offences procedure procedure act provincial offences defaces injures mutilates defaces skeletal remains burial artifacts destroys mutilates 0.04754577
  11. burial grounds gross misdemeanor disposition permit historic preservation unlawfully removed conviction thereof grave artifact fence wall grave vault private lands 0.04716950
  12. stat ann dead body disinters removes intentionally excavates deceased person grave tomb disinterment removal excavates disinters dead person thousand dollars 0.04301240
  13. stat ann rev stat unmarked burial skeletal remains burial site funeral procession burial sites burial artifacts admin regs funeral home 0.04141044
  14. cremated remains death certificate grand ducal mortal remains sexual intercourse article chapter level felony title article civil registrar grave rights 0.04090582
  15. medical examiner final disposition admin code burial site cataloged burial code dhs death occurred cremation permit sexual contact death certificate 0.03936073
  16. burial permit designated agent historic resources responsible person disinterment permit medical examiner chief medical palaeontological resource historic resource lineal descendant 0.03757965
  17. cemetery corporation dead body local health awaiting burial pub health profit corp religious corporation tombstones monuments attached thereto burial removal 0.03335480
  18. cremated remains funeral provider heritage object public health funeral services damage excavate provincial heritage tissue gift gift act damage desecrate 0.03295103
  19. coffin casket tomb monument airtight metal disinter individual individual remains lined burial private burying historic preservation enforcement officer metal lined 0.02902297
  20. funeral services services business public health health director business licensee national public services provider transportation services classified heritage funeral operations 0.02134617

You see the same general relative proportions, but the bigrams give a bit more clarity to the topic (read each list as pairs of words. Either way you cut it, there’s not much language given over to dealing with buying or selling of the dead, and a lot more space given over to regulating the funeral industry and graveyards.

Calculating tf-idf gives a sense of what differentiates the different jurisdictions, since it will pull out words that are comparatively rare in the complete body of text but prominent in a single document. I’m having trouble getting the visualizations to lay out cleanly (text overlaps; darn R). In terms of comparing the cosine similarity of texts, there’s some interesting patterns there; here’s a sample:

1 Iowa Michigan 0.5987066
2 Michigan Iowa 0.5987066
3 Florida Michigan 0.5116013
4 Michigan Florida 0.5116013
5 Iowa Florida 0.5100568
6 Florida Iowa 0.5100568
7 District of Columbia Georgia 0.4800154
8 Georgia District of Columbia 0.4800154
9 Mississippi Georgia 0.4771568
10 Georgia Mississippi 0.4771568

…that is to say: Iowa & Michigan are about 60% similar; Florida and Michigan are about 51% similar; and so on. I had done this to see what the outliers are; I tried representing these relationships as a network:

So… what *are* the laws around buying and selling human remains? I went on an epic twitter thread yesterday as I read through Marsh 2016. Thread starts here:

And I managed to break the thread; it resumes with this one:

All of this will be summarised and discussed in our book about the human remains trade, in due course.

A museum bot

I wanted to build a bot, inspired by some of my students who made a jupyter notebook that pulls in a random object from the Canadian Science and Technology Museum’s open data, displaying all associated information.

The museum’s data is online as a csv file to download (go here to find it: http://data.techno-science.ca/en/dataset/cstmc-smstc-artifacts-artefact ). Which is great; but not easy to integrate – no API.

Build an API for the data

So, I used Simon Willison’s Datasette package to take the csv table, turn it into a sqlite database, and then push it online – https://datasette.io/.

First I installed sqlite-utils and datasette using homebrew:

brew install sqlite-utils datasette

then I turned the csv into sql:

sqlite-utils insert cstm.db artefacts cstmc-CSV-en.csv --csv

I installed the commandline tools for vercel, where my museum data api will live, with

npm i -g vercel

vercel login

then I pushed the data online with datasette; datasette wraps the database in all its datasette goodness:

datasette publish vercel cstm.db --project=cstm-artefacts

You can see the results for yourself at https://cstm-artefacts.vercel.app/ (click on ‘artefacts’).

Now, a few days ago, Dan Pett posted the code for a bot he made that tweets out pics & data from the Portable Antiquities Scheme database – see his repo at https://github.com/portableant/findsbot. I figured it should be easy enough to adapt his code, especially since my new api will return data as json.

Build a Bot with R

So I fired up RStudio on my machine, and began experimenting. The core of my code runs an sql query on the API looking for a random object where ideally the general description and thumbnail fields are not null. Then it parses out the information I want, and builds a tweet:

library(httr)
library(rtweet)
library(jsonlite)
library(digest)

search <- paste0('https://cstm-artefacts.vercel.app/cstm.json?sql=SELECT+*+FROM+artefacts+WHERE+NOT+GeneralDescription+IS+NULL+AND+NOT+thumbnail+IS+NULL+ORDER+BY+RANDOM%28%29+LIMIT+1%3B')
randomFinds <- fromJSON(search)
## grab the info, put it into a dataframe
df <- as.data.frame(randomFinds$rows)
artifactNumber <- df$V1
generalDescription <- df$V3
contextFunction <- df$V17
thumbnail <- df$V36

## write a tweet
tweet <- paste(artifactNumber,generalDescription,contextFunction, sep=' ')

## thank god the images have a sensible naming convention;
## grab the image data
imagedir <- randomFinds$results$imagedir
image <- paste0(artifactNumber,'.aa.cs.thumb.png')
imageUrl <- paste0('http://source.techno-science.ca/artifacts-artefacts/images/', URLencode(image))

## but sometimes despire my sql, I get results where there's an issue with the thumbnail
## so we'll test to see if there is an error, and if there is, we'll set up a 
## an image of the Museum's lighthouse, to signal that well, we're a bit lost here
if (http_error(imageUrl)){
  imageUrl <- paste0('https://ingeniumcanada.org/sites/default/files/styles/inline_image/public/2018-04/lighthouse_.jpg')
  tweet <- paste(artifactNumber,generalDescription,contextFunction, "no image available", sep=' ')
}

## then we download the image so that we can upload it within the tweet
temp_file <- tempfile()
download.file(imageUrl, temp_file)

So all that will construct our tweet.

Authenticate….Authenticate…

The next issue is setting up a bot on twitter, and getting it to… tweet. You have to make a new account, verify it, and then go to developer.twitter.com and create a new app. Once you’ve done that, find the consumer key, the consumer secret, the access token, and the access secret. Then, make a few posts from the new account as well just to make it appear like your account is a going concern. Now, back in our script, I add the following to authenticate with twitter:

findsbot_token <- rtweet::create_token(
  app = "THE-EXACT-NAME-YOU-GAVE-YOUR-APP",
  consumer_key = "THE-KEY-GOES-HERE",
  consumer_secret = "THE-SECRET-GOES-HERE",
  access_token = "THE-TOKEN-GOES-HERE",
  access_secret = "THE-ACCESS-SECRET-GOES-HERE"
)

# post the tweet
rtweet::post_tweet(
  status = tweet,
  media = temp_file,
  token = findsbot_token
)

And, if all goes according to plan, you’ll get a “your tweet has been posted!” message.

Getting the authentication to work for me took a lot longer than I care to admit; the hassel was all on the developer.twitter.com site because I couldn’t find the right damned placed to click.

Secrets

Anyway, a bot that tweets when I run code on my machine is cool, but I’d rather the thing just ran on its own. Good thing I have Dan on speed-dial.

It turns out you can use Github Actions to run the script periodically. I created a new public repo (Github actions for private repos cost $) with the intention of putting my bot.R script in it. It is a very bad idea to put secret tokens in plain text on a public repo. So we’ll use the ‘secrets’ settings for the repo to store this info, and then modify the code to pull that info from there. Actually, let’s modify the code first. Change the create_token to look like this:

findsbot_token <- rtweet::create_token(
  app = "objectbot",
  consumer_key =    Sys.getenv("TWITTER_CONSUMER_API_KEY"),
  consumer_secret = Sys.getenv("TWITTER_CONSUMER_API_SECRET"),
  access_token =    Sys.getenv("TWITTER_ACCESS_TOKEN"),
  access_secret =   Sys.getenv("TWITTER_ACCESS_TOKEN_SECRET")
)

Save, and then commit to your repo. Then, click on the cogwheel for your repo, and select ‘Secrets’ from the menu on the left. Create a new secret, call it TWITTER_CONSUMER_API_KEY and then paste in the relevant info, and save. Do this for the other three items.

One thing left to do. Create a new file, and give it the file name .github\workflows\bot.yml ; here’s what should go inside it:

name: findsbot

on:
  schedule:
    - cron: '0 */6 * * *'
  workflow_dispatch:
    inputs:
      logLevel:
        description: 'Log level'
        required: true
        default: 'warning'
      tags:
        description: 'Run findsbot manually'
jobs:
  findsbot-post:
    runs-on: macOS-latest
    env:
      TWITTER_CONSUMER_API_KEY: ${{ secrets.TWITTER_CONSUMER_API_KEY }}
      TWITTER_CONSUMER_API_SECRET: ${{ secrets.TWITTER_CONSUMER_API_SECRET }}
      TWITTER_ACCESS_TOKEN: ${{ secrets.TWITTER_ACCESS_TOKEN }}
      TWITTER_ACCESS_TOKEN_SECRET: ${{ secrets.TWITTER_ACCESS_TOKEN_SECRET }}
    steps:
      - uses: actions/checkout@v2
      - uses: r-lib/actions/setup-r@master
      - name: Install rtweet package
        run: Rscript -e 'install.packages("rtweet", dependencies = TRUE)'
      - name: Install httr package
        run: Rscript -e 'install.packages("httr", dependencies = TRUE)'
      - name: Install jsonlite package
        run: Rscript -e 'install.packages("jsonlite", dependencies = TRUE)'
      - name: Install digest package
        run: Rscript -e 'install.packages("digest", dependencies = TRUE)'
      - name: Create and post tweet
        run: Rscript bot.R

If you didn’t call your script bot.R then you’d change that last line accordingly. Commit your changes. Ta da!

The line that says ‘cron: ‘0 */6 * * *’ is the actual schedule. You can decipher that with this:

which comes from here: https://www.adminschoice.com/crontab-quick-reference . If you want to test your workflow, click on the ‘actions’ link at the top of your repo, then on ‘findsbot’. If all goes according to plan, you’ll soon see a new tweet. If not, you can click on the log file to see where things broke. Here’s my repo, fyi https://github.com/shawngraham/cstmbot.

So to reiterate – we found a whole bunch of open data; we got it online in a format that we can query; we wrote a script to query it, and build and post a tweet from the results; we’ve used github actions to automate the whole thing.

Oh, here’s my bot, by the way: https://twitter.com/BotCstm

Time for a drink of your choice.

Postscript

Individual objects are online, and the path to them can be built from the artefact number, as Steve Leahy pointed out to me: https://ingeniumcanada.org/ingenium/collection-research/collection-item.php?id=1979.0363.041 Just slap that number after the php?id=. So, I added that to the text of the tweet. But this also sometimes causes the thing to fail because of the character length. I’m sure I could probably test for tweet length and then swap in alternative text as appropriate, but one thing at least is easy to implement in R – the use of an url shortener. Thus:

library(urlshorteneR)

liveLink <- paste0('https://ingeniumcanada.org/ingenium/collection-research/collection-item.php?id=', artifactNumber)
shortlink <- isgd_LinksShorten(longUrl = liveLink)

tweet <- paste(artifactNumber,generalDescription,contextFunction,shortlink, sep=' ')

Which works well. Then, to make sure this works with Github actions, you have to install urlshorteneR with this line in your yaml:

   - name: Install urlshorteneR package
        run: Rscript -e 'install.packages("urlshorteneR", dependencies = TRUE)'

ta da!