A System of Relationships, or Getting My Stuff Into Neo4j

Photo by Pixabay on Pexels.com

A context is just the name we give to describe a system of relationships at a particular point in time; a point that conventionally corresponds with a singular event in the life of the ‘site’. But there’s nothing real about a context. It’s just a convention that hides the fact that we are describing the edges, not the nodes, of material culture. What would an archaeology that focused exclusively on ‘assemblages’ in the sense of D&G – agencements – that had their own agency, coming together?

When we think of things this way, it’s clear that a conventional ‘relational’ database is the wrong way of ordering things. A relational database, ironically, does not attached any meaning to the connections in themselves; it’s just a way of joining tables. A graph database on the other hand assigns properties to both the nodes and the edges, and allows us to query through both the properties and the relationships.

Graph databases emerge out of networks & graph theory. Neo4j is one of the most prominent in the field at the moment. But it has, to my mind, a glaring problem: it’s a right pain in the ass to get information into it. You can write queries that describe the nodes and the edges, and you can import csv files, but they have to be arranged just right. Much as in social network analysis, one effective way to import things is to have a table of the nodes, where the columns contain the properties, and another with the edges, and their properties.

But data entry is a nasty business. If you just start filling in cells in a spreadsheet, you quickly run into user interface issues, data validation issues, sheer pain-in-the-arsedness issues. The other requirement was that there were a series of images on which we wanted to annotate the location of objects. The forms would then capture the metadata around how these objects interrelated with each other: the photo captured one moment in time, one event, one context. I wanted therefore to design a form that would make that pain a little less awful.

One can design forms in Excel. Or in Access (though I didn’t have a copy of Access). Or Filemaker. Or Google App script. But… part of me is bloody minded. If I was going to screw things up, I wanted it to be because I screwed things up, not because some part of Excel decided that the data I was entering was really a date, or some other ‘helpful’ automagical correction.

There are a variety of ways of doing this DIY business. I could’ve gone with html and webforms. Remember when writing html was straightforward? Now it’s all Django or Svelte or typescript or whatever. Not being much of a python person (or front-end person too, for that matter), but recognizing that it could run on whatever system, I thought I’d see what I could do in that regard. Which is how I came to Tkinter. Similarly, I could’ve used pyqt4, 5, or 6, or even gui tools (eg, like this) for designing python forms. But I had found someone’s tutorial that did pretty much what I wanted to do. I might not be able to write from scratch, but i can generally follow the logic and adapt/adopt things as I find ’em.

I wrote four forms. One for context metadata, one for artifact metadata, one for photo metadata, and one for notes. I built in validation and dropdowns that pulled from one form to the next, so that I’d minimize my ability to screw up consistency of names and descriptions. I got the forms to import image annotations from LabelImg, and to export the whole boiling to four csv tables. (Chantal, my student, pointed out that if I’d started in pyqt5, I could probably have built the forms directly into LabelImg in the first place. Maybe next time.)

Now the problem was getting all this into Neo4J. Two nodes were obvious: photo, and artifact.

(Artifact)-[APPEARS_IN]->(Photo).

But was the third node really ‘context’? I think the answer is ‘no’. The third node is the physical location; but the context describes an event that happens there. So:

(Photo)-[DEPICTS]->(Square)

and the various contextual information is a property sometimes of DEPICTS, sometimes of APPEARS_IN, sometimes of Square, Photo, and Artifact.

My forms didn’t export a nice table of relationships; instead I had to do some merging using Pandas to parse the various fields/cells into the places I want. Code examples follow below. Then it became a matter of writing print statements that would iterate through the rows of data and write the values in the various cells mixed together with the correct Cypher text and syntax. In this, I was following an example from this ; there are similar things around the web.

I bundled these up into little scripts, and wrote one last gui to help with the workflow, a launcher with buttons for all the various components. I can enter data (saving it to sqlite3, which could be pushed online with Datasette, which’ll wrap it in an API so others can play with it), I can export it to flat csv tables, and I can push a button to get the Cypher statements to add to the graph database.

In Neo4j, I can now start querying the material as if I was looking for shortest paths through it, attenuated by physical x,y,z. Community detection. Cliques. And so on. If an archaeological site is a series of relationships, then I want to use a method that is built around understanding the structure (including absences!) of relationships. Tune in later for when I start querying stuff; that’ll be the thing: was all of this worth it?.

~~~

Python that creates a Cypher create statement from a CSV:

import sys
import csv

import pandas as pd

filename="../data/context_data.csv" 
  
# load the data with pd.read_csv
record = pd.read_csv(filename)

#print(record.Square.unique())
record = record.drop('CONTEXT_NUMBER', 1) # contexts will show up in the relationships

#a square might show up a couple of times 
#because of the other csv files - many artifacts from the same location - 
#so I just want ONE node to represent the square

record.drop_duplicates(subset=['Square'], inplace=True)
row_number = record.index

original_stdout = sys.stdout

with open('./square_nodes.cql', 'w') as f:
	sys.stdout = f # Change the standard output to the file we created.
	for i, row in record.iterrows():
		print("\n"+ "CREATE (s"+ str(row_number[i]+1) +":SQUARE {square:'"+record.loc[i][0] +"',module:'"+record.loc[i][1]+"',context_type:'"+record.loc[i][2]+"',description:'"+record.loc[i][6]+"'})")
	sys.stdout = original_stdout

And some Python that creates relationships; remember, there was no original relationships table; these are all derived implicitly or explicitly from the act of recording, and thinking of contexts as relationships, not things. Incidentally, I wrote these in a jupyter notebook first, so I could tinker slowly and make sure that I was grabbing the right things, that everything looked ok.

import pandas as pd
import csv
import sys

a = pd.read_csv("../data/photo_data.csv")
b = pd.read_csv("../data/artifact_data.csv")
b = b.dropna(axis=1)

c = pd.read_csv("../data/context_data.csv")
d = pd.read_csv("../data/photo_data.csv")
d = d.dropna(axis=1)

merged2 = c.merge(d, on='CONTEXT_NUMBER')
merged = a.merge(b, on='Photo_filename')


original_stdout = sys.stdout

with open('./relationships.cql', 'w') as f:

	for i, row in merged.iterrows():
		sys.stdout = f
		print("\n"+ "MATCH (a {artifactName:'"+ str(merged.loc[i][16]) +"'}), (b {photoNumber:"+ str(merged.loc[i][0])+"}) MERGE (a)-[:APPEARS_IN{timestamp:'"+ merged.loc[i][3]+"', xmin:"+ str(merged.loc[i][10])+", ymin:"+ str(merged.loc[i][11])+", xmax:"+ str(merged.loc[i][12])+", ymax:"+ str(merged.loc[i][13])+"}]->(b);")
		sys.stdout = original_stdout


with open('./relationships.cql', 'a') as f:

	for i, row in merged2.iterrows():
		sys.stdout = f
		print("\n"+ "MATCH (a {photoNumber:"+ str(merged2.loc[i][8]) +"}), (b {square:'"+ str(merged2.loc[i][1])+"'}) MERGE (a)-[:TAKEN_FROM{Square:'"+ merged2.loc[i][1]+"', Module:'"+ str(merged2.loc[i][2])+"', CONTEXT_NUMBER:'"+ str(merged2.loc[i][0])+"'}]->(b);")
		sys.stdout = original_stdout


Now I’d like to make my code more generalizable. But for the time being… it works, it’s alive!