a note on git-lfs

Sometimes, I have files that are larger than github’s 100 mb. So here’s what you need to do.

brew install git-lfs
brew upgrade git-lfs

Start a new git repository, and then make sure git large file storage (git lfs) is tracking the large file. For instance, I just moved a topic model visualization to a repo on github (20,000 archaeological journal articles). It has a data csv that is 135 mb. So I made a new repo on github, but didn’t initialize it on the website. Instead, after getting git-lfs installed on my machine:

git init
git lfs track "20000/data/topic_words.csv"
git add .gitattributes 20000/data/topic_words.csv
git commit -m "initial"
git add .
git commit -m "the rest"
git remote add origin https://github.com/shawngraham/archae-topic-models.git
git push -u origin master

Making Nerdstep Music as Archaeological Enchantment, or, How do you Connect with People Who Lived 3000 Years Ago?

by Shawn Graham, Eric Kansa, Andrew Reinhard

What does data sound like?

Over the last few days, what began as a bit of a lark has transformed into something more profound and meaningful. We’d like to share it with you—not just the result, but also our process. And in what we’ve made, perhaps, we find a way of answering the title’s question: how do you connect with people who lived 3,000 years ago?

In the recent past, Shawn has become more and more interested in representing the patterns we might detect, at a distance, in the large collections of digital data that are becoming more and more available . . . using sound. Called ‘sonification’, this technique maps aspects of the information against things like timbre, scale, instrumentation, rhythm, and beats-per-minute to highlight aspects of the data that a visual representation might not pick up. It’s also partly about making something strange—we’ve become so used to visual representations of information that we don’t necessarily recognize the ways assumptions about it are encoded in the visual grammars of barcharts and graphs. By trying to represent historical information in sound, we have to think through all of those basic decisions and elaborate on their implications.

Last week, he was toying with mapping patterns of topics in publications from Scotland from the 18th and 19th centuries as sound, using an online app called ‘TwoTone’. He shared it on Twitter, and well, one thing led to another, and a conversation began between Shawn, Eric, and Andrew: What might archaeological data sound like?

Sing in me Muse, through thine API, of sherds and munsell colors, of stratigraphic relations, and of linked thesauri URIs!

—Eric Kansa

Get Some Data

First things first: get some data. Open Context (Eric’s pet project) carefully curates and publishes archaeological data from all over the world. He downloaded 38,000 rows of data from the excavations at the Etruscan site of Poggio Civitate (where, in a cosmic coincidence, Andrew attended field school in 1991) and began examining it for fields that could be usefully mapped to various sonic dimensions. Ultimately, it was too much data! While there are a variety of ways of performing a sonification (see Cristina Wood’s Songs of the Ottawa, for instance), TwoTone only accepts 2,000 rows. The data used therefore for this audio experiment was very simple—counts of objects from Poggio Civitate were rendered as arpeggiated piano lines over three octaves; average latitude and average longitude were calculated for each class of thing thereby making a chord, and then each class of thing had its own unique value. Shawn’s initial result of data-driven piano sonification can be listened to here.

The four original dimensions of the sonification appear above, mapped in TwoTone. The rising notes in the bottom track are the item type ids. All of the materials come from the same chronological period, thus to listen (or view left-to-right) needed some sort of organizing principle. Whether or not it is the right principle is a matter of interpretation and debate.

Archaeology is a Remix

But what if an actual musician got a hold of these tracks? Andrew recently published a work called ‘Assemblage Theory’ where he remixed found digital music in order to explore ideas of archaeological assemblages.[1] Taking his experimentation in electronic dance music (EDM) a step beyond Assemablage Theory, he took Shawn’s four original tracks based on Eric’s 3,000-year-old data and began to play, iterating through a couple of versions, in a genre he calls ‘nerdstep’. He crafted a 5-minute piece that has movements isolating one of the four data threads, which sometimes crash together like waves of building data, yet are linked together. He opted for 120 bpm, a dance music standard, and then, noting where the waves of data subside into quiet pools, was inspired to write some lyrics. “The quiet segues are basically data reflexivity in audio form,” he says.

Data propagation
All this information
Gives me a reaction
Need time for reflection

A one-way conversation
This endless computation
Numbs me from sensation
Need time for reflection

Reflexivity
Give me time to breathe
Give me time to think

Reflexivity
Data raining down on me

Emotionally exhausting
How much will this cost me
I’m alone but you are watching
Look up from your screen

Reflexivity
Give me time to breathe
Give me time to think
Look up from your screen.

Reinhard used the open source Audacity audio software application to create the song based on archaeological data sonification. The first four tracks are Shawn’s piano parts, staggered in such a way as to introduce the data bit-by-bit, and then merged with 16 other tracks—overburden or matrix. In the beginning, they are harmonious and in time, but because of subtle variations in bpm, by the time the song ends the data have become messy and frenetic, a reflection of the scattered pieces within the archaeological record, something that happens over time. Each movement in the song corresponds to an isolated data thread from one of Shawn’s piano parts, which then loops back in with the others to see how they relate.

Life is A Strange Loop

Speaking of loops, let’s think about the full loop we’ve encountered here. 3,000 years ago, at a plateau in the tufa landscape of southern Etruria, people lived their lives, only to have their debris carefully collected, studied, systematized, counted, digitized, and exposed online. No longer things but data, these counts and spaces were mapped to simple sonic dimensions using a web-toy, making a moderately pleasing experience. Remixed, the music moves us, enchants us, towards pausing and thinking through the material, the labour, the meanings, of a digital archaeology.[2] If/when this song is performed in a club (attn: John Schofield and the Theoretical Archaeology Groups [TAG] in both the UK and North America), the dancers would then be embodying our archaeological knowledge of Poggio in their movements, in the flows and subtle actions/reactions their bodies make across the floor. In dancing, we achieve a different kind of knowledge of the world, that reconnects us with the physicality of the world.[3] The eruptions of deep time into the present [4] – such as that encountered at an archaeological site – are weird and taxing and require a certain kind of trained imagination to engage with. But by turning the data into music, we let go of our authority over imagination, and let the dancers perform what they know.

For the three of us as creators, this playful sonification of data allows us to see archaeological material with fresh eyes . . . errrrrr ears . . . and by doing so restores the enchantment we once felt at the start of a new project, or of being interested in archaeology in the first place. Restoring the notion of wonder into three middle-aged archaeologists is no small feat, but the act of play enabled us to approach a wealth of artifacts from one site we know quite well, and realize that we didn’t know it quite like this. Using the new music bridges the gap between humans past and present and in dancing we (and hopefully you) embody the data we present. It’s a new connection to something old, and is experienced by bodies. This is perhaps almost as intoxicating as the work done by Patrick McGovern (U. Penn) and Sam Caglione (Dogfish Head) in their experimentation and creation of ancient ales, the first of which was “Midas Touch”, a surprisingly drinkable brew concocted from an ancient recipe recovered on excavation in Asia Minor. Archaeology is often a cerebral enterprise, which deserves—at times—a good ass-shaking derived from a driving beat.

I’m listening now and am amazed. It is really beautiful, not only as a finished product, but as a process that started with people who lived their lives almost 3000 years ago.

—Eric Kansa

Reflexivity, by KGR [5]

Endnotes

[1] Reinhard’s article, “Assemblage Theory: Recording the Archaeological Record,” and two responses by archaeologists Jolene Smith and Bill Caraher.

[2] An argument made by Perry, Sara. (2019). The Enchantment of the Archaeological Record. European Journal of Archaeology, 22(3), 354-371. doi:10.1017/eaa.2019.24

[3] See for instance Block, Betty, and Judith Kissel (2001). Dance: The Essence of Embodiment. Theoretical Medicine and Bioethics 22(1), 5-15. DOI: 10.1023/A:1009928504969

[4] Fredengren, Christina (2016). Unexpected Encounters with Deep Time Enchantment. Bog Bodies, Crannogs and ‘Otherworldly’ sites. The materializing powers of disjunctures in time. World Archaeology 48(4), 482-499, DOI: 10.1080/00438243.2016.1220327

[5]  Kansa-Graham-Reinhard (pronounced as either “Cager” or “Kegger”—the GIF-debate of archaeological nerdstep/nerdcore).

References

Block, Betty, and Judith Kissel (2001). Dance: The Essence of Embodiment. Theoretical Medicine and Bioethics 22(1), 5-15. DOI: 10.1023/A:1009928504969

Caraher, William. (2019). “Assemblage Theory: Recording the Archaeological Record: Second Response” Epoiesen http://dx.doi.org/10.22215/epoiesen/2019.10

Fredengren, Christina (2016). Unexpected Encounters with Deep Time Enchantment. Bog Bodies, Crannogs and ‘Otherworldly’ sites. The materializing powers of disjunctures in time. World Archaeology 48(4), 482-499, DOI: 10.1080/00438243.2016.1220327

Perry, Sara. (2019). The Enchantment of the Archaeological Record. European Journal of Archaeology, 22(3), 354-371. doi:10.1017/eaa.2019.24

Reinhard, Andrew. (2019). “Assemblage Theory: Recording the Archaeological Record” Epoiesen http://dx.doi.org/10.22215/epoiesen/2019.1

Smith, Jolene. (2019). “Assemblage Theory: Recording the Archaeological Record: First Response” Epoiesen http://dx.doi.org/10.22215/epoiesen/2019.5

Anthony Tuck. “Murlo“. (2012) Anthony Tuck (Ed.) . Released: 2012-07-06. Open Context. <http://opencontext.org/projects/DF043419-F23B-41DA-7E4D-EE52AF22F92F> DOI: https://doi.org/10.6078/M77P8W98 ARK (Archive): https://n2t.net/ark:/28722/k2222wm10

Featured Image by Sarthak Navjivan https://unsplash.com/photos/iTZOPe7BpTM

A Song of Scottish Publishing, 1671-1893

The Scottish National Library has made available a collection of chapbooks printed in Scotland, from 1671 – 1893, on their website here. That’s nearly 11 million words’ worth of material. The booklets cover an enormous variety of subjects. So, what do you do with it? Today, I decided to turn it into music.

As part of writing the second edition to the Historian’s Macroscope, I’ve been re-writing the topic modeling section, and I’ve included working with this information, and building a topic model for it using R. As part of that exercise, I preprocessed all the data so that it would be a bit easier for the newcomer to work with it (which will be held in a Github repo for the purpose). Part of the preprocessing was adding a ‘publication date’ to the NLS-provided inventory file (which involved a whole bunch of command line regex etc to grab that info from the METS metadata files).

To turn this into sound – I used the Topic Modeling Tool  to build a quick topic model on the 3000 + text files containing the ocr’d text. The TMT can also match your metadata up against the topic results, which is very nice and handy, especially for turning the results into music, which I did with the TwoTone app. Drop the resulting csv onto TwoTone, and your columns are ready to map to the music; the visualization is also handy to get a sense of when your topics are most prominent (where the left hand side is my earliest date, and the right hand side is my latest date):

Then I played with the settings, filtering things so that notes only are played if they are making a meaningful contribution to the entire year’s text.

You can listen to it on Soundcloud.

The piano arpeggios are mapped to a topic that seems largely to be bad ocr. The pipe organ corresponds to a topic about religion. The trumpet seems to be stories of people off to make their fortune (as I read the topic words for that topic). There’s a double base in there, which I assigned to the ‘histories’ topic (because why not). The glockenspiel is assigned to the topic that I understand as ‘folk wisdom’, while the harp is mapped to stories of love and romanced (and doomed love too, for that matter).

What do we learn doing this? Well, for one thing, it forces us to think about the constructedness of our ‘visualizations’, which is never a bad thing. It foregrounds how much dirty data is in this thing. It shows change over time in Scottish publishing habits (“we could have done that with a graph, Shawn!” to which I say: So what? Now you can engage a different part of your brain and feel that change over time.)

Enjoy.

Revisiting AR, some notes

I haven’t futzed around with AR in a while. Here are some notes from a recent foray back into Unity3d + Vuforia. My students Ayda & Marissa were trying to use ARIS to do some AR in the National Gallery. It worked in their tests with 1 image, and so they went ahead and developed several more triggers and overlays (which took a lot of term time), but then when they went to play the AR, it would crash. I thought maybe it was a memory error, but after reformatting their images, triggers, and database, the crash continued.

We quickly ported it to Unity3d and Vuforia.

– when installing Unity, you now have to install unity hub. It’s from the hub that you add the android sdk, and the vuforia modules, if you forget to add those initially.

– the programming historian tutorial on unity and vuforia is a bit out of date as a consequence

– the unity quickstart guide is pretty good for getting going https://docs.unity3d.com/Manual/vuforia_get_started.html

– it adds 3d objects to the tracking images. For an image overlay, right click on the ImageTarget, select ‘plane’ from 3d object. Drag your image overlay from your assets folder ONTO the plane.

– make sure your plane occupies the same spatial coordinates as your target. Otherwise, in AR the image will float in ‘real’ space at those other coordinates

– name your scene containing all of your ImageTargets ‘scene 1’

– make a new scene for your splash/menu. Call it ‘scene 0’

– make sure to add your scenes to the build settings, and that scene 0 is loaded BEFORE scene 1.

– this tutorial can be followed, more or less, to create the menu http://theflyingkeyboard.net/unity/unity-ui-c-simple-main-menu/

– you really just need the exit button, and the button that loads scene 1 when pressed.

An Enchantment of Digital Archaeology – a peek at the contents

What is more rational than a computer, reducing all phenomena down to tractable ones and zeros? What is more magical than a computer, that we tie our identities to the particular hardware or software machines we use?…Archaeology, as conventionally practiced, uses computation to effect a distancing from the world; perhaps not intentionally but practically. Its rituals (the plotting of points on a map; the carefully controlled vocabularies to encode the messiness of the world into a database and thence a report, and so on) relieves us of the task of feeling the past, of telling the tales that enable us to envision actual lives lived. The power of the computer relieves us of the burden of having to be human.

An enchanted digital archaeology remembers that when we are using computers, the computer is not a passive tool. It is an active agent in its own right (in the same way that an environment can be seen to be active)…In that emergent dynamic, in that co-creation with a non-human but active agent, we might find the enchantment, the magic of archaeology that is currently lacking in the field.

I might be a posthumanist.

An Enchantment of Digital Archaeology: Raising the Dead with Agent Based Models, Archaeogaming, and Artificial Intelligence with Berghahn Books in New York, in the series edited by Andrew Reinhard, ‘Digital Archaeology: Documenting the Anthropocene’, has now moved to the next stage of the publishing process. I signed the contract two years ago, got the first draft to the editor in June of this year, got the peer reviews back in September, rejigged the damned thing, rewrote parts, rearranged the structure, added new parts, and resubmitted it earlier this month. The peer reviews were incredibly generous, even when some parts or decisions on my end left them cold, and so of course the end result isn’t their fault, as the acknowledgements will duly note.

In one way or another, this is the book that I’ve been trying to write since I came to Carleton. I might’ve even outlined the idea for this book in my original application for the job. It’s always had a pedagogical aspect to it, even when it was called Toying with the Past (too negative) and then later Practical Necromancy (too scary). But what finally made things start to click were conversations with the folks at the University of York, who as a department seem to really gel around ideas of reflexivity, affective engagement, and plain ol’ out-there digital archaeology. I love them all!

The book is meant to take the reader through my own experience of disenchantment with archaeology, and then the ways I found myself re-enchanted through digital work. The intended audience is undergraduate students; I am not writing a how-to, but rather, I want to enthuse with the possibilities, to spark curiosity, and to fire the imagination.

Anyway, the publisher asked me to write abstracts for each chapter, and said I could share them here, so awwaaaaay we gooooo!

Introduction: An Enchantment of Digital Archaeology?

‘Enchantment’ is discussed, drawing on the political philosophy of Bennett, and contrasted with the ways archaeology comes to know the past. The rupture of the past into the present is one locus of enchantment. The chapter argues that simulation and related digital technologies capture something similar. A rationale for why simulation should be a necessary part of the archaeologist’s toolkit is offered. Considering enchantment means confronting disenchantment, and so prompts a reflective examination of the purpose of archaeology and archaeological computing. This kind of reflexive writing necessarily requires a very personal engagement with the materials. The chapter concludes with a discussion on some of the potential dangers of misunderstanding ‘enchantment’ as ‘seduction’.

Keywords: enchantment; digital archaeology; new aesthetic; simulation; affective engagement

Chapter One: Imagine a Network

Networks are a foundational metaphor for digital archaeology. If we can imagine the archaeological past within a system of relationships, we are dealing with networks. Networks can then be operationalized as the substrate for simulation, and the substrate for computation. The chapter sets up a longer discussion where we begin with a network as metaphor before moving towards more grounded and less metaphorical uses. It imagines the city of Rome as a process of flows through intertwining networks, a process of concretization of flows of energy and power and materials.

Keywords: city of Rome; bricks; building trade; complexity; networks

Chapter Two: Reanimating Networks
Agent based simulations are introduced. Their potentials and limitations are discussed, as well as the ways the code of the simulation captures the historiography of the phenomenon under discussion. Part of the attraction of agent based models rests in their formalization of the ‘just-so’ stories we might normally tell. This allows us to test the unintended or unforeseen consequences of those stories. We create self-contained software agents and give them rules drawn from our understanding of the past to guide their behaviour – and then we turn them loose to interact within the channels of the archaeological networks we have uncovered. In this case, the network of inter-urban communications.

Keywords: agent based models; antonine itineraries; information diffusion; replicability; formal models

Chapter Three: Add Agents, and Stir

The network can exist in social space, in addition to physical space. The social information recovered from stamped Roman bricks can be stitched into a network of human interactions over time; these networks can then be used as the starting point for simulating ancient social dynamics, and for asking what-if questions. The chapter concludes with a reflection on how such computational agents might escape from the confines of the machine, and what that implies for how we might know or have an affective response for the past. One way is that the labour these resurrected Romans, these ’digital zombies’ do depends on compelled labour in the ‘real’ world. How we talk about the creatures we create (in silicon) has ramifications for the world outside the machine.

Keywords: agent based models; agency; salutatio; violence; assemblages; vibrant matter

Chapter Four: Archaeogaming

One way for the archaeologist to sink into the digital assemblages reanimated with simulation is to transform the simulation into a game. Archaeogaming is considered in the sense of playing games with archaeological themes. A theory of play is also a theory of learning. The simulation considered in the previous chapter is recast as an archaeogame and the consequences of ‘playing’ this game are considered. The points of intersection between archaeogames and agent based models are considered for the ways in which the two forms differ. The chapter concludes with a discussion of a case study where students were asked to design games to communicate ‘good history’. The play of building leads to greater engagement and enchantment.

Keywords: archaeogames; design; play; time; failure; pedagogy

Chapter Five: The Fun is in the Building

A case study building an actual video-game informed by an agent based model is discussed, including design elements and a post-mortem on the successes and failures of the project. The ethics of game play and meaningful individual choices as they intersect with a larger society-level simulation should make for an engaging experience, but our lack of expertise in actual game design hampers the project. There is a mismatch between the mechanics of the genre and the dynamics of the cultural experience we wish to explore. Returning to the idea of the city of Rome as a kind of emergent outcome of dynamic flows, we consider the city management genre and its connections to archaeogaming. The chapter concludes with a consideration of how an analogue format, the board game, promotes the kind of digital thinking and enchantment we are seeking.

Keywords: first person shooter; artificial anasazi; simCity; Will Wright; board games

Chapter Six: Artificial Intelligence

Networks are capable of computation. Neural networks enable us to represent our archaeological information and historical imagination in ways that a computer can engage with creatively. A simple recurrent neural network is trained on the writings of various historical personae so that it can mimic their voice. A very complex language model released by the OpenAI foundation is used as a kind of parameter space out of which we can collapse its understanding of ‘archaeology’, as filtered through its understanding of the writings of Flinders Petrie. The enchantment of digital archaeology might therefore sit at the point of combination of powerful neural network models of knowledge with agent based models of behaviour and archaeogaming methods for interaction.

Keywords: artificial intelligence; GPT-2; neural networks; ethics; augmented reality

Conclusion: Enchantment is a Remembering

Digital artefacts are subject to decay and ruin. They sometimes erupt into the digital world’s ever-present ‘now’ in the same ways archaeological materials interrupt the physical world of today. To program something necessarily means cutting away information, and to understand how something is programmed involves actively trying to break it, to see in its ruptures what has been cut away. There is enchantment in this process. The simulations and toys that the book considers also point to the playfulness that is necessary to find the enchantment in digital archaeology. Ultimately, the growing power of digital technologies to pluck representations of the past out of the possibility space of computation increases our responsibility to the dead to be truthful, to be engaged, and to be enchanted.

Keywords: ruin, forgetting, world-views; representation; complexity

Afterword: Guidelines for developing your own digital archaeology

Some thoughts on how one might get started in all of this

Appendices

Code walk throughs for developing some ABMS and re-implimenting one of my earlier models.

Failing Gloriously and Other Essays

‘Failing Gloriously and Other Essays’, my book reflecting on what ‘failure’ means, can mean, should mean, in the digital humanities and digital archaeology will be released on Dec 1. From the publisher website (where you’ll be able to get your copy in due course):

Failing Gloriously and Other Essays documents Shawn Graham’s odyssey through the digital humanities and digital archaeology against the backdrop of the 21st-century university. At turns hilarious, depressing, and inspiring, Graham’s book presents a contemporary take on the academic memoir, but rather than celebrating the victories, he reflects on the failures and considers their impact on his intellectual and professional development. These aren’t heroic tales of overcoming odds or paeans to failure as evidence for a macho willingness to take risks. They’re honest lessons laced with a genuine humility that encourages us to think about making it safer for ourselves and others to fail.

A foreword from Eric Kansa and an afterword by Neha Gupta engage the lessons of Failing Gloriously and consider the role of failure in digital archaeology, the humanities, and social sciences

The book will be available in print for $, and for free via pdf download.

Quinn Dombrowski has posted a wonderfully generous review over on Stanford Digital Humanities . I hope you’ll find value in it too!

scraping with rvest

We’re working on a second edition for the Historian’s Macroscope. We’re pruning dead links, updating bits and bobs, and making sure things still work the way we imagined they’d work.

But we really relied on a couple of commercial pieces of software and while there’s nothing wrong in doing that, I really don’t want to be shilling for various companies, and trying to explain in print how to click this, then that, then look for this menu…

So, I figured, what the hell, let’s take the new-to-digital-history person by the hand and push them into the R and RStudio pool.

What shall we scrape? Perhaps we’re interested in the diaries of the second American President, John Adams. The diaries have been transcribed and put online by the Massachusetts Historical Society. The diaries are sorted by date on this page. Each diary has its own webpage, and is linked to on that index page. We would like to collect all of these links into a list, and then iterate through the list, grabbing all of the text of the diaries (without all of the surrounding html!) and copying them into both a series of text files on our machine, and into a variable so that we can do further analysis (eventually).

If you look at any of the webpages containing the diary entries, and study the source (right-click, ‘view source’) you’ll see that text of the diary is wrapped or embraced by an opening




<div class="entry">

and closing

</div>



That’s what we’re after.  If you look at the source code for the main index page listing all of the diaries, you’ll see that the links are all relative links rather than absolute ones – they just have the next bit of the url relative to a baseurl. Every webpage will be different; you will get used to right-clicking and ‘viewing source’ or using the ‘inspector’

For the purposes of this exercise, it isn’t necessary to install R and RStudio on your own machine, although you are welcome to do so and you will want to do so eventually. For now we can run a version of RStudio in your browser courtesy of the Binder service – if you click the link here a version of RStudio already preconfigured with many useful packages will (eventually) fire up in your browser, including rvest and dpylr, which we will be using shortly.

With RStudio loaded up, select file > new file > r script (or, click on the green plus sign beside the R icon).

The panel that opens is where we’re going to write our code. We’re not going to write our code from first principles though. We’re going to take advantage of an existing package called ‘rvest’ (pronounce it as if you’re a pirate….) and we are going to reuse but gently modify code that Jerid Francom first wrote to scrape State of the Union Addresses. By writing scripts or code to do our work (from data gathering all the way through to visualization) we enable other scholars to build on our work, to replicate our work, and to critique our work.

In the code snippets below, any line that starts with a # is a comment. Anything else is a line we run.


library(rvest)
library(dplyr)

These first two lines tell R that we want to use the rvest and dplyr packages to make things a bit easier. Put your cursor at the end of each line, and hit the ‘run’ button. R will pass the code into the console window below; if all goes well, it will just show a new prompt down there. Error messages will appear if things go wrong, of course. The cursor will move down to the next line; hit ‘run’ again. Now let’s tell R the baseurl and the main page that we want to scrape. Type:


base_url <- "https://www.masshist.org"
# Load the page
main.page <- read_html(x = "https://www.masshist.org/digitaladams/archive/browse/diaries_by_date.php")

I don’t know why WordPress is mangling those three lines up, breaking them apart like that. They should look like this:

We give a variable a name, and then use the <- arrow to tell R what goes into that variable. In the code above, we are also using rvest’s function for reading html to tell R that, well, we want it to fill the variable ‘main.page’ with the html from that location. Now let’s get some data:

# Get link URLs
urls <- main.page %>% # feed `main.page` to the next step
    html_nodes("a") %>% # get the CSS nodes
    html_attr("href") # extract the URLs
# Get link text
links <- main.page %>% # feed `main.page` to the next step
    html_nodes("a") %>% # get the CSS nodes
    html_text() # extract the link text

In the code above, we first create a variable called ‘urls’. We feed it the html from main.page; the %>% then passes the data on the left to the next function on the right, in this case ‘html_nodes’ which is a function that travels through the html looking for the ‘a’ node in the CSS, and then passes that to the next part, the ‘href’ of a hyperlink. The url is thus extracted. Then we do it again, but this time pass the text of the link to our ‘links’ variable. You’ve scraped some data!

But it’s not very usable yet. We’re going to make a ‘data frame’, or a table, of these results, creating a column for ‘links’ and a column for ‘urls’. Remember how we said earlier that the links were all relative? We’re also going to paste the base url into those links so that we get the complete path, the complete url, to each diary’s webpage.


# Combine `links` and `urls` into a data.frame
# because the links are all relative, let's add the base url with paste
diaries <- data.frame(links = links, urls = paste(base_url,urls, sep=""), stringsAsFactors = FALSE)

Here, we have created a ‘diaries’ variable, and we’ve told R that it’s actually a dataframe. Into that dataframe we are saying, ‘make a links column, and put links into it; and make an urls column, but paste the base_url and the link url together and do not put a space between them’. The ‘stringsAsFactors’ bit isn’t germane to us right now (but you can read about it here.) Want to see what you’ve got so far?


View(diaries)

The uppercase ‘V’ is important; a lowercase view doesn’t exist, in R. Your dataframe will open in a new tab beside your script, and you can see what you have. But there are a couple of rows there where we’ve grabbed links like ‘home’, ‘search’, ‘browse’ which we do not want. Every row that we want begins with ‘John Adams’ (and in fact, if we don’t get rid of those rows we don’t want, the next bit of code won’t work!).


# but we have a few links to 'home' etc that we don't want
# so we'll filter those out with grepl and a regular
# expression that looks for 'John' at the start of
# the links field.
diaries <- diaries %>% filter(grepl("^John", links))

We are telling R to overwrite ‘diaries’ with ‘diaries’ that we have passed through a filter. The filter command has also been told how to filter: use ‘grepl’ and the regular expression (or search pattern) ^John. In English: keep only the rows that begin with the word John in the links column. Try View diary again. All the extra stuff should be gone now!

We still haven’t grabbed the diary entries themselves yet. We’ll do that in a moment, while at the same time writing those entries into their own folder in individual text files. Let’s create a directory to put them in:


#create a directory to keep our materials in

dir.create("diaries")

and now, we’re going to systematically move through our list of diaries, one row at a time, extracting the diary entry which, when we examined the webpage source code earlier, we saw was marked by an ‘entry’ div. Here we go!


# Loop over each row in `diaries`
for(i in seq(nrow(diaries))) { # we're going to loop over each row in 'diaries', extracting the entries from the pages and then writing them to file.
text <- read_html(diaries$urls[i]) %>% # load the page
html_nodes(".entry") %>% # isloate the text
html_text() # get the text

# Create the file name
filename <- paste0("diaries/", diaries$links[i], ".txt") #this uses the relevant link text as the file name sink(file = filename) %>% # open file to write
cat(text) # write the file
sink() # close the file
}

The first line sets up a loop – ‘i’ is used to keep track of which row in ‘diaries’ that we are currently in. The code between the { and } is the code that we loop through, for each row. So, we start with the first row. We create a variable called ‘text’ into which we get the read_html function from rvest to read the html for the webpage address that exists in the url column of ‘diaries’ in row i. We pass that html to the html_nodes function, which looks for the div that embraces the diary entry. We pass what we found there to the html_text function, which extracts the actual text.

That was part one of the loop. In part two of the loop we create a filename variable and create a name from the link text for the webpage by pasting the folder name diaries + link-name-from-this-row + .txt. We use the ‘sink’ command to tell R we want to drain the data into a file. ‘cat’, which is short for ‘concatenate’, does the writing, putting the contents of the text variable into the file. Then we close the sink. We get to the closing bracket } and we start the loop over again, moving to the next row.

Cool, eh?

You now have a folder filled with text files, that we can analyze with a variety of tools or approaches, and a text variable all ready to go for more analysis right now in R.

The full code is in this github gist:

#after https://francojc.github.io/2015/03/01/web-scraping-with-rvest-in-r/
library(rvest)
library(dplyr)
base_url <- "https://www.masshist.org"
# Load the page
main.page <- read_html(x = "https://www.masshist.org/digitaladams/archive/browse/diaries_by_date.php")
# Get link URLs
urls <- main.page %>% # feed `main.page` to the next step
html_nodes("a") %>% # get the CSS nodes
html_attr("href") # extract the URLs
# Get link text
links <- main.page %>% # feed `main.page` to the next step
html_nodes("a") %>% # get the CSS nodes
html_text() # extract the link text
# Combine `links` and `urls` into a data.frame
# because the links are all relative, let's add the base url with paste
diaries <- data.frame(links = links, urls = paste(base_url,urls, sep=""), stringsAsFactors = FALSE)
# but we have a few links to 'home' etc that we don't want
# so we'll filter those out with grepl and a regular
# expression that looks for 'John' at the start of
# the links field.
diaries <- diaries %>% filter(grepl("^John", links))
#update nov 9 – I find that line 26 doesn't work in some versions of r via binder that
#i have running. I think it's a versioning thing. Anyway, another way of achieving the same
#effect if you get an error there is to slice away the bits you don't want (thus keeping
#the range of stuff you *do* want:
diaries <- diaries %>% slice(9:59)
#create a directory to keep our materials in
dir.create("diaries")
# Loop over each row in `diaries`
for(i in seq(nrow(diaries))) { # we're going to loop over each row in 'diaries', extracting the entries from the pages and then writing them to file.
text <- read_html(diaries$urls[i]) %>% # load the page
html_nodes(".entry") %>% # isloate the text
html_text() # get the text
# Create the file name
filename <- paste0("diaries/", diaries$links[i], ".txt") #this uses the relevant link text as the file name
sink(file = filename) %>% # open file to write
cat(text) # write the file
sink() # close the file
}

view raw
diary-scraper.r
hosted with ❤ by GitHub

The Resurrection of Flinders Petrie

The following is an extended excerpt from my book-in-progress, “An Enchantment of Digital Archaeology: Raising the Dead with Agent Based Models, Archaeogaming, and Artificial Intelligence”, which is under contract with Berghahn Books, New York, and is to see the light of day in the summer of 2020. I welcome your thoughts. The final form of this section will no doubt change by the time I get through the entire process. I use the term ‘golems’ earlier in the book to describe the agents of agent based modeling, which I then translate into archaeogames, which then I muse might be powered by neural network models of language like GPT-2.

The code that I used to generate pseudo Gibbons and pseudo Sophocles modelled the probabilities of different letters following one another. While sophisticated at the time, that approach is now little more than a toy. With the increase in computational power and complexity, these newer models open up tricky ethical issues for us, and in particular, if we use them to try to give our digital creations their own voice to speak. Let me sketch out how these new models work, resurrect Flinders Petrie, and then we’ll examine the aftermath.

More complex models of language now try to work out the ‘rules’ of language by ‘masking’ whole words and working out the probabilities to deduce the hidden word from the words that sit to either side. Others try to represent a word as an ‘embedding’ in multi-dimensional space (using a model built by Google). But the most complicated model, and the most successful, as of this writing, is probably the GPT-2 model, developed by the OpenAI foundation. It was trained on webpages found from outward links on Reddit that Reddit users scored as highly useful, some 8 million websites. The sheer size of this data let OpenAI develop a model that contains 1.5 billion parameters. Because of the potential malicious uses of the model, OpenAI did not release this model, but instead made available a series of smaller models- a ‘small’ model with 117 million parameters, and a ‘medium’ model with 355 million (there is a ‘large’ model available, but it is too large to run on a typical laptop).

Jesse Vig from the ‘Towards Data Science’ group blog has a post (2019) exploring how this model generates text. He begins by experimenting, trying to break the model (an ‘alchemical’ approach, as Rahimi 2017 laments), feeding it two distinct sentence prompts:

‘The dog on the ship ran’
‘The motor on the ship ran’

The model completes the first sentence with, “The dog on the ship ran off, and the dog was found by the crew.” It completes the second sentence with, “The motor on the ship ran at a speed of about 100 miles per hour.” The model understands the difference between dogs and motors, and the likely probabilities of different words completing the sentence. The model has what its creators call an attention mechanism, which focuses on the linguistic idea of the ‘head’ of the noun phrase. There are other linguistic patterns captured by the model, arranged 12 to a layer across 12 layers. It is this attention mechanism and the learning of linguistic patterns that accounts for this model’s success.

I fed the opening chapter of this book to talktotransformer.com, an interactive front-end for the GPT-2 model; the bolded text is what I wrote, the second paragraph is the machine’s:

This is a book about enchantment and digital archaeology. It is a book about trying to pull together the connective threads on nearly twenty years of work in simulation, agent modelling, video games, and Roman economic history. These are not, on the face of it, ‘enchanting’ topics. But hear me out. It’s about trying to remember what it was that was magical about archaeology when I first began, and why I find digital approaches to archaeology to still be magical.

In this book, I’d like to trace out some of the elements that underpin magical engagement with archaeology. I would like to show you some of the tools digital archaeology (D-arch), and computer archaeology have developed in order to be able to address specific aspects of magical engagement. But it isn’t hard to imagine that if you were an archaeologist working in the digital frontier of archaeology a lot of these new tools might not have occurred to you. To me, it’s about being able to be enchanted by the methods that a new medium has provided to the archaeological enterprise

I don’t know whether to be pleased or worried by this result. It could almost be the blurb on the back of this book’s cover. It gets a lot of things right.

Salesforce, a marketing company, has built their own model called CTRL on top of GPT-2 and released it on Github (https://github.com/salesforce/ctrl). This model has 1.6 billion parameters “trained to condition on control codes that specify domain, subdomain, entities, relationships between entities, dates, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation” (Salesforce, 2019). Presumably, Salesforce is not as concerned about malicious weaponization of their model as OpenAI is.

This elaboration of the GPT-2 model can be prompted with for instance a URL (real or fake) and, told that the prompt text is a ‘link’, it will generate text informed by the metadata contained in the link (Woolf, 2019). For instance, using Woolf’s code for interacting with CTRL, I fed it the fake url, https://www.cnn.com/world/09/19/2019/evidence-for-atlantis-found-in-suburban-ottawa-backyard. This fake URL does not point to anything at CNN, but it does mimic the structure of a real CNN story. But, the words in the page title are understood by CTRL as controls on that 1.6 billion dimensional space on the kind of language that would be appropriate, collapsing the probabilities to the ‘news story’ dimensions. The result:

Evidence of Atlantis found in suburban Ottawa
MUST WATCH
Story highlights
The discovery was made by a man who lives on the property
It is believed to be one of only two known locations where ancient artifacts have been discovered
(CNN) A Canadian couple has uncovered evidence that could help solve an age-old mystery — and possibly prove Atlantis’ existence.
In what they say is the first time such a find has ever happened in Canada, a husband and wife from Ottawa, Ontario, stumbled upon a cairn…

I fed the same model fake URLs patterned after my university department’s faculty biography pages. I created two names, ‘Sarah O’Connor’ and ‘Stuart O’Connor’. Sarah’s biography was utterly glowing, but listed the many service organizations that Sarah belonged to, while Stuart’s biography put its energy into listing all of his books. The model understands the gendered and hidden labour of the academy it seems – should you play with talktotransformer.com and experiment with sexualized or racialized language, do not be surprised when it generates toxic or not-safe-for-work text, reflecting our current world view back at us (at least, current as represented by the internet in 2019).

The text that models like GPT-2 and CTRL can generate demonstrate an incredibly sophisticated understanding of how the English language works, and the interrelationships of different concepts (seemingly without limit!) that might be the subject of conversation. GPT-2 and CTRL are not just language models, but knowledge models.

And somewhere in there is archaeology.

The experiment with using CTRL demonstrates that my thought experiment from earlier in this chapter, building a digital pseudo-Wheeler who I might want to interrogate about his worldview, is feasible now. What’s more, the CTRL codes show that it is possible to collapse the multiverse of possible generations to a local universe conditioned on a particular worldview. In neural network work on images, you might be familiar with the idea of photos that get ‘painted’ in the style of Monet or Picasso via ‘style-transfer’. This is the same idea, but with text.

I used Max Woolf’s code for retraining the GPT-2 ‘medium’ model (with 355 million parameters, 1.5 gb in size) on Flinders Petrie’s Ten Years’ Digging in Egypt 1881-1891, creating a model of Petrie’s language and – potentially – worldview (I would have used Wheeler, but Petrie’s writings are already available in the Gutenberg Project website, readily digitized.) Then, I ask this simPetrie some questions about archaeology. The dialogue below represents our ‘conversation’; each question by me was a prompt to the model to generate text. None of simPetrie’s responses have been edited. As you read simPetrie’s responses, reflect on what they (may) reveal about the doing of archaeology in the late 19th century. Given that Flinders Petrie is celebrated in archaeology for the development of both field and analytical methods, let’s ask his digital doppelgänger for its thoughts on the matter:

SG> ’Professor Petrie, what is the key finding of your researches?''
SimPetrie> ’The only way to find out what is the key to every mystery is to know everything about it, and then to try to discover the mystery as far as possible by deductions from what is known. When we come to the real problem, how do we find the real in the unreal?''
SG> 'Professor Petrie, could you describe your methods for us?''
SimPetrie> ’I have always tried to be fair to the student, and to give as much information as possible. I have tried to give as many facts as I can remember; and then try to match them with each other in the various accounts, so as to arrive at some general idea. People often ask me where the evidence for a fact comes from; and that is no easy matter, when the evidence is so scanty. I have often thought of writing a work on probability, in which every fact might be found, and the numbers by which they must be numbered, by heart, according to their weight in evidence. I believe I have found the right ratio; and the more I think of it, the more I see that it will not only be right, but will be of great service to our subject.
This method of ascertaining the amount of evidence is the only one that seems to me entirely consistent with scientific accuracy. It is easy enough to find the average of thousands of facts; but it is much harder to find the average of the facts themselves.’
SG> “Could you describe for us the role of local labour in your excavations and surveys?’”
SimPetrie> ’The local labourers, who live and work under me, are the most intelligent and hardy people I have met. They live by digging, and having to earn their living by selling the finished product to other local people who use it to make shoes, &c. Though they are often exploited in this business, they never trouble about it too much, and are content with what they can get for a good wage. Their only ambition is to live as cheaply as possible. The knowledge which I have gained from their labours has been of great service to me in my excavations, and I have always been glad when one of them has told me how he has improved his skill, or found a new construction, by improving the methods of his fellows. I should be sorry to see such methods go to waste for want of improvement and improvement of the methods of my neighbours, for the end in view is the same.
SG> 'Professor Petrie, what are the sorrows of archaeology?''
SimPetrie>’ 'The sorrows of a dead man. I have seen the dead. All that is preserved is what is in their power to leave behind them. The remains of the temples, though they may be exposed, are of no use. They are only patches of stone, in a country where there are many other buildings. If the dead man had been a sculptor, he would have had no need of the temples; nor is it likely that any one could have made any use of them except by accident. The temples, then, were but a sad sacrifice to an unmeaning king.

SimPetrie is startling coherent. The original Petrie’s concern for rigor and consistency in his archaeological recording and systemization of materials appears in the digital Petrie. The use/abuse of local labour is hinted at, with a touch of the colonial condescension we might expect from a 19th century Englishman. The overly -to our ears- florid language of the period comes through.

When we ask simPetrie about ‘his’ world, the neural network takes our question as a signal of how to collapse the possibilities for the generation of its response. Careful questioning and breaking could reveal the limits of that simulated world view. How does that worldview map back to the original Petrie’s? How far can it be pushed before it breaks? Much like an agent based model has to be run through all of its possible combinations of parameters to understand the simulated world view, the simulated history’s behaviourspace, we have to figure out a method for doing the same for this neural networked model. One way perhaps of doing this might be to deploy data mining and text analysis. I could imagine asking simPetrie the same question a thousand times at each ‘temperature’ or creativity setting between 0 and 1. Then, I would topic model (look for statistical patterns of co-occurance of words in a response) these responses, and map how the discourses found therein persist or evolve over the creative space of simPetrie’s responses. That might begin to give us a map of the territory that we have stumbled upon. It will require much work and indeed play, experimentation, and the willful breaking of the models to expose the sharp edges.

Some of the things we wish to play with, like the GPT-2 and CTRL models with their billions of parameters, are perhaps too big for enchantment? Is this where we spill from enchantment to terror? These models after all, now that they’ve been generated (and consider the energy and environmental costs of training such models is estimated to be five times worse that that emitted by a car over its entire lifespan, or approximately 626,000 pounds of carbon dioxide equivalent, Strubell et al 2019; Hao 2019) can now be deployed so easily that a single scholar on a commercial laptop can use them. The technology behind these models is not that far removed from the technologies that can simulate and generate perfect audio and perfect video of things that never happened or were never said, so-called ‘deepfakes’ (these too depend on neural network architectures). We will need to develop methods to deal with and identify when these models are deployed, and quickly. By the time this book is in your hands, there will be new models, larger models, of text generation, of language, and they will be deployed across a range of tasks. It will be exceedingly hard to spot the work written by the machine, versus that written by a human. Our golems are getting out of control. But there are other ethical issues, too.

The Ethics of Giving the Golems a Voice

“When we teach computers to write, the computers don’t replace us any more than pianos replace pianists—in a certain way, they become our pens, and we become more than writers. We become writers of writers.” – Goodwin 2016

“The hypothesis behind invisible writings was laughably complicated.  All books are tenuously connected through L-space and, therefore, the content of any book ever written or yet to be written may, in the right circumstances, be deduced from a sufficiently close study of books already in existence.  Future books exist in potentia, as it were…” Pratchett, The Last Continent

“How do we find the real in the unreal?” – simPetrie

In a world where computers can be creative on their own, ‘authorship’ is not about putting the words down on the page, and ‘scholarship’ is not necessarily about marshalling facts about the world in a logical order to make an argument. Instead, they become an act of creative composition and recomposition, or remixing and selecting of texts for training and hyper parameters to be tuned. It is in fact the same skills and techniques and scholarly work that informs the creation of agent based models. This kind of generative computational creative writing is not really about making a machine pass for a human, but, much like the agent based models discussed earlier in this volume, it is about discovering and mapping the full landscape of possibilities, the space within which Petrie could have written. These particular questions prompted the machine to collapse the possibility space around how archaeology was conducted, and whose voice mattered in that work; thus the results perhaps give us access to things that were so obvious they were never written down. What is the evidentiary status of a mapping of the behaviour space of the model? There could be a fascinating PhD thesis in that question. But this dialogue with simPetrie, for me, also raises some interesting ethical issues that so far in digital archaeology – led by the work of people like Meghan Dennis or Lorna Richardson or Colleen Morgan – we are only beginning to explore.

Tiffany Chan, for her MA thesis in English at the University of Victoria, used a recurrent neural network to map out the space of one particular author. She writes,

“[W]hat could we learn about our object of inquiry (in this case, literature) if we broke down, remade, and compared or interpreted it either alongside or as if it were the original? Articulated in Victorian terms, this project is like conducting a séance with a computer instead of a Ouija board. The computer mediates between human and machine, between the dead and the living. If, as Stephen Greenblatt suggests, literary study begins with “the desire to speak with the dead”… then [this project] begins by impelling the dead to speak.” (2017).

Colleen Morgan wrote, a decade ago, in the context of video games that use historical persons as non-player characters to decorate the games, “NPCs are nonhuman manifestations of a network of agents (polygons, “modern” humans, fiber-optics, and the dead person herself) and the relationships between these agents and as a result should be studied as such.  But does this understanding of an NPC as a network make it ethical to take such liberties with the visages of the dead? What does it mean when Joey Ramone comes back from the dead to sell Doc Martins?”

In these two passages, we find many of the threads of this book. We see ‘networks’ as both a literal series of connective technologies that thread the digital and analog worlds together. We see an impulse to raise the dead and ask them questions, and we see something of the ethical issues in making the dead speak. For instance, Petrie plainly did not say any of the things the simPetrie did in our dialogue. What if simPetrie had said something odious? It’s entirely possible that the model could extrapolate from hateful speech collected in its training corpus, triggered by passages in the small body of text of Petrie with which I perturbed the original.

What if that text gets taken out of context (an academic book or journal article) and is treated as if Petrie actually did say these things? In a conversation on Twitter about simPetrie, the computer scientist and sometimes archaeogamer John Aycock raised the issue with me of desecration: similar to the way human remains can be desecrated and ill-used in the real world, could this use of computation be a kind of desecration of a person’s intellectual remains? Lorna Richardson points out that the creation of any kind of visualization of archaeological materials or narrative ’is a conscious choice, and as well as political act.’ (Richardson, 2018). If these models are the instrument through which I ‘play’ the past as Goodwin (2016) suggests, then I am responsible for what collapses out of that possibility space. The ethical task would be to work out the ways the collapsing possibility space can do harm, and to whom.

The advertising and entertainment industries have the greatest experience so far with raising simulacra of dead celebrities to sell us things and to entertain us. Tupac Shakur raps on stage with Snoop Dog, years after his death. Michale Jackson performs from beyond the grave at the Billboard Awards. Nat King Cole sings a duet with his daughter. Steve McQueen races a 2005 Ford Mustang. These uses of the dead, and their resurrection, are more troubling that portrayals of historical figures in films or video games, because of the aura of authenticity that they generate. Alexandra Sherlock argues that

“… The digital individual continues, irrelevant of the death of its author and prototype, and since the relationship that viewers have with this social entity was always conducted through representations and images anyway, nothing about this relationship actually changes… in popular culture the media persona becomes divorced from the actual embodied celebrity and their representations become a separate embodiment of their own – an embodiment with which people are able to identify and bond with in an authentic and real way.” (2013: 168).

These representations of dead celebrities worked because they depended upon, and continued to promote, para-social one-sided relationships – the public was so used to the feeling of being connected with the idea of these individuals, that their digital resurrection proved no obstacle, no barrier to enjoying the performance. Sherlock discusses an episode where the digital resurrection of a celebrity did go wrong – the resurrection of Orville Redenbacher, of popcorn fame: “Rather than promoting the enchanting notion of immortality, Redenbacher’s advertising agency had accidentally and rather embarrassingly reminded viewers of the mortality of Redenbacher, and themselves by extension’ (170). The advertisement fell into the uncanny valley, the term from robotics that describes when a robot is so human-like that the few errors in the depiction (lifeless eyes, for instance) generate a feeling of creepiness.

Sherlock calls this entire process of using the images of entertainers, whether as holograms or on film, ‘digital necromancy’, and attributes some of the success (or failures) to the idea that, in addition to profiting from a para-social relationship, the revenants fill a need for answers, a need for reassurance in the face of death, given that Western culture largely avoids talking about death:

“…a form of necromancy does exist today, precisely in response to the marginalization of death. One might perhaps consider the technicians who created the Bob Monkhouse advertisement [where the comedian tells the audience about his own death from cancer] as modern necromancers – reanimating the digital remains of the deceased Monkhouse to impart his knowledge concerning his own death. It is as though the ancient art of necromancy has resurfaced in the practice of digital resurrection.” (171).

All of which is to say: simPetrie could become ‘real’ in the same way the personas of entertainers and celebrities become ‘real’, and the views and opinions expressed by the digital doppelgänger given far more weight than is warranted. “Subconsciously, their appearances may appeal to embedded beliefs that the dead are wise and knowledgeable: if they speak or show themselves to us, we should pay attention. Somehow the dead seem more believable.” (172)

When 2k Games, the makers of the game Civilization, in its sixth iteration, included the Cree Pîhtokahanapiwiyin (Poundmaker) as one of the playable leader characters, they put words in his mouth. Milton Tootoosis of the modern Poundmaker First Nation said, “[This representation] perpetuates this myth that First Nations had similar values that the colonial culture has, and that is one of conquering other peoples and accessing their land… That is totally not in concert with our traditional ways and world view.” (Chalk, 2018). While the depiction and lack of consultation with the Poundmaker First Nation is troubling enough on its own, imagine if the game-character of Pîhtokahanapiwiyin was coded in the way simPetrie was, and imagine further that the developers did not consult with the Cree on which texts to use for training – or whether to do this at all.

The danger of the neural networked power representation is in its liveliness, the possibility of fostering the kind of para-social bonds that make the examples drawn from the advertising and entertainment worlds work. A neural network powered representation of a key figure in Cree history would run the risk of becoming the version of Pîhtokahanapiwiyin that sticks; who builds and designs such a representation, and for what aim, matters. This neural network approach to giving voice to a video game’s non-player characters, to an agent-based simulation’s agents, is exceedingly powerful. If we are building simulations of the past, whether through archaeogaming or agent modeling, we either need to make our software agents mere ciphers for actual humans, or we need to think through the ethics of consultation, of representation, and permission in a much deeper way. The technology is racing ahead of our ability to think through its potential harms.

There is also the ethical issue in the creation of the training data for GTP-2 in the first place, the creation of the possibility space. The authors of those 8 million webpages obviously never consented to being part of GTP-2; the material was simply taken (a kind of digital colonialism/terra nullius). The use of Reddit as a starting place, and relying on Reddit users’ selection of ‘useful’ sites (by the awarding of ‘karma’ points of 3 or more to a link) does not take into account the demographics of the Reddit user community/communities. The things that white men 18-35 living in a technophilic West see as interesting or valuable may not be the kind of possibility-space that we really want to start baking into our artificial intelligences powering the world. Taking a page from information ethics, Sicart (2009) argues in the context of video games that permitting meaningful choices within a game situation is the correct ethical stance; where are the meaningful choices for me who ‘plays’ the GPT-2 model, or for me whose website may be somewhere inside the model?

A framework for considering the myriad ethical issues that might percolate out of this way of raising the dead and giving them a voice again might be the ‘informational ethics’ of Floridi and Sanders, as interpreted by Sicart from the perspective of video games. This perspective considers ‘beings’ in terms of their data properties. Data properties are the properties of relationships and the contingent situation of a thing. That is to say, what makes the rock on my desk a paperweight rather than merely debris is its relationship to me, our past history of a walk on the beach and the act of me picking the rock up, and the proper ways of using objects for holding down papers on desks (Sicart 2009, 246, citing Floridi 2003). Compare this with Ingold’s ‘material against materiality’, where he invites you to pick up a stone, wet it, and then come back to it a short while later:

“[…]the stone has changed as it has dried out. Stoniness, then, is not in the stone’s ‘nature’, in its materiality. Nor is it merely in the mind of the observer or practitioner. Rather, it emerges through the stone’s involvement in its total surroundings – including you, the observer – and from the manifold ways in which it is engaged in the currents of the lifeworld. The properties of materials, in short, are not attributes but histories.” (Ingold 2007, 15)

The meaning of data entities lies within the web of relationships with other data entities, and all things, whether biological or digital, are data entities (Sicart 2009 128-130; Morgan 2009). From this perspective there is moral import because to reduce information complexity is to cause damage: “information ethics considers moral actions an information process” (Sicart 2009 130). The information processes that give birth to simPetrie, that abstract information out of GPT-2, that collapse the parameter space to one local universe out of its multiverses, are all moral actions. For instance, these language models and these neural network technologies are predicated on an English model of the world, and English approach to language. Models like GPT-2 obtain part of their power through their inscrutability. Foucault (1999: 222) wondered what an ‘author’ might be, and concluded it emerges in the condensation of physical and cultural influences, that ‘the author function’ disappears instead to be experienced:

“What are the modes of existence of this discourse? Where has it been used, how can it circulate, and who can appropriate it for himself? What are the places in it where there is room for possible subjects? […] What difference does it make who is speaking?”

That is the ethical question posed by archaeogaming, because the ‘who’ isn’t just humans anymore.

“An Open Access Oops?” – my #patc4 source

“An Open Access Oops?”

Abstract:
I generally believe that making my research and my results open access is a moral imperative. But recently, certain events in the reception of our research on the trade in human remains online have made me wonder if there are situations where the greater good is served by _not_ making our work openly available. In this piece, I recount what happened and reflect on the contexts of archaeological openness.

Delivered: 15 minutes/ 45 secs per tweet. Below is the text I pasted into the ‘what’s new?’ box as fast as I could go. Turns out you can’t schedule a thread in tweetdeck; or if you can, I couldn’t figure it out.

Hi folks, I’m Shawn Graham; I’m a prof in the history dept @Carleton_U . Somewhere along the way I became a digital archaeologist. My #patc4 paper is “An Open Access Oops”.

Lemme tell you a little story & let me ask some little questions. /1

[gif House saying oops ]

Firstly, I became a digital archaeologist from necessity. If people shared data, I cld pretend to myself that I was ‘doing’ archae! Open access was a lifeline. Playing, exploring, & building from other people’s data allowed me to re-invent myself /2 #patc4

I’ve always felt then, aside from all the other arguments for open access, there was a moral imperative to pay it back. Right? I had benefited; now that I’m in a position to do it, I need to get my materials out there, in remembrance of the lost post-phd guy I was. /3 #patc4

Fast-forward. I never set out to study the trade in human remains http://bonetrade.github.io. But here I am, & we’ve been publishing in OA journals, making code and data freely available… Good, right? Well… here’s what happened. Let’s air what feels like a fail. /4 #PATC4

[gif ‘fail’ krusty, judges 0]

In january, the faculty did a piece on our ‘Bone Trade’ project (@damien_huffer) (here: https://m.carleton.ca/fass/story/innovative-historian-studies-the-sale-of-human-remains-on-the-internet/).

This summer, a local journalist wanted to talk to me about the project; the story was published here: https://ottawacitizen.com/news/local-news/carleton-prof-harnesses-machine-learning-to-explore-the-bone-trade-netherworld /5 #PATC4


So far, so good! Everyone wants their research to attract some attention, right? The Citizen is part of the Postmedia group, so the story got taken up by various papers across Canada.

Then a political candidate bought a human skull as a gift for her boyfriend. /6 #PATC4

[oh no < – kermit gif]


APTN, Aboriginal Peoples Television Network, broke the story and asked me for comment, having seen the other newspaper article. The APTN story was taken up by lots of other outlets, including Newsweek. Suddenly, there were interview requests everywhere /7 #PATC4

Our work made it into Wired (the politician did not) https://www.wired.co.uk/article/instagram-skull-trade . But, in trying to be ‘balanced’, it seems, the story included interviews w collectors. And they made the editorial decision to embed _in the story_ posts from Instagram selling human remains /8 #PATC4

The story was picked up and re-worked across multiple outlets. Here’s the Sun’s attempt https://www.thesun.co.uk/tech/9542441/human-remains-for-sale-instagram-black-market/. We’ve been erased from the research, and the nuance we try for in our work is lost. But the collectors are getting a lot of oxygen! /9 #patc4

A number of outlets contacted us, for interviews (including BBC), requesting that we also put them in touch with collectors. I refused to do this. If we were studying sex trafficking, would you ask us to put you in touch with pimps? /10 #patc4

[gif why monkey]

I know this is not a particularly egregious case; there are far worse out there. But we know that buyers/sellers of human remains are reading our work and adapting accordingly. With the press attention, and the celebration of the ‘eccentric’ collectors, + /11 #patc4

how much traffic have we driven to collectors? to what degree have we helped promote the trade we are studying? how have we changed their behaviour to _enhance_ their ability to trade without prying eyes? /12 #patc4

These human remains were collected in morally, ethically, legally dubious circumstances. To reduce them to clickbait is to return us to the era of ‘human zoos’. How many times will these people be dehumanized? But… we published OA. We put our material out there. /12 #patc4

It’s our fault, right? Publishing the work needs to be done openly, I thought, given how these remains were collected in the first place in secret (eg https://www.academia.edu/14663044/Harlan_I._Smiths_Jesup_Fieldwork_on_the_Northwest_Coast p154). sunlight, disinfectant?

Maybe I was wrong. /13 #PATC4

But hiding the work behind paywalls is wrong, too. Publicly funded work should be accessible by the public (which publics, SG?). We didn’t conceive the project as ‘public archae’, but if we had we would not have gotten into this mess of inadvertently promoting sellers. /14 #PATC4

A month or two later, I return to scraping Instagram, and I notice new figures active, old figures gone, & maybe the internet’s short attention span has taken care of the situation. Maybe I worry too much. But is this a case where OA is the wrong approach? /15 #patc4

Or is the error: the attracting of attention, drawing the eye of a media ecosystem addicted to both-sides-ism, an ecosystem addled by ‘engagement’ mechanics predicated on outrage? /16 #patc4

[eye of sauron]

I know I conceived this project without thinking about how, if you study things online, things online have a way of pushing back. In which case, I decided to talk about it here at #patc4, so that I can learn from wiser heads. /17

The human remains trade in its origins is part of the literal flow of human bodies from around the world into the West. As @priscillaulguim reminds us https://twitter.com/priscillaulguim/status/1169382105547202561 OA assumes I have the right to share; but not always true & the contexts are complex. /18 #PATC4

I am also from the global north, the consumer of these bodies, of these data. Unthinking OA (as @priscillaulguium alluded to last night https://twitter.com/priscillaulguim/status/1169382281485701127) allows me to profit academically from these bodies one more time. /19 #patc4

Before I was a prof, OA let me play at being an archaeologist. Now on the other side, I want to get my research out there: but naive OA, especially in archaeology, is not without its risks, as this summer has demonstrated. I need to do better. /fin #patc4

[screenshot of the thing below]

PS One more thing- The one seller, who got progressively higher and higher profile in the news stories? IG deleted his account. His webstore remains, but he’s rebuilding on Instagram. The internet makes Red Queens of us all. https://en.wikipedia.org/wiki/Red_Queen_hypothesis /really fin

[pic!]

quick visualization of tags – notes using sublime, zettlekasten, gephi, and bash

So you take your notes following the Zettlekasten method, do you? One thought per card? Cool. I was never taught how to take good notes, and I still struggle with it. Rene Schallner’s zk-sublime  suits the way I like to work these days, in a text editor. I end up with a lovely folder filled with markdown notes that have internal links, tag searching, ‘friends’ searching… it’s great. As long as I’m using Sublime 3. (which is no chore).

Anyway, I was thinking to myself that it would be nice to feed the notes into a static site generator to make a nice online version that other folks could peruse. This would require converting all of the internal links to markdown links, and if I was using Jekyll etc, adding the right kind of metadata to every post. I cheated, and tried to use mdwiki, a no-longer-actively-maintained project that turns a folder into a site with the addition of a single html file (containing all of the necessary js and so on). I spent a lot of time on that; here’s a bash script that turns the directory listing of my note folder into an index.md that mdwiki can use:


#!/bin/bash
# A sample Bash script to turn the contents of a directory
# into a md file with filenames as md links


# put the directory contents into a file
echo "creating toc"
ls > index.md

# put the brackets around the line
echo "beginning line formatting"
sed -i '.bak' 's/^/[/' index.md
sed -i '.bak' 's/$/]/' index.md

# duplicate the line

sed -i '.bak' -E 's/^(.*)/\1\1/' index.md

# now to convert the SECOND [ and ] to ( )

sed -i '.bak' 's/\[/\(/2' index.md
sed -i '.bak' 's/\]/\)/2' index.md

# and this bit was the start of me trying to create a unique page for each
# tag, which eventually would end up listing all relevant
# note pages. I got the files made, at any rate; nothing in 'em yet.

grep tags *.md -R > tags.md

sed -i '.bak' 's/#/ /g' tags.md
sed -E 's/([0-9]+.)([A-Za-z ]+.)('md:tags:')//g' tags.md | tr ' ' '\n' > tags2.md
sed -i '.bak' '/^[[:space:]]*$/d' tags2.md
cat tags2.md | xargs touch
rm tags2.md
echo "done"

which was fine, but meh.

So I abandoned that, after so.many.hours. I started focusing on the tags instead, realizing that at least having a visualization of how my notes interconnect. Every note has ‘tags’ in the metadata, so a grepping we go:


grep tags *.md -R > tags.md
sed -E 's/([0-9]+.)([A-Za-z ]+.)('md:tags:')/"\1 \2"\,/g' tags.md > net.csv

This gives me two columns, a file name in quotations, and the relevant tags. I cheat and use find and replace in excel on the second column to replace spaces with semi-colons. This I can then open in gephi, selecting ‘adjacency’ and ‘semi-colon’ and boom. A nice visual depiction of how my notes inter-connect.

First part of the day: several hours. Second part: 30 minutes. Sigh.

 

 

SimRomanCity

Ever since I first read about the original SimCity source code being open sourced as Micropolis (play here), I have wanted to build a course around using that code to simulate a Roman city. Students would keep open notebooks and devlogs, and together, we’d build our simulation.

To start we would spend a few weeks looking at the literature, the archaeology, and the scholarship surrounding ideas of the ancient Roman city, and from these, develop an idea of what kinds of things one would want to have in a simulation – and what kinds of questions a simulation might answer, or lessons it might teach. This would take us about four or five weeks.

SimCity has had enormous influence in games and beyond, and in many ways our everyday thinking about how cities work can be traced back to the way SimCity modeled urban systems. I would have the students look into the history of SimCity and Will Wright’s influences, and discuss what that might for how we understand ancient cities, and how the study of the ancient city is entangled with these particular models of modern, Western cities that SimCity represents.

The second half of the course is where things’d get really interesting. We’d take those paper designs and that understanding of SimCity-as-an-artefact and we’d build. We’d take the source code, and try to modify it to model an ancient Roman city. Is this possible? What assumptions about the ways cities work are hardbaked into the ‘SimCity’ framework ab initio? If we can just change the skin of the game, its sprites and graphics, and come up with something that functions how we imagine ancient cities did, what does this say about our ideas of the past? Maybe we’d find that some of our ideas about the past are not as true as we perhaps thought. This might be a case where we could expect failure but that would be ok, because then we could spend a few weeks on the why and how of that failure and what that tells us about the consequences of the influence of SimCity.

But alas, when I look at the Micropolis source code, I am stymied. I have no idea how to even begin. I shelved the idea.

But recently, I came across a port of the game, still in development, by Graeme McCutcheon. His port (works best in Chrome) translates the game to js/html5. And when I look at the code, it seems fairly intelligible!

So now it’s just a matter of figuring out how to build the game from his source code. After much farting around, I figured out more or less what one has to do.

1. Fork his repo.

2. Clone it to your machine.

3. Get nodejs

4. Open the micropolisjs folder in your terminal, and install the dependencies listed in the package.json file with npm install

5. You can start it up right away with npm run-script startand then going to localhost:8080 in your browser.

The various scripts and models that make up the game’s simulation are in the src folder; edit these, then use npm run-script build. And of course, all the sprites and graphics could be altered in any graphics program.

It would be a steep learning curve, but since we’d do this as a class, I think every student could find a role through which to contribute. Anyway, I’m off now to design a Roman tileset.

Don’t buy human remains

I was interviewed by Kristy Cameron for the Evan Solomon Show (radio) today. It was about my perspective on this story  about a federal candidate for election who bought a human skull as a gift for her boyfriend. Short answer:

Don’t buy human remains.

In anticipation of the interview, I wrote some notes about what I wanted to say, which I’m pasting here below:

What are the ethical issues?

– there are several ethical problems with giving a skull as a gift, and they circle around what a skull is, and where these remains come from, and how they come to be traded:

1. the skull was a human person. Trading skulls reduces people to mere things.
2. many of these skulls are on the market largely as a result of white people collecting non-white people, robbing graves, collecting the bones of slaves, of prisoners, for the purposes of ‘scientific racism’, of proving the superiority of one race over another.
3. Even skulls from ‘european’ sources: did they consent? Of course not.
4. a skull is not a ‘thing’, it is a person: to many indigenous groups from whose members many human remains were stolen, to not be buried and accorded respect and dignity as appropriate to the group is a continuing harm to the group.
5. the skull has no archaeological context – the exact knowledge of the conditions of burial, the other objects or scientific information that allows us to work out the meaning of objects from the past – so the trade destroys knowledge about the past
6. from what we can see in the photograph, (Damien Huffer & I) there are some indications that make us suspicious about how this skull came to be on the market. For one thing, there looks to still be dirt on it. The skull itself seems to be flaking, which can be caused by alternating wet/dry or freeze/thaw conditions. There is also a chip on the skull that looks quite recent and doesn’t look like it was caused by an animal or natural causes; my first thought is maybe a pick or tool, as there also looks to be root marks on the skull. So, given the photograph, we think there’s reason to be concerned that this skull might only have recently been dug up. We have seen videos on Facebook of recent graves being opened. Ms. Rattée says she has documentation that it is European in origin, but that’s no guarantee.

How are they sold?

– these are bought and sold on instagram, facebook, and other social media marketplaces. Skulls were bought and sold through shops long before social media, but social media increases the reach and size of the market. Facebook of course makes money from ads served alongside these posts, so it’s in FB’s interests to facilitate the reach and ‘engagement’ with the posts.

What are my thoughts on the situation?

– it is not illegal to buy and sell human remains in Canada, but I feel it ought to be simply by virtue of the fact that we owe it to our fellow Canadians, Indigenous Canadians, to try to right some of the wrongs we have done in the name of ‘science’. Harlan Smith, the ‘father of BC archaeology’, robbed graves in the 19th century and sent the remains to new york to go into a museum. He knew what he was doing was wrong: there’s no excuse. Social media makes human remains into entertainment. If a potential politician sees no problem with buying and selling a dead human, that does not speak well to their judgement regarding living humans.

– as far as using the skull as a model: a resin cast is surely a good enough model for drawing skulls on skin.

(featured image: israel palacio on unsplash)