Open Notebooks Part IV – autogenerating a table of contents

I’ve got MDWiki installed as the public face of my open notebook.

Getting it installed was easy, but I made it hard, and so I’ll have to collect my thoughts and remember exactly what I did… but, as I recall, it was this bit I found in the documentation that got me going:

First off, create a new (empty) repository on GitHub, then;

git clone https://github.com/exalted/mdwiki-seed.git
cd mdwiki-seed
git remote add foobar <HTTPS/SSH Clone URL of the New Repository>
git push foobar gh-pages

 

Then, I just had to remember to edit the ‘gh-pages’ branch. Also, on github, if you click on ‘settings’, it’ll give you the .io version of your page, which is the pretty bit. So, I updated robot 3 to push to the ‘uploads/documents’ folder. Hooray! But what I needed was a self-updating ‘table of contents’. Here’s how I did that.

In the .md file that describes a particular project (which goes in the ‘pages’ folder) I have a heading ‘Current Notes’ and a link to a file, content.md, like so:

## [Current Notes](uploads/documents/contents.md)

Now I just train a robot to always make an updated contents.md file that gets pushed by robot 3.

I initially tried building this into robot 2 (‘convert-rtf-to-md’), but I outfoxed myself too many times. So I inserted a new robot into my flow between 2 & 3. Call it 2.5, ‘Create-toc’:

Screen Shot 2014-09-24 at 9.40.16 PM

It’s just a shell script:

cd ~/Documents/conversion-folder/Draft
ls *.md &gt; nolinkcontents.md
sed -E -n 's/(^.*[0-9].*$)/ \* [\1](\1)/gpw contents.md' nolinkcontents.md 
rm nolinkcontents.md

Or, in human: go to the conversion folder. List out all the newly-created md files and write that to a file called ‘nolinkcontents.md’. Then, wrap markdown links around each line, and use each line as the text of the link, and call that ‘contents.md’. Then remove the first file.

Ladies and gentlemen, this has taken me the better part of four hours.

Anyway, this ‘contents.md’ file gets pushed to github, and since my project description page always links to it, we’re golden.

Of course, I realize now that I’ll have to modify things slightly, structurally and in my nomenclature, once I start pushing more than one project’s notes to the notebook. But that’s a task for another night.

Now to lesson plan for tomorrow.

(update: when I first posted this, I kept saying robot 4. Robot 4 is my take-out-the-trash robot, which cleans out the conversion folder, in readiness for the next time. I actually meant Robot 3. See Part III)

Open notebooks part III

Do my bidding my robots!

Do my bidding my robots!

I’ve sussed the Scrivener syncing issue by moving the process of converting out of the syncing folder (remember, not the actual project folder, but the ‘sync to external folder’). I then have created four automator applications to push my stuff to github in lovely markdown. Another thing I’ve learned today: when writing in Scrivener, just keep your formatting simple. Don’t use markdown syntax within Scrivener or your stuff on github will end up looking like this \##second-heading. I mean, it’s still legible, but not as legible as we’d like.

So – I have four robots. I write in Scrivener, keep my notes, close the session, whereupon it syncs rtf to the ‘external folder’ (in this case, my dropbox folder for this purpose; again, not the actual scrivener project folder).

  1. I hit robot 1 on my desktop. Right now, this is called ‘abm-project-move-to-conversion-folder’. When I have a new project, I just open this application in Automator, and change the source directory to that project’s Scrivener external syncing folder. It grabs everything out of that folder, and copies it into a ‘conversion-folder’ that lives on my machine.
  2. I hit robot 2, ‘convert-rtf-to-md’, which opens ‘conversion-folder’ and turns everything it finds into markdown. The conversion scripts live in the ‘conversion-folder’; the things to be converted live in a subfolder, conversion-folder/draft
  3. I hit robot 3, ‘push-converted-files-to-github-repo’. This grabs just the markdown files, and copies them into my local github repository for the project. When I have a new project, I’d have to change this application to point to the new folder. This also overwrites anything with the same file name.
  4. I hit robot 4, ‘clean-conversion-folder’ which moves everything (rtfs, mds,) to the trash. This is necessary because if not, then I can end up with duplicates of files I haven’t actually modified getting through my pipeline onto my github page. (If you look at some of my experiments on github, you’ll see the same card a number of times with 1…2…3…4 versions).

Maybe it’s possible to create a meta-automator that strings those four robots into 1. I’ll try that someday.
[pause]
Ok, so of course, I tried stringing them just now. And it didn’t work. So I put that automator into the trash –
[pause]
and now my original four robots give me errors, ‘the application …. can’t be opened. -1712’. I found the solution here (basically, go to spotlight, type in activity, then locate the application on the list and quit it).

Here are my automators:

Robot 1

Robot 1

Robot 2

Robot 2

Robot 3

Robot 3

Robot 4

Robot 4

Automator….

I think I love you.

 

An Open Research Notebook Workflow with Scrivener and Github Part 2: Now With Dillinger.io!

A couple of updates:

First item

The four scripts that sparkygetsthegirl crafted allow him to

1. write in Scrivener,

2. sync to a Dropbox folder,

3. Convert to md,

4. then open those md files on an android table to write/edit/add

5. and then reconvert to rtf for syncing back into Scrivener.

Screen Shot 2014-09-19 at 2.24.27 PMI wondered to myself, what about some of the online markdown editors? Dillinger.io can scan Dropbox for md files. So, I went to Dillinger.io, linked it to my dropbox, scanned for md files, and lo! I found my project notes. So if the syncing folder is shared with other users, they can edit the notecards via Dillinger. Cool, eh? Not everyone has a native app for editing, so they can just point their browser’s device to the website. I’m sure there are more options out there.

Second Item

I was getting syncing errors because I wasn’t flipping the md back to rtf.

But, one caveat: when I went to run the md to rtf script, to get my changes back into Scrivener (and then sync), things seemed to go very wonky indeed. One card was now blank, the others were all Scrivener’s markup but Scrivener wasn’t recognizing it.

So I think the problem is me doing things out of order. I continue to play.

Third Item

I automated running of the conversion scripts. You can see my automator set up in the screenshot below. Again, I saved it as an application on my desktop. First step is to grab the right folder. Second, to open the terminal, input the commands, then close the terminal.

Screen Shot 2014-09-19 at 2.36.03 PM

Postscript

I was asked why on earth would I want to share my research notes? Many many reasons – see Caleb McDaniel’s post, for instance – but one other feature is that, because I’m doing this on Github, a person could fork (copy) my entire research archive. They could then use it to build upon. Github keeps track of who forks what, so forking becomes a kind of mass citation and breadcrumb trail showing who had an idea first. Moreover, github code (or in this case, my research archive) can be archived on figshare too, thus giving it a unique DOI *and* proper digital archiving in multiple locations. Kinda neat, eh?

An Open Research Notebook Workflow with Scrivener and Github

I like Scrivener. I *really* like being able to have my research and my writing in the same place, and most of all, I like being able to re-arrange the cards until I start to see the ideas fall into place.

I’m a bit of a visual learner, I suppose. (Which makes it ironic that I so rarely provide screenshots here. But I digress). What I’ve been looking for is a way to share my research, my lab notes, my digital ephemera in a single notebook. Lots of examples are out there, but another criterion is that I need to be able to set something up that my students might possibly be able to replicate.

So my requirements:

1. Visually see my notes, their layout, their possible logical connections. The ability to rearrange my notes provides the framework for my later written outputs.

2. Get my notes (but not all of the other bits and pieces) onto the web in such a way that each note becomes a citable object, with revision history freely available.

3. Ideally, that could then feed into some sort of shiny interface for others’ browsing – something like Jeckyll, I guess – but not really a big deal at the moment.

So #1 is taken care of with Scrivener. Number 2? I’m thinking Github. Number 3? We’ll worry about that some other day. There are Scrivener project templates that can be dropped into a Github repository (see previous post). You would create a folder/repo on your computer, drop the template into that, and write away to your hearts content, committing and syncing at the end of the day. This is what you’d get. All those slashes and curly brackets tell Scrivener what’s going on, but it’s not all that nice to read. (After all, that solution is about revision history, not open notebooks).

Now, it is possible to manually compile your whole document, or bits at a time, into markdown files and to commit/sync those. That’s nice, but time consuming. What I think I need is some way to turn Scrivener’s rtf’s into nice markdown. I found this, a collection of scripts by Sparkygetsthegirl as part of a Scrivener to Android tablet and back writing flow. Check it out! Here’s how it works. NB, this is all Mac based, today.

1. Make a new Scrivener project.

2. Sync it to dropbox. (which is nice: backups, portability via Dropbox, sharing via Github! see below)

3. drop the 4 scripts into the synced folder. Open a terminal window there. We’ll come back to that.

4. open Automator. What we’re going to do is create an application that will open the ‘drafts’ folder in the synced project, grab everything, then filter for just the markdown files we made, then move them over to our github repo, overwriting any pre-existing files there. Here’s a screenshot of what that application looks like in the Automator editing screen:

Remember, you're creating an 'application', not a 'workflow'

Remember, you’re creating an ‘application’, not a ‘workflow’

You drag the drafts folder into the ‘Get specified finder items’ box, get the folder contents, filter for files with file extension .md, and then copy to your github repo. Tick off the overwrite checkbox.

Back in scrivener, you start to write.

Write write write.

Here’s a screenshot of how I’m setting up a new project.

Screen Shot 2014-09-17 at 1.50.14 PM

In this screenshot, I’ve already moved my notecards from ‘research’ into ‘draft’. In a final compile, I’d edit things heavily, add bits and pieces to connect the thoughts, shuffle them around, etc. But right now, you can see one main card that identifies the project and the pertinent information surrounding it (like for instance, when I’m supposed to have this thing done). I can compile just that card into multimarkdown, and save it directly to the github repository as readme.md.

Now the day is done, I’m finished writing/researching/playing. I sync the project one last time. Then, in the terminal window, I can type

./rtf2md Draft/*.rtf

for everything in the draft folder, and

./rtf2md Notes/*.rtf

for everything in the notes folder. Mirabile dictu, the resulting md files will have the title of the notecard as their file name!

Screen Shot 2014-09-17 at 1.56.06 PM

Here, I’ve used some basic citation info as the name for each card; a better idea might be to include tags in there too. Hey, this is all still improv theatre.

Now, when I created that application using automator, I saved it to my desktop. I double-click on it, and it strains out the md files and moves them over to my github repository. I then commit & sync, and I now have an open lab notebook on the web. Now, there are still some glitches; my markdown syntax that I wrote in, in Scrivener, isn’t being recognized on github because I think Scrivener is adding backslashes here and there, which are working like escape characters?

Anyway, this seems a promising start. When I do further analysis in R, or build a model in Netlogo, I can record my observations this way, create an R notebook with knitr or a netlogo applet, and push these into subfolders in this repo. Thus the whole thing will stick together.

I think this works.

~o~
Update Sept 18. I’ve discovered that I might have messed something up with my syncing. It could be I’ve just done something foolish locally or it might be something with my workflow. I’m investigating, but the upshot is, I got an error when I synced and a new folder called ‘Trashed Files’, and well, I think I’m close to my ideal setup, but there’s still something wonky. Stay tuned.

Update Sept 19 Don’t write in Scrivener using markdown syntax! I had a ‘doh’ moment. Write in Scrivener using bold, italics, bullets, etc to mark up your text. Then, when the script converts to markdown, it’ll format it correctly – which means that github will render it more or less correctly, making your notes a whole lot easier to read. Click on ‘raw’ on this page to see what I mean!

Open Notebooks

This post is more a reminder to me that anything you’d like to read, but anyway-

I want to make my research more open, more reproducible, and more accessible. I work from several locations, so I want to have all my stuff easily to hand. I work on a Mac (sometimes) a PC (sometimes) and on Linux (rarely, but it happens; with new goodies from Bill Turkel et al I might work more there!).

I build models in Netlogo. I do text analysis in R. I visualize and analyze with things like Voyant and Overview. I scrape websites. I use Excel quite a lot. I’m starting to write in markdown more often. I want to teach students (my students typically have fairly low levels of digital literacy) how to do all this too. What I don’t do is much web development type stuff, which means that I’m still struggling with concepts and workflow around things like version control. And indeed, getting access to a server where I can just screw around to try things out is difficult (for a variety of reasons). So my server-side skills are weak.

What I think I need, is an open notebook. Caleb McDaniel has an excellent post on what this could look like. He uses Gitit. I looked at the documentation, and was defeated out of the gate. Carl Boettiger uses a combination of github and jekyll and who knows what else. What I really like is Mark Madsen’s example but I’m not aufait enough yet with all the bits and pieces (damn you version control, commits, make, rake, et cetera et cetera!)

I’ve got ipython notebooks working on my PC, which are quite cool (I installed the Anaconda version). I don’t know much python though, so yeah. Stefan Sinclair is working on ‘voyant notebooks’ which uses the same general idea to wrap analysis around Voyant, so I’m looking forward to that. Ipython can be used to call R, which is cool, but it’s still early days for me (here’s a neat example passing data to R’s ggplot2).

So maybe that’s just the wrong tool.  Much of what I want to do, at least as far as R is concerned is covered in this post by Robert Flight on ‘creating an analysis as a package and vignette‘ in R studio. And there’s also this, for making sure things are reproducible – ‘packrat

Some combination of all of this I expect will be the solution that’ll work for me. Soon I want to start doing some more agent based modeling & simulation work, and it’s mission critical that I sort out my data management, notebooks, versioning etc first this time.

God, you should see the mess around here from the last time!

SAA 2015: Macroscopic approaches to archaeological histories: Insights into archaeological practice from digital methods

Ben Marwick and I are organizing a session for the SAA2015 (the 80th edition, this year in San Francisco) on “Macroscopic approaches to archaeological histories: Insights into archaeological practice from digital methods”. It’s a pretty big tent. Below is the session ID and the abstract. If this sounds like something you’d be interested in, why don’t you get in touch?

Session ID 743.

The history of archaeology, like most disciplines, is often presented as a sequence of influential individuals and a discussion of their greatest hits in the literature.  Two problems with this traditional approach are that it sidelines the majority of participants in the archaeological literature who are excluded from these discussions, and it does not capture the conversations outside of the canonical literature.  Recently developed computationally intensive methods as well as creative uses of existing digital tools can address these problems by efficiently enabling quantitative analyses of large volumes of text and other digital objects, and enabling large scale analysis of non-traditional research products such as blogs, images and other media. This session explores these methods, their potentials, and their perils, as we employ so-called ‘big data’ approaches to our own discipline.

—-

Like I said, if that sounds like something you’d be curious to know more about, ping me.

Quickly Extracting Data from PDFs

By ‘data’, I mean the tables. There are lots of archaeological articles out there that you’d love to compile together to do some sort of meta-study. Or perhaps you’ve gotten your hands on pdfs with tables and tables of census data. Wouldn’t it be great if you could just grab that data cleanly? Jonathan Stray has written a great synopsis of the various things you might try and has sketched out a workflow you might use. Having read that, I wanted to try ‘Tabula‘, one of the options that he mentioned. Tabula is open source and runs on all the major platforms. You simply download it an double-click on the icon; it runs within your browser. You load your pdf into it, and then draw bounding boxes around the tables that you want to grab. Tabula will then extract that table cleanly, allowing you to download it as a csv or tab separated file, or paste it directly into something else.

For instance, say you’re interested in the data that Gill and Chippindale compiled on Cycladic Figures. You can grab the pdf from JSTOR:

Material and Intellectual Consequences of Esteem for Cycladic Figures
David W. J. Gill and Christopher Chippindale
American Journal of Archaeology , Vol. 97, No. 4 (Oct., 1993) , pp. 601-659
Article DOI: 10.2307/506716

Download it, and then feed it into Tabula. Let’s look at table 2.

gillchippendaletable2
You could just highlight this table in your pdf reader and hit ctrl+c to copy it; when you paste that into your browser, you’d get:
gillchippendaletable2cutnpaste
Everything in a single column. For a small table, maybe that’s not such a big deal. But let’s look at what you get with Tabula. You drag the square over that same table; when you release the mouse button you get:
tabula1
Much, much cleaner & faster! I say ‘faster’, because you can quickly drag the selection box around every table and hit download just the one time. Open the resulting csv file, and you have all of your tables in a useful format:
tabula2
But wait, there’s more! Since you can copy directly to the clipboard, you can paste directly into a google drive spreadsheet (thus taking advantage of all the visualization options that Google offers) or into something like Raw from Density Design.
Tabula is a nifty little tool that you’ll probably want to keep handy.

Gaze & Eonydis for Archaeological Data

I’m experimenting with Clement Levallois‘ data mining tools ‘Gaze‘ and ‘Eonydis‘. I created a table with some mock archaeological data in it: artefact, findspot, and date range for the artefact. More on dates in a moment. Here’s the fake dataset.

Firstly, Gaze will take a list of nodes (source, target), and create a network where the source nodes are connected to each other by virtue of sharing a common target. Clement explains:

Paul,dog
Paul, hamster
Paul,cat
Gerald,cat
Gerald,dog
Marie,horse
Donald,squirrel
Donald,cat
… In this case, it is interesting to get a network made of Paul, Gerald, Marie and Donald (sources nodes), showing how similar they are in terms of pets they own. Make sure you do this by choosing “directed networks” in the parameters of Gaze. A related option for directed networks: you can choose a minimum number of times Paul should appear as a source to be included in the computations (useful to filter out unfrequent, irrelevant nodes: because you want only owners with many pets to appear for instance).

The output is in a nodes.dl file and an edges.dl file. In Gephi, go to the import spreadsheet button on the data table, import the nodes file first, then the edges file. Here’s the graph file.

Screenshot, Gaze output into Gephi, from mock archaeo-data

Screenshot, Gaze output into Gephi, from mock archaeo-data

Eonydis on the other hand takes that same list and if it has time-stamps within it (a column with dates), will create a dynamic network over time. My mock dataset above seems to cause Eonydis to crash – is it my negative numbers? How do you encode dates from the Bronze Age in the day/month/year system? Checking the documentation, I see that I didn’t have proper field labels, so I needed to fix that. Trying again, it still crashed. I fiddled with the dates to remove the range (leaving a column to imply ‘earliest known date for this sort of thing’), which gave me this file.

Which still crashed. Now I have to go do some other stuff, so I’ll leave this here and perhaps one of you can pick up where I’ve left off. The example file that comes with Eonydis works fine, so I guess when I return to this I’ll carefully compare the two. Then the task will be to work out how to visualize dynamic networks in Gephi. Clement has a very good tutorial on this.

Postscript:

Ok, so I kept plugging away at it. I found if I put the dates yyyy-mm-dd, as in 1066-01-23 then Eonydis worked a treat. Here’s the mock data and here’s the gexf.

And here’s the dynamic animation! http://screencast.com/t/Nlf06OSEkuA

Post post script:

I took the mock data (archaeo-test4.csv) and concatenated a – in front of the dates, thus -1023-01-01 to represent dates BC. In Eonydis, where it asks for the date format, I tried this:

#yyyy#mm#dd  which accepted the dates, but dropped the negative;

-yyyy#mm#dd, which accepted the dates and also dropped the negative.

Thus, it seems to me that I can still use Eonydis for archaeological data, but I should frame my date column in relative terms rather than absolute, as absolute isn’t really necessary for the network analysis/visualization anyway.

How I Lost the Crowd: A Tale of Sorrow and Hope

Yesterday, my HeritageCrowd project website was annihilated. Gone. Kaput. Destroyed. Joined the choir.

It is a dead parrot.

This is what I think happened, what I now know and need to learn, and what I think the wider digital humanities community needs to think about/teach each other.

HeritageCrowd was (may be again, if I can salvage from the wreckage) a project that tried to encourage the crowdsourcing of local cultural heritage knowledge for a community that does not have particularly good internet access or penetration. It was built on the Ushahidi platform, which allows folks to participate via cell phone text messages. We even had it set up so that a person could leave a voice message and software would automatically transcribe the message and submit it via email. It worked fairly well, and we wrote it up for Writing History in the Digital Age. I was looking forward to working more on it this summer.

Problem #1: Poor record keeping of the process of getting things intalled, and the decisions taken.

Now, originally, we were using the Crowdmap hosted version of Ushahidi, so we wouldn’t have to worry about things like security, updates, servers, that sort of thing. But… I wanted to customize the look, move the blocks around, and make some other cosmetic changes so that Ushahidi’s genesis in crisis-mapping wouldn’t be quite as evident. When you repurpose software meant for one domain to another, it’s the sort of thing you do. So, I set up a new domain, got some server space, downloaded Ushahidi and installed it. The installation tested my server skills. Unlike setting up WordPress or Omeka (which I’ve done several times), Ushahidi requires the concommitant set up of ‘Kohana‘. This was not easy. There are many levels of tacit knowledge in computing and especially in web-based applications that I, as an outsider, have not yet learned. It takes a lot of trial and error, and sometimes, just dumb luck. I kept poor records of this period – I was working to a tight deadline, and I wanted to just get the damned thing working. Today, I have no idea what I actually did to get Kohana and Ushahidi playing nice with one another. I think it actually boiled down to file structure.

(It’s funny to think of myself as an outsider, when it comes to all this digital work. I am after all an official, card-carrying ‘digital humanist’. It’s worth remembering what that label actually means. At least one part of it is ‘humanist’. I spent well over a decade learning how to do that part. I’ve only been at the ‘digital’ part since about 2005… and my experience of ‘digital’, at least initially, is in social networks and simulation – things that don’t actually require me to mount materials on the internet. We forget sometimes that there’s more to the digital humanities than building flashy internet-based digital tools. Archaeologists have been using digital methods in their research since the 1960s; Classicists at least that long – and of course Father Busa).

Problem #2: Computers talk to other computers, and persuade them to do things.

I forget where I read it now (it was probably Stephen Ramsay or Geoffrey Rockwell), but digital humanists need to consider artificial intelligence. We do a humanities not just of other humans, but of humans’ creations that engage in their own goal-directed behaviours. As some one who has built a number of agent based models and simulations, I suppose I shouldn’t have forgotten this. But on the internet, there is a whole netherworld of computers corrupting and enslaving each other, for all sorts of purposes.

HeritageCrowd was destroyed so that one computer could persuade another computer to send spam to gullible humans with erectile dsyfunction.

It seems that Ushahidi was vulnerable to ‘Cross-site Request Forgery‘ and ‘Cross-site Scripting‘ attacks. I think what happened to HeritageCrowd was an instance of persistent XSS:

The persistent (or stored) XSS vulnerability is a more devastating variant of a cross-site scripting flaw: it occurs when the data provided by the attacker is saved by the server, and then permanently displayed on “normal” pages returned to other users in the course of regular browsing, without proper HTML escaping.

When I examine every php file on the site, there are all sorts of injected base64 code. So this is what killed my site. Once my site started flooding spam all over the place, the internet’s immune systems (my host’s own, and others), shut it all down. Now, I could just clean everything out, and reinstall, but the more devastating issue: it appears my sql database is gone. Destroyed. Erased. No longer present. I’ve asked my host to help confirm that, because at this point, I’m way out of my league. Hey all you lone digital humanists: how often does your computing services department help you out in this regard? Find someone at your institution who can handle this kind of thing. We can’t wear every hat. I’ve been a one-man band for so long, I’m a bit like the guy in Shawshank Redemption who asks his boss at the supermarket for permission to go to the bathroom. Old habits are hard to break.

Problem #3: Security Warnings

There are many Ushahidi installations all over the world, and they deal with some pretty sensitive stuff. Security is therefore something Ushahidi takes seriously. I should’ve too. I was not subscribed to the Ushahidi Security Advisories. The hardest pill to swallow is when you know it’s your own damned fault. The warning was there; heed the warnings! Schedule time into every week to keep on top of security. If you’ve got a team, task someone to look after this. I have lots of excuses – it was end of term, things were due, meetings to be held, grades to get in – but it was my responsibility. And I dropped the ball.

Problem #4: Backups

This is the most embarrasing to admit. I did not back things up regularly. I am not ever making that mistake again. Over on Looted Heritage, I have an IFTTT recipe set up that sends every new report to BufferApp, which then tweets it. I’ve also got one that sends every report to Evernote. There are probably more elegant ways to do this. But the worst would be to remind myself to manually download things. That didn’t work the first time. It ain’t gonna work the next.

So what do I do now?

If I can get my database back, I’ll clean everything out and reinstall, and then progress onwards wiser for the experience. If I can’t… well, perhaps that’s the end of HeritageCrowd. It was always an experiment, and as Scott Weingart reminds us,

The best we can do is not as much as we can, but as much as we need. There is a point of diminishing return for data collection; that point at which you can’t measure the coastline fast enough before the tides change it. We as humanists have to become comfortable with incompleteness and imperfection, and trust that in aggregate those data can still tell us something, even if they can’t reveal everything.

The HeritageCrowd project taught me quite a lot about crowdsourcing cultural heritage, about building communities, about the problems, potentials, and perils of data management. Even in its (quite probable) death, I’ve learned some hard lessons. I share them here so that you don’t have to make the same mistakes. Make new ones! Share them! The next time I go to THATCamp, I know what I’ll be proposing. I want a session on the Black Hats, and the dark side of the force. I want to know what the resources are for learning how they work, what I can do to protect myself, and frankly, more about the social and cultural anthropology of their world. Perhaps there is space in the Digital Humanities for that.

PS.

When I discovered what had happened, I tweeted about it. Thank you everyone who responded with help and advice. That’s the final lesson I think, about this episode. Don’t be afraid to share your failures, and ask for help. As Bethany wrote some time ago, we’re at that point where we’re building the new ways of knowing for the future, just like the Lunaticks in the 18th century. Embrace your inner Lunatick:

Those 18th-century Lunaticks weren’t about the really big theories and breakthroughs – instead, their heroic work was to codify knowledge, found professional societies and journals, and build all the enabling infrastructure that benefited a succeeding generation of scholars and scientists.

[…]

if you agree with me that there’s something remarkable about a generation of trained scholars ready to subsume themselves in collaborative endeavors, to do the grunt work, and to step back from the podium into roles only they can play – that is, to become systems-builders for the humanities — then we might also just pause to appreciate and celebrate, and to use “#alt-ac” as a safe place for people to say, “I’m a Lunatick, too.”

Perhaps my role is to fail gloriously & often, so you don’t have to. I’m ok with that.

Converting 2 mode networks with Multimodal plugin for Gephi

Scott Weingart drew my attention this morning to a new plugin for Gephi by Jaroslav Kuchar that converts multimodal networks to one mode networks.

This plugin allows multimode networks projection. For example: you can project your bipartite (2-mode) graph to monopartite (one-mode) graph. The projection/transformation is based on the matrix multiplication approach and allows different types of transformations. Not only bipartite graphs. The limitation is matrix multiplication – large matrix multiplication takes time and memory.

After some playing around, and some emails & tweets with Scott, we determined that it does not seem to work at the moment for directed graphs. But if you’ve got a bimodal undirected graph, it works very well indeed! It does require some massaging though. I assume you can already download and install the plugin.

1. Make sure your initial csv file with your data has a column called ‘type’. Fill that column with ‘undirected’. The plugin doesn’t work correctly with directed graphs.

2. Then, once your csv file is imported, create a new column on the nodes table, call it ‘node-type’ – here you specify what the thing is. Fill it up accordingly. (cheese, crackers, for instance).

3. I thank Scott for talking me through this step. First, save your network; this next step will irrevocably change your data. Click ‘load attributes’. Under attribute type, select your column you created for step 2. Then, for left matrix, select select Cheese – Crackers; for right matrix, select Crackers – Cheese. Hit ‘run’. This gets you a new Cheese-Cheese network (select the inverse to get a crackers – crackers network).  You can then remove any isolates or dangly bits by ticking ‘remove edges’ or ‘remove nodes’ as appropriate.

4. Save your new 1 mode network. Go back to the beginning to create the other 1 mode network.

Looted Heritage: Infotrapping the Illicit Antiquities Trade

To crowdsource something – whether it is a problem of code, or the need to transcribe historical documents – is generally about fracturing a problem into its component pieces, and allowing an interested public to solve the tiny pieces. In 2011 I and my students embarked on a project to crowdsource sense of place, using Ushahidi to solicit and collect community memories about cultural heritage resources in Pontiac and Renfrew counties in Eastern Canada (The HeritageCrowd Project). As part of that project, I had initially set up a deployment of Ushahidi using their hosted service, called ‘Crowdmap‘. It contains all of the functionality of the vanilla Ushahidi, but since they are doing the hosting, it cannot be customized to any great deal. I mothballed that initial deployment, and installed my own on my own server so I could customize. The entire experience was recounted in a contribution to the forthcoming born-digital volume, ‘Writing History in the Digital Age‘, edited by Jack Dougherty and Kristen Nawrotzki. One of the findings of that small experiment was about the order of operations in a crowdsourced project:

…in one sense our project’s focus was misplaced. Crowdsourcing should not be a first step. The resources are already out there; why not trawl, crawl, spider, and collect what has already been uploaded to the internet? Once the knowledge is collected, then one could call on the crowd to fill in the gaps. This would perhaps be a better use of time, money, and resources.

This January, I started the second term of my year long first year seminar in digital antiquity, and I decided to re-start my mothballed Crowdmap as part of a module on crowdsourcing. But as I prepared, that one paragraph above kept haunting me. Perhaps what was needed was not so much a module on crowdsourcing, but rather, one on information trapping. I realized that Ushahidi and Crowdmap are better thought of as info-traps. Thus, ‘Looted Heritage: Monitoring the Illicit Antiquities Trade‘ was born. The site monitors various social media and regular media feeds for stories and reports about the trade in antiquities, which can then be mapped, giving a visual depiction of the impact of the trade. I intend to use Zotero public library feeds as well, to enable the mapping of academic bibliography on the trade.

Why illicit antiquities? The Illicit Antiquities Research Centre at Cambridge University (which is, sadly, closed) provides some statistics on the nature and scale of the illicit antiquities trade; these statistics date to 1999; the problem has only grown with the wars and disruptions of the past decade:

  • Italy: 120,000 antiquities seized by police in five years;
  • Italy: 100,000+ Apulian tombs devastated;
  • Niger: in southwest Niger between 50 and 90 per cent of sites have been destroyed by looters;
  • Turkey: more than 560 looters arrested in one year with 10,000 objects in their possession;
  • Cyprus: 60,000 objects looted since 1974;
  • China: catalogues of Sotheby’s sales found in the poor countryside: at least 15,000 sites vandalized, 110,000 illicit cultural objects intercepted in four years;
  • Cambodia: 300 armed bandits surround Angkor Conservation compound, using hand grenades to blow apart the monuments; 93 Buddha heads intercepted in June this year, 342 other objects a month later;
  • Syria: the situation is now so bad a new law has been passed which sends looters to jail for 15 years;
  • Belize: 73 per cent of major sites looted;
  • Guatemala: thieves now so aggressive they even looted from the laboratory at Tikal;
  • Peru: 100,000 tombs looted, half the known sites.

-Brodie and Watson, http://www.mcdonald.cam.ac.uk/projects/iarc/culturewithoutcontext/issue5/brodie-watson.htm

The idea then is to provide my students with hands-on experience using Crowdmap as a tool, and to foster engagement with the real-world consequences of pot-hunting and tomb-robbing. Crowdmap also allows users to download all of the reports that get created. One may download all reports on Looted Heritage in CSV format at the download page. I will be using this data as part of my teaching about data mining and text analysis, getting the students to run this data through tools like Voyant to spot patterns in the way that the antiquities trade is portrayed in the media.

At one point, I also had feeds from eBay for various kinds of artefacts, Greco-Roman, Egyptian, pre-Columbian, etc, but there is so much volume going through eBay that it completely overwhelmed the other signals. I think dealing with eBay and exploring the scope of the trade there will require different scrapers and quantitative tools, so I’m leaving that problem aside for the time being.

I’m also waiting anxiously for trap.it, which was described earlier this week in Profhacker, to allow exports of the information it finds. Right now, you can only consume what it finds within its ecosystem (or through kludgy workarounds). Ideally, I would be able to grab a feed from Trap.it for one of its traps and bring that directly into Looted Heritage. Trap.it is more of an active agent than the passive sieve approach that Crowdmap takes, so combining the two ought to be a powerful approach. From Profhacker:

[Trap.it is] a web service that combines machine learning algorithms with user-selected topics and filters. (The algorithms used in this project stem from the same research that led to Apple’s Siri.) After creating an account, you create a “trap” by entering in a keyword or short phrase into the Discovery box. Once you save your trap, you personalize it by clicking thumbs up or thumbs down on a number of articles in your trap. The more articles you rate, the closer attuned the trap becomes to the kinds of material you want to read.

Finally, one of the other aspects we learned in the HeritageCrowd project was the importance of outreach. For Looted Heritage, I am using a combination of If This Then That to monitor Looted Heritage’s feed, sending new reports to Buffer App, which then sends to Twitter, @looted_heritage. Aside from some problems with duplicated tweets, I’ve been quite happy with this setup (and thanks to Terry Brock for suggesting this). There’s an interesting possibility of circularity there, with Crowdmap picking up @looted_heritage in its traps, and then sending them out again… but my students should spot that if it occurs.

Ushahidi also provides an IOS app that can be deployed in the field, so that an archaeologist who discovers the work of tombaroli could take geo-tagged photographs and submit them with a click to the Looted Heritage installation, drawing police attention.

So far, the response to this project has been good, with 20 people now following the twitter account since I set it up last week (and which I haven’t promoted very actively). Please feel free to contribute reports to Looted Heritage via its submission page, or by tagging your tweets with #looted #antiquities.

https://twitter.com/#!/gregshine/status/170543712944914433

https://twitter.com/#!/melissaterras/status/170223071834279939

Reading ‘Writing History in the Digital Age’ at a Distance

Topics by authors in Writing History in the Digital Age

I and my students have made some contributions to ‘Writing History in the Digital Age‘, the born-digital volume edited by  Jack Dougherty and Kristen Nawrotzki. Rather than reflect on the writing process, I thought I’d topic model the volume to see what patterns emerged in the contributions.

I use Mallet to do this. I’ve posted earlier about how to get Mallet running. I used Outwit Hub to scrape each individual paragraph from each paper (> 700 paragraphs) into a CSV file (I did not scrape block quotes, so my paragraph numbers are slightly out of sync with those used on the Writing History website). I used the Textme excel macro (google it; it lives in multiple versions and requires a bit of modification to work exactly the way you want it to) to save each paragraph into its own unique text file, which I then load into Mallet.

Phew. Now, the tricky part with Mallet is deciding how many topics you want it to look for. Finding the *right* number of topics requires a bit of iteration – start with say 10. Look at the resulting composition of files to topics. If an inordinate number of files all fall into one topic, you don’t have enough granularity yet.

As an initial read, I went with 15 topics. One topic – which I’ll label ‘working with data’ – had quite a large number of files (composition document) (remember, the individual paragraphs from the papers). Ideally, I would re-run the analysis with a greater number of topics, so that the ‘working with data’ topic would get broken up.

I also graphed the results, so that each author is linked to the topics which compose his or her paper; the thickness of the line indicates multiple paragraphs with that topic. I have also graphed topics by individual paragraphs, but the granularity isn’t ideal making the resulting visual not all that useful. The colours correspond with the ‘modularity’ of the graph, that is, communities of similar patterns of connections. The size of the node represents ‘betweeness’ on all paths between every pair of nodes.

So what does it all mean? At the level of paragraph-by-topic, if we had the correct level of granularity, one might be able to read the entire volume by treating the graph as a guide to hyperlinking from paragraph to paragraph, perhaps – a machine generated map/index of the internal structure of ideas. At the level of individual authors, it perhaps suggests papers to read together and the organizing themes of the volume.

This is of course a quick and dirty visualization and analysis, and my initial impressions. More time and consideration, greater granularity, is to be desired.

Topic, Authors Community
Crowdsourcing 0
Students’ Learning 0
graham 0
grahammassiefeuerherm 0
sikarskie 0
Working with Data 1
Video 1
faltesek 1
Games 1
noonan 1
poe 1
zucconietal 1
castaneda 2
Activism, Protests 2
African Americans and the South 2
Primary Resources, Teaching, and Libraries 2
haber 2
judkins 2
madsen-brooks 2
sklardubin 2
tomasek 2
wolff 2
Blogging and Peer Interactions 3
Monitoring Wikipedia 3
introduction 3
jarret 3
lawrence 3
saxtonetal 3
seligman 3
bauer 4
Keywords and Search 4
cummings 4
Japan and History 4
Writing Process 4
dorn 4
Space and Geography 4
erikson 4
harbisonwaltzer 4
petrzelamanekin 4
roberston 4
tanaka 4
gibbsowens 5
Visualization 5
theibault 5