Home » data mining
Category Archives: data mining
Topic Modeling the Portable Antiquities Scheme
I got my hands on the latest build of the Portable Antiquities Scheme database. I want to topic model the items in this database, to look for patterns in the small material culture of Britain, across time and space.
The data comes in a single CSV, with approximately 500 000 individual rows. The data’s a bit messy, as a result of extra commas slipping in here and there. The names of the Finds Liaison Officers slip into a column meant to record epigraphic info from coins, for instance, from time to time. Not a big deal, over 500 000 records.
The first issue I had was that after opening the CSV file in Excel, Excel would regard all of those epigraphic conventions (the use of =, +, or [ ] and so on) as formulae. This would generate ‘circular reference’ errors. I could sort that out by inserting a ‘ at the beginning of that column. But as you can imagine, sorting through, filtering, or any kind of manipulation of a single table that large would slow things considerably – and frequently crashed this poor ol’ desktop. I tried using Open Refine to clean up the data. I suspect with a bit of time and effort I’d be able to use that product well, but yesterday all I achieved, once I imported my csv file and clicked ‘make project’, was an ‘undefined error’ (after several minutes of chugging). This morning, I turned to Access and was able to import the csv, and begin querying it, cleaning things up a bit, and so on.
So I decided to focus on the Roman records, for the time being. There are some 66 000 unique records, coming from over 80 unique districts of the UK. This leaves me with a table with the chronological range for the object, a description of the object, and some measurements. I have a script that can take each individual row, and turn it into a txt file which I can then import into MALLET. Each individual row can also include the district name.
So I’m wondering now: should I just cut and paste all of the rows for a single district into a single txt file (and thus the routine will not have the place-name in the analyzed text)? Or should I preserve the granularity, and just topic model over every record, preserving the place name? Ie, a collection of 80 txt files where there are no place names, or a collection of 66 000 txt files where every file has the place name – will they swamp the signals?
It’s too early in the morning for this kind of thinking.
Data Mining an Archaeological Database

I’m using their materials with permission.
In July, I’m presenting work related to data mining an archaeological database, in this case, the Portable Antiquity Scheme.
I wondered, if I treated each district in the UK as a ‘document’, and the items recovered in its territory as the words, would I see any interesting or useful patterns if I ran some topic models?
To give you a sense of the scale of this data, there are over 160 000 individual records in the material I obtained from PAS. An individual record might include a ‘hoard’, so there are *well* over 160 000 individual objects. When you sort this material into broad chronological materials, you find:
Paleolithic: 305 records
Mesolithic: 2281
Neolithic: 3608
Prehistoric: 426
Bronze Age: 2620
Iron Age:4695
Roman: 63479
Greek/Roman Provincial:25
Byzantine: 25
Early Medieval: 8421
Medieval: 44982
Post Medieval: 27879
Modern: 306
“Unknown”: 1486
Blank cells: 1278
Quite a lot of material. So, after massaging the data, cleaning things up, I began to work with a very small subset of materials – records tagged ‘bronze age’ from 14 districts (104 records). This was merely an exploration, to see if there’s any meat to my intuitive belief that there should be some sort of latent structure. The 14 districts I selected (the first 14 when I sorted ‘Bronze Age’) are:
Ashford
Bromley
Dover
East hampshire
Hart
Medway
New Forest
Sevenoaks
Test Valley
Winchester
Wokingham
I put every record from Wokingham District into a single txt file, then every one from Winchester, until I was done (and I really need to automate that). Then I fed the text files through MALLET, using the JAVA Gui for this initial exploration (using the JAVA Gui’s default settings. In a more robust exploration, I would go direct from the command line, tweaking until I found the best number of topics, etc).
So here’s what I found.
List of Topics
1. alloy palstave mm copper green surface slight cast dark penannular
2. mouth sides loop dims looped corners armorican axeheads core cast
3. blade axehead prominent casting iron hoard intact uneven single narrow
4. age fragment late surfaces alan spear body faces head flanged
5. age socket collar sectioned alloy slightly ridge seams front square
6. record flint grey scraper antiquities dorsal tool angle black visible
7. bronze patina end stop made remains flat decoration found corroded
8. database central rectify working recording standards usual fall aware began
9. bronze copper flashes part side edge large ridges shallow top
10. socketed straight axe rounded complete horizontal moulded rectangular expanding upper
What do those topics mean? To a human, they are all variations on the description of the artefacts. Given that multiple humans described these artefacts in the first place, perhaps (and it depends too on the kind of guidance and rigour that the PAS uses in its data entry) these topics gather some of the blurriness of categorization, a way of bypassing the clumpers and the splitters amongst us. Obviously, some more thought about what these may mean is necessary. But onwards!
I brought the resultant ‘documents: topics, % contribution’ list into gephi for some visualization. Since it was a small dataset, I did no pruning. Topic 4 does the most lifting in this network. In its ‘module’, you find topics 9, 10, 3, 5 (coloured purple) and districts of Gravesham, Bromley, Dover, Canterbury, Test Valley, and New Forest. But how much weight does this visualization carry? Since it’s two-mode, and these metrics are really only appropriate for a one-mode graph, probably not much. So I collapsed this graph into a one-mode graph of district to district, based on weighted ties by topic.
The resultant graph is probably more useful for archaeology, for it ties areas together based on all of the material culture recorded in the database. At the recent SAA in Honolulu, in the Connected Past session, folks were constructing networks from artefacts using Brainerd Robinson coefficients. The methodology I’m trying ought to be compared with those studies (see for instance Barbara Mill’s et al recent article). I then ran modularity and betweeness statistics again. Why betweeness? If the ‘topics’ that emerge in this database reflect something within the underlying material culture, then interconnections between sites constructed from topics show some kind of flow (of ideas? culture? economics?), thus ‘between’ sites straddle the most important of those flows – in which case the most ‘between’ districts might be rather more important.
Remarkably (and this could be an artefact of the method, rather than the underlying data), I get next to no variation in betweeness – every district except for East Hamphsire, Ashford, and New Forest has the same score (and these three all have the same score too). Modularity finds two groups. Perhaps it’s an east/west dichotomy? I laid the network out with the nodes at their geographic locations (typically, the district council office). No east-west dichotomy. (Incidentally, you can now export to Google Earth, overlaying your network against pretty satellite pictures).
So… there seems to be something to it. The thing to do now is to do every record, every district, and every period, mapping out changes over time. In the interests of being able to assess this, though, I should perhaps stick to my knitting and just do the Roman period.
Text Analysis of 2012 Digital Humanities Job Adverts part 3
Here is a zoomable pdf of the same image, for clarity
In this case, the two mode network of jobs to top consituent topics provides much more clarity than the graph I posted at the end of part 2, the one-mode jobs-to-jobs via shared topics. I used the java gui for MALLET, which arranges the output in a very nice hyperlinked folder, which you may explore here. You can grab the CSV and the Gephi files from this directory.
Text Analysis of 2012 Digital Humanities Job Adverts part 2
If we look at simple word frequencies in the 2012 job advertisement documents for Digital Humanities, we find these top words and raw frequency counts:
research 650
university 577
experience 499
library 393
work 334
information 303
position 299
project 269
applications 257
(I’ve deleted ‘digital’ and ‘humanities’ from this list).
If job advertisements are a way of signalling what an institution hopes the future will hold, one gets the sense that the focus of digital humanities work will be on projects, on research, in conjunction with libraries. But we can extract more nuance, using network analysis. You can feed the texts into Voyant’s ‘RezoViz’ tool, which extracts paired nouns in each document.
This can be outputted as a .net file, and then imported into Gephi. The resulting graph has 1461 nodes, and 20649 edges. Of course, there are some duplicates (like ‘US’ and ‘United States’), but this is only meant to be rough and ready, ‘generative‘, as it were (and note also that a network visualization is not necessary for the analysis. So no spaghetti balls. What’s important are the metrics). What I’d like to find out are what concepts are doing the heavy lifting in these job advertisements? What is the hidden structure of the future of digital humanities, as evidenced by job advertisements in the English speaking world?
My suspicion is that ‘modularity’ aka ‘community detection’, and ‘betweeness centrality’, are going to be the key metrics for figuring this out. Modularity groups nodes on the basis of shared similar local patternings of ties (or, to put it another way, it decomposes the global network into maximal subnetworks). Seth Long recently did some network analysis on the Unabomber’s manifesto, and lucidly explains why betweeness centrality is a useful metric for understanding semantic meaning: ”A word with high betweenness centrality is a word through which many meanings in a text circulate.” In other words, the heavy lifters.
So let’s peer into the future.
I ended up with about 15 groups. The first three groups by modularity account for 75% of the nodes, and 80% of the ties. These are the groups where the action lies. So let’s look at words with the highest betweenness centrality scores for those first three groups.
The first group
University
CSS
PHP
Digital
Ruby
METS (Metadata encoding and transmission standard)
United States
Python
MLS
New York
‘University’ is not surprising, and not useful. So let us discard it and bring in the next highest word:
MySQL
This one group by modularity also has all of the highest betweenness centrality scores – and it reads like a laundry list of the skills a budding DH practitioner must hold. The US, and New York would seem to be the centre of the world, too.
If we take the next ten words, we get:
MODS (Netadata Object Description Schema)
XHTML
University Libraries
CLIR (Council on Library and Information Resources)
University of Alberta
North America
Drupal
XML
MARC
Duke University
Again, skills and places figure – in Canada, U of A appears. So far, the impression is that DH is all about text, markup, and metadata. Our favorite programming languages are python and ruby. We use php, xhtml, xml, and drupal (plain-jane vanilla html eventually turns up in the list, but it’s buried very, very deep.).
So that’s an impression of the first group. (Remembering that groups are defined by patterns of similarity in their linkages).
The Second Group
The next group looks like this:
Digital Humanities
London
UK
CV
Dublin
Europe
Ireland
ICT
Department of Digital Humanities
Department of History
“digital humanities” is probably not helpful, so let’s eliminate that and go one more down: “US”. Indeed, let’s take a look at the next ten, too:
Human Resources
Department
Computer Science
BCE
Head of School
Faculty of Humanities
European
University of Amsterdam
MA
Italy
Here, we’re dealing very much with a UK, Ireland, and European focus. The ‘BCE’ is telling, for it suggests an archaeological focus in there, somewhere (unless this is some new DH acronym of which I’m not aware; I’m assuming ‘before the common era’).
The Third Group
In the final group we’ll consider here, we find a strong Canadian focus:
CRC (Canada Research Chair)
Canada
Waterloo
TEI (Text Encoding Initiative)
SSHRC
Victoria
Canada Research Chair
Skype
Digital Humanities Summer Institute
University of Victoria
Since we’ve got some duplication in here, let’s look at the next ten:
Canadian
Quebec
ETCL (Electronic Textual Cultures Laboratory, U Victoria)
Montreal
Concordia
University of Waterloo
DHSI (Digital Humanities Summer Institute)
Stratford
Faculty of Arts
Stratford Campus
‘Canada Research Chairs’ are well-funded government appointments, and so give an indication of where the state would like to see some research. Victoria continually punches above its weight, with look ins from Waterloo and Concordia.
So what have we learned? Well, despite the efforts of the digital history community, ‘digital humanities’ is still largely a literary endeavor – although it’s quite possible that a lot of the marking up that these job advertisements might envision could be of historical documents. Invest in some python skills (see Programming Historian). My friends in government tell me that if you can data mine, you’ll be set for life, as the government is looking for those skills. (Alright, that didn’t come out in this analysis at all, but he’s looking over my shoulder right now).
Finally – London, Dublin, New York, Edmonton, Victoria, Waterloo, Montreal – these seem to be the geographic hotspots. Speaking of temperature, Victoria has the nicest weather. Go there, young student!
Or come to Carleton and study with me. We’ve got tunnels.
update March 4th: jobs-topics-dh as a network graph IN the analysis above, I’ve generated a network using Voyant’s RezoViz tool. Today, I topic modelled all of the texts looking for 10 topics. So a slightly different approach. I turned the resulting document composition (ie doc 1 is 44% topic 1, 22% topic 4, 10% topic 3, etc) into a two mode graph, job advert to top two constituent topics. I then turned this into a 1 mode graph where job adverts are tied to other job adverts based on topic composition. Then I ran modularity, and found 3 groups by modularity; edges are percent composition by topics discerned through topic modeling.Nodes are ‘betweenness centrality’. Most between? George Mason University. I’m not sure what ‘betweenness centrality’ means though in this context, yet.
Makes for interesting clusters of job adverts. Topic model results to be discussed tomorrow.
Text analysis of 2012 Digital Humanities Job Adverts

2012 was a good year for hirings in the digital humanities. See for yourself at this archive of DH jobs: http://jobs.lofhm.org/ Now: what do these job adverts tell us, if you’re a graduate student trying to find your way?
Next week, I’m speaking to the Underhill Graduate Students’ Colloquium at Carleton University on ‘Living the life electric: becoming a digital humanist’. It’s broadly autobiographical in that I’ll talk about my own idiosyncratic path into this field.
That’s quite the point: there’s no firm/accepted/typical/you-ought-to-do X recipe for becoming a digital humanist. You have to find your own way, though the growing body of courses, books, journals, blog-o-sphere and twitterverse certainly makes a huge difference.
But in the interests of providing perhaps a more satisfying answer, I’ll try my hand at data mining those job posts (some 150 of them) using Voyant and MALLET to see what augurs for the future of the field.
Feel free to explore the corpus uploaded into Voyant. In any graphs you produce, January is on the left, December is on the right. If you spot anything interesting/curious, let me know.

And, because word counts are amazing:
| Word | Count |
| digital | 1082 |
| research | 650 |
| university | 577 |
| experience | 499 |
| library | 393 |
| humanities | 386 |
| work | 334 |
| information | 303 |
| position | 299 |
| project | 269 |
| applications | 257 |
| new | 223 |
| faculty | 222 |
| development | 216 |
| collections | 210 |
| department | 207 |
| management | 206 |
| projects | 195 |
| knowledge | 192 |
| data | 187 |
| including | 185 |
| ability | 182 |
| services | 180 |
| teaching | 180 |
| history | 177 |
| libraries | 176 |
| skills | 176 |
| qualifications | 172 |
| technology | 169 |
| required | 166 |
| media | 163 |
| jobs | 151 |
| application | 149 |
| original | 146 |
| program | 145 |
| link | 143 |
| web | 143 |
| working | 142 |
| loading | 140 |
| related | 140 |
| staff | 138 |
| academic | 137 |
| communication | 133 |
| job | 132 |
| college | 130 |
| degree | 127 |
| professor | 126 |
| education | 125 |
| students | 125 |
| studies | 123 |
Visualizing THATCamp
THATCamps are quite popular. I’m throwing one myself. But who are the people talking about them on Twitter? What does the THATCamp look like on the Twitterverse?
I used NodeXL to retrieve the data – a search for tweets, people, and the links between them. I then visualized the data in Gephi, where colour = community (per Gephi’s modularity routine) and sized the nodes (individual Twitterers) using Pagerank, on the premise that this was a directed graph and one should follow the links (although there was little difference with Betweeness Centrality. Major players are still major, either way).
I found 233 individuals, linked together by 4435 edges. Some general stats on this directed network:
| Top 10 Vertices, Ranked by Betweenness Centrality | Betweenness Centrality |
| thatcamp | 10493.93299 |
| marindacos | 3381.65598 |
| amandafrench | 2530.589717 |
| openeditionsays | 2491.27153 |
| inactinique | 2183.450362 |
| briancroxall | 2093.876857 |
| piotrr70 | 2014.064889 |
| brettbobley | 1798.013658 |
| miriamkp | 1693.203103 |
| melissaterras | 1596.42585 |
| Top Replied-To in Entire Graph | Entire Graph Count |
| colleengreene | 4 |
| thatcamp | 3 |
| rosemarysewart | 2 |
| normasalim | 2 |
| janaremy | 2 |
| spagnoloacht | 1 |
| chuckrybak | 1 |
| lawnsports | 1 |
| ncecire | 1 |
| academicdave | 1 |
| Top Mentioned in Entire Graph | Entire Graph Count |
| thatcamp | 25 |
| piotrr70 | 25 |
| briancroxall | 17 |
| ncecire | 16 |
| spouyllau | 14 |
| thtcmpfeminisms | 10 |
| marindacos | 8 |
| dhlib2012 | 8 |
| thatcamprtp | 7 |
| goldstoneandrew | 6 |
| Top URLs in Tweet in Entire Graph | Entire Graph Count |
| http://leo.hypotheses.org/9506 | 26 |
| http://bit.ly/RyrPvA | 19 |
| http://tcp.hypotheses.org/609 | 19 |
| http://tcp.hypotheses.org/programme | 15 |
| http://rtp2012.thatcamp.org/apply/ | 12 |
| http://bit.ly/w1IFmR | 11 |
| http://dhlib2012.thatcamp.org/register/ | 10 |
| http://goo.gl/qJ185 | 10 |
| http://dhlib2012.thatcamp.org/ | 8 |
| http://bit.ly/RNHLKO | 8 |
| Top Hashtags in Tweet in Entire Graph | Entire Graph Count |
| thatcamp | 137 |
| mla13 | 27 |
| dh | 22 |
| tcp2012 | 17 |
| thatcampsocal | 12 |
| dhlib2012 | 9 |
| unconferences | 7 |
| thatcamptheory | 6 |
| digitalhumanities | 6 |
| tcny2012 | 6 |
And now the visualization. You can download the zoomable pdf here.
As I look at the modularity in this graph, at first blush, you can see quite a North America / European divide, with various satellite outposts. This could be of course because there’s a THATCamp Paris coming down the pipe (lots of French in the tweets).
Mining a Day of Archaeology
The Day of Archaeology is modeled after the Day of Digital Humanities. Archaeologists from around the world take a few moments to blog about what they’re doing, right now. This year, it was on June 29th. It’s a fascinating window into a fascinating profession. As an archaeologist-cum-digital-humanities person, the obvious thing to do with all of this info (over 700 individual archaeologists; over 300 individual posts of some 250-500 words each, at least) is to mine it, to analyze it, to topic model it. What are the discourses of practicing archaeologists?
- The first thing is to collect all of the information. I’m using OutWit Hub to scrape every post. Now scraping can be morally dubious, but happily the organizers of DoA and all of its contributors agree to a creative commons attribution. Designing a scraper involves looking at the source code for the page, figuring out the page structure, and identifying the tags that enclose the information that one wants to collect. Then, OutWit can be sent forth to work through each page in succession. The resulting information can then be exported into Excel; I send it over as a csv file so that I can then do further work with it. (The CSV file may be downloaded here)
- I then use a macro in Excel to save each individual row as an individual text files, into a separate folder. This folder can be zipped and uploaded into Voyant Tools.
- Finally, I can point the Mallet Java GUI at the original csv file and topic model all of the posts (400 iterations, with 40 topic words and topic proportion threshold 0.05, for 20 topics); I then visualize the interrelationships of the topics using Gephi.
In this post, I’m going to provide you with links to the data in various tools, and give a first pass over the data. Then, why not play with this info for yourself, and see what you find? Perhaps we can all crowdsource an article out of this; comments and findings in the comments please!
Let’s begin, shall we?
This is what I find. First, the topics (related files here):
List of Topics
Archaeologists love the communities they work in and the people that they work with. This I think is evident from the number of topics that could be labelled ‘public archaeology’, like #20, 14, and 11. You can click through the topics above to read the DoA posts that are composed by these documents; it will indicate to what percentage a given post is composed of the various topics. From this, you can begin to choose your own adventure through the day of archaeology.
I can also take that information, and represent it as a network where each document is linked to its highest percentage topic. Keeping in mind all of the caveats that such an approach entails (see Scott Weingart’s salutary warnings), we end up with a map of the mental geography of topics to posts; a mental geography of archaeological discourse. Interestingly, the top three topics holding it all together are 13, 17, and 10. The first two would seem to be topics related to the mundane every day tasks that archaeologists do; topic 10 seems to relate to how we teach the discipline.
The Gephi file may be downloaded here, so that you can explore this data for yourself. I ran the modularity routine to detect any ‘communities’ of thought in the topics/posts. The colours in the image below correspond to community; the size of the node relates to betweeness. In the gephi file, you can filter the data table by ‘modularity’ to see which posts and topics are in what community. According to this routine, there are roughly 13 communities of thought across 335 posts. Where does your post fit in?
Voyant Tools
I’ve also uploaded all of the posts into Voyant Tools for text analysis. Obviously, ‘archaeology’ and its derivatives will skew things a bit. But lets see what we find. “There are 335 documents in this corpus with a total of 156,396 words and 15,100 unique words” says Voyant. We’ve got a wide vocabulary folks! But in the spirit of Steven Ramsay’s algorithmic reading, what are the surprises? What do we see when we deform an entire corpus of text in this manner? It’s worth pointing out that you ought to open the corpus in Voyant in Chrome, as sometimes Firefox trips, crashes, and burns.
Let’s extract named entities from the text, and stitch them together based on appearance in the same post. You get the following (remember, open in Chrome for best results: http://voyant-tools.org/tool/RezoViz/?corpus=1341853693115.3474 ). If you mouse over an entity, it highlights all others to whom it is connected. You can also fiddle with the settings to show more or less connections.
You can also do a principle components analysis on the word frequencies. In the image below, all instances of ‘day’ and derivatives of ‘archaeology’ have been excised, to make the patterns clearer (try for yourself here).
So – what patterns do you notice? What strikes you as odd and in need of explanation?
Mining the Open Web with ‘Looted Heritage’ – draft
What follows is a draft of a paper written in conjunction with Robert Blades concerning the Looted Heritage project.
Introduction
In his overview of what ‘open access’ might mean in the academy, Peter Suber draws attention to the salient features of what it means to call something ‘open’ – that it is digital, the cost (to the reader) is free, and most copyright or similar legal restrictions are relaxed (Suber 2012). In this paper, we describe ‘Looted Heritage’, a developing digital archaeology project and its early results that explore ways of leveraging open content, of dealing with the firehose of data that comes when materials can be freely collected and examined. We focus not on the academic open access movement, but rather on the torrent of archaeological materials shared through social media streams such as Twitter and blogs. We focus on user-generated content surrounding the trade in illicit antiquities, reports of looting, and explore the patterns in this data, of not just what is shared, but why.
In a way, our approach is the inverse of ‘crowdsourcing’. To crowdsource something, whether a problem of software development, or the need to transcribe historical documents, is generally to fracture a problem into its component pieces, allowing an interested public to solve them. In archaeology, such approaches are starting to find currency in everything from funding fieldwork (Morgan 2011) to the entire excavation and its subsequent interpretation (Wilkins et al. 2012; Wilkins, B 2012). In 2011 I and my students embarked on a project to crowdsource the idea of ‘sense of place’, using an open-source software platform to solicit and collect community memories about cultural heritage resources in Central Canada (Graham, Massie and Feuerherm 2011). One of our findings in that project concerned the order of operations that should be followed, that perhaps it is better to collect what is freely available first, before asking the crowd to fill in the gaps (Graham, Massie, and Feuerherm 2011).
Accordingly, we set up a data-trap, to collect the tiny pieces out on the open access web. We then study these pieces using data mining and text analysis to develop a picture of what is happening right now. It is a kind of digital excavation, and what we are excavating is the world of social media. We then put all of our data, and our analysis, online to allow others to fill in the gaps. When we mine the open web for information about looted cultural heritage, what are the discourses? What are people saying, does what they say change over time, and do these trends and this excercise hold any lessons for us as archaeologists?
Social Media
The business model for many popular social media websites/services is built on allowing users to connect with other users, selling this data onwards to marketers. The microblogging website Twitter caches every ‘tweet’ (short messages of 140 characters) after approximately two weeks and sells them. The marketers then mine this data looking to predict the next big thing, or to understand the public perception of their product (Barnett 2012). This material can be considered ‘open’ as long as one is looking for it during that period before it is cached. Williams, Terras, and Warwick (in press) recently completed a meta-study of over 550 academic articles that focused on Twitter. Of these, the researchers identified 53 studies that relate to mining Twitter content. These ranged from using tweets to offer better personalized news recommendations (Abel, Gao, Houben, and Tao 2011) to predicting flu trends (Achrekar, Gandhe, Lazarus, Yu, and Liu 2011; Chew, Eysenbach 2010) to attempts to predict the future (Asur, Huberman 2010) and the stock market (Bollen and Mao 2011). (The full database of Twitter-related research developed by Williams et al. will be appearing online, Terras pers. comm.).
The other facet of user-generated content that we wish to mine is the world of archaeological blogging. Blogging, it should be noted, is not a genre of writing, but rather a platform for writing and for the rapid dissemination of material onto the web. Nevertheless, the caricature of blogging is that it is the narcissistic shouting into the void about narrow, meaningless, ephemera; that it is ‘noise’ in contrast to the strong ‘signal’ that a peer-reviewed journal might provide. How then can anything useful be found in this open environment? Until the advent of Google, there was no good answer to this question. Google is not a search engine, nor a catalog, nor an index: it is a massive experiment in prediction. Google benefits from the billions of searches that we the users perform every week. In essence, we are teaching the machine what is useful when we select one result out of the millions provided. Google observes this. Google uses over 200 such signals to match useful information to each individual user, who each have their own idea of what constitutes ‘useful’ (Levy 2010).
Blogging as a medium creates strong signals. Academic blogs tend to have a very tight focus (notable examples are Bill Caraher’s New Archaeology of the Mediterranean World and Colleen Morgan’s Middle Savagery). They are updated fairly regularly, as the academic incorporates them into his or her work cycles. They focus on a comparative narrow range of topics, and are thus semantically tight. The anchor text for links tends to be rather unique combinations of words, and thus provide more signal to Google’s algorithm. A static, rarely-updated website (like many academic department websites) does not provide strong signals, and thus is not often returned in search results. Blogs and other high-signal sites like Wikipedia are displayed first. Research shows that most users never look further than the first few results provided by any search engine (Jansen and Spink 2006: 260). To the wider world, only that which is blogged, tweeted, or written about on Wikipedia, exists; that which is hidden behind a paywall, does not.
Data Collection Methods
In practice the web is infinite. Our project attempts to monitor that slice of it which is open, accessible, and taking place on Twitter, on blogs, using RSS feeds, automatic news aggregators, and other web 2.0 tools (cf. Kansa and Kansa 2011). We use an integrated environment for marshalling this data called ‘Ushahidi’. The word ‘ushahidi’ is a Swahili word meaning ‘testimony’. Ushahidi was developed in Kenya to map reports of violence after the bitterly contested elections of 2008 (Ushahidi 2012). It accepts information submitted via web form, email, and cell-phone short-message-service. It can also be used to collect RSS feeds and to trawl Twitter, copying tweets that contain particular keywords.
We set Ushahidi to search Twitter for #looting, #antiquities, #looted, #illicit. The search will also turn up results without the # symbol; the convention on Twitter however is to indicate descriptive keywords for one’s ‘tweet’ by using the # symbol in conjunction with the keyword. This allows for more effective searching, and for users to follow developing conversations even if they themselves do not follow all the participants in the conversation. We are subscribed to feeds from Art Theft Central; Conflict Antiquities; Illicit Cultural Property; Looting Matters; and Saving Antiquities for Everyone. We also have a saved search at Google News that returns items based on the keywords ‘looted antiquities’.
As of April 12, 2012, we have over 1300 items in the queue from these feeds with approximately 50 to 100 new items appearing each day – firehose indeed! In the first quarter of 2012 we have culled 207 reports from this stream. Ushahidi is also a form of simple GIS, wherein each report is also categorized and tied to a geographical location.
Analytical Methods
We use the techniques of text-analysis and topic modeling. Digital text analysis has a long tradition in what is now called ‘digital humanities’, emerging out of efforts to systematize the generation of concordances and vocabulary counts (see Hockey, 2004 for an overview). We use the ‘Voyant’ online tool (formerly ‘Voyeur Tools’, Rockwell and Sinclair, 2012; Sinclair and Rockwell 2009) to explore word use in our texts. Because this tool is online, we can share this step in our analysis with others by providing a unique URL to our corpus (see table 1). We loaded our reports in chronological order into Voyant so that we could examine word use over time through simple frequencies (see for instance Burrows, 2004 for the wide variety of approaches to which text analysis may be put).
| Looted Heritage: Monitoring the Illicit Antiquities Trade | http://heritage.crowdmap.com |
| Full Corpus of Reports loaded into Voyant Tools | http://j.mp/looted-heritage-reports |
| A guide to the Voyant interface | http://docs.voyant-tools.org/standard-ui-elements/ |
| Full output from MALLET Topic Modeling algorithm in csv and html format | http://j.mp/graham-blades-dataset |
| Full Corpus of Reports loaded into Voyant Tools Frequency Tool | http://j.mp/looted-heritage-word-frequencies |
| Visualizing the patterns of social connections in the corpus of reports | http://j.mp/looted-heritage-rezoviz |
Table 1. Internet Location for tools and datasets referenced in the text.
Voyant also has a tool called RezoViz which extracts named persons from documents, and links them together on the basis of occurrence in the same document. With this tool, Voyant becomes a tool for data discovery of social networks. However, it is still in ‘alpha’ meaning that not all of the idiosyncrasies of the code have been completely solved. Nevertheless, given the nature of our data, it is a useful tool to begin to understand who the key players might be, tracking them over time and space. One might then use this data to refine the social media searches, for instance.
We then explore the texts for deeper structure, using ‘topic modeling’ (a Bayesian statistical approach formally called ‘Latent dirichlet allocation’, Blei, Ng, and Jordan 2003; Underwood 2012a; Weingart 2011). Topic modeling determines collections of words that occur in semantically meaningful ways in different proportions within a text. As Ted Underwood puts it, ‘Topic modeling is a way of extrapolating backward from a collection of documents to infer the discourses (“topics”) that could have generated them’ (Underwood 2012b). It begins with simple frequencies of words, but also considers the way a particular word is used in some documents, but not others. The algorithm introduced by Blei et al. (2003) assumes that for any possible topic, a word has a possibility of being part of that topic: it multiplies the frequency of this word in this topic by the number of words in the document that already belong to the topic. The result is a probability that the word actually belongs to that topic (Underwood 2012b). This is an iterative process that begins initially from a random position. As the algorithm cycles to produce a best fit, words are gradually sorted into ‘topics’, and ‘topics’ into documents. As Underwood emphasizes however, these are not ‘topics’ as one might understand from a book index. Rather, they might be better thought of as discourses (2012a, 2012b).
We use the ‘Machine Learning for LanguagE Toolkit (MALLET, McCallum 2002) and its implementation of the algorithm. MALLET is open source software that runs from the computer’s command line. A Java based graphical user interface (GUI; Newman and Balagopalan 2011) used in tandem with MALLET makes it easier to run the algorithms and to select and manage one’s data. MALLET outpus a series of comma separated files that give a breakdown of ‘topics in documents’ and ‘documents in topics’ and ‘key words in topics’. The GUI produces a series of webpages that allow one to explore the results by clicking through documents, topics, and the linkages between them. This output has been deposited with the data repository Figshare and may be downloaded and explored; see table 1).
We set the algorithm to iterate 400 times as it converged on the best solution, assuming 15 topics. It ignored a preset list of ‘stopwords’ (‘the’, ‘and’, ‘of’ etc) that tend to obscure the patterns we wish to find. There is no way of predetermining the ‘best’ number of topics. Instead, one runs the analysis again and again, looking at the resulting composition of documents, looking for a distribution of topics to documents that does not clump too heavily. In general, we began by searching for five topics and increased until we hit 15 topics, which seemed to capture the variety well (one or two topics did not account for the majority of documents, for instance).
To get a sense of what the ‘topics’ found by the algorithm might mean, one examines the document composition output, looking for those documents where the topic occurs with the highest probability. We can then suggest a descriptive label, a meaningful ‘topic’ in human terms (cf Nelson 2011) or try to understand the list of words as a kind of discourse (Underwood 2012a,b). The composition output can also be visualized as a kind of network, where there are two kinds of nodes, documents and topics. The composition probability gives a weight to the strength of the connection (Meeks 2011). We use the Gephi open source network visualization software package to visualize these patterns (Bastian, Heymann, and Jacomy 2009). We can identify, in network terms, reports or topics that seem to be most crucial for keeping the network together. We can also calculate statistics on this network which give us an indication of the ‘communities’ of reports (ie, subnetworks) that seem to have similar patterns of composition. Topic modeling the reports for the underlying structural patterns of word use gives us insight into the ideological realm; visualizing this output as a network makes those patterns clearer.
Limitations of the Data
Obviously, what we search for dictates what we will find. These reports represent the concerns of the anglophone, Western world. To what degree is Twitter representative of other places and cultures? In China, the important social media player is Weibo. There are 500 million Chinese online; Weibo grows at the rate of approximately 10 million a month (DeWoskin 2012; Chen 2012). Currently on ‘Looted Heritage’, China is a comparative blank spot. Similarly the Spanish speaking nations are not well represented, nor is sub-Saharan Africa.
The other major blindspot concerns the traditional auction houses and newer entrants such as eBay. Online anonymous auctions make it easy for low-level, low-value antiquities to be bought and sold. On March 1st 2012 we scraped the RSS feed from eBay.ca for items tagged as Roman antiquities. We found over 400 items being sold, with a combined listed value of over $C 48,000. The median value of these items was approximately $C 20. We elected not to create reports of items listed for sale on eBay, from a desire to not encourage their sale.
Stanish points out that a great deal of what passes for an antiquity on eBay might well be a fake and so eBay, by flooding the market, is in fact preserving antiquities (Stanish 2009). Nevertheless the technology of eBay facilitates a trade in extremely portable antiquities like coins and brooches, a search that can be just as destructive as that for ‘high-end’ antiquities. The British Museum and the Portable Antiquities Scheme reached an understanding in 2006 to monitor eBay.co.uk for antiquities which might fall under the rubric of the Treasure Act (British Museum, 2006). In France, a group of archaeologists attempts to monitor eBay.fr to persuade it to remove antiquities from the lists (Champault, pers. comm.) By Champault’s reckoning, some 75,000 euros’ worth of sales have been cancelled due to their efforts.
In future iterations of ‘Looted Heritage’, we will be working to incorporate data from eBay as we suspect that monitoring and mapping the locations of sellers of sudden assemblages of small finds could point the way to tracking the field of operations of pot hunters and subsistence looting.
Findings – Text analysis
Frequencies are normalized in terms of keyword per 10 000 words over the corpus. ‘Museum’ is an obvious word with which to begin. The major spikes in its frequency correspond with thefts-to-order from the Museum at Olympia in Greece. ‘Looting’ occurs with high frequency in December and January, is virtually silent until the end of February, spikes enormously in early March, and thereafter percolates nicely until the end of the data (Figure 1). That there should be seasonality to looting is not surprising (presumably looters are not impervious to the weather) but it is gratifying to see this reflected in the materials trawled from the web. The spike in the frequency of ‘looting’ in early March seems extreme though, compared to the rest of the trend. It corresponds in our data with the release of four videos. The spike in word frequency perhaps can be explained by the relative shortness of the text that accompanies these videos, skewing the proportions.
[insert figure 1 about here]
One can compare multiple words at the same time. For instance, ‘art’ and ‘objects’ on the face of it should often go together. Figure 2 graphs the relative frequency of both words. At many points in December and January, this indeed seems to be the case. Where they differ often seems to be in cases of repatriation (the report speaks of ‘art’) whereas in cases of theft, the language tends to use ‘objects’. Do antiquities only become ‘art’ once they’ve been stolen, or displayed in a museum? There is perhaps an interesting discourse hinted at here to be explored concerning the semantics of looting and cultural heritage crime.
[insert figure 2 about here]
Turning to the RezoViz module (see table 1), figure 3 displays 25 individuals and their connections, as extracted from the reports. It appears to display names according to their frequency in the underlying documents. ‘Simon MacKenzie’ and ‘Neil Brodie’, scholars from the University of Glasgow who investigate the illicit antiquity trade are represented and tied together, by virtue of newspaper and blog articles reporting on a grant the two scholars received. Similarly, a cluster of individuals is focused on the nexus of the ‘Getty Museum’, ‘Giacomo Medici’, and ‘Marion True’. The investigative reporters Jason Felch and Ralph Frammolino, who wrote about the Getty’s connection to Medici and other aspects of its collecting history (2011), are also tied to that cluster, as is the scholar David Gill, who has written on his blog about the Getty. If one slides the ‘items’ slider to the right, more individuals are added to the display. Figure 4 displays 125 individuals and their connections; individuals named in the same reports as Zahi Hawass (former Minister of State for Antiquities, Egypt) are highlighted, for the sake of illustration (interestingly, Napoleon Bonaparte is also part of this subnetwork!). When this tool is formally incorporated into Voyant perhaps it will allow these networks to be exported to formal social network analysis tools so that the data may be cleaned and explored more rigorously. Who for instance is most central? Who is in a position of power, in terms of their ability to influence opinion or control information? Is it a complete network in the sense that one can chart a path between any pair of individuals, or are there isolated areas? Are there sub-communities within the graph?
[insert figures 3 & 4 about here]
Findings – Topic Modeling
The topics extracted by MALLET from the text of the reports, indicating the associated key words and a descriptive label are listed in table 2.
| Topic # | Key words | Possible descriptive label |
| 1 | museum objects getty art antiquities museums italy returned italian princeton potts collection university acquired collections director true investigation roman works curator aphrodite exhibition greek almagi | Museum ethics |
| 2 | project archaeologists research university community national heritage german dr state archaeology nigeria association development museums looting nok gundu nigerian work evidence news claims local communities | Archaeological ethics |
| 3 | turkey mosaics university art turkish made dealer information london find late dealers bgsu roman purchase received history collection center owner pa recently december expect | Repatriation & university museums: Old world |
| 4 | antiquities years world trade international countries past illicit looted director make future material year provide long information report policy local questions including part provenance study | Provenance |
| 5 | ancient artifacts city people back place return time large left iraq ruins discovered don experts set country great peru added east officials built excavated bank | Repatriation & university museums: New world |
| 6 | national history show archaeologist sites american artifacts treasure archaeology archaeological state historical based law wrote tv trafficking lost historic illegally malter park shows property years | Television |
| 7 | museum stolen culture ministry museums ancient security government minister pieces art country items head gallery olympia told guards thieves reported guard artefacts made building back | Museum thefts |
| 8 | hawass british syria war loot syrian gold treasure found foreign part zahi augustine french display funds army mubarak brought men japan return caught | War & antiquities |
| 9 | statue looted auction government sotheby cambodia statement states private york sale united case cambodian law legal officials million stolen sell house piece property sold mask | Auction houses |
| 10 | public committee aaa president letter funding learn coin day policy images arts research image human half recently item montreal king affairs read request report statement | Identifying looted antiquities |
| 11 | archaeological cultural site sites looting heritage antiquities country department authorities excavation important objects state international illicit protection red list area destruction property dealers general mali | Fighting the illicit antiquities trade |
| 12 | antiquities items stolen documents artifacts charges rare include church artefacts case golan jerusalem jewish court israel allegedly heart gang age police st accused trial believed | Theft from historic sites |
| 13 | egypt egyptian el antiquities council sites people hibeh cairo al foreign revolution archaeologists including digging work afghanistan police dr stop period article group rich supreme | Egypt |
| 14 | police greece theft year greek coins antiquities crime national metal athens crisis small years smuggling found arrested including century thieves work months month bronze members | Greece |
| 15 | art market high china chinese goods relics number stone paintings works bronze million palace dynasty jade fake thousands global collectors village fine summer antiques antique | China |
Table 2. Keywords in topics.
While we can give descriptive labels as a short hand for the ‘topic’, we can also imagine these as discourses. We can imagine that Topic 1 concerns discourse surrounding the Getty Museum. Similarly, Topic 7 should concern the Olympia Museum, while Topic 10 with the words ‘Montreal’ and ‘AAA’ (American Anthropology Association) concerns a North America discourse around museums and the identification of artefacts. Topic 6 is clearly connected with the controversy generated by Spike TV and the National Geographic Channel’s recent television programs connected with metal detectoring. It is interesting that the topic modelling routine seems to tease out two distinct strands surrounding discourses of ‘repatriation’, split between old and new world, as well as discourses surrounding particular ‘source’ nations.
MALLET outputs a table indicating the relative percentage that each topic contributes towards the composition of each report. We can consider these percentages as a weighting of a link between reports and topics, and thus we can represent the ‘topic-space’ as a kind of network map. We extracted a list of reports to constituent topics, capturing at least 50% of each report’s composition (including every report’s complete breakdown would render the resulting map unintelligible).
The resulting network visualization has 222 nodes (representing reports and topics) and 887 links between them. Gephi uses an algorithm called ‘modularity’ developed by Blondel et al. (2008), to identify subgroups (‘module’ being a synonym for ‘community’). Modules are based on the similarity of linkages. In this routine, a result closer to 1 indicates the strength of the result. We ran modularity 200 times on this data; the best results clustered around 0.216 and tended to produce 6 communities, table 3:
| Group 1 | Identifying Looted AntiquitiesChina |
| Group 2 | Repatriation & university museums: old world materialsMuseum Ethics
Auction Houses |
| Group 3 | TelevisionArchaeological Ethics |
| Group 4 | War & AntiquitiesFighting the Illicit Antiquities Trade
Egypt |
| Group 5 | ProvenanceRepatriation & University Museums: new world materials |
| Group 6 | GreeceMuseum Thefts
Theft from historic sites |
Table 3. Grouping topics into ‘communities’.
To make the resulting visualization more intelligible, we filtered out individual linkages weighing less than 20%, Figure 5. The colours represent the partition of the network into the communities detected through the modularity routine. We can then measure the network to identify those reports or topics that are positioned on the most paths between any two other items, a measurement called ‘betweeness centrality’. In a sense, what we have produced here is a map of the idea-space surrounding the illicit antiquities trade. The links between nodes are thicker the stronger the tie between the topic and the report. There appears to be a current of thought which runs from ‘museum thefts’ to ‘Greece’, to ‘theft from historic sites’ more generally. Another runs from ‘Egypt’ to ‘fighting the illicit antiquities trade’ to ‘provenance’. The one isolated topic that never ties in (except in the most tenuous of ways) is ‘television’, a body of reports connected with the outrage over shows that are seen in the professional community to be promoting pot hunting as a glamorous recreational pursuit.
The ‘betweeness centrality’ metric also directs our attention to particular reports. One such is Report 161 from January 26th, a report about the looting of a church. In fact, a spate of thefts from churches and other historic sites occurred in late February and early March. Is this a new trend?
[insert figure 4 about here]
Conclusion
In this paper, we demonstrate a workflow for collecting open-access materials created by individuals, academics, and the press who use social media to communicate information about the trade in illicit antiquities. We then analyze the text of these reports for both superficial and deep patterns. Ideally, as more and more information gets collected, we will be able to spot and understand underlying trends in the world antiquities market. We would also like to include the work of archaeologists and scholars published in journals, to provide the deeper insight and reflection that this just-in-time approach currently lacks. Alas, the majority of these works are not available to be analyzed and mined this way. We also understand that ‘open access’ could mean that our underlying data should be available for others to study or repurpose for themselves, whether to extend on our study or to contest it. We have deposited all of our materials in online repositories to encourage just that. By providing this material digitally, we accelerate the tedious process of corpus building. In this way, open access becomes about generating a discussion, and building upon each other’s work.
Bibliography
Abel, F, Gao, Q, Houben, G-J, Tao, K. 2011. Analyzing user modeling on Twitter for personalized news recommendations. Lecture Notes in Computer Science, 6787: 1-12.
Abel, F, Gao, Q, Houben, G-J, Tao, K. 2011. Analyzing user modeling on Twitter for personalized news recommendations. Lecture Notes in Computer Science, 6787: 1-12.
Achrekar, H, Gandhe, A, Lazarus, R, Yu, S-H, Liu, B. 2011. Predicting flu trends using twitter data. In IEEE Conference on Computer Communications Workshops. Shanghai: INFOCOM WKSHPS, pp. 702-707.
Achrekar, H, Gandhe, A, Lazarus, R, Yu, S-H, Liu, B. 2011. Predicting flu trends using twitter data. In IEEE Conference on Computer Communications Workshops. Shanghai: INFOCOM WKSHPS, pp. 702-707.
Asur, S, Huberman, B.A. 2010. Predicting the future with social media. HP Laboratories Technical Report, no. 53.
Barnett, E. 2012. Twitter sells tweet archive to marketers. The Telegraph [online] February 28th. Available at: <http://www.telegraph.co.uk/technology/twitter/9110943/Twitter-sells-tweet-archive-to-marketers.html> [Accessed April 18, 2012].
Bastian M., Heymann S., Jacomy M. (2009). Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media. [online] Available at: <https://gephi.org/> [Accessed April 23 2012].
Blei, D., Ng, A., & Jordan, M. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3:993-1022.
Bollen, J., Mao, H. 2011. Twitter mood as a stock market predictor. Computer, 44.10: 91-94.
British Museum. 2006. eBay partners with British Museum and Museums, Libraries and Archives Council to protect British treasures. [press release] Available at: <http://www.britishmuseum.org/the_museum/news_and_debate/press_releases/2006/ebay_partnership.aspx> [Accessed April 15, 2012]
Burrows, J. 2004. Textual Analysis. In A Companion to Digital Humanities. (ed. S. Schreibman, R. Siemens, and J. Unsworth). Oxford: Blackwell. Available at: <http://www.digitalhumanities.org/companion/index.html> [Accessed April 18, 2012].
Caraher, B. 2010-2012 The New Archaeology of the Mediterranean World [blog]. Available at:<http://mediterraneanworld.wordpress.com/>[Accessed April 23 2012].
Chen, J. 2012. China’s Weibo Guru, Kai-fu Lee. Forbes [online] January 20th. Available at: <http://www.forbes.com/sites/china/2012/01/20/chinas-weibo-guru-kai-fu-lee/>%5BAccessed April 18, 2012].
Chew, C., and Eysenbach, G. 2010. Pandemics in the age of Twitter: Content analysis of tweets during the 2009 H1N1 outbreak. PLoS ONE, 5:11. [online] Available at: <doi:10.1371/journal.pone.0014118> [Accessed April 18, 2012].
Cohen, D., and J. Fragaszy Toryano (eds.) Journal of Digital Humanities Roy Rosenzweig Center for History and New Media [online]. Available at: <http://journalofdigitalhumanities.org/> [Accessed April 23 2012].
DeWoskin, R., 2012. East Meets Tweet. Vanity Fair [online] February 17 <http://www.vanityfair.com/culture/2012/02/weibo-china-twitter-chinese-microblogging-tom-cruise-201202> [Accessed April 18, 2012].
Durney, M. 2012. Art Theft Central. [blog] Available at: < http://arttheftcentral.blogspot.ca/> [Accessed April 23 2012].
Fincham, D. 2012. Illicit Cultural Property: A weblog about art, antiquities and the law [blog] Available at: < http://illicit-cultural-property.blogspot.ca/> [Accessed April 23 2012].
Gill, D. 2012. Looting Matters: Discussion of the archaeological ethics surrounding the collecting of antiquities [blog] Available at:< http://lootingmatters.blogspot.ca/> [Accessed April 23 2012].
Graham, S, Massie, G., and Feuerherm, N. 2012. The HeritageCrowd Project: A Case Study in Crowdsourcing Public History. In Writing History in the Digital Age (eds. J. Dougherty and K. Nawrotzki) Under contract with the University of Michigan Press. Trinity College (CT) web-book edition, Spring 2012, [online] Available at:<http://WritingHistory.trincoll.edu.> [Accessed April 23 2012].
Hardy, S. 2012. Conflict Antiquities: Illicit Antiquities Trading in Economic Crisis, Organised Crime, and Political Violence. [blog] Available at: <http://conflictantiquities.wordpress.com/> [Accessed April 23 2012].
Hockey, S. 2004. The History of Humanities Computing. In A Companion to Digital Humanities. (ed. S. Schreibman, R. Siemens, and J. Unsworth). Oxford: Blackwell. Available at: <http://www.digitalhumanities.org/companion/index.html> [Accessed April 18, 2012].
Jansen, B., and Spink, A. 2006. How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing and Management 42.1: 248-263.
Kansa, Eric and Sara Kansa. 2011. Toward a Do-It-Yourself Cyberinfrastructure: Open Data, Incentives, and Reducing Costs and Complexities of Data Sharing. In Archaeology 2.0: New Approaches to Communication and Collaboration (eds. E. Kansa, S. Whitcher Kansa, and E. Watrall). Berkely: Cotsen Institute of Archaeology. [online] Available at: <http://escholarship.org/uc/item/1r6137tb> [Accessed April 20 2012].
Levy, S. 2010. How Google’s Algorithm Rules the Web Wired [online] February 22. Available at:< http://www.wired.com/magazine/2010/02/ff_google_algorithm/all/1> [Accessed April 23 2012]
McCallum, Andrew Kachites. 2002. MALLET: A Machine Learning for Language Toolkit. [online]. Available at:< http://mallet.cs.umass.edu> [Accessed April 23 2012].
Meeks, Elijah. 2011 Comprehending the Digital Humanities Digital Humanities Specialist Stanford University Libraries and Academic Information Resources[blog] February 19. Available at <https://dhs.stanford.edu/comprehending-the-digital-humanities/> [Accessed April 23 2012].
Morgan, C. 2006-2012 Middlesavagery [blog]. Available at:<http://middlesavagery.wordpress.com/ > [Accessed April 23 2012].
Morgan, C. 2011. Crowdsourcing Archaeology – The Maeander Project Kickstarter Page. Middle Savagery [blog] June 16. Available at: < http://middlesavagery.wordpress.com/2011/06/16/crowdsourcing-archaeology-the-maeander-project-kickstarter-page/> [Accessed April 23 2012].
Nelson, Robert K. 2011 Mining the Dispatch. Digital Scholarship Lab at the University of Richmond [online]. Available at <http://dsl.richmond.edu/dispatch/pages/about> [Accessed April 23 2012].
Newman, David and Alun Balagopalan, 2011. A graphical user interface tool for topic modeling. Google Code [online]. Available at:<http://code.google.com/p/topic-modeling-tool/> [Accessed April 23 2012].
Rockwell, G., and Sinclair, S. 2012. Voyant Tools: Reveal Your Texts, [online] Available at: http://voyant-tools.org/ [Accessed April 1st, 2012].
SAFE: Saving Antiquities for Everyone. 2012. Blog. [blog] Available at: <http://www.savingantiquities.org/> [Accessed April 23 2012].
Sinclair, S. 2009. The Rhetoric of Text Analysis. [online] Availabe at: <http://hermeneuti.ca/rhetoric>. [Accessed April 15, 2012].
Stanish, C. 2009. Forging Ahead, or, how I learned to stop worrying and love eBay. Archaeology, 62.3 [online] Available at <http://www.archaeology.org/0905/etc/insider.html> [Accessed April 12, 2012].
Suber, P. 2004. Open Access Overview. The SPARC Open Access Newsletter. [online] (Updated March 18 2012) Available at: <http://www.earlham.edu/~peters/fos/overview.htm> [Accessed April 17, 2012]
Underwood, T. 2011. Why humanists need to understand text mining. The Stone and Shell, [blog] May 29. Available at: <http://tedunderwood.wordpress.com/2011/06/29/why-humanists-need-to-understand-text-mining/> [Accessed February 27, 2012].
Underwood, T. 2012a. What kinds of “topics” does topic modeling actually produce? The Stone and Shell, [blog] April 1. Available at: <http://tedunderwood.wordpress.com/2012/04/01/what-kinds-of-topics-does-topic-modeling-actually-produce/> [Accessed April 14, 2012].
Underwood, T. 2012b. Topic modeling made just simple enough. The Stone and Shell, [blog] April 7. Available at <http://tedunderwood.wordpress.com/2012/04/07/topic-modeling-made-just-simple-enough/> [Accessed April 14, 2012].
Ushahidi. 2012, Ushahidi, < http://ushahidi.com/>.
Weingart, S. 2011. Topic Modeling and Network Analysis. The Scottbot Irregular, [blog] November 15. Available at <http://www.scottbot.net/HIAL/?p=221> [Accessed April 18, 2012].
Wilkins, B. 2012. Comment #1 on Graham, S. Digventures, Flag Gen, and Crowd-everything archaeology Electric Archaeology [blog] March 13. Available at <https://electricarchaeologist.wordpress.com/2012/03/13/digventures-flag-fen-and-crowd-everything-archaeology/#comments> [Accessed April 23 2012].
Wilkins, L., Wilkins, B., and Dave, R. 2012. Digventures: How it works. Digventures. [online] Available at: <http://digventures.com/how-it-works/ > [Accessed April 23 2012].
Williams, S., Terras, M., and Warwick, C. Forthcoming. What people study when they study twitter: Classifying Twitter Related Academic Papers. Journal of Documentation. Submitted 2012.
Play with the data from Looted Heritage

Visualizing the patterns of topic composition in 208 reports from Looted Heritage, first quarter of 2012.
Rob Blades (my student) and I are in the process of submitting an article concerning our Looted Heritage project. The gist of the article is a discussion of our workflow and the kinds of patterns that may be observed when data is available freely & openly. Ideally, this would include academic papers in journals. For the time being though, we focus on social media. We also try in the paper to include our reader in the exploration of the data. Rather than presenting static images, tables, graphs, and statistics, we put the onus on the reader to check our data for his or herself. Perhaps the reader will spot important patterns, which can then be discussed in another paper. Rather than the paper being the final end-point for our data, we want it to become a jumping off point instead.
In which case, to get the conversation started, you may find links to our dataset and our analytical tools below:
| Looted Heritage: Monitoring the Illicit Antiquities Trade | http://heritage.crowdmap.com |
| Full Corpus of Reports loaded into Voyant Tools | http://j.mp/looted-heritage-reports |
| A guide to the Voyant interface | http://docs.voyant-tools.org/standard-ui-elements/ |
| Full output from MALLET Topic Modeling algorithm in csv and html format | http://j.mp/graham-blades-dataset |
| Full Corpus of Reports loaded into Voyant Tools Frequency Tool | http://j.mp/looted-heritage-word-frequencies |
| Visualizing the patterns of social connections in the corpus of reports | http://j.mp/looted-heritage-rezoviz |
Converting 2 mode networks with Multimodal plugin for Gephi
Scott Weingart drew my attention this morning to a new plugin for Gephi by Jaroslav Kuchar that converts multimodal networks to one mode networks.
This plugin allows multimode networks projection. For example: you can project your bipartite (2-mode) graph to monopartite (one-mode) graph. The projection/transformation is based on the matrix multiplication approach and allows different types of transformations. Not only bipartite graphs. The limitation is matrix multiplication – large matrix multiplication takes time and memory.
After some playing around, and some emails & tweets with Scott, we determined that it does not seem to work at the moment for directed graphs. But if you’ve got a bimodal undirected graph, it works very well indeed! It does require some massaging though. I assume you can already download and install the plugin.
1. Make sure your initial csv file with your data has a column called ‘type’. Fill that column with ‘undirected’. The plugin doesn’t work correctly with directed graphs.
2. Then, once your csv file is imported, create a new column on the nodes table, call it ‘node-type’ – here you specify what the thing is. Fill it up accordingly. (cheese, crackers, for instance).
3. I thank Scott for talking me through this step. First, save your network; this next step will irrevocably change your data. Click ‘load attributes’. Under attribute type, select your column you created for step 2. Then, for left matrix, select select Cheese – Crackers; for right matrix, select Crackers – Cheese. Hit ‘run’. This gets you a new Cheese-Cheese network (select the inverse to get a crackers – crackers network). You can then remove any isolates or dangly bits by ticking ‘remove edges’ or ‘remove nodes’ as appropriate.
4. Save your new 1 mode network. Go back to the beginning to create the other 1 mode network.






