Mining the Open Web with ‘Looted Heritage’ – draft

What follows is a draft of a paper written in conjunction with Robert Blades concerning the Looted Heritage project.


In his overview of what ‘open access’ might mean in the academy, Peter Suber draws attention to the salient features of what it means to call something ‘open’ – that it is digital, the cost (to the reader) is free, and most copyright or similar legal restrictions are relaxed (Suber 2012). In this paper, we describe ‘Looted Heritage’, a developing digital archaeology project and its early results that explore ways of leveraging open content, of dealing with the firehose of data that comes when materials can be freely collected and examined. We focus not on the academic open access movement, but rather on the torrent of archaeological materials shared through social media streams such as Twitter and blogs. We focus on user-generated content surrounding the trade in illicit antiquities, reports of looting, and explore the patterns in this data, of not just what is shared, but why.

In a way, our approach is the inverse of ‘crowdsourcing’. To crowdsource something,  whether a problem of software development, or the need to transcribe historical documents, is generally to fracture a problem into its component pieces, allowing an interested public to solve them. In archaeology, such approaches are starting to find currency in everything from funding fieldwork (Morgan 2011) to the entire excavation and its subsequent interpretation (Wilkins et al. 2012; Wilkins, B 2012). In 2011 I and my students embarked on a project to crowdsource the idea of ‘sense of place’, using an open-source software platform to solicit and collect community memories about cultural heritage resources in Central Canada (Graham, Massie and Feuerherm 2011). One of our findings in that project concerned the order of operations that should be followed, that perhaps it is better to collect what is freely available first, before asking the crowd to fill in the gaps (Graham, Massie, and Feuerherm 2011).

Accordingly, we set up a data-trap, to collect the tiny pieces out on the open access web. We then study these pieces using data mining and text analysis to develop a picture of what is happening right now. It is a kind of digital excavation, and what we are excavating is the world of social media. We then put all of our data, and our analysis, online to allow others to fill in the gaps. When we mine the open web for information about looted cultural heritage, what are the discourses? What are people saying, does what they say change over time, and do these trends and this excercise hold any lessons for us as archaeologists?

Social Media

The business model for many popular social media websites/services is built on allowing users to connect with other users, selling this data onwards to marketers. The microblogging website Twitter caches every ‘tweet’ (short messages of 140 characters) after approximately two weeks and sells them. The marketers then mine this data looking to predict the next big thing, or to understand the public perception of their product (Barnett 2012).  This material can be considered ‘open’ as long as one is looking for it during that period before it is cached. Williams, Terras, and Warwick (in press) recently completed a meta-study of over 550 academic articles that focused on Twitter. Of these, the researchers identified 53 studies that relate to mining Twitter content. These ranged from using tweets to offer better personalized news recommendations (Abel, Gao, Houben, and Tao 2011) to predicting flu trends (Achrekar, Gandhe, Lazarus, Yu, and Liu 2011; Chew, Eysenbach 2010) to attempts to predict the future (Asur, Huberman 2010) and the stock market (Bollen and Mao 2011). (The full database of Twitter-related research developed by Williams et al. will be appearing online, Terras pers. comm.).

The other facet of user-generated content that we wish to mine is the world of archaeological blogging. Blogging, it should be noted, is not a genre of writing, but rather a platform for writing and for the rapid dissemination of material onto the web.  Nevertheless, the caricature of blogging is that it is the narcissistic shouting into the void about narrow, meaningless, ephemera; that it is ‘noise’ in contrast to the strong ‘signal’ that a peer-reviewed journal might provide. How then can anything useful be found in this open environment? Until the advent of Google, there was no good answer to this question. Google is not a search engine, nor a catalog, nor an index: it is a massive experiment in prediction. Google benefits from the billions of searches that we the users perform every week. In essence, we are teaching the machine what is useful when we select one result out of the millions provided. Google observes this. Google uses over 200 such signals to match useful information to each individual user, who each have their own idea of what constitutes ‘useful’ (Levy 2010).

Blogging as a medium creates strong signals. Academic blogs tend to have a very tight focus (notable examples are Bill Caraher’s New Archaeology of the Mediterranean World and Colleen Morgan’s Middle Savagery). They are updated fairly regularly, as the academic incorporates them into his or her work cycles. They focus on a comparative narrow range of topics, and are thus semantically tight. The anchor text for links tends to be rather unique combinations of words, and thus provide more signal to Google’s algorithm. A static, rarely-updated website (like many academic department websites) does not provide strong signals, and thus is not often returned in search results. Blogs and other high-signal sites like Wikipedia are displayed first. Research shows that most users never look further than the first few results provided by any search engine (Jansen and Spink 2006: 260). To the wider world, only that which is blogged, tweeted, or written about on Wikipedia, exists; that which is hidden behind a paywall, does not.

Data Collection Methods

In practice the web is infinite. Our project attempts to monitor that slice of it which is open, accessible, and taking place on Twitter, on blogs, using RSS feeds, automatic news aggregators, and other web 2.0 tools (cf. Kansa and Kansa 2011). We use an integrated environment for marshalling this data called ‘Ushahidi’. The word ‘ushahidi’ is a Swahili word meaning ‘testimony’. Ushahidi was developed in Kenya to map reports of violence after the bitterly contested elections of 2008 (Ushahidi 2012). It accepts information submitted via web form, email, and cell-phone short-message-service. It can also be used to collect RSS feeds and to trawl Twitter, copying tweets that contain particular keywords.

We set Ushahidi to search Twitter for #looting, #antiquities, #looted, #illicit. The search will also turn up results without the # symbol; the convention on Twitter however is to indicate descriptive keywords for one’s ‘tweet’ by using the # symbol in conjunction with the keyword. This allows for more effective searching, and for users to follow developing conversations even if they themselves do not follow all the participants in the conversation. We are subscribed to feeds from Art Theft Central; Conflict Antiquities; Illicit Cultural Property; Looting Matters; and Saving Antiquities for Everyone. We also have a saved search at Google News that returns items based on the keywords ‘looted antiquities’.

As of April 12, 2012, we have over 1300 items in the queue from these feeds with approximately 50 to 100 new items appearing each day – firehose indeed! In the first quarter of 2012 we have culled 207 reports from this stream. Ushahidi is also a form of simple GIS, wherein each report is also categorized and tied to a geographical location.

Analytical Methods

We use the techniques of text-analysis and topic modeling. Digital text analysis has a long tradition in what is now called ‘digital humanities’, emerging out of efforts to systematize the generation of concordances and vocabulary counts (see Hockey, 2004 for an overview). We use the ‘Voyant’ online tool (formerly ‘Voyeur Tools’, Rockwell and Sinclair, 2012; Sinclair and Rockwell 2009) to explore word use in our texts. Because this tool is online, we can share this step in our analysis with others by providing a unique URL to our corpus (see table 1). We loaded our reports in chronological order into Voyant so that we could examine word use over time through simple frequencies (see for instance Burrows, 2004 for the wide variety of approaches to which text analysis may be put).

Looted Heritage: Monitoring the Illicit Antiquities Trade
Full Corpus of Reports loaded into Voyant Tools
A guide to the Voyant interface
Full output from MALLET Topic Modeling algorithm in csv and html format
Full Corpus of Reports loaded into Voyant Tools Frequency Tool
Visualizing the patterns of social connections in the corpus of reports

Table 1. Internet Location for tools and datasets referenced in the text.

Voyant also has a tool called RezoViz which extracts named persons from documents, and links them together on the basis of occurrence in the same document. With this tool, Voyant becomes a tool for data discovery of social networks. However, it is still in ‘alpha’ meaning that not all of the idiosyncrasies of the code have been completely solved. Nevertheless, given the nature of our data, it is a useful tool to begin to understand who the key players might be, tracking them over time and space. One might then use this data to refine the social media searches, for instance.

We then explore the texts for deeper structure, using ‘topic modeling’ (a Bayesian statistical approach formally called ‘Latent dirichlet allocation’, Blei, Ng, and Jordan 2003; Underwood 2012a; Weingart 2011). Topic modeling determines collections of words that occur in semantically meaningful ways in different proportions within a text. As Ted Underwood puts it, ‘Topic modeling is a way of extrapolating backward from a collection of documents to infer the discourses (“topics”) that could have generated them’ (Underwood 2012b). It begins with simple frequencies of words, but also considers the way a particular word is used in some documents, but not others. The algorithm introduced by Blei et al. (2003) assumes that for any possible topic, a word has a possibility of being part of that topic: it multiplies the frequency of this word in this topic by the number of words in the document that already belong to the topic. The result is a probability that the word actually belongs to that topic (Underwood 2012b). This is an iterative process that begins initially from a random position. As the algorithm cycles to produce a best fit, words are gradually sorted into ‘topics’, and ‘topics’ into documents. As Underwood emphasizes however, these are not ‘topics’ as one might understand from a book index. Rather, they might be better thought of as discourses (2012a, 2012b).

We use the ‘Machine Learning for LanguagE Toolkit (MALLET, McCallum 2002) and its implementation of the algorithm. MALLET is open source software that runs from the computer’s command line. A Java based graphical user interface (GUI; Newman and Balagopalan 2011) used in tandem with MALLET makes it easier to run the algorithms and to select and manage one’s data. MALLET outpus a series of comma separated files that give a breakdown of ‘topics in documents’ and ‘documents in topics’ and ‘key words in topics’. The GUI produces a series of webpages that allow one to explore the results by clicking through documents, topics, and the linkages between them. This output has been deposited with the data repository Figshare and may be downloaded and explored; see table 1).

We set the algorithm to iterate 400 times as it converged on the best solution, assuming 15 topics. It ignored a preset list of ‘stopwords’ (‘the’, ‘and’, ‘of’ etc) that tend to obscure the patterns we wish to find. There is no way of predetermining the ‘best’ number of topics. Instead, one runs the analysis again and again, looking at the resulting composition of documents, looking for a distribution of topics to documents that does not clump too heavily. In general, we began by searching for five topics and increased until we hit 15 topics, which seemed to capture the variety well (one or two topics did not account for the majority of documents, for instance).

To get a sense of what the ‘topics’ found by the algorithm might mean, one examines the document composition output, looking for those documents where the topic occurs with the highest probability. We can then suggest a descriptive label, a meaningful ‘topic’ in human terms (cf Nelson 2011) or try to understand the list of words as a kind of discourse (Underwood 2012a,b). The composition output can also be visualized as a kind of network, where there are two kinds of nodes, documents and topics. The composition probability gives a weight to the strength of the connection (Meeks 2011). We use the Gephi open source network visualization software package to visualize these patterns (Bastian, Heymann, and Jacomy 2009). We can identify, in network terms, reports or topics that seem to be most crucial for keeping the network together. We can also calculate statistics on this network which give us an indication of the ‘communities’ of reports (ie, subnetworks) that seem to have similar patterns of composition. Topic modeling the reports for the underlying structural patterns of word use gives us insight into the ideological realm; visualizing this output as a network makes those patterns clearer.

Limitations of the Data

Obviously, what we search for dictates what we will find. These reports represent the concerns of the anglophone, Western world. To what degree is Twitter representative of other places and cultures? In China, the important social media player is Weibo. There are 500 million Chinese online; Weibo grows at the rate of approximately 10 million a month (DeWoskin 2012; Chen 2012). Currently on ‘Looted Heritage’, China is a comparative blank spot. Similarly the Spanish speaking nations are not well represented, nor is sub-Saharan Africa.

The other major blindspot concerns the traditional auction houses and newer entrants such as eBay. Online anonymous auctions make it easy for low-level, low-value antiquities to be bought and sold. On March 1st 2012 we scraped the RSS feed from for items tagged as Roman antiquities. We found over 400 items being sold, with a combined listed value of over $C 48,000. The median value of these items was approximately $C 20. We elected not to create reports of items listed for sale on eBay, from a desire to not encourage their sale.

Stanish points out that a great deal of what passes for an antiquity on eBay might well be a fake and so eBay, by flooding the market, is in fact preserving antiquities (Stanish 2009). Nevertheless the technology of eBay facilitates a trade in extremely portable antiquities like coins and brooches, a search that can be just as destructive as that for ‘high-end’ antiquities. The British Museum and the Portable Antiquities Scheme reached an understanding in 2006 to monitor for antiquities which might fall under the rubric of the Treasure Act (British Museum, 2006). In France, a group of archaeologists attempts to monitor to persuade it to remove antiquities from the lists (Champault, pers. comm.) By Champault’s reckoning, some 75,000 euros’ worth of sales have been cancelled due to their efforts.

In future iterations of ‘Looted Heritage’, we will be working to incorporate data from eBay as we suspect that monitoring and mapping the locations of sellers of sudden assemblages of small finds could point the way to tracking the field of operations of pot hunters and subsistence looting.

Findings – Text analysis

Frequencies are normalized in terms of keyword per 10 000 words over the corpus. ‘Museum’ is an obvious word with which to begin. The major spikes in its frequency correspond with thefts-to-order from the Museum at Olympia in Greece. ‘Looting’ occurs with high frequency in December and January, is virtually silent until the end of February, spikes enormously in early March, and thereafter percolates nicely until the end of the data (Figure 1). That there should be seasonality to looting is not surprising (presumably looters are not impervious to the weather) but it is gratifying to see this reflected in the materials trawled from the web. The spike in the frequency of ‘looting’ in early March seems extreme though, compared to the rest of the trend. It corresponds in our data with the release of four videos. The spike in word frequency perhaps can be explained by the relative shortness of the text that accompanies these videos, skewing the proportions.

[insert figure 1 about here]

One can compare multiple words at the same time. For instance, ‘art’ and ‘objects’ on the face of it should often go together. Figure 2 graphs the relative frequency of both words. At many points in December and January, this indeed seems to be the case. Where they differ often seems to be in cases of repatriation (the report speaks of ‘art’) whereas in cases of theft, the language tends to use ‘objects’. Do antiquities only become ‘art’ once they’ve been stolen, or displayed in a museum? There is perhaps an interesting discourse hinted at here to be explored concerning the semantics of looting and cultural heritage crime.

[insert figure 2 about here]

Turning to the RezoViz module (see table 1), figure 3 displays 25 individuals and their connections, as extracted from the reports. It appears to display names according to their frequency in the underlying documents. ‘Simon MacKenzie’ and ‘Neil Brodie’, scholars from the University of Glasgow who investigate the illicit antiquity trade are represented and tied together, by virtue of newspaper and blog articles reporting on a grant the two scholars received. Similarly, a cluster of individuals is focused on the nexus of the ‘Getty Museum’, ‘Giacomo Medici’, and ‘Marion True’. The investigative reporters Jason Felch and Ralph Frammolino, who wrote about the Getty’s connection to Medici and other aspects of its collecting history (2011), are also tied to that cluster, as is the scholar David Gill, who has written on his blog about the Getty. If one slides the ‘items’ slider to the right, more individuals are added to the display. Figure 4 displays 125 individuals and their connections; individuals named in the same reports as Zahi Hawass (former Minister of State for Antiquities, Egypt) are highlighted, for the sake of illustration (interestingly, Napoleon Bonaparte is also part of this subnetwork!). When this tool is formally incorporated into Voyant perhaps it will allow these networks to be exported to formal social network analysis tools so that the data may be cleaned and explored more rigorously. Who for instance is most central? Who is in a position of power, in terms of their ability to influence opinion or control information? Is it a complete network in the sense that one can chart a path between any pair of individuals, or are there isolated areas? Are there sub-communities within the graph?

[insert figures 3 & 4 about here]

Findings – Topic Modeling

The topics extracted by MALLET from the text of the reports, indicating the associated key words and a descriptive label are listed in table 2.

Topic # Key words Possible descriptive label
1 museum objects getty art antiquities museums italy returned italian princeton potts collection university acquired collections director true investigation roman works curator aphrodite exhibition greek almagi Museum ethics
2  project archaeologists research university community national heritage german dr state archaeology nigeria association development museums looting nok gundu nigerian work evidence news claims local communities Archaeological ethics
3  turkey mosaics university art turkish made dealer information london find late dealers bgsu roman purchase received history collection center owner pa recently december expect Repatriation & university museums: Old world
4  antiquities years world trade international countries past illicit looted director make future material year provide long information report policy local questions including part provenance study Provenance
5  ancient artifacts city people back place return time large left iraq ruins discovered don experts set country great peru added east officials built excavated bank Repatriation & university museums: New world
6  national history show archaeologist sites american artifacts treasure archaeology archaeological state historical based law wrote tv trafficking lost historic illegally malter park shows property years Television
7  museum stolen culture ministry museums ancient security government minister pieces art country items head gallery olympia told guards thieves reported guard artefacts made building back Museum thefts
8  hawass british syria war loot syrian gold treasure found foreign part zahi augustine french display funds army mubarak brought men japan return caught War & antiquities
9  statue looted auction government sotheby cambodia statement states private york sale united case cambodian law legal officials million stolen sell house piece property sold mask Auction houses
10  public committee aaa president letter funding learn coin day policy images arts research image human half recently item montreal king affairs read request report statement Identifying looted antiquities
11  archaeological cultural site sites looting heritage antiquities country department authorities excavation important objects state international illicit protection red list area destruction property dealers general mali Fighting the illicit antiquities trade
12  antiquities items stolen documents artifacts charges rare include church artefacts case golan jerusalem jewish court israel allegedly heart gang age police st accused trial believed Theft from historic sites
13  egypt egyptian el antiquities council sites people hibeh cairo al foreign revolution archaeologists including digging work afghanistan police dr stop period article group rich supreme Egypt
14  police greece theft year greek coins antiquities crime national metal athens crisis small years smuggling found arrested including century thieves work months month bronze members Greece
15  art market high china chinese goods relics number stone paintings works bronze million palace dynasty jade fake thousands global collectors village fine summer antiques antique China

Table 2. Keywords in topics.

While we can give descriptive labels as a short hand for the ‘topic’, we can also imagine these as discourses. We can imagine that Topic 1 concerns discourse surrounding the Getty Museum. Similarly, Topic 7 should concern the Olympia Museum, while Topic 10 with the words ‘Montreal’ and ‘AAA’ (American Anthropology Association) concerns a North America discourse around museums and the identification of artefacts. Topic 6 is clearly connected with the controversy generated by Spike TV and the National Geographic Channel’s recent television programs connected with metal detectoring. It is interesting that the topic modelling routine seems to tease out two distinct strands surrounding discourses of ‘repatriation’, split between old and new world, as well as discourses surrounding particular ‘source’ nations.

MALLET outputs a table indicating the relative percentage that each topic contributes towards the composition of each report. We can consider these percentages as a weighting of a link between reports and topics, and thus we can represent the ‘topic-space’ as a kind of network map.  We extracted a list of reports to constituent topics, capturing at least 50% of each report’s composition (including every report’s complete breakdown would render the resulting map unintelligible).

The resulting network visualization has 222 nodes (representing reports and topics) and 887 links between them. Gephi uses an algorithm called ‘modularity’ developed by Blondel et al. (2008), to identify subgroups (‘module’ being a synonym for ‘community’). Modules are based on the similarity of linkages. In this routine, a result closer to 1 indicates the strength of the result. We ran modularity 200 times on this data; the best results clustered around 0.216 and tended to produce 6 communities, table 3:

Group 1 Identifying Looted AntiquitiesChina
Group 2 Repatriation & university museums: old world materialsMuseum Ethics

Auction Houses

Group 3 TelevisionArchaeological Ethics
Group 4 War & AntiquitiesFighting the Illicit Antiquities Trade


Group 5 ProvenanceRepatriation & University Museums: new world materials
Group 6 GreeceMuseum Thefts

Theft from historic sites

Table 3. Grouping topics into ‘communities’.

To make the resulting visualization more intelligible, we filtered out individual linkages weighing less than 20%, Figure 5. The colours represent the partition of the network into the communities detected through the modularity routine. We can then measure the network to identify those reports or topics that are positioned on the most paths between any two other items, a measurement called ‘betweeness centrality’. In a sense, what we have produced here is a map of the idea-space surrounding the illicit antiquities trade. The links between nodes are thicker the stronger the tie between the topic and the report. There appears to be a current of thought which runs from ‘museum thefts’ to ‘Greece’, to ‘theft from historic sites’ more generally. Another runs from ‘Egypt’ to ‘fighting the illicit antiquities trade’ to ‘provenance’. The one isolated topic that never ties in (except in the most tenuous of ways) is ‘television’, a body of reports connected with the outrage over shows that are seen in the professional community to be promoting pot hunting as a glamorous recreational pursuit.

The ‘betweeness centrality’ metric also directs our attention to particular reports. One such is Report 161 from January 26th, a report about the looting of a church. In fact, a spate of thefts from churches and other historic sites occurred in late February and early March. Is this a new trend?


[insert figure 4 about here]


In this paper, we demonstrate a workflow for collecting open-access materials created by individuals, academics, and the press who use social media to communicate information about the trade in illicit antiquities. We then analyze the text of these reports for both superficial and deep patterns. Ideally, as more and more information gets collected, we will be able to spot and understand underlying trends in the world antiquities market. We would also like to include the work of archaeologists and scholars published in journals, to provide the deeper insight and reflection that this just-in-time approach currently lacks. Alas, the majority of these works are not available to be analyzed and mined this way. We also understand that ‘open access’ could mean that our underlying data should be available for others to study or repurpose for themselves, whether to extend on our study or to contest it. We have deposited all of our materials in online repositories to encourage just that. By providing this material digitally, we accelerate the tedious process of corpus building. In this way, open access becomes about generating a discussion, and building upon each other’s work.


Abel, F, Gao, Q, Houben, G-J, Tao, K. 2011. Analyzing user modeling on Twitter for personalized news recommendations. Lecture Notes in Computer Science, 6787: 1-12.

Abel, F, Gao, Q, Houben, G-J, Tao, K. 2011. Analyzing user modeling on Twitter for personalized news recommendations. Lecture Notes in Computer Science, 6787: 1-12.

Achrekar, H, Gandhe, A, Lazarus, R, Yu, S-H, Liu, B. 2011. Predicting flu trends using twitter data. In IEEE Conference on Computer Communications Workshops. Shanghai: INFOCOM WKSHPS, pp. 702-707.

Achrekar, H, Gandhe, A, Lazarus, R, Yu, S-H, Liu, B. 2011. Predicting flu trends using twitter data. In IEEE Conference on Computer Communications Workshops. Shanghai: INFOCOM WKSHPS, pp. 702-707.

Asur, S, Huberman, B.A. 2010. Predicting the future with social media. HP Laboratories Technical Report, no. 53.

Barnett, E. 2012. Twitter sells tweet archive to marketers. The Telegraph [online] February 28th. Available at: <; [Accessed April 18, 2012].

Bastian M., Heymann S., Jacomy M. (2009). Gephi: an open source software for exploring and manipulating networks. International AAAI Conference on Weblogs and Social Media. [online] Available at: <> [Accessed April 23 2012].

Blei, D., Ng, A., & Jordan, M. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3:993-1022.

Bollen, J., Mao, H. 2011. Twitter mood as a stock market predictor. Computer, 44.10: 91-94.

British Museum. 2006. eBay partners with British Museum and Museums, Libraries and Archives Council to protect British treasures. [press release] Available at: <>  [Accessed April 15, 2012]

Burrows, J. 2004. Textual Analysis. In A Companion to Digital Humanities. (ed. S. Schreibman, R. Siemens, and J. Unsworth). Oxford: Blackwell. Available at: <; [Accessed April 18, 2012].

Caraher, B. 2010-2012 The New Archaeology of the Mediterranean World [blog]. Available at:<>[Accessed April 23 2012].

Chen, J. 2012. China’s Weibo Guru, Kai-fu Lee. Forbes [online] January 20th. Available at: <>%5BAccessed April 18, 2012].

Chew, C., and Eysenbach, G. 2010. Pandemics in the age of Twitter: Content analysis of tweets during the 2009 H1N1 outbreak. PLoS ONE, 5:11. [online] Available at: <doi:10.1371/journal.pone.0014118> [Accessed April 18, 2012].

Cohen, D., and J. Fragaszy Toryano (eds.) Journal of Digital Humanities Roy Rosenzweig Center for History and New Media [online]. Available at: <> [Accessed April 23 2012].

DeWoskin, R., 2012. East Meets Tweet. Vanity Fair [online] February 17 <>  [Accessed April 18, 2012].

Durney, M. 2012. Art Theft Central. [blog] Available at: <> [Accessed April 23 2012].

Fincham, D. 2012. Illicit Cultural Property: A weblog about art, antiquities and the law [blog] Available at: <> [Accessed April 23 2012].

Gill, D. 2012. Looting Matters: Discussion of the archaeological ethics surrounding the collecting of antiquities [blog] Available at:<> [Accessed April 23 2012].

Graham, S, Massie, G., and Feuerherm, N. 2012. The HeritageCrowd Project: A Case Study in Crowdsourcing Public History. In Writing History in the Digital Age (eds. J. Dougherty and K. Nawrotzki) Under contract with the University of Michigan Press. Trinity College (CT) web-book edition, Spring 2012, [online] Available at:<> [Accessed April 23 2012].

Hardy, S. 2012. Conflict Antiquities: Illicit Antiquities Trading in Economic Crisis, Organised Crime, and Political Violence. [blog] Available at: <; [Accessed April 23 2012].

Hockey, S. 2004. The History of Humanities Computing. In A Companion to Digital Humanities. (ed. S. Schreibman, R. Siemens, and J. Unsworth). Oxford: Blackwell. Available at: <; [Accessed April 18, 2012].

Jansen, B., and Spink, A. 2006. How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing and Management 42.1: 248-263.

Kansa, Eric and Sara Kansa. 2011. Toward a Do-It-Yourself Cyberinfrastructure: Open Data, Incentives, and Reducing Costs and Complexities of Data Sharing. In Archaeology 2.0: New Approaches to Communication and Collaboration (eds. E. Kansa, S. Whitcher Kansa, and E. Watrall). Berkely: Cotsen Institute of Archaeology. [online] Available at: <; [Accessed April 20 2012].

Levy, S. 2010. How Google’s Algorithm Rules the Web Wired [online] February 22. Available at:<> [Accessed April 23 2012]

McCallum, Andrew Kachites. 2002. MALLET: A Machine Learning for Language Toolkit. [online]. Available at:<> [Accessed April 23 2012].

Meeks, Elijah. 2011 Comprehending the Digital Humanities Digital Humanities Specialist Stanford University Libraries and Academic Information Resources[blog] February 19. Available at <> [Accessed April 23 2012].

Morgan, C. 2006-2012 Middlesavagery [blog]. Available at:< > [Accessed April 23 2012].

Morgan, C. 2011. Crowdsourcing Archaeology – The Maeander Project Kickstarter Page. Middle Savagery [blog] June 16. Available at: <> [Accessed April 23 2012].

Nelson, Robert K. 2011 Mining the Dispatch. Digital Scholarship Lab at the University of Richmond [online]. Available at <; [Accessed April 23 2012].

Newman, David and Alun Balagopalan, 2011. A graphical user interface tool for topic modeling. Google Code [online]. Available at:<> [Accessed April 23 2012].

Rockwell, G., and Sinclair, S. 2012. Voyant Tools: Reveal Your Texts, [online] Available at: [Accessed April 1st, 2012].

SAFE: Saving Antiquities for Everyone. 2012. Blog. [blog] Available at: <; [Accessed April 23 2012].

Sinclair, S. 2009. The Rhetoric of Text Analysis. [online]  Availabe at: <;. [Accessed April 15, 2012].

Stanish, C. 2009. Forging Ahead, or, how I learned to stop worrying and love eBay. Archaeology, 62.3 [online] Available at <; [Accessed April 12, 2012].

Suber, P. 2004. Open Access Overview. The SPARC Open Access Newsletter. [online] (Updated March 18 2012) Available at: <; [Accessed April 17, 2012]

Underwood, T. 2011. Why humanists need to understand text mining. The Stone and Shell, [blog] May 29. Available at: <; [Accessed February 27, 2012].

Underwood, T. 2012a. What kinds of “topics” does topic modeling actually produce? The Stone and Shell, [blog] April 1. Available at: <; [Accessed April 14, 2012].

Underwood, T. 2012b. Topic modeling made just simple enough. The Stone and Shell, [blog] April 7. Available at <; [Accessed April 14, 2012].

Ushahidi. 2012, Ushahidi, <;.

Weingart, S. 2011. Topic Modeling and Network Analysis. The Scottbot Irregular, [blog] November 15. Available at <; [Accessed April 18, 2012].

Wilkins, B. 2012. Comment #1 on Graham, S. Digventures, Flag Gen, and Crowd-everything archaeology Electric Archaeology [blog] March 13. Available at <> [Accessed April 23 2012].

Wilkins, L., Wilkins, B., and Dave, R. 2012. Digventures: How it works. Digventures. [online] Available at: < > [Accessed April 23 2012].

Williams, S., Terras, M., and Warwick, C. Forthcoming. What people study when they study twitter: Classifying Twitter Related Academic Papers. Journal of Documentation. Submitted 2012.


12 thoughts on “Mining the Open Web with ‘Looted Heritage’ – draft

  1. Very interesting approach! Thank you for sharing. One quick question: in Voyant, is there a way to easily upload/target multiple texts/web pages? (I have a test in mind but it involves 70+ web pages)

    1. Hi Francis,
      Thank you for your note. Re your question – You can put the txts (in whatever order works for you; we went for creation-date) into a single zip file, and then you upload that single zip file.
      Good luck!

  2. maybe I missed it, but what are the four videos that corrrespond with the spike in frequency noted in early March?

  3. Hey there would you mind stating which blog platform you’re using?
    I’m planning to start my own blog soon but I’m having a difficult time making a decision between BlogEngine/Wordpress/B2evolution
    and Drupal. The reason I ask is because your design and style seems different then most
    blogs and I’m looking for something complletely unique.

    P.S Sorry for getting off-topic but I had to ask!

  4. Link exchange is nothing else however it is simply placing the other person’s website link on your page at suitable place and other person will also do similar
    in support of you.

Comments are closed.