The following is a piece by Joe Aitken, a student in my CLCV3202a Roman Archaeology for Historians class at Carleton University. His slides may be found here. I asked Joe if I could share his work with the wider world, because I thought it an interesting example of using simple text analysis to explore broader trends in public archaeology. Happily, he said yes.

Exploring Trends in Archaeology: Professional, Public, and Media Discourses

An immense shift in content and terminology emerges when analysing the text of several documents relating to the archaeology of Colchester, as information grows from its genesis as an archaeological report, through the stage of public archaeology, and finally to mass media. Many inconsistencies emerge as the form in which archaeological information is presented changes.

This analysis was done with the help of Voyant Tools, “a web-based text analysis environment.”[1] Z-score, representing the number of standard deviations above the mean at which each term appears, will be used as the basic marker of frequency. Skew, “A measure of the asymmetry of relative frequency values for each document in the corpus,”[2] will also be used. Having a skew close to zero suggests that the term appears with relative consistency throughout the documents. This means that in comparison to, for example, “piggery,” with a skew of 11, terms with a low skew are not only frequent in the corpus as a whole, but are prevalent in many of the documents that make up the corpus.

A text analysis of Colchester Archaeological Trust Reports 585-743 (February 2011 to 22nd October 2013)[3] is the basis of this comparison. Dominant in this corpus are terms related to archaeological excavations. The term “report” has a z-score of 8.69, “finds” has a z-score of 6.43, and “site” has a z-score of 8.81. The same terms, respectively, have skews of 0.93, 0, and 0.88. Another relatively consistent term is “pottery,” which has a skew of 1 and a z-score of 5.26. “Brick”, with a skew of 2.17 and a z-score of 3.1, is similarly consistent.

The relevance of these figures becomes clearer upon a comparison with the public archaeological writings as they appear on the Colchester Archaeologist blog. The blog exists on the public-facing website of the Colchester Archaeological Trust, and has been blogging about its archaeological discoveries since 2011. This analysis will use the Voyant-Tools difference function, which returns a value based on a comparison between the z-scores of two corpora,[4] as well as a direct comparison of the z-score and skew of each term between the two corpora.

Some of the most consistent terms from the archaeological corpus appear very infrequently in the public archaeology. “Pottery” has a skew of 9.49 and a z-score of 0.25, and appears at about 1/5 of the frequency as it does in the reports. “Brick” similarly disappears: in the public archaeology, it has a skew of 9.56 and a z-score of -0.02, compared to a skew of 2.17 and a z-score of 3.1 in the archaeological reports.

Terms relating to the excavation also disappear. “Finds,” which in the archaeological reports has a skew of 0 and a z-score of 6.43, has a skew of 4.94 and a z-score of 0.42 in the public archaeology. “Report” similarly changes from a skew of 0.93 to 9.87, with it’s z-score dropping from 8.69 to -0.06. Site follows this trend to a lesser extent, although this is likely due to it appearing in the public archaeology in the context of “website,” rather than as an archaeological term. Still, the shift in z-score and skew are significant, and in the same direction: an archaeological z-score of 8.81 to a public z-score of 3.83, and an archaeological skew of 0.88 to a public skew of 1.28. In each case, these commonly used terms from the archaeological reports appeared less frequently and less consistently in the blog.

On the other hand, some terms are much more common in the public archaeology. Compared to the corpus of archaeological reports, the public archaeology texts contain the term “circus” at 5 times the frequency. In the blog, “circus” has a z-score of 5.77 and a relatively stable skew of 1.79, compared to a minimal z-score of 0.69 and a volatile skew of 6.3 in the archaeological reports. A similar change occurs to the term “burial,” although to a lesser extent: from report to blog, the z-score rises from 0.25 to 0.86, and the skew drops from 3.84 to 3.65.

Terms with a high skew and a non-insignificant z-score in the archaeological reports seem to be the most prevalent terms altogether in the public archaeology, while terms with a skew closer to zero in the reports disappear in the public archaeology: that is, the terms that appear infrequently but in large numbers in the reports are the ones selected for representation in the blog. This emphasises rare and exciting discoveries, such as the circus and large burials, while ignoring the more regular and consistent discoveries of pottery and bricks. For terms with high skew, there is a consistent rise in z-score and drop in skew in the incidences of the term between the archaeological and public corpora. For terms with a skew closer to zero, there is a consistent decline in z-score. The two trends that terms follow with regards to their relative frequencies between the two corpora can be defined as follows: low-skew terms, which tend to disappear, and significant-z-score/high skew terms, which tend to be emphasised in the public archaeology.

Archaeology in the media seems to mostly follow from the public archaeology rather than the archaeological reports on most aspects. The media corpus contains articles about the archaeology of Colchester from sources ranging from local to national media, including the BBC, the Colchester Daily Gazette, the Essex County Standard, and the Independent, in addition to international Archaeological publications. In these articles, “circus” has a low skew of 1.51, although its z-score isn’t as overwhelmingly high as it is in the public archaeology at 1.64. Still, it is much greater than the z-score of 0.69 for “circus” in the reports, and this z-score most likely reflects a greater lexical variety rather than a focus on other aspects of the archaeology, as this is the fifth-highest z-score in the entire media corpus. Still, there is less emphasis on the circus here than in the blog.

In common between the public and media corpora is their near complete removal of non-Roman archaeological terminology. The term “medieval” appears 1555 times in the archaeological corpus, with a z-score of 3.42 and a skew of 2.64. In the public corpus, the same term appears twice, with a z-score of negative -0.09 and a skew of 10.30. In the selection of news about the archaeology of Colchester, the term never appears. This follows the same trends of selection as the public archaeology: “medieval,” a low-skew term in the archaeological corpus, is ignored in favour of high-skew terms.

Although the media and public corpora contain writings about the same discoveries and use similar language, the frequency at which they do so differs. The media, unlike the blog, is unlikely to repeatedly write about the circus even when no new information is available. Rather, each media seems to be inspired by the archaeological reports, but takes its information from the public archaeology. That is, instead of repeating the public archaeology, the media takes inspiration from the actual archaeological discovery, but takes their information about this archaeology from the blog rather than directly from the report.

Altogether, archaeological writing about Colchester appears to become much narrower over time. While the archaeological reports assumedly accurately reflect what is found, the public archaeology, and, in turn, the media, does not. Instead, they focus on more marketable and exciting aspects of the archaeology: these can be recognized as the high-skew/high-z-score terms in the analysis. As a result, the particulars of the excavation, as well as the majority of findings, are de-emphasised; these are the low-skew terms. By the stage of public presentation, only a very narrow view of the archaeology of Colchester has been presented. It is almost exclusively monumental and Roman, and is at odds with the multiplicity of archaeological findings that are seen in the reports.


Archaeological Reports: http://voyant-tools.org/?corpus=1385952648533.7651

Public Archaeology: http://voyant-tools.org/?corpus=1385952090402.1310

Archaeology in Media: http://voyant-tools.org/?corpus=1385743429982.2427

Academic Archaeology: http://voyant-tools.org/?corpus=1385756548766.8274

All reports, blog posts, articles, papers, corpora, and a list of stopwords used is available at: https://www.dropbox.com/sh/kdj0ez8mwep0c7e/ZKViQxSG99.

