Archeology versus Archaeology versus #Blogarch

I’m working on a paper that maps the archaeological blogosphere. I thought this morning it might be good to take a quick detour into the Twitterverse.

Behold!

'archaeology' on twitter, april 7 2014

‘archaeology’ on twitter, april 7 2014

‘archaeology’ on twitter

Here we have every twitter username, connected by referring to each other in a tweet. There’s a seriously strong spine of tweeting, but it doesn’t make for a unified graph. The folks keeping this clump all together, measured by betweeness centrality:

pompeiiapp
arqueologiabcn
herculaneumapp
romanheritage
openaccessarch
cmount1
groovyhistorian
lornarichardson

top replied-to
hotrodngold
raymondsnoddy
colesprouse
1014retold
janell_elise
yorksarch
holleyalex
bonesbehaviours
uclu
illustreets

Top URLS:

http://bit.ly/1husSFB

http://phy.so/316076983

http://bit.ly/1sqHFu0

http://beasiswaindo.com/1796

https://www.dur.ac.uk/archaeology/conferences/current/babao2014/

http://wanderinggypsyvoyager.blogspot.com/2014/04/archaeology-two-day-search.html?spref=tw

http://www.thisiscolossal.com/2014/04/aerial-archaeology/

http://news.sciencemag.org/archaeology/2014/04/did-europeans-get-fat-neandertals

http://www.smartsurvey.co.uk/s/HadriansWall

http://ift.tt/PWRYrf

Top hashtags:
archaeology 325
Pompeii 90
fresco 90
Archaeology 77
Herculaneum 40
Israel 24
nowplaying 20
roman 18
newslocker 16
Roman 14

Archeology

Let’s look at american archeology – as signified by the dropped ‘e’.

'archeology' on twitter, april 7

‘archeology’ on twitter, april 7

An awful lot more fragmented – less popular consciousness of archaeology-as-a-community?
Top by betweeness centrality – the ones who hold this together:
illumynous
archeologynow
youtube
heritagedaily
algenpfleger
riosallier
david328124
ogurek3
gold248131
leafenthusiast

Top urls:

http://ift.tt/1hN75Lp

http://wp.me/p4jAM9-1cZ

http://fav.me/d7d95kp

http://bit.ly/1qdaHLD

http://newszap.com

http://www.valencia953fm.com.ve

http://bit.ly/PS6hg4

http://goo.gl/fb/MfmNZ

http://goo.gl/fb/IfRnh

Top hashtags:
archeology
history
rome
ancient
easterisland
mystery
easter
slave
esoteric
egypt

Top replied-to
atheistlauren
nofaith313
faraishah
sebpatrick
swbts
thebiblestrue
animal
christofpierson
simba_83
andystacey

#Blogarch on twitter

twitter search '#blogarch' april 7 2014

twitter search ‘#blogarch’ april 7 2014

And now, the archaeologists themselves, as indicated by #blogarch

We talk to ourselves – but with the nature of the hashtag, I suppose that’s to be expected?

Top by betweeness centrality
openaccessarch
drspacejunk
bonesdonotlie
archeowebby
drkillgrove
fieldofwork
archaeo_girl
brennawalks
ejarchaeology
yagumboya

top urls

http://zoharesque.blogspot.com/2014/03/space-age-archaeology-and-future-do-i.html?spref=tw

http://bit.ly/1gBkNin

http://campusarch.msu.edu/?p=2782

http://wp.me/p36umf-cW

http://www.poweredbyosteons.org/2014/03/blogging-bioarchaeology-where-do-we-go.html#.Uzm7zM8kJUw.twitter

http://ow.ly/3iVK4f

http://wp.me/p3Kfwu-cb

http://bit.ly/PCdEIE

http://wp.me/p1rKjz-V2

http://diggin-it-archaeology.blogspot.com/2014/04/my-future-in-blogging-archaeology.html

Top hashtags
blogarch
BlogArch
archaeology
saa2014
SAA2014
blogging
CRMArch
newslocker
crmarch

Top replied to
electricarchaeo (yay me!)

Top mentioned:
drspacejunk
bonesdonotlie
fieldofwork
openaccessarch
archeowebby
jsatgra
cmount1
archaeo_girl
capmsu
drkillgrove

Put them altogether now…

And now, we put them altogether to get ‘archaeology’ on the twitterverse today:

'archaeology, archeology, and #blogarch' on twitter, april 7

‘archaeology, archeology, and #blogarch’ on twitter, april 7

Visually, it’s apparent that the #blogarch crew are the ones tying together the wider twitter worlds of archaeology & archeology, thought it’s still pretty fragmented. There’re 460 folks in this graph.

Top by betweeness centrality:

openaccessarch
drspacejunk
bonesdonotlie
archeowebby
drkillgrove
fieldofwork
jamvallmitjana
archaeo_girl
brennawalks
ejarchaeology

Top urls

http://zoharesque.blogspot.com/2014/03/space-age-archaeology-and-future-do-i.html?spref=tw
http://bit.ly/1gBkNin
http://www.poweredbyosteons.org/2014/03/blogging-bioarchaeology-where-do-we-go.html#.Uzm7zM8kJUw.twitter
http://campusarch.msu.edu/?p=2782
http://wp.me/p4jAM9-1cZ
http://fav.me/d7d95kp
http://wp.me/p1rKjz-V2
http://diggin-it-archaeology.blogspot.com/2014/04/my-future-in-blogging-archaeology.html
http://bonesdontlie.wordpress.com/2014/04/01/the-future-of-blogging-for-bones-dont-lie/
http://soundcloud.com/vrecordings/l-side-andrezz-archeology-v

top hashtags (not useful, given the nature of the search, right? But anyway)

blogarch
archeology
archaeology
BlogArch
history
ancient
easterisland
mystery
easter
slave

Top word pairs in those largest groups:

archeology,professor 30
started,yesterday 21
yesterday,battle 21
battle,towton 21
towton,weapon 21
weapon,tests 21
tests,forensic 21
forensic,archeology 21
museum,archeology 19
blogging,archaeology 17

second group:
blogging,archaeology 13
future,blogging 12
archaeology,go 7
archaeology,future 7
archaeology,final 6
final,review 6
review,blogarch 6
hopes,dreams 6
dreams,fears 6
fears,blogging 6

third group:
space,age 6
age,archaeology 6
archaeology,future 6
future,know 6
know,going 6
saa2014,blogarch 6
going,blogarch 5
blogarch,post 3
post,future 3
future,blogging 3

fourth group:
easterisland,ancient 10
ancient,mystery 10
mystery,easter 10
easter,slave 10
slave,history 10
history,esoteric 10
esoteric,archeology 10
archeology,egypt 10
rt,illumynous 9
illumynous,easterisland 9

fifth group:
costa,rica 8
rt,archeologynow 7
archeologynow,modern 4
modern,archeology 4
archeology,researching 4
researching,dive 4
dive,bars 4
bars,costa 4
rica,costa 4
rica,star 4

(once I saw ‘bars’, I stopped. Archaeological stereotypes, maybe).

Top mentioned in the entire graph

illumynous 9 bonesdonotlie 8
drspacejunk 8 drkillgrove 4
bonesdonotlie 8 capmsu 4
archeologynow 7 yagumboya 3
openaccessarch 7 drspacejunk 3
macbrunson 6 archeowebby 3
swbts 6 allarchaeology 3
archeowebby 6 openaccessarch 3
algenpfleger 5 cmount1 3
youtube 5 brennawalks 2

So what does this all mean? Answers on a postcard, please…

(My network files will be on figshare.com eventually).

Quickly Extracting Data from PDFs

By ‘data’, I mean the tables. There are lots of archaeological articles out there that you’d love to compile together to do some sort of meta-study. Or perhaps you’ve gotten your hands on pdfs with tables and tables of census data. Wouldn’t it be great if you could just grab that data cleanly? Jonathan Stray has written a great synopsis of the various things you might try and has sketched out a workflow you might use. Having read that, I wanted to try ‘Tabula‘, one of the options that he mentioned. Tabula is open source and runs on all the major platforms. You simply download it an double-click on the icon; it runs within your browser. You load your pdf into it, and then draw bounding boxes around the tables that you want to grab. Tabula will then extract that table cleanly, allowing you to download it as a csv or tab separated file, or paste it directly into something else.

For instance, say you’re interested in the data that Gill and Chippindale compiled on Cycladic Figures. You can grab the pdf from JSTOR:

Material and Intellectual Consequences of Esteem for Cycladic Figures
David W. J. Gill and Christopher Chippindale
American Journal of Archaeology , Vol. 97, No. 4 (Oct., 1993) , pp. 601-659
Article DOI: 10.2307/506716

Download it, and then feed it into Tabula. Let’s look at table 2.

gillchippendaletable2
You could just highlight this table in your pdf reader and hit ctrl+c to copy it; when you paste that into your browser, you’d get:
gillchippendaletable2cutnpaste
Everything in a single column. For a small table, maybe that’s not such a big deal. But let’s look at what you get with Tabula. You drag the square over that same table; when you release the mouse button you get:
tabula1
Much, much cleaner & faster! I say ‘faster’, because you can quickly drag the selection box around every table and hit download just the one time. Open the resulting csv file, and you have all of your tables in a useful format:
tabula2
But wait, there’s more! Since you can copy directly to the clipboard, you can paste directly into a google drive spreadsheet (thus taking advantage of all the visualization options that Google offers) or into something like Raw from Density Design.
Tabula is a nifty little tool that you’ll probably want to keep handy.

Mapping the Web in Real Time

I don’t think I’ve shared my workflow before for mapping the structure of a webcrawl. After listening to Sebastian Heath speak at #dapw it occurred to me that it might be useful for, interalia linked open data type resources. So, here’s what you do (and my example draw’s from this year’s SAA 2014 blogging archaeology session blog-o-sphere):

1. install the http graph generator from the gephi plugin marketplace.

2. download the navicrawler + firefox portable zip file at the top of this page.

3. make sure no other instance of firefox is open. Open firefox portable. DO NOT click the ‘update firefox’ button, as this will make navicrawler unusable.

4. Navicrawler can be used to download or scrape the web. In the navicrawler window, click on the (+) to select the ‘crawl’ pane. This will let you set how deep and how far to crawl. Under the ‘file’ tab, you can save all of what you crawl in various file formats. With the httpgraph plugin for Gephi however, we will simply ‘listen’ to the browser and render the graph in real time.

5. The first time you run firefox portable, you will need to configure a manual proxy. Do this by going to tools >> options >> network >> settings. Set the manual proxy configuration for http to 127.0.0.1 and the port to 8088. Click ‘ok’.

If you tried loading a webpage at this point, you’d get an error. To resolve this, you need to tell Gephi to connect to that port as well, and then web traffic will be routed correctly.

6. Open Gephi. Select new project. Under ‘generate’, select ‘http graph’. This will open a dialogue box asking for the port number. Enter 8088.

7. Over in Firefox portable, you can now start a websearch or go to the page from which you wish to crawl. For instance, you could put in the address bar, http://dougsarchaeology.wordpress.com/2013/11/05/blogging-archaeology/. Over in gephi, you will start to see a number of nodes and edges appearing. In the ‘crawl’ window in Navicrawler, set ‘max depth’ to 1, ‘crawl distance’ to 2′ and ‘tabs count’ to 25. Then hit the ‘start’ button. Your Gephi window will now begin to fill with the structure of the internet. There are 4 types of nodes: client, uri, host, and domain. For our purposes here, we will want to filter the resulting graph to hide most of the architecture of the web and just show the URIs. (This by the way could be very useful for visualizing archaeological resources organized via Linked Open Data principles).

Your crawl can run for quite some time.  I was running the crawl describe above for around 10 minutes when it crashed on me. The resulting gephi file (which has 5374 nodes and 14993 edges) can be downloaded from my space on figshare. For the illustration below, I filtered the ‘content-type’ for ‘text/html’, to present the structure of the human readable archaeo-blog-o-sphere as represented by Doug’s Blogging Archaeology Carnival.

The view from Doug's place
The view from Doug’s place

Gaze & Eonydis for Archaeological Data

I’m experimenting with Clement Levallois‘ data mining tools ‘Gaze‘ and ‘Eonydis‘. I created a table with some mock archaeological data in it: artefact, findspot, and date range for the artefact. More on dates in a moment. Here’s the fake dataset.

Firstly, Gaze will take a list of nodes (source, target), and create a network where the source nodes are connected to each other by virtue of sharing a common target. Clement explains:

Paul,dog
Paul, hamster
Paul,cat
Gerald,cat
Gerald,dog
Marie,horse
Donald,squirrel
Donald,cat
… In this case, it is interesting to get a network made of Paul, Gerald, Marie and Donald (sources nodes), showing how similar they are in terms of pets they own. Make sure you do this by choosing “directed networks” in the parameters of Gaze. A related option for directed networks: you can choose a minimum number of times Paul should appear as a source to be included in the computations (useful to filter out unfrequent, irrelevant nodes: because you want only owners with many pets to appear for instance).

The output is in a nodes.dl file and an edges.dl file. In Gephi, go to the import spreadsheet button on the data table, import the nodes file first, then the edges file. Here’s the graph file.

Screenshot, Gaze output into Gephi, from mock archaeo-data

Screenshot, Gaze output into Gephi, from mock archaeo-data

Eonydis on the other hand takes that same list and if it has time-stamps within it (a column with dates), will create a dynamic network over time. My mock dataset above seems to cause Eonydis to crash – is it my negative numbers? How do you encode dates from the Bronze Age in the day/month/year system? Checking the documentation, I see that I didn’t have proper field labels, so I needed to fix that. Trying again, it still crashed. I fiddled with the dates to remove the range (leaving a column to imply ‘earliest known date for this sort of thing’), which gave me this file.

Which still crashed. Now I have to go do some other stuff, so I’ll leave this here and perhaps one of you can pick up where I’ve left off. The example file that comes with Eonydis works fine, so I guess when I return to this I’ll carefully compare the two. Then the task will be to work out how to visualize dynamic networks in Gephi. Clement has a very good tutorial on this.

Postscript:

Ok, so I kept plugging away at it. I found if I put the dates yyyy-mm-dd, as in 1066-01-23 then Eonydis worked a treat. Here’s the mock data and here’s the gexf.

And here’s the dynamic animation! http://screencast.com/t/Nlf06OSEkuA

Post post script:

I took the mock data (archaeo-test4.csv) and concatenated a – in front of the dates, thus -1023-01-01 to represent dates BC. In Eonydis, where it asks for the date format, I tried this:

#yyyy#mm#dd  which accepted the dates, but dropped the negative;

-yyyy#mm#dd, which accepted the dates and also dropped the negative.

Thus, it seems to me that I can still use Eonydis for archaeological data, but I should frame my date column in relative terms rather than absolute, as absolute isn’t really necessary for the network analysis/visualization anyway.

Exploring Trends in Archaeology: Professional, Public, and Media Discourses

The following is a piece by Joe Aitken, a student in my CLCV3202a Roman Archaeology for Historians class at Carleton University. His slides may be found here. I asked Joe if I could share his work with the wider world, because I thought it an interesting example of using simple text analysis to explore broader trends in public archaeology. Happily, he said yes.

Exploring Trends in Archaeology: Professional, Public, and Media Discourses

An immense shift in content and terminology emerges when analysing the text of several documents relating to the archaeology of Colchester, as information grows from its genesis as an archaeological report, through the stage of public archaeology, and finally to mass media. Many inconsistencies emerge as the form in which archaeological information is presented changes.

This analysis was done with the help of Voyant Tools, “a web-based text analysis environment.”[1] Z-score, representing the number of standard deviations above the mean at which each term appears, will be used as the basic marker of frequency. Skew, “A measure of the asymmetry of relative frequency values for each document in the corpus,”[2] will also be used. Having a skew close to zero suggests that the term appears with relative consistency throughout the documents. This means that in comparison to, for example, “piggery,” with a skew of 11, terms with a low skew are not only frequent in the corpus as a whole, but are prevalent in many of the documents that make up the corpus.

A text analysis of Colchester Archaeological Trust Reports 585-743 (February 2011 to 22nd October 2013)[3] is the basis of this comparison. Dominant in this corpus are terms related to archaeological excavations. The term “report” has a z-score of 8.69, “finds” has a z-score of 6.43, and “site” has a z-score of 8.81. The same terms, respectively, have skews of 0.93, 0, and 0.88. Another relatively consistent term is “pottery,” which has a skew of 1 and a z-score of 5.26. “Brick”, with a skew of 2.17 and a z-score of 3.1, is similarly consistent.

The relevance of these figures becomes clearer upon a comparison with the public archaeological writings as they appear on the Colchester Archaeologist blog. The blog exists on the public-facing website of the Colchester Archaeological Trust, and has been blogging about its archaeological discoveries since 2011. This analysis will use the Voyant-Tools difference function, which returns a value based on a comparison between the z-scores of two corpora,[4] as well as a direct comparison of the z-score and skew of each term between the two corpora.

Some of the most consistent terms from the archaeological corpus appear very infrequently in the public archaeology. “Pottery” has a skew of 9.49 and a z-score of 0.25, and appears at about 1/5 of the frequency as it does in the reports. “Brick” similarly disappears: in the public archaeology, it has a skew of 9.56 and a z-score of -0.02, compared to a skew of 2.17 and a z-score of 3.1 in the archaeological reports.

Terms relating to the excavation also disappear. “Finds,” which in the archaeological reports has a skew of 0 and a z-score of 6.43, has a skew of 4.94 and a z-score of 0.42 in the public archaeology. “Report” similarly changes from a skew of 0.93 to 9.87, with it’s z-score dropping from 8.69 to -0.06. Site follows this trend to a lesser extent, although this is likely due to it appearing in the public archaeology in the context of “website,” rather than as an archaeological term. Still, the shift in z-score and skew are significant, and in the same direction: an archaeological z-score of 8.81 to a public z-score of 3.83, and an archaeological skew of 0.88 to a public skew of 1.28. In each case, these commonly used terms from the archaeological reports appeared less frequently and less consistently in the blog.

On the other hand, some terms are much more common in the public archaeology. Compared to the corpus of archaeological reports, the public archaeology texts contain the term “circus” at 5 times the frequency. In the blog, “circus” has a z-score of 5.77 and a relatively stable skew of 1.79, compared to a minimal z-score of 0.69 and a volatile skew of 6.3 in the archaeological reports. A similar change occurs to the term “burial,” although to a lesser extent: from report to blog, the z-score rises from 0.25 to 0.86, and the skew drops from 3.84 to 3.65.

Terms with a high skew and a non-insignificant z-score in the archaeological reports seem to be the most prevalent terms altogether in the public archaeology, while terms with a skew closer to zero in the reports disappear in the public archaeology: that is, the terms that appear infrequently but in large numbers in the reports are the ones selected for representation in the blog. This emphasises rare and exciting discoveries, such as the circus and large burials, while ignoring the more regular and consistent discoveries of pottery and bricks. For terms with high skew, there is a consistent rise in z-score and drop in skew in the incidences of the term between the archaeological and public corpora. For terms with a skew closer to zero, there is a consistent decline in z-score. The two trends that terms follow with regards to their relative frequencies between the two corpora can be defined as follows: low-skew terms, which tend to disappear, and significant-z-score/high skew terms, which tend to be emphasised in the public archaeology.

Archaeology in the media seems to mostly follow from the public archaeology rather than the archaeological reports on most aspects. The media corpus contains articles about the archaeology of Colchester from sources ranging from local to national media, including the BBC, the Colchester Daily Gazette, the Essex County Standard, and the Independent, in addition to international Archaeological publications. In these articles, “circus” has a low skew of 1.51, although its z-score isn’t as overwhelmingly high as it is in the public archaeology at 1.64. Still, it is much greater than the z-score of 0.69 for “circus” in the reports, and this z-score most likely reflects a greater lexical variety rather than a focus on other aspects of the archaeology, as this is the fifth-highest z-score in the entire media corpus. Still, there is less emphasis on the circus here than in the blog.

In common between the public and media corpora is their near complete removal of non-Roman archaeological terminology. The term “medieval” appears 1555 times in the archaeological corpus, with a z-score of 3.42 and a skew of 2.64. In the public corpus, the same term appears twice, with a z-score of negative -0.09 and a skew of 10.30. In the selection of news about the archaeology of Colchester, the term never appears. This follows the same trends of selection as the public archaeology: “medieval,” a low-skew term in the archaeological corpus, is ignored in favour of high-skew terms.

Although the media and public corpora contain writings about the same discoveries and use similar language, the frequency at which they do so differs. The media, unlike the blog, is unlikely to repeatedly write about the circus even when no new information is available. Rather, each media seems to be inspired by the archaeological reports, but takes its information from the public archaeology. That is, instead of repeating the public archaeology, the media takes inspiration from the actual archaeological discovery, but takes their information about this archaeology from the blog rather than directly from the report.

Altogether, archaeological writing about Colchester appears to become much narrower over time. While the archaeological reports assumedly accurately reflect what is found, the public archaeology, and, in turn, the media, does not. Instead, they focus on more marketable and exciting aspects of the archaeology: these can be recognized as the high-skew/high-z-score terms in the analysis. As a result, the particulars of the excavation, as well as the majority of findings, are de-emphasised; these are the low-skew terms. By the stage of public presentation, only a very narrow view of the archaeology of Colchester has been presented. It is almost exclusively monumental and Roman, and is at odds with the multiplicity of archaeological findings that are seen in the reports.

Corpora

Archaeological Reports: http://voyant-tools.org/?corpus=1385952648533.7651

Public Archaeology: http://voyant-tools.org/?corpus=1385952090402.1310

Archaeology in Media: http://voyant-tools.org/?corpus=1385743429982.2427

Academic Archaeology: http://voyant-tools.org/?corpus=1385756548766.8274

All reports, blog posts, articles, papers, corpora, and a list of stopwords used is available at: https://www.dropbox.com/sh/kdj0ez8mwep0c7e/ZKViQxSG99.

 Professional Bibliography

“Colchester Archaeological Trust – Online Report Library.” CAT Reports 585-743. http://cat.essex.ac.uk/all-reports.html

Public Bibliography

“News | The Colchester Archaeologist.” All posts since 2013-11-30. http://www.thecolchesterarchaeologist.co.uk/?cat=11

Media Bibliography

Anonymous. “Colchester dig uncovers ‘spearmen’ skeletons.” BBC, 18 April 2011.

-—. “Colchester Roman circus visitor centre a step closer.” BBC, 14 May 2012.

—-. “Roman ruins to go on display as part of new restaurant.” Essex County Standard, 31 December 2012.

—-. “Colchester archaeology shares in £250,000 funding boost.” Daily Gazette, 27 March 2013.

—-. “2,000-Year-Old Warrior Grave & Spears Unearthed.” Archaeology, 18 September 2013.

Brading, Wendy. “Roman history all set to be revealed.” Daily Gazette, 19 June 2012.

—-. “Excavations to find out Colchester life – Roman style.” Daily Gazette, 11 October 2012.

—-. “Experts discover new Roman graves.” Daily Gazette, 16 January 2013.

—-. “Warrior grave found in excavation.” Essex County Standard, 16 September 2013.

Calnan, James. “Archaeologists discover 900-year-old-abbey.” Daily Gazette, 22 February 2011.

—-. “Uncovered: The remains of two Roman soldiers. Daily Gazette, 14 April 2011.

—-. “Colchester Archaeological Trust unearths English Civil War star fort.” Daily Gazette, 26 August 2011.

—-. “Roman Circus site may open next summer.” Daily Gazette, 16 December 2011.

Cox, James. “Roman road found beneath the southwell arms.” Daily Gazette, 30 July 2012.


[1] “Voyeur Tools: See Through Your Texts,” http://hermeneuti.ca/voyeur

[2] Mouseover text.

[3] “Colchester Archaeological Trust,” http://cat.essex.ac.uk/all-reports.html.

[4] Brian Croxall, “Comparing Corpora in Voyant Tools.” http://www.briancroxall.net/2012/07/18/comparing-corpora-in-voyant-tools/.

Visualizing texts using Overview

I’ve come across an interesting tool called ‘Overview‘. It’s meant for journalists, but I see no reason why it can’t serve historical/archaeological ends as well. It does recursive adaptive k-means clustering rather than topic modeling, as I’d initially assumed (more on process here). You can upload texts as pdfs or within a table. One of the columns in your table could be a ‘tags’ column, whereby – for example – you indicate the year in which the entry was made (if you’re working with a diary). Then, Overview sorts your documents or entries into nested folders of similiarity. You can then see how your tags – decades – play out across similar documents. In the screenshot below, I’ve fed the text of ca 600  historical plaques into Overview:

Overview divides the historical plaques, at the broadest level, of similarity into the following groups:

‘church, school, building, toronto, canada, street, first, house, canadian, college (545 plaques),

‘road, john_graves, humber, graves_simcoe, lake, river, trail, plant’ (41 plaques)

‘community’ with ‘italian, north_york,  lansing, store, shepard, dempsey, sheppard_avenue’, 13 documents

‘: years’ with ‘years_ago, glacier, ice, temperance, transported, found, clay, excavation’, 11 documents.

That’s interesting information to know. In terms of getting the info back out, you can export a spreadsheet with tags attached. Within Overview, you might want to tag all documents together that sort into similar groupings, which you could then visualize with some other program. You can also search documents, and tag them manually. I wondered how plaques concerned with ‘children’, ‘women’, ‘agriculture’, ‘industry’, etc might play out, so I started using Overview’s automatic tagger (search for a word or phrase, apply that word or phrase as a tag to everything that is found). One could then visually explore the way various tags correspond with particular folders of similar documents (as in this example). That first broad group of ‘church school building canada toronto first york house street canadian’ just is too darned big, and so my tagging is hidden (see the image)- but it does give you a sense that the historical plaques in Toronto really are concerned with the first church, school, building, house, etc in Toronto (formerly, York). Architectural history trumps all. It would be interesting to know if these plaques are older than the other ones: has the interest in spaces/places of history shifted over time from buildings to people? Hmmm. I’d better check my topic models, and do some close reading.

Anyway, leaving that aside for now, I exported my tagged texts, and did a quick and dirty network visualization of tags connected to other tags by virtue of shared plaques. I only did this for 200 of the plaques, because, frankly, it’s Friday evening and I’d like to go home.

Here’s what I saw [pdf version]:

visualizing-tags-via-overview

So a cluster with ‘elderly’, ‘industry’, ‘doctor’, ‘medical’, ‘woman’…. I don’t think this visualization that I did was particularly useful.

Probably, it would be better to generate tags that collect everything together in the groups that the tree visualization in Overview generates, export that, and visualize as some kind of dendrogram. It would be good if the groupings could be exported without having to do that though.

Getting Historical Network Data into Gephi

I’m running a workshop next week on getting started with networks & gephi. Below, please find my first pass at a largely self-directed tutorial. This may eventually get incorporated into the Macroscope.

Data files for this tutorial may be found here. There’s a pdf/pptx with the images below, too.

The data for this exercise comes from Peter Holdsworth’s MA dissertation research, which Peter shared on Figshare here. Peter was interested in the social networks surrounding ideas of commemoration of the centenerary of the War of 1812, in 1912. He studied the membership rolls for women’s service organization in Ontario both before and after that centenerary. By making his data public, Peter enables others to build upon his own research in a way not commonly done in history. (Peter can be followed on Twitter at https://twitter.com/P_W_Holdsworth).

On with the show!

Download and install Gephi. (What follows assumes Gephi 0.8.2). You will need the MultiMode Projection pluging installed.

To install the plugin – select Tools >> Plugins  (across the top of Gephi you’ll see ‘File Workspace View Tools Window Plugins Help’. Don’t click on this ‘plugins’. You need to hit ‘tools’ first. Some images would be helpful, eh?).

In the popup, under ‘available plugins’ look for ‘MultimodeNetworksTransformation’. Tick this box, then click on Install. Follow the instructions, ignore any warnings, click on ‘finish’. You may or may not need to restart Gephi to get the plugin running. If you suddenly see on the far right of ht Gephi window a new tab besid ‘statistics’, ‘filters’, called ‘Multimode Network’, then you’re ok.

Slide1

Getting the Plugin

Assuming you’ve now got that sorted out,

1. Under ‘file’, select -> New project.
2. On the data  laboratory tab, select Import-spreadsheet, and in the pop-up, make sure to select under ‘as table: EDGES table. Select women-orgs.csv.  Click ‘next’, click finish.

(On the data table, have ‘edges’ selected. This is showing you the source and the target for each link (aka ‘edge’). This implies a directionality to the relationship that we just don’t know – so down below, when we get to statistics, we will always have to make sure to tell Gephi that we want the network treated as ‘undirected’. More on that below.)

Slide2

Loading your csv file, step 1.

Slide3

Loading your CSV file, step 2

3. Click on ‘copy data to other column’. Select ‘Id’. In the pop-up, select ‘Label’
4. Just as you did in step 2, now import NODES (Women-names.csv)

(nb. You can always add more attribute data to your network this way, as long as you always use a column called Id so that Gephi knows where to slot the new information. Make sure to never tick off the box labeled ‘force nodes to be created as new ones’.)

Adding new columns

Adding new columns

5. Copy ID to Label
6. Add new column, make it boolean. Call it ‘organization’

Filtering & ticking off the boxes

Filtering & ticking off the boxes

7. In the Filter box, type [a-z], and select Id – this filters out all the women.
8. Tick off the check boxes in the ‘organization’ columns.

Save this as ‘women-organizations-2-mode.gephi’.

Now, we want to explore how women are connected to other women via shared membership.

Setting up the transformation.

Setting up the transformation.

Make sure you have the Multimode networks projection plugin installed.

On the multimode networks projection tab,
1. click load attributes.
2. in ‘attribute type’, select organization
4. in left matrix, select ‘false – true’ (or ‘null – true’)
5. in right matrix, select ‘true – false’. (or ‘true – null’)
(do you see why this is the case? what would selecting the inverse accomplish?)

6. select ‘remove edges’ and ‘remove nodes’.

7. Once you hit ‘run’, organizations will be removed from your bipartite network, leaving you with a single-mode network. hit ‘run’.

8. save as ‘women to women network.csv’

…you can reload your ‘women-organizations-2-mode.gephi’ file and re-run the multimode networks projection so that you are left with an organization to organization network.

! if your data table is blank, your filter might still be active. make sure the filter box is clear. You should be left with a list of women.

9. You can add the ‘women-years.csv’ table to your gephi file, to add the number of organizations the woman was active in, by year, as an attribute. You can then begin to filter your graph’s attributes…

10. Let’s filter by the year 1902. Under filters, select ‘attributes – equal’ and then drag ’1902′ to the queries box.
11. in ‘pattern’ enter [0-9] and tick the ‘use regex’ box.
12. click ok, click ‘filter’.

You should now have a network with 188 nodes and 8728 edges, showing the women who were active in 1902.

Let’s learn something about this network. On statistics,
13. Run ‘avg. path length’ by clicking on ‘run’
14. In the pop up that opens, select ‘undirected’ (as we know nothing about directionality in this network).
15. click ok.

16. run ‘modularity’ to look for subgroups. make sure ‘randomize’ and ‘use weights’ are selected. Leave ‘resolution’ at 1.0

Let’s visualize what we’ve just learned.

17. On the ‘partition’ tab, over on the left hand side of the ‘overview’ screen, click on nodes, then click the green arrows beside ‘choose a partition parameter’.
18. Click on ‘choose a partition parameter’. Scroll down to modularity class. The different groups will be listed, with their colours and their % composition of the network.
19. Hit ‘apply’ to recolour your network graph.

20. Let’s resize the nodes to show off betweeness-centrality (to figure out which woman was in the greatest position to influence flows of information in this network.) Click ‘ranking’.
21. Click ‘nodes’.
22. Click the down arrow on ‘choose a rank parameter’. Select ‘betweeness centrality’.
23. Click the red diamond. This will resize the nodes according to their ‘betweeness centrality’.
24. Click ‘apply’.

Now, down at the bottom of the middle panel, you can click the large black ‘T’ to display labels. Do so. Click the black letter ‘A’ and select ‘node size’.

Mrs. Mary Elliot-Murray-Kynynmound and Mrs. John Henry Wilson should now dominate your network. Who were they? What organizations were they members of? Who were they connected to? To the archives!

Congratulations! You’ve imported historical network data into Gephi, manipulated it, and run some analyzes. Play with the settings on ‘preview’ in order to share your visualization as svg, pdf, or png.

Now go back to your original gephi file, and recast it as organizations to organizations via shared members, to figure out which organizations were key in early 20th century Ontario…

Patterns in Roman Inscriptions

Update August 22 I’ve now analyzed all 1385 inscriptions. I’ve put an interactive browser of the visualized topic model at http://graeworks.net/roman-occupations/.

See how nicely the Latin clusters?

See how nicely the Latin clusters?

I’ve played with topic modeling inscriptions before. I’ve now got a very effective script in R that runs the topic model and produces various kinds of output (I’ll be sharing the script once the relevant bit from our book project goes live). For instance, I’ve grabbed 220 inscriptions from Miko Flohr’s database of inscriptions regarding various occupations in the Roman world(there are many more; like everything else I do, this is a work in progress).

Above is the dendrogram of the resulting topics. Remember, those aren’t phrases, and I’ve made no accounting for case endings. (Now, it’s worth pointing out that I didn’t include any of the meta data for these inscriptions; just the text of the inscription itself, with the diacritical marks removed.) Nevertheless, you get a sense of both the structure and content of the inscriptions, reading from left to right, top to bottom.

We can also look at which inscriptions group together based on the similarity matrix of their topics, and graph the result.

roman-occ-graph

Inscriptions, linked based on similarity of the language of the inscription, via topics. If the image appears wonky, just click through.

So let’s look at these groups in a bit more depth. I can take the graph exported by R and import it into Gephi (or another package) to do some exploratory statistical analysis.

I’ve often put a lot of stock in ‘betweeness centrality’, reckoning that if a document is highly between in a network representation of the patterns of similarity of topics, then that document is representative of the kinds of discourses that run through it. What do we get, then?

We get this (here’s the page in the database):

aurifices Roma CIL 6, 9207 Inscription Occupation
M(arcus) Caedicius Iucundus / aurifex de / sacra via vix(it) a(nnos) XXX // Clodia …

But there are a lot of subgroupings in this graph. Something like ‘closeness’ might indicate more locally important inscriptions. In this case, the two with the highest ‘closeness’ measures are

aurifices Roma CIL 6, 9203 Inscription Occupation
Protogeni / aurfici / vix(it) an(nos) LXXX / et Claudiae / Pyrallidi con(iugi) …

and

aurifices Roma CIL 6, 3950 Inscription Occupation
Lucifer v(ixit) a(nnum) I et d(ies) XLV / Hesper v(ixit) a(nnos) II / Callistus …

If we look for subgroupings based on the patterning of connections, the biggest subgroup has 22 inscriptions:
Dis Manibus Felix publicus Brundisinorum servus aquarius vixit…
Dis Manibus Laetus publicus populi Romani 3 aquarius aquae An{n}ionis…
Dis Manibus sacrum Euporo servo vilico Caesaris aquario fecit Vestoria Olympias…
Nymphis Sanctis sacrum Epictetus aquarius Augusti nostri
Dis Manibus Agathemero Augusti liberto fecerunt Asia coniugi suo bene…
Agatho Aquarius Caesaris sibi et Anniae Myrine et suis ex parte parietis mediani…
Dis Manibus Sacrum Doiae Palladi coniugi dignissimae Caius Octavius…
Dis Manibus Tito Aelio Martiali architecto equitum singularium …
Dis Manibus Aureliae Fortunatae feminae incomparabili et de se bene merenti..
Dis Manibus Auliae Laodices filiae dulcissimae Rusticus Augusti libertus…
Dis Manibus Tychico Imperatoris Domitiani servo architecto Crispinilliano.
Dis Manibus Caio Iulio 3 architecto equitum singularium…
Dis Manibus Marco Claudio Tryphoni Augustali dupliciario negotiatori…
Dis Manibus Bromius argentarius
Faustus 3ae argentari
Dis Manibus sacrum Tiberius Claudius Hymeneus aurarius argentarius…
Dis Manibus Silio Victori filio et Naebiae Amoebae coniugi et Siliae…
Dis Manibus 3C3 argentari Allia coniugi? bene merenti fecit…
Dis Manibus Marco Ulpio Augusti liberto Martiali coactori argentario…
Suavis 3 aurarius
Dis Manibus sacrum Tiberius Claudius Hymeneus aurarius argentarius…
Dis Manibus Tito Aurelio Aniceto Augusti liberto aurifici Aurelia…

What ties these together? Well, ‘dis manibus’ is good, but it’s pretty common. The occupations in this group are all argentarii, architectii, or aquarii. So that’s a bit tighter. Many of these folks are mentioned in conjunction with their spouses.

In the next largest group, we get what must be a family (or familia, extended slave family) grouping:
Caius Flaminius Cai libertus Atticus argentarius Reatinus
Caius Octavius Parthenio Cai Octavi Chresti libertus argentarius
Musaeus argentarius
Caius Caicius Cai libertus Heracla argentarius de foro Esquilino sibi…
Caius Iunius Cai libertus Salvius Caius Iunius Cai libertus Aprodisi…
Caius Vedennius Cai filius Quirina Moderatus Antio militavit in legione…
Aurifex brattarius
Caius Acilius Luci filius Trebonia natus architectus
Caius Postumius Pollio architectus
Caius Camonius Cai libertus Gratus faber anularius
Caius Antistius Isochrysus architectus
Elegans architectus
Caius Cuppienus Cai filius Pollia Terminalis praefectus cohortis…
Cresces architectus
Cresces architectus
Caius Vedennius Cai filius Quirina Moderatus Antio militavit in legione…
Pompeia Memphis fecit sibi et Cnaeo Pompeio Iucundo coniugi suo aurifici…
Caius Papius Cai libertus Salvius Caius Papius Cai libertus Apelles…
Caius Flaminius Cai libertus Atticus argentarius Reatinus

The outliers here are graffitos or must be being picked up by the algorithmn due to the formation of the words; the inclusion of Pompeia in here is interesting, which must be to the overall structure of that inscription. Perhaps a stretch too far to wonder why these would be similar…?

This small experiment demonstrates I think the potential of topic modeling for digging out patterns in archaeological/epigraphic materials. In due time I will do Flohr’s entire database. Here are my files to play with yourself.

Giant component at the centre of these 220 inscriptions.

Giant component at the centre of these 220 inscriptions.

Topic Modeling #dh2013 with Paper Machines

I discovered the pdf with all of the abstracts from #dh2013 on a memory-stick-cum-swag this AM. What can I do with these? I know! I’ll topic model them using Paper Machines for Zotero.

Iteration 1.
1. Drop the pdf into a zotero collection.
2. Create a parent item from it.
3. Add a date (July 2013) to the date field on the parent item.
4. Right click on the collection, extract text for paper machines.
5. Right click on the collection, topic model –> by date.
6. Result: blank screen.

Damn.
Right-click the collection, ‘reset papermachines output’.

Iteration 2.
1. Split the pdfs for the abstracts themselves into separate pages. (pg 9 – 546).
2. Drop the pdfs into a zotero collection.
3. Create parent items for it. (Firefox hangs badly at this stage. And keeps redirecting through scholar.google.com for reasons I don’t know why).
4. Add dates to the date field; grab these by hand from the dh schedule page. God, there’s gotta be an easier way of doing this. Actually, I’ll just skip this for now and hope that the sequential page numbers/multiple documents will suffice.
5. Right click on the collection, extract text for paper machines.
6. Right click on the collection, topic model –> by date.
7. Result: IndexError: index out of range: -1.

Damn.
Right-click the collection, ‘reset papermachines output’.

Iteration 3.
Jump directly to #4, add dates to date field. In the interests of getting something done this morning, I will give them all the same date – a range from July 16 – July 19. If I gave them all their correct dates, you’d get a much more granular view. But I’m adding these by hand. (Though there probably exists some sort of batch edit for Zotero fields? Hang on, I right click on ‘change fields for items’ type ‘date’ for field, put in my range, hey presto! Thanks, Zotero)
5. Right click on the collection, extract text for paper machines.
6. Right click on the collection, topic model –> by date.
7. Result:

Damn.

Chased down the folder where all of these was being stored. Ahha. Each extracted text file is blank. Nice.

Blow this for a lark. Sometimes, folks, the secret is to go away, and come back later.

Update: I tweeted:

And then walked away for a while. Came back, and went to the TEI file. I used Notepad ++ to strip everything else out but the abstracts. I saved it as a csv. Then, in Excel, I used a custom script I found lying about on teh webs to turn each line into its own txt file. Then I copied the directory into Zotero. I gave each txt file its own parent. I mass edited those items so that they all carried the date July 16 – 19 2013. Then I extracted texts (which seems redundant, but you can’t jump ahead).

And then I selected topic modeling by time.

Which at least created a topic model, but it didn’t make the stream graph. The heat map worked, but all it showed was the US, UK, and Germany. And Florida, for reasons unexplained.

So I went back to Gephi for my topic model visualization. I used Ben Marwick’s Mallet-in-R script to do the modeling and to transform the output so I could easily visualize the correlations. Behold, I give you the network of strongly correlated #dh2013 abstracts by virtue of their shared topics:

dhabstracts-strongy-correlations

It’s coloured by modularity and sized by betweeness, which gives us groups of abstracts and the identification of the abstract whose topics/discourse/text do all of the heavy lifting. A brief glance at the titles suggests that these papers are all concerned with issues of data management of text.

I’ll put all of this data up on my space at Figshare.com in a moment It’s up on Figshare, and provide some further reflections. Currently, this machine is hanging up on me frequently, and I want to get this out before it crashes. Here are the topics; you can add labels if you’d like, but the top three seem to be ‘publishing & scholarly communication’; ‘visualization’; ‘teaching’:

Correlated topics at #dh2013

Correlated topics at #dh2013

0.35142 humanities digital social scholarly http research history accessed work community scholarship www access dh journal publication citation communication publishing
0.28061 literary reading analysis visualization text texts digital literature century studies media topic humanities corpus mining modeling press textual paper
0.21684 digital humanities students university teaching research dh participants workshop projects education pedagogy program tools academic arts graduate project resources
0.18993 digital collections research collection content researchers users access library user resources image images libraries archives metadata cultural information tools
0.14539 tei text document documents encoding markup xml texts index london indexing http uk html encoded links search version modern
0.11833 data historical map time gis information temporal maps university spatial geographic locations texts geographical place names mapping date dates
0.11792 crowdsourcing digital project public states united archaeological america archaeology projects poster university virginia web community social civil media users
0.11289 systems model modeling system narrative media theory elements classification type features user markup ic gesture expression representation press character
0.09601 editions edition text scholarly digital women editing collation print textual texts tools http image manuscript electronic editorial versions environment
0.08569 authorship author words texts corpus attribution characters frequency plays fig classification results number novels genre authors analysis character delta
0.08016 semantic annotation web linked open ontology data rdf scholarly http ontologies research annotations information review project metadata knowledge org
0.07777 social network networks analysis graph relationships characters group graphs jazz science family de interaction publication relationship nodes discussion cultural
0.06328 language corpus text txm http german de web lexicon platform corpora tools analysis unicode research annotation encoding languages lexus
0.05286 digital knowledge community fabrication migration book open feminist learning field knitting desktop world practices cultural experience work lab academic
0.04856 text analysis programming voyant tools ca poster interface alberta live rank sinclair http latent environments ualberta touch screen environment
0.04131 words poetry word text poem texts poetic ford english author segments conrad analysis language poems zeta newton mining chapters
0.0364 simulation information time content model vsim environment narrative abm distribution feature light embedded study narratives virtual japan plot resources
0.03538 query search google alloy xml language words typesetting algorithm de detection cf engine speech mql algorithms body searches paris
0.01131 de la el homer movement uncertainty en se clock catalogue del astronomical una movements para los dance las imprecision

A quick run with Serendip-o-matic

I just ran my announcement of our book through the #owot Serendip-o-matic serendipity engine.

It took the text of my post, and extracted these key words:

book, digital, writing, online, process, project, students, things, us, wanted, going, historian, nervous, one, programming.

I wondered if the selected keywords changed each time, if there was a bit of fuzziness to the extraction routine.  The image results this second time looked different than the first (more digitally than booky the second time, more bookish the first time than digital), but the results from the ‘save’ button were the same:

So, for pass one:

  1. Writing 2.0: Using Google Docs as a Collaborative Writing Tool in the Elementary Classroom: http://thoth.library.utah.edu:1701/primo_library/libweb/action/dlDisplay.do?vid=MWDL&afterPDS=true&docId=digcoll_uvu_19UVUTheses/609. From DPLA.
  2. Effectiveness of an Improvement Writing Program According to Students’ Reflexivity Levels: http://preview.europeana.eu/portal/record/9200102/F5795175AA2BAED57402D982C774072FE21364BF.html?utm_source=api&utm_medium=api&utm_campaign=iiecvYL4T. From Europeana.
  3. Students in the incubation room at the Woodbine Agricultural School, New Jersey: http://www.flickr.com/photos/36988361@N08/4296232936/. From Flickr Commons.
  4. Impossible things [book review]: http://thoth.library.utah.edu:1701/primo_library/libweb/action/dlDisplay.do?vid=MWDL&afterPDS=true&docId=digcoll_byu_12CBPR/201. From DPLA.
  5. Let The Feeling Flow: http://preview.europeana.eu/portal/record/2023601/F8C732E3D49AC67D886564EC78D0E37F02617C72.html?utm_source=api&utm_medium=api&utm_campaign=iiecvYL4T. From Europeana.
  6. Student reading to two little girls. Photographed for 1920 home economics catalog by Troy.: http://www.flickr.com/photos/30515687@N05/3856396957/. From Flickr Commons.

For pass two: – well, lots of different stuff, some overlaps, but a glitch meant that my results didn’t get saved.

Pass three: these words extracted- book, digital, writing, online, process, project, students, things, us, wanted, going, historian, nervous, one, programming. Same words, different order; but there were many different images from passes 1 and 2, while some images stayed the same. The ‘save’ page brought up the list above.  If I was serious about saving, I’d try to push from the results page into Zotero; in any event, after five workdays, this is a hell of a neat piece of work!  For contrast, let me take those keywords that serendipomatic extracted, and run them through google. Three results:

http://electricarchaeology.ca/2013/07/24/themacroscope/

http://electricarchaeology.ca/

http://dohistory.org/on_your_own/toolkit/oralHistory.html

So serendipomatic is the winner, hands down! Putting the keywords* extracted via natural language processing into google really highlights how google works: it exactly points to the post with which we began. And there, ladies and gentlemen, is the reason why Google, for all its power, is not the friend to research that you might have thought. Google is for generating needles; Serendipomatic is for generating haystacks, and it does it well. Well done #owot team!

*putting the whole text generated an error: Error 414 (Request URI too large!) Sorry google, didn’t mean to break you.