Signal Versus Noise: Why Academic Blogging Matters: A Structural Argument. SAA 2011

Signal versus Noise: Why Academic Blogging Matters. Shawn Graham, Carleton University, Ottawa Canada. [presentation with voice-over here; 15 mb] (Comic from the New York Times article is by David G. Klein)

“Omnia disce; postea videbis nihil esse superfluum” said Hugh of St Victor in the 12th century. ‘Learn everything; later it will all be useful somehow’. The irony of course is that I would in all probability have never come across this epigram (not being a medievalist) if it hadn’t been for the magic of the internet and my faculty’s Dean’s blog. Hugh goes on to say, ‘coartata scientia iucunda non est’, ‘narrow knowledge is not pleasant’. That phrase fits neatly with one of the standard criticisms of blogging, that blogs are narrowly focused, shrill, and often an echo-chamber for their (and their readers’) own views.

In a final neat connection, this phrase of Hugh’s is the epitaph on the tomb of Father Leonard Boyle. Father Boyle is buried at San Clemente in Rome, in the ruins of the 4th century church. This ‘lower church’ was found in the mid 19th century underneath the present basilica (which dates from the 12th century).  Father Boyle was the Irish Dominicans’ (who manage the site) archivist and historian, and it is indeed a moving testament to his life’s work that he should be buried in the ancient basilica. The epigram then is very much an archaeological sentiment, both in its context of display, and how it implores us to learn everything: for what else is an excavation but the careful recording of everything on the chance that it will be useful later on?

But it’s also directly useful to us who blog archaeology, who take on the mantel of public archaeology. It could, in a sense, be a motto for Google, who try to ‘learn everything’ with no idea of what will be useful to whom or in what way. But that’s the problem right there – deciding what is useful, and finding it. ‘Narrow knowledge is not pleasant’ I think neatly describes the results of search engines in that first phase of the internet, when the world wide web had just been created and people were still trying to produce human-curated guides to the ‘net. Google of course changed everything with the invention of ‘PageRank’. The mathematics of ‘PageRank’ are based on graph theory and network analysis. In essence, PageRank considers each link on a page as a kind of vote on the relative importance of the page being linked to.  It also considers the relative importance of the pages being linked out from as well, and so it’s a recursive process. This was Google’s original insight: that the importance of a page depends on the kind and quality and number of its relations to all other pages on the net.

Learn everything: but that’s only half the battle. The other part is determining what is useful, of extracting the signal from the noise of not only the search query, but of all those millions of pages of information.  And in this, Google benefits from the billions of searches that we the users perform every week. In essence, we are teaching the machine what is useful when we skip over the first page of results, looking for the one that *really* seems to match what we were looking for. Google observes this. Wired Magazine not long ago looked under the hood to see how Google learns from user behaviors. Google isn’t a search engine, or a catalog, or an index: it’s a massive experiment in prediction. Apparently, Google uses over 200 signals to match useful information to each individual user (who each have their own idea of what constitutes ‘useful’). PageRank is one signal; the title of a webpage another; the actual text of a hyperlink; freshness; and geolocation of the person doing the search.

This isn’t foolproof however. The system can be gamed. In November 2010, the New York Times published a story about, an online seller of glasses and eyewear, run by one Vitaly Borker. Borker discovered that if he offered poor service to some of his customers, those customers would complain on the internet (especially in forums), linking to his site in warning to others. One would think this would be poison to his business, but on the contrary, Borker discovered that it made his site’s listing on Google search results improve. That is, all publicity is good publicity, as the algorithm powering the search did not consider the semantic meaning of those mentions. So Borker would then go out of his way to aggravate certain of his customers to such a degree that they would generate more web traffic to his store. Once the New York Times broke the story, Google made some changes to its algorithm. Google did not reveal what changes it had made, in order to prevent other unscrupulous individuals from similarly gaming the system. Borker’s website dropped from its number one position to somewhere deep on the twentieth page of results in the immediate aftermath of the changes.

This story is illuminating on a number of levels. As educators, we’re already familiar with the fact that our students turn to the internet, and more specifically, Google, when they begin their research.  How deep do they go on a search results page? Search Engine Optimization is a bit of a black art, but all agree that appearing in the first five search results is the key: people do not click on results much after the fifth results. If one’s website does not appear in that golden group, it might as well not exist.

The story about Borker illustrates the way human interaction and Google search are linked. Google looks for actively updated materials; materials that are semantically tight; and materials that people link to. People link to the materials that Google serves up in its top five, thus creating a positive feedback loop. Wikipedia and Google were made for one another.  Wikipedia is simultaneously the product of enormous human energy, and enormous human laziness. Wikipedia produces strong signals – whether good or bad, Google doesn’t care. Google returns a Wikipedia page, and a human reads it, a fraction edit it, another fraction link to it whether to praise, disparage, or simply use it as a kind of glossary of terms, thus creating signals that Google picks up.  (This incidentally is also an argument for why academics must engage with Wikipedia and actively work to improve its content! It’s also an argument for why Wikipedia cannot be displaced: it’s here to stay, and will only become more dominant through this positive feedback loop).

So how does blogging fit into this? Blogging is a medium, not a genre, and so content itself is a bit secondary. What is important is that blogging as a medium also creates strong signals. ‘Academic Blogging’, as a genre creates very strong signals. That is, it should. Academic blogs tend to have a very tight focus. They are updated fairly regularly, as the academic incorporates them into his or her work cycles. The anchor text for linking tends to be rather unique combinations of words, what Amazon would call ‘statistically improbable phrases’, and thus provide more signal to Google’s robots. Contrast that with a static department website, for instance. It’s blogging that brings the latest research to that golden group of 5 results.

Let’s look at some structure. I searched ‘Blogging Archaeology’ via Google, crawled the results, and imported them into Gephi. I let the crawl run for about 20 minutes, recovering over 8500 nodes linked together by nearly 9000 edges. There’s a lot of noise, when you look at it. However, this network has a diameter (the maximum distance between the two furthest nodes) of 8 – that is, 8 steps from one side to the other. On average, to get from any node to any other node takes roughly 3 steps, and so a rather tight network. But I want to know where the academic bloggers fit into this, so I run the ‘modularity’ routine in Gephi. This routine looks for areas of self-similarity in the patterning of connections. I find four communities that translate into the archaeological blogosphere (green in the image), center around Colleen’s Middle Savagery. Light blue in the image seems to correlate to ‘cloud’ based storage. Purple seems to be the social media sector (Facebook etc – showing incidentally what a walled garden it is becoming). Red appears to be aggregator websites. Interestingly, Twitter -microblogging- is the purple node that sits at the intersection of the green and purple (perhaps I need to do this study from a Twitter-centric point of view).

In a sense, these results are not surprising, since I ‘gamed’ the system by looking for a term that I knew was active and heavily represented in the archaeological blogosphere. Let’s look for something a bit more generic: ‘Roman Archaeology’. Crawling the results for the same amount of time, we find 6240 nodes and 13 216 edges – a more dense network already. The diameter of this network is 10, and the average path length is 6, which suggests that it’s going to be a bit more parochial, despite all those connections. Once I search this network for modularity, I find 9 communities.  The image is striking, almost a barbell shape with Wikipedia being one of the weights and academia being the other (curiously, Columbia and Duke especially) – and the weak link connecting the two are certain blogs and twitter accounts. What better argument for academic blogging, and considering digital archaeology as public archaeology, could be made? We’ve argued that academic bloggers tell the rest of the world what the academy is up to, but never so strongly as this image depicts. Academia, the font of ‘professional’ knowledge, and Wikipedia, the font of crowdsourced knowledge connect through us, the academic bloggers.

A consistent presence then by an academic blogger can perform magic. It begins to tell Google what’s important. Blogging is just a medium, not a genre. It’s a content management system. It’s unfortunate that so many academics are turned off by the word ‘blog’, because they are actually missing an important new venue for communicating what they do to the wider world. In this day and age, if you’re not making that argument for, someone else will make the argument against, and it becomes very easy for a decision maker in government to say, ‘what good is x? Let’s cut its funding’.  All archaeology is public archaeology, ultimately. And we ignore that at our peril. We need to create the strongest signal in the noise that we can: and blogging is a crucial part of that. ‘Omnia disce; postea videbis nihil esse superfluum’. Google learns everything, but it still needs to be taught.

That’s our job.

Carpenter, Brian. 2010. Observed Relationships between Size Measures of the Internet or Is the Internet really just a star network after all?. [April 1 2010] 2010. Gephi, an open source graph visualization and manipulation software. [April 1 2010]
Jacomy, M. P. Girard, A. Delanoe. Navicrawler 1.7.3 [April 1 2010]
“Leonard Boyle” [April 1 2010]
Levy, Steve. 2010. ‘How Google’s Algorithm Rules the Web’ Wired March. [April 1 2010]
Segal, David. 2010 ‘A Bully Finds a Pulpit on the Web’. The New York Times, Nov. 26. [April 1 2010]
Singhal, Amit. 2010 .“Official Google Blog: Being bad to your customers is bad for business”  Dec. 1. [April 1 2010]
“PageRank” [April 1 2010]