Home » networks (Page 2)
Category Archives: networks
In these two one-mode networks (generated from Jonathan Bardill’s study of the brickstamps of Constantinople), colour indicates modularity class, while size indicates betweeness centrality. What network measures are most appropriate for understanding archaeological networks?
April 4th: There is now a plugin for Gephi which will convert from multi-modal to 1-mode networks: https://gephi.org/plugins/multimode-networks-transformations/
Say you’re interested in patterns of communication between individuals who are members of multiple organizations (like for instance historical societies), or artefact types across multiple sites. You might like to map the network between these individuals and those organizations to understand something of how information flows in that world, how social norms permeate, or ideologies of consumption or display map across space (as Tom Brughmans does here and I’ve done in other places).
1. Make a list. Every time you encounter an individual mentioned as a member of a group, write it out. Two columns: Source, Target. Shawn Graham, Carleton University. You might include a third column called ‘weight’ which gives some measure of the importance of that connection. Why ‘source’, why ‘target’? Because we’re going to import that list into Gephi, and that’s how Gephi requires the information. However, when we do any sort of metrics, we’ll always treat this network as undirected; that is, we’re making no claim to know anything about the direction of the relationship (in a directed network, Alice’s connection to Bob is different than Bob’s connection to Alice). Save that list as a .csv file. If you graphed this right now, you’d have a network where there are two kinds of nodes; hence a two-mode network. Network statistics envision a network where the modes are all of the same kind, which is why we’re doing this tutorial.
2. Import the list into Gephi. Open Gephi, start a new project. Click on the ‘data laboratory’ lab. Under ‘data table’ click on ‘edges’ (this is important; if you click on ‘nodes’, this doesn’t work correctly). Click on ‘import spreadsheet’. Select your csv file, and make sure that ‘as table:’ is set to ‘edges table’. Click next. Click Finish.
3. Go to File ->> export ->> graph file. Save as file type .net (Pajek).
4. Open Sci2; click on File >> Load and select your .net file.
5. Click ‘data preparation’ >> extract reference co-occurence (bibliographic coupling) network. (see also 5.b, below under ‘variations’)
6. Click ‘preprocessing’ >> networks >> delete isolates. You’ve now collapsed your two-mode network into a one mode network where your nodes under ‘target’ are now all connected to each other. If your source, target was ‘site’, ‘ware’, you’ve got a one mode network where wares are connected to each other by virtue of being listed at the same site, ie, the linkage implies the site. (If you’ve done step 5.b, your one mode network would be sites connected to each other by virtue of sharing the same wares; the linkage implies the ware.)
7. At this point, go to file >> view and your notepad application will open, displaying a table where each node in your network has its own unique id, and ‘label*string’, which is your original label. You’ll Save this from notepad as a txt file. You might call it ‘ware to ware index’ (following our example in 6).
8.This is where things get a bit tricky. Click ‘File’ >> ‘save’. Select ‘pajek .net’ as your file type.(see also 8.b below, under ‘variations’)
9. You can then go back to Gephi, start a new project, click ‘open’ and select the .net file you just created. Your one mode network will load up. HOWEVER Gephi won’t recognize the original node lables anymore. This is why you need the index you saved in step 7, so that when you run metrics on this one mode network, you’ll know that ‘node 342 is actually Stamped Brick CIL XV.1 841.d (for instance). (see also 9.b below, under ‘variations’)
5.b In step 5, you created a one mode network based on your ‘target’ column. To create a one mode network based on your ‘source’ column, click ‘data preparation’ >> extract document co-citation network. Resume at step 6.
8.b If you want to preserve the node labels in Gephi, instead of step 8.a, click on Visualization >> Networks >> GUESS. This is a small visualization tool (that allows you to do some network metrics; but if your network is v. big, > 1000 nodes, this might not be a good idea). In GUESS, click File >> export graph. Give it a file name that makes sense, and don’t forget to type in the extension .gdf; otherwise it won’t export. Go to step 9b.
9.b Go to Gephi, start a new project, click on ‘open’, and select the .gdf file you just created. Your node labels will now be present in the graph, and so you don’t need that index file you created. Perhaps a bug: In my experiments, node labels don’t seem to appear in the ‘graph overview’ pane, when working with the gdf file. Your experience might be different. However, they do appear when I export an image of the network, under ‘preview’ >> export.
Fin. Let me know if/how your experience differs, or if these steps require clarification.
I’m working on a paper for a conference next month. In it, I consider artefact copresence at various sites as a means for generating networks, in an effort to get at some of the ideological or social frameworks underpinning the distribution of these networks. I’m looking at stamped brick from Constantinople. I create a list where each entry is a site and a single example of a stamped brick. These I can then visualize using Gephi, and using Sci2 I can convert the two mode network (brick – place) to two one mode networks (bricks – bricks, tied because they’re found at the same place; place – place, tied because they use the same bricks). When I have this data as 1-mode networks, I can then run network stats. Below is the one mode graph, showing places tied to other places, based on the distribution of over 2300 stamped bricks:
I’ve been having an interesting conversation with Ben Marwick, in the comments thread of my initial ‘Getting Started with Topic Modeling’ post. Ben pointed me to an interesting GUI for Mallet, which may be downloaded here. I’ve been trying it out this morning, and I like what I’m seeing. Topic modeling is becoming more and more popular amongst the Digital Humanities crowd. An interesting automated approach to generating networks of topics and ideas from texts is reported by Scott Weingart, using the writings of Newton.
While I have nothing near so polished available, the GUI for Mallet used with Gephi can do nearly the same thing. My body of data comes from Writing History in the Digital Age. An earlier experiment with the same data is recounted here. I re-ran the data using the GUI approach, and have to say, this is a much easier and accessible approach. Run the program; select the folder with your txt documents in it; select the target number of topics; select the appropriate language stopwords list if necessary; hit ‘train topics’. What is very neat about this program is how it presents its output in both html and csv.
So in the spirit of crowdsourcing, I’ve put the output files online, and haven’t tried to decide yet what the topics might mean. Instead, why don’t you view the files for yourself, and let’s identify the topics using the comments of this post?
I then took the CSV files, and got them ready for import into Gephi. Decide which two columns you’d like to represent as being connected, and prune away the extraneous data. I took the ‘topicsindocs.csv’ file, and pruned it so that each paragraph of each author is paired with its major topic. I stripped away the info about the paragraph itself, so that the resulting visualization is just authors to the topics they write about. In the screenshot below, you can see the open gephi file with my own ‘Wikiblitz’ article highlighted, and its connections.
What’s also interesting is when I ran the ‘modularity’ routine – identifying communities based on patterns of self-similarity of ties – only four communities emerged (albeit with a very low modularity measurement, 0.235, which suggests that these communities are all that strong). A natural grouping of the papers, perhaps? (by the way, here’s the pdf/svg file).
I’m writing a lecture on social network analysis for one of my classes. I thought it would be good to illustrate some of the features of networks with reference to the parlour game, ‘Six Degrees of Kevin Bacon‘ – and then remembered that I had once been an extra in a film. Perhaps I too had a Bacon Number?
H20 was a miniseries starring Paul Gross. It was filmed in Ottawa in the winter of 2004. I thought it would be fun to see how tv movies get made, so I auditioned and got a few days of extra work. Most of the time, it was as a face in the crowd. But once – once! my big break! look ma! – I got to be in a scene with one of the principles (I saw Paul Gross once in the hallway of the Chateau Laurier, where we were filming, but that doesn’t seem enough for my purposes today). The principle was Yves Jacques, who was playing the premier of Quebec. If you watch the film, you can see me nod and answer the phone. 2 seconds of cinematic glory. But sufficient to generate a Bacon number (from the Oracle of Bacon, more or less, since us extras aren’t in the database):
Perhaps we need an archaeological equivalent. What’s your Wheeler Number? How many steps to the icon?
Michael Fulford was one of my supervisors; Fulford worked for Barry Cunliffe, who studied with Glyn Daniel ; Daniel was a host of the game show Animal, Vegetable, Mineral? along with Mortimer Wheeler (in 1971, Cunliffe also hosted)…. making my Wheeler Number 4. (or 3, depending on whether or not Cunliffe co-hosted with Wheeler in 1971).
Of course, it might be more …academic… to look at citations and co-authorships, but somehow it’s more satisfying to know that a game show figures in this linkage.
There’s a new report out today from the Council on Library and Information Resources, by Alison Babeau: “Rome Wasn’t Digitized in a Day”: Building a Cyberinfrastructure for Digital Classicists. Lots of interesting stuff. I was chuffed to see a bit of work that I did with Giovanni Ruffini a few years ago on social networks and prosopography was examined in this report. SNA & prosopography could be a very powerful one-two punch. I’ve not been able to devote much energy to that field lately, but I’d be glad to help anyone who wanted to explore it.
p169: “While Graham and Ruffini acknowledged that most of their analysis is still fairly speculative, they also convincingly argued that the unique nature of their results derived from network analysis of ancient evidence suggests that there are many interesting avenues of future work.
Giovanni’s work on the prosopography of Byzantine Egypt is absolutely astounding, and you really must look his work up.
I used NodeXL to search, scrape, and collect the pattern of linkages in tweets using the #dayofarch hashtag. I then exported these to a .net file, and used Gephi to visualize and study the pattern. The dayofarchaeology_tweets file is a zoomable svg/pdf showing the full pattern.
There are 454 nodes (individuals) connected by 993 edges (co-mentions, links, RTs, etc). The diagram’s colors indicate degree – the blacker, the higher number of connections. Top three users: lornarichardson, portableant, dayofarch. The size of the node indicates betweeness centrality – in this case, the users who tie together the twittersphere (in that they lie on top of the most paths connecting any two pairs of users). Top three: lornarichardson, cmount1, jadufton.
The network diameter is 13, meaning that the longest path between any two users is 13 jumps; the average path length is about 5 jumps.
I’ll mull these – and other figures – in a forthcoming post. There is meaning in structure…
Signal versus Noise: Why Academic Blogging Matters. Shawn Graham, Carleton University, Ottawa Canada. [presentation with voice-over here; 15 mb] (Comic from the New York Times article is by David G. Klein)
“Omnia disce; postea videbis nihil esse superfluum” said Hugh of St Victor in the 12th century. ‘Learn everything; later it will all be useful somehow’. The irony of course is that I would in all probability have never come across this epigram (not being a medievalist) if it hadn’t been for the magic of the internet and my faculty’s Dean’s blog. Hugh goes on to say, ‘coartata scientia iucunda non est’, ‘narrow knowledge is not pleasant’. That phrase fits neatly with one of the standard criticisms of blogging, that blogs are narrowly focused, shrill, and often an echo-chamber for their (and their readers’) own views.
In a final neat connection, this phrase of Hugh’s is the epitaph on the tomb of Father Leonard Boyle. Father Boyle is buried at San Clemente in Rome, in the ruins of the 4th century church. This ‘lower church’ was found in the mid 19th century underneath the present basilica (which dates from the 12th century). Father Boyle was the Irish Dominicans’ (who manage the site) archivist and historian, and it is indeed a moving testament to his life’s work that he should be buried in the ancient basilica. The epigram then is very much an archaeological sentiment, both in its context of display, and how it implores us to learn everything: for what else is an excavation but the careful recording of everything on the chance that it will be useful later on?
But it’s also directly useful to us who blog archaeology, who take on the mantel of public archaeology. It could, in a sense, be a motto for Google, who try to ‘learn everything’ with no idea of what will be useful to whom or in what way. But that’s the problem right there – deciding what is useful, and finding it. ‘Narrow knowledge is not pleasant’ I think neatly describes the results of search engines in that first phase of the internet, when the world wide web had just been created and people were still trying to produce human-curated guides to the ‘net. Google of course changed everything with the invention of ‘PageRank’. The mathematics of ‘PageRank’ are based on graph theory and network analysis. In essence, PageRank considers each link on a page as a kind of vote on the relative importance of the page being linked to. It also considers the relative importance of the pages being linked out from as well, and so it’s a recursive process. This was Google’s original insight: that the importance of a page depends on the kind and quality and number of its relations to all other pages on the net.
Learn everything: but that’s only half the battle. The other part is determining what is useful, of extracting the signal from the noise of not only the search query, but of all those millions of pages of information. And in this, Google benefits from the billions of searches that we the users perform every week. In essence, we are teaching the machine what is useful when we skip over the first page of results, looking for the one that *really* seems to match what we were looking for. Google observes this. Wired Magazine not long ago looked under the hood to see how Google learns from user behaviors. Google isn’t a search engine, or a catalog, or an index: it’s a massive experiment in prediction. Apparently, Google uses over 200 signals to match useful information to each individual user (who each have their own idea of what constitutes ‘useful’). PageRank is one signal; the title of a webpage another; the actual text of a hyperlink; freshness; and geolocation of the person doing the search.
This isn’t foolproof however. The system can be gamed. In November 2010, the New York Times published a story about DecorMyEyes.com, an online seller of glasses and eyewear, run by one Vitaly Borker. Borker discovered that if he offered poor service to some of his customers, those customers would complain on the internet (especially in forums), linking to his site in warning to others. One would think this would be poison to his business, but on the contrary, Borker discovered that it made his site’s listing on Google search results improve. That is, all publicity is good publicity, as the algorithm powering the search did not consider the semantic meaning of those mentions. So Borker would then go out of his way to aggravate certain of his customers to such a degree that they would generate more web traffic to his store. Once the New York Times broke the story, Google made some changes to its algorithm. Google did not reveal what changes it had made, in order to prevent other unscrupulous individuals from similarly gaming the system. Borker’s website dropped from its number one position to somewhere deep on the twentieth page of results in the immediate aftermath of the changes.
This story is illuminating on a number of levels. As educators, we’re already familiar with the fact that our students turn to the internet, and more specifically, Google, when they begin their research. How deep do they go on a search results page? Search Engine Optimization is a bit of a black art, but all agree that appearing in the first five search results is the key: people do not click on results much after the fifth results. If one’s website does not appear in that golden group, it might as well not exist.
The story about Borker illustrates the way human interaction and Google search are linked. Google looks for actively updated materials; materials that are semantically tight; and materials that people link to. People link to the materials that Google serves up in its top five, thus creating a positive feedback loop. Wikipedia and Google were made for one another. Wikipedia is simultaneously the product of enormous human energy, and enormous human laziness. Wikipedia produces strong signals – whether good or bad, Google doesn’t care. Google returns a Wikipedia page, and a human reads it, a fraction edit it, another fraction link to it whether to praise, disparage, or simply use it as a kind of glossary of terms, thus creating signals that Google picks up. (This incidentally is also an argument for why academics must engage with Wikipedia and actively work to improve its content! It’s also an argument for why Wikipedia cannot be displaced: it’s here to stay, and will only become more dominant through this positive feedback loop).
So how does blogging fit into this? Blogging is a medium, not a genre, and so content itself is a bit secondary. What is important is that blogging as a medium also creates strong signals. ‘Academic Blogging’, as a genre creates very strong signals. That is, it should. Academic blogs tend to have a very tight focus. They are updated fairly regularly, as the academic incorporates them into his or her work cycles. The anchor text for linking tends to be rather unique combinations of words, what Amazon would call ‘statistically improbable phrases’, and thus provide more signal to Google’s robots. Contrast that with a static department website, for instance. It’s blogging that brings the latest research to that golden group of 5 results.
Let’s look at some structure. I searched ‘Blogging Archaeology’ via Google, crawled the results, and imported them into Gephi. I let the crawl run for about 20 minutes, recovering over 8500 nodes linked together by nearly 9000 edges. There’s a lot of noise, when you look at it. However, this network has a diameter (the maximum distance between the two furthest nodes) of 8 – that is, 8 steps from one side to the other. On average, to get from any node to any other node takes roughly 3 steps, and so a rather tight network. But I want to know where the academic bloggers fit into this, so I run the ‘modularity’ routine in Gephi. This routine looks for areas of self-similarity in the patterning of connections. I find four communities that translate into the archaeological blogosphere (green in the image), center around Colleen’s Middle Savagery. Light blue in the image seems to correlate to ‘cloud’ based storage. Purple seems to be the social media sector (Facebook etc – showing incidentally what a walled garden it is becoming). Red appears to be aggregator websites. Interestingly, Twitter -microblogging- is the purple node that sits at the intersection of the green and purple (perhaps I need to do this study from a Twitter-centric point of view).
In a sense, these results are not surprising, since I ‘gamed’ the system by looking for a term that I knew was active and heavily represented in the archaeological blogosphere. Let’s look for something a bit more generic: ‘Roman Archaeology’. Crawling the results for the same amount of time, we find 6240 nodes and 13 216 edges – a more dense network already. The diameter of this network is 10, and the average path length is 6, which suggests that it’s going to be a bit more parochial, despite all those connections. Once I search this network for modularity, I find 9 communities. The image is striking, almost a barbell shape with Wikipedia being one of the weights and academia being the other (curiously, Columbia and Duke especially) – and the weak link connecting the two are certain blogs and twitter accounts. What better argument for academic blogging, and considering digital archaeology as public archaeology, could be made? We’ve argued that academic bloggers tell the rest of the world what the academy is up to, but never so strongly as this image depicts. Academia, the font of ‘professional’ knowledge, and Wikipedia, the font of crowdsourced knowledge connect through us, the academic bloggers.
A consistent presence then by an academic blogger can perform magic. It begins to tell Google what’s important. Blogging is just a medium, not a genre. It’s a content management system. It’s unfortunate that so many academics are turned off by the word ‘blog’, because they are actually missing an important new venue for communicating what they do to the wider world. In this day and age, if you’re not making that argument for, someone else will make the argument against, and it becomes very easy for a decision maker in government to say, ‘what good is x? Let’s cut its funding’. All archaeology is public archaeology, ultimately. And we ignore that at our peril. We need to create the strongest signal in the noise that we can: and blogging is a crucial part of that. ‘Omnia disce; postea videbis nihil esse superfluum’. Google learns everything, but it still needs to be taught.
That’s our job.
The archaeological blogosphere [zoomable pdf of image] is strangely beautiful. I generated this by scraping over 8000 pages from a Google Search of ‘Blogging Archaeology’. MiddleSavagery sits right there in the middle of the Green Zone. For more on this, and what it means, see my discussion tomorrow at the Society for American Archaeology’s general meeting.
If you’d like to play with the files and data I scraped send me a note. You’ll need Gephi. To do your own crawling, you might try this.
The archaeological blogosphere: green
The cloud: light blue (Google, Amazon, Youtube,)
Social Media: Purple (Facebook, Twitter; also online newspapers)
News aggregators: Red (news.google.com)
The unprocessed network is shown below:
This image represents all of the contributions in response to Colleen’s first question for the Blogging Archaeology Carnival. It was created in Gephi using the HTTP Graph plugin. With Gephi open and running, you set your browser to pass its information through Gephi, which then represents all of the resulting data in terms of its network relationships.
So, I began by pointing my browser to Colleen’s post. Data began to fill the Gephi window. Then, I clicked on each link in turn, which would pour more data into Gephi. I returned to Colleen’s post, and then clicked on the next link. And so on. The resulting image (click here for an svg/pdf higher resolution image) shows how we’re all interconnected. One can automate this process by using Chrome with a web crawler (or see the video).
(by the way, you could use this to visualize all sorts of relations scraped from online databases – that’s a post for another day)
So, in response to the questions posed for this week’s edition of #blogarch , I would say that one way I try to understand where my blogging fits into the wider ecosystem is to actually map it out from time to time. A bit of navel gazing I suppose, but who hasn’t googled themselves at one point or another? My more serious point is to build on Bill’s observation:
Of course the model for understanding blogs that downplays the atomized post:comment relationship is not a product of the digital age and the internet. In fact, I think that the way most people read and write to the web has close parallels with traditions of modern academic writing and reading. Most academics do not pause to comment on specific articles or even individual conference paper (although books and reviews are an exception); instead they build references to these articles into their own work through the predecessor of hyperlinks: footnotes. The networks that have emerged among bloggers find have nice parallels with the intellectual networks manifest in academic citations. The biggest difference between the two practices is the speed with which the discourse can develop (and evaporate) through digital publication.
I was over the moon when I got my first comment on my blog, oh-so-long-ago; I was especially chuffed when Bill had kind things to say about my blogging too (thanks Bill!). Nowadays what comments I get on average tend to be spam. Like Bill (and I suspect, everyone else) I sometimes get emails, phone calls, or ‘by the way’ notes that reference something I have blogged. I recently heard that a class at York in the UK uses some of my blogs in their course work (as examples of best practice or good ideas, I hope!) In which case, I think it is a useful exercise to try to map out the networks that we are creating through this prolonged short-form engagement with the profession, the public, and our subject matter. Blogging sometimes is a bit like “launch and forget”… but we need to have some idea who our community is and how far our thoughts are likely to percolate . We need to be aware of possible network effects in our blogging, and to use these to get our professional voice out there in those top five search results. Is anybody listening? Yes, probably; what I’ve tried to do in my little experiment today is to show how we can begin to approach the question of ‘who?’.