Home » data management
Category Archives: data management
Yesterday, my HeritageCrowd project website was annihilated. Gone. Kaput. Destroyed. Joined the choir.
This is what I think happened, what I now know and need to learn, and what I think the wider digital humanities community needs to think about/teach each other.
HeritageCrowd was (may be again, if I can salvage from the wreckage) a project that tried to encourage the crowdsourcing of local cultural heritage knowledge for a community that does not have particularly good internet access or penetration. It was built on the Ushahidi platform, which allows folks to participate via cell phone text messages. We even had it set up so that a person could leave a voice message and software would automatically transcribe the message and submit it via email. It worked fairly well, and we wrote it up for Writing History in the Digital Age. I was looking forward to working more on it this summer.
Problem #1: Poor record keeping of the process of getting things intalled, and the decisions taken.
Now, originally, we were using the Crowdmap hosted version of Ushahidi, so we wouldn’t have to worry about things like security, updates, servers, that sort of thing. But… I wanted to customize the look, move the blocks around, and make some other cosmetic changes so that Ushahidi’s genesis in crisis-mapping wouldn’t be quite as evident. When you repurpose software meant for one domain to another, it’s the sort of thing you do. So, I set up a new domain, got some server space, downloaded Ushahidi and installed it. The installation tested my server skills. Unlike setting up WordPress or Omeka (which I’ve done several times), Ushahidi requires the concommitant set up of ‘Kohana‘. This was not easy. There are many levels of tacit knowledge in computing and especially in web-based applications that I, as an outsider, have not yet learned. It takes a lot of trial and error, and sometimes, just dumb luck. I kept poor records of this period – I was working to a tight deadline, and I wanted to just get the damned thing working. Today, I have no idea what I actually did to get Kohana and Ushahidi playing nice with one another. I think it actually boiled down to file structure.
(It’s funny to think of myself as an outsider, when it comes to all this digital work. I am after all an official, card-carrying ‘digital humanist’. It’s worth remembering what that label actually means. At least one part of it is ‘humanist’. I spent well over a decade learning how to do that part. I’ve only been at the ‘digital’ part since about 2005… and my experience of ‘digital’, at least initially, is in social networks and simulation – things that don’t actually require me to mount materials on the internet. We forget sometimes that there’s more to the digital humanities than building flashy internet-based digital tools. Archaeologists have been using digital methods in their research since the 1960s; Classicists at least that long – and of course Father Busa).
Problem #2: Computers talk to other computers, and persuade them to do things.
I forget where I read it now (it was probably Stephen Ramsay or Geoffrey Rockwell), but digital humanists need to consider artificial intelligence. We do a humanities not just of other humans, but of humans’ creations that engage in their own goal-directed behaviours. As some one who has built a number of agent based models and simulations, I suppose I shouldn’t have forgotten this. But on the internet, there is a whole netherworld of computers corrupting and enslaving each other, for all sorts of purposes.
HeritageCrowd was destroyed so that one computer could persuade another computer to send spam to gullible humans with erectile dsyfunction.
The persistent (or stored) XSS vulnerability is a more devastating variant of a cross-site scripting flaw: it occurs when the data provided by the attacker is saved by the server, and then permanently displayed on “normal” pages returned to other users in the course of regular browsing, without proper HTML escaping.
When I examine every php file on the site, there are all sorts of injected base64 code. So this is what killed my site. Once my site started flooding spam all over the place, the internet’s immune systems (my host’s own, and others), shut it all down. Now, I could just clean everything out, and reinstall, but the more devastating issue: it appears my sql database is gone. Destroyed. Erased. No longer present. I’ve asked my host to help confirm that, because at this point, I’m way out of my league. Hey all you lone digital humanists: how often does your computing services department help you out in this regard? Find someone at your institution who can handle this kind of thing. We can’t wear every hat. I’ve been a one-man band for so long, I’m a bit like the guy in Shawshank Redemption who asks his boss at the supermarket for permission to go to the bathroom. Old habits are hard to break.
Problem #3: Security Warnings
There are many Ushahidi installations all over the world, and they deal with some pretty sensitive stuff. Security is therefore something Ushahidi takes seriously. I should’ve too. I was not subscribed to the Ushahidi Security Advisories. The hardest pill to swallow is when you know it’s your own damned fault. The warning was there; heed the warnings! Schedule time into every week to keep on top of security. If you’ve got a team, task someone to look after this. I have lots of excuses – it was end of term, things were due, meetings to be held, grades to get in – but it was my responsibility. And I dropped the ball.
Problem #4: Backups
This is the most embarrasing to admit. I did not back things up regularly. I am not ever making that mistake again. Over on Looted Heritage, I have an IFTTT recipe set up that sends every new report to BufferApp, which then tweets it. I’ve also got one that sends every report to Evernote. There are probably more elegant ways to do this. But the worst would be to remind myself to manually download things. That didn’t work the first time. It ain’t gonna work the next.
So what do I do now?
If I can get my database back, I’ll clean everything out and reinstall, and then progress onwards wiser for the experience. If I can’t… well, perhaps that’s the end of HeritageCrowd. It was always an experiment, and as Scott Weingart reminds us,
The best we can do is not as much as we can, but as much as we need. There is a point of diminishing return for data collection; that point at which you can’t measure the coastline fast enough before the tides change it. We as humanists have to become comfortable with incompleteness and imperfection, and trust that in aggregate those data can still tell us something, even if they can’t reveal everything.
The HeritageCrowd project taught me quite a lot about crowdsourcing cultural heritage, about building communities, about the problems, potentials, and perils of data management. Even in its (quite probable) death, I’ve learned some hard lessons. I share them here so that you don’t have to make the same mistakes. Make new ones! Share them! The next time I go to THATCamp, I know what I’ll be proposing. I want a session on the Black Hats, and the dark side of the force. I want to know what the resources are for learning how they work, what I can do to protect myself, and frankly, more about the social and cultural anthropology of their world. Perhaps there is space in the Digital Humanities for that.
When I discovered what had happened, I tweeted about it. Thank you everyone who responded with help and advice. That’s the final lesson I think, about this episode. Don’t be afraid to share your failures, and ask for help. As Bethany wrote some time ago, we’re at that point where we’re building the new ways of knowing for the future, just like the Lunaticks in the 18th century. Embrace your inner Lunatick:
Those 18th-century Lunaticks weren’t about the really big theories and breakthroughs – instead, their heroic work was to codify knowledge, found professional societies and journals, and build all the enabling infrastructure that benefited a succeeding generation of scholars and scientists.
if you agree with me that there’s something remarkable about a generation of trained scholars ready to subsume themselves in collaborative endeavors, to do the grunt work, and to step back from the podium into roles only they can play – that is, to become systems-builders for the humanities — then we might also just pause to appreciate and celebrate, and to use “#alt-ac” as a safe place for people to say, “I’m a Lunatick, too.”
Perhaps my role is to fail gloriously & often, so you don’t have to. I’m ok with that.
This plugin allows multimode networks projection. For example: you can project your bipartite (2-mode) graph to monopartite (one-mode) graph. The projection/transformation is based on the matrix multiplication approach and allows different types of transformations. Not only bipartite graphs. The limitation is matrix multiplication – large matrix multiplication takes time and memory.
After some playing around, and some emails & tweets with Scott, we determined that it does not seem to work at the moment for directed graphs. But if you’ve got a bimodal undirected graph, it works very well indeed! It does require some massaging though. I assume you can already download and install the plugin.
1. Make sure your initial csv file with your data has a column called ‘type’. Fill that column with ‘undirected’. The plugin doesn’t work correctly with directed graphs.
2. Then, once your csv file is imported, create a new column on the nodes table, call it ‘node-type’ – here you specify what the thing is. Fill it up accordingly. (cheese, crackers, for instance).
3. I thank Scott for talking me through this step. First, save your network; this next step will irrevocably change your data. Click ‘load attributes’. Under attribute type, select your column you created for step 2. Then, for left matrix, select select Cheese – Crackers; for right matrix, select Crackers – Cheese. Hit ‘run’. This gets you a new Cheese-Cheese network (select the inverse to get a crackers – crackers network). You can then remove any isolates or dangly bits by ticking ‘remove edges’ or ‘remove nodes’ as appropriate.
4. Save your new 1 mode network. Go back to the beginning to create the other 1 mode network.
To crowdsource something – whether it is a problem of code, or the need to transcribe historical documents – is generally about fracturing a problem into its component pieces, and allowing an interested public to solve the tiny pieces. In 2011 I and my students embarked on a project to crowdsource sense of place, using Ushahidi to solicit and collect community memories about cultural heritage resources in Pontiac and Renfrew counties in Eastern Canada (The HeritageCrowd Project). As part of that project, I had initially set up a deployment of Ushahidi using their hosted service, called ‘Crowdmap‘. It contains all of the functionality of the vanilla Ushahidi, but since they are doing the hosting, it cannot be customized to any great deal. I mothballed that initial deployment, and installed my own on my own server so I could customize. The entire experience was recounted in a contribution to the forthcoming born-digital volume, ‘Writing History in the Digital Age‘, edited by Jack Dougherty and Kristen Nawrotzki. One of the findings of that small experiment was about the order of operations in a crowdsourced project:
…in one sense our project’s focus was misplaced. Crowdsourcing should not be a first step. The resources are already out there; why not trawl, crawl, spider, and collect what has already been uploaded to the internet? Once the knowledge is collected, then one could call on the crowd to fill in the gaps. This would perhaps be a better use of time, money, and resources.
This January, I started the second term of my year long first year seminar in digital antiquity, and I decided to re-start my mothballed Crowdmap as part of a module on crowdsourcing. But as I prepared, that one paragraph above kept haunting me. Perhaps what was needed was not so much a module on crowdsourcing, but rather, one on information trapping. I realized that Ushahidi and Crowdmap are better thought of as info-traps. Thus, ‘Looted Heritage: Monitoring the Illicit Antiquities Trade‘ was born. The site monitors various social media and regular media feeds for stories and reports about the trade in antiquities, which can then be mapped, giving a visual depiction of the impact of the trade. I intend to use Zotero public library feeds as well, to enable the mapping of academic bibliography on the trade.
Why illicit antiquities? The Illicit Antiquities Research Centre at Cambridge University (which is, sadly, closed) provides some statistics on the nature and scale of the illicit antiquities trade; these statistics date to 1999; the problem has only grown with the wars and disruptions of the past decade:
- Italy: 120,000 antiquities seized by police in five years;
- Italy: 100,000+ Apulian tombs devastated;
- Niger: in southwest Niger between 50 and 90 per cent of sites have been destroyed by looters;
- Turkey: more than 560 looters arrested in one year with 10,000 objects in their possession;
- Cyprus: 60,000 objects looted since 1974;
- China: catalogues of Sotheby’s sales found in the poor countryside: at least 15,000 sites vandalized, 110,000 illicit cultural objects intercepted in four years;
- Cambodia: 300 armed bandits surround Angkor Conservation compound, using hand grenades to blow apart the monuments; 93 Buddha heads intercepted in June this year, 342 other objects a month later;
- Syria: the situation is now so bad a new law has been passed which sends looters to jail for 15 years;
- Belize: 73 per cent of major sites looted;
- Guatemala: thieves now so aggressive they even looted from the laboratory at Tikal;
- Peru: 100,000 tombs looted, half the known sites.
The idea then is to provide my students with hands-on experience using Crowdmap as a tool, and to foster engagement with the real-world consequences of pot-hunting and tomb-robbing. Crowdmap also allows users to download all of the reports that get created. One may download all reports on Looted Heritage in CSV format at the download page. I will be using this data as part of my teaching about data mining and text analysis, getting the students to run this data through tools like Voyant to spot patterns in the way that the antiquities trade is portrayed in the media.
At one point, I also had feeds from eBay for various kinds of artefacts, Greco-Roman, Egyptian, pre-Columbian, etc, but there is so much volume going through eBay that it completely overwhelmed the other signals. I think dealing with eBay and exploring the scope of the trade there will require different scrapers and quantitative tools, so I’m leaving that problem aside for the time being.
I’m also waiting anxiously for trap.it, which was described earlier this week in Profhacker, to allow exports of the information it finds. Right now, you can only consume what it finds within its ecosystem (or through kludgy workarounds). Ideally, I would be able to grab a feed from Trap.it for one of its traps and bring that directly into Looted Heritage. Trap.it is more of an active agent than the passive sieve approach that Crowdmap takes, so combining the two ought to be a powerful approach. From Profhacker:
[Trap.it is] a web service that combines machine learning algorithms with user-selected topics and filters. (The algorithms used in this project stem from the same research that led to Apple’s Siri.) After creating an account, you create a “trap” by entering in a keyword or short phrase into the Discovery box. Once you save your trap, you personalize it by clicking thumbs up or thumbs down on a number of articles in your trap. The more articles you rate, the closer attuned the trap becomes to the kinds of material you want to read.
Finally, one of the other aspects we learned in the HeritageCrowd project was the importance of outreach. For Looted Heritage, I am using a combination of If This Then That to monitor Looted Heritage’s feed, sending new reports to Buffer App, which then sends to Twitter, @looted_heritage. Aside from some problems with duplicated tweets, I’ve been quite happy with this setup (and thanks to Terry Brock for suggesting this). There’s an interesting possibility of circularity there, with Crowdmap picking up @looted_heritage in its traps, and then sending them out again… but my students should spot that if it occurs.
Ushahidi also provides an IOS app that can be deployed in the field, so that an archaeologist who discovers the work of tombaroli could take geo-tagged photographs and submit them with a click to the Looted Heritage installation, drawing police attention.
So far, the response to this project has been good, with 20 people now following the twitter account since I set it up last week (and which I haven’t promoted very actively). Please feel free to contribute reports to Looted Heritage via its submission page, or by tagging your tweets with #looted #antiquities.
I and my students have made some contributions to ‘Writing History in the Digital Age‘, the born-digital volume edited by Jack Dougherty and Kristen Nawrotzki. Rather than reflect on the writing process, I thought I’d topic model the volume to see what patterns emerged in the contributions.
I use Mallet to do this. I’ve posted earlier about how to get Mallet running. I used Outwit Hub to scrape each individual paragraph from each paper (> 700 paragraphs) into a CSV file (I did not scrape block quotes, so my paragraph numbers are slightly out of sync with those used on the Writing History website). I used the Textme excel macro (google it; it lives in multiple versions and requires a bit of modification to work exactly the way you want it to) to save each paragraph into its own unique text file, which I then load into Mallet.
Phew. Now, the tricky part with Mallet is deciding how many topics you want it to look for. Finding the *right* number of topics requires a bit of iteration – start with say 10. Look at the resulting composition of files to topics. If an inordinate number of files all fall into one topic, you don’t have enough granularity yet.
As an initial read, I went with 15 topics. One topic – which I’ll label ‘working with data’ – had quite a large number of files (composition document) (remember, the individual paragraphs from the papers). Ideally, I would re-run the analysis with a greater number of topics, so that the ‘working with data’ topic would get broken up.
I also graphed the results, so that each author is linked to the topics which compose his or her paper; the thickness of the line indicates multiple paragraphs with that topic. I have also graphed topics by individual paragraphs, but the granularity isn’t ideal making the resulting visual not all that useful. The colours correspond with the ‘modularity’ of the graph, that is, communities of similar patterns of connections. The size of the node represents ‘betweeness’ on all paths between every pair of nodes.
So what does it all mean? At the level of paragraph-by-topic, if we had the correct level of granularity, one might be able to read the entire volume by treating the graph as a guide to hyperlinking from paragraph to paragraph, perhaps – a machine generated map/index of the internal structure of ideas. At the level of individual authors, it perhaps suggests papers to read together and the organizing themes of the volume.
This is of course a quick and dirty visualization and analysis, and my initial impressions. More time and consideration, greater granularity, is to be desired.
|Working with Data||1|
|African Americans and the South||2|
|Primary Resources, Teaching, and Libraries||2|
|Blogging and Peer Interactions||3|
|Keywords and Search||4|
|Japan and History||4|
|Space and Geography||4|
UPDATE! September 19th 2012: Scott Weingart, Ian Milligan, and I have written an expanded ‘how to get started with Topic Modeling and MALLET’ for the Programming Historian 2. Please do consult that piece for detailed step-by-step instructions for getting the software installed, getting your data into it, and thinking through what the results might mean.
Original Post that Inspired It All:
I’m very interested in topic modeling at the moment. It has not been easy however to get started – I owe a debt of thanks to Rob Nelson for helping me to get going. In the interests of giving other folks a boost, of paying it forward, I’ll share my recipe. I’m also doing this for the benefit of some of my students. Let’s get cracking!
First, some background reading:
- Clay Templeton, “Topic Modeling in the Humanities: An Overview | Maryland Institute for Technology in the Humanities”, n.d., http://mith.umd.edu/topic-modeling-in-the-humanities-an-overview/.
- Rob Nelson, Mining the Dispatch http://dsl.richmond.edu/dispatch/
- Cameron Blevins, “Topic Modeling Martha Ballard’s Diary” Historying, April 1, 2010, http://historying.org/2010/04/01/topic-modeling-martha-ballards-diary/
- David J Newman and Sharon Block, “Probabilistic topic decomposition of an eighteenth‐century American newspaper,” Journal of the American Society for Information Science and Technology 57, no. 6 (April 1, 2006): 753-767.
- David Blei, Andrew Ng, and Michael Jordan, “Latent dirichlet allocation,” The Journal of Machine Learning Research 3 (2003), http://dl.acm.org/citation.cfm?id=944937.
Then, you’ll need the Java developer’s kit – nb, not the regular Java that’s on every computer, but the one that lets you program things. Install this.
Unzip Mallet into your C:/ directory . This is important; it can’t be anywhere else. You’ll then have a folder called C:/mallet-2.0.6 or similar.
Next, you’ll need to create an environment variable called MALLET_HOME. You do this by clicking on control panel >> system >> advanced system settings (in Windows 7; for XP, see this article), ‘environment variables’. In the pop-up, click ‘new’ and type MALLET_HOME in the variable name box; type c:/mallet-2.0.6 (ie, the exact location where you unzipped Mallet) in variable value.
To run mallet, click on your start menu >> all programs >> accessories >> command prompt. You’ll get the command prompt window, which will have a cursor at c:\user\user> (or similar). type cd .. (two periods; that ain’t a typo) to go up a level; keep doing this until you’re at the C:\ . Then type cd:\mallet-2.0.6 and you’re in the Mallet directory. You can now type Mallet commands directly. If you type bin\mallet at this point, you should be presented with a list of Mallet commands – congratulations!
At this point, you’ll want some data. Using the regular windows explorer, I create a folder within mallet where I put all of the data I want to study (let’s call it ‘data’). If I were to study someone’s diary, I’d create a unique text file for each entry, naming the text file with the entry’s date. Then, following the topic modeling instructions on the mallet page, I’d import that folder, and see what happens next. I’ve got some work flow for scraping data from websites and other repositories, but I’ll leave that for another day (or skip ahead to The Programming Historian for one way of going about it.)
Once you’ve imported your documents, Mallet creates a single ‘mallet’ file that you then manipulate to determine topics.
bin\mallet import-dir --input \data\johndoediary --output johndoediary.mallet \ --keep-sequence --remove-stopwords
(modified from the Mallet topic modeling page)
This sequence of commands tells mallet to import a directory located in the subfolder ‘data’ called ‘johndoediary’ (which contains a sequence of txt files). It then outputs that data into a file we’re calling ‘johndoediary.mallet. Removing stopwords strips out ‘and’ ‘of’ ‘the’ etc.
Then we’re ready to find some topics:
bin\mallet train-topics --input johndoediary.mallet \ --num-topics 100 --output-state topic-state.gz --output-topic-keys johndoediary_keys.txt --output-doc-topics johndoediary_composition.txt
(modified from the Mallet topic modeling page)
Now, there are more complicated things you can do with this – take a look at the documentation on the Mallet page. Is there a ‘natural’ number of topics? I do not know. What I have found is that I have to run the train-topics with varying numbers of topics to see how the composition file breaks down. If I end up with the majority of my original texts all in a very limited number of topics, then I need to increase the number of topics; my settings were too coarse.
More on interpreting the output of Mallet to follow.
Again, I owe an enormous debt of gratitude to Rob Nelson for talking me through the intricacies of getting Mallet to work, and for the record, I think the work he is doing is tremendously important and fascinating!
Time was, if you wanted some augmented reality, you had to upload your own points of interest into something like Wikitude or Layar. However, in its quest for world domination, Google seems to be working on something that will render those services moot: Google Goggles (silly name, profound implications).
As Leonard Low says on the MLearning Blog:
The official Google site for the project (which is still in development) provides a number of ways Goggles can be used to accomplish a “visual search”, including landmarks, books, contact information, artwork, places, logos, and even wine labels (which I anticipate could go much further, to cover product packaging more broadly).
So why is this a significant development for m-learning? Because this innovation will enable learners to “explore” the physical world without assuming any prior knowledge. If you know absolutely nothing about an object, Goggles will provide you with a start. Here’s an example: you’re studying industrial design, and you happen to spot a rather nicely-designed chair. However, there’s no information on the chair about who designed it. How do you find out some information about the chair, which you’d like to note as an influence in your own designs? A textual search is useless, but a visual search would allow you to take a photo of the chair and let Google’s servers offer some suggestions about who might have manufactured, designed, or sold it. Ditto unusual insects, species of tree, graphic designs, sculptures, or whatever you might happen to by interested in learning.
Just watch this space. I think Google Goggles is going to rock m-learning…
Now imagine this in action with an archaeological site, and google connects you with something less than what we as archaeological professionals would like to see. Say it was some sort of aboriginal site with profound cultural significance – but the site it connects with argues for the opposite. Another argument for archaeologists and historians to ‘create signal’ and to tell Google what’s important.
See the video:
It would’ve been nice if the IT folks at U Manitoba had given me some warning that they were about to close my account. I’m no longer going to be teaching for them in the fall, it is true; but a lot of my stuff – not to mention my agent models – are on their servers.
My own fault, I guess – I should’ve cleaned everything off of there when I decided to decline the fall courses, but still, it would’ve been nice to have had some warning.
So, if you’re looking for me @umanitoba.ca, that doesn’t work any more. I’ll be getting some new contact info before too much longer, and hopefully, some space for my simulations, too.
Have you seen the old man
In the closed-down market
Kicking up the paper,
with his worn out shoes?
In his eyes you see no pride
And held loosely at his side
Yesterday’s paper telling yesterday’s news
So how can you tell me you’re lonely,
And say for you that the sun don’t shine?
Let me take you by the hand and lead you through the streets of London
I’ll show you something to make you change your mind
a) I had an iPhone and
b) I was in London.
I look forward to seeing more of these sorts of things emerge. Imagine – mashing the physical, the digital, the past, and the present all at once. Landscape archaeology as palimpsest is a fairly standard idea, but these sorts of applications should only enhance the notion more popularly [he said, hopefully...]
From THATCamp Paris, a manifesto for Digital Humanities (I translate from the French below, with a wee bit of a kickstart from Google Translate; I do not guarantee that this is a perfect or most accurate translation):
Manifesto for the Digital Humanities
We practitioners or observers of digital humanities (Digital Humanities) met in Paris at the THATCamp 18 and May 19, 2010.
During these two days, we have discussed, exchanged, reflected together on what are the digital humanities and we have tried to imagine and invent what they might become.
After these two days which are only one step, we propose to research communities and to all those involved in creating, editing, enhancement or preservation, a manifesto for “digital humanities”.
1. The computational turn taken by the society changes and examines the conditions of production and dissemination of knowledge.
2. For us, the digital humanities relate to all Social Sciences, Arts and Letters. The digital humanities are not a clean slate. They rely instead on all the paradigms, skills and knowledge specific to these disciplines, while leveraging the tools and the unique perspectives of the digital field.
3. The digital humanities designate a ‘transdiscipline’, embodying the methods, devices and heuristics related to digital opportunities in the field of humanities and social sciences.
4. We note:
- That there has been increased experimentation in the field of digital humanities and social sciences in the last half-century. What has emerged more recently – digital humanities centers- are, at present, prototypes or specific areas of application of an approach to digital humanities;
- That computational or digital approaches induce a stronger technical constraint and thus an economic one; therefore, that this constraint is an opportunity to change the collective work;
- There are a number of proven methods, known and shared unequally;
- There are multiple communities from special interest practices, tools or interdisciplinary approaches (encoding textual sources, geographic information systems, lexicometry, digitization of cultural heritage, scientific and technical web mapping , data mining, 3D, oral archives, digital arts and literature and hypermedia, etc..), and that these communities are converging to form the field of ‘digital humanities’.
5. We, the practitioners of digital humanities, are building a community of practice that is open, welcoming and freely accessible
6. We are a community without borders. We are a multilingual community and we are multidisciplinary.
7. Our aims are the advancement of knowledge, enhancing the quality of research in our disciplines, and the enrichment of the knowledge not just within but also beyond the academic sphere.
8. We call for the integration of digital culture in the definition of the general culture of the twenty-first century.
9. We call for open access to data and metadata. These must be documented and interoperable, both technically and conceptually.
10. We support the dissemination, movement and free enhancement of methods, code, formats and results of research.
11. We call for the integration of training in digital humanities within the curriculum in Social Studies in Arts and Letters. We also want the creation of specialist diplomas in the digital humanities and the development of dedicated professional training. Finally, we hope that these skills will be taken into account in recruitment and career development.
12. We are committed to building a collective competency based on a common vocabulary, which is from the collective expertise of all working practitioners. This collective expertise is to become a common good. It is a scientific opportunity but also an opportunity for professional development in all sectors.
13. We want to participate in defining and disseminating best practices related to identified disciplinary and interdisciplinary identified. These are needs will be identified as they emerge from debate and consensus amongst the communities concerned. The fundamental openness of digital humanities nevertheless provides a pragmatic approach to protocols and visions, which maintains the right to coexistence of different approaches and competing for the benefit of the enrichment of the thinking and practices.
14. We call for the construction of scalable Cyberinfrastructures responding to real needs. These Cyberinfrastructures be built iteratively, based on the finding of methods and approaches that are proven in the research communities.
(errors of translation are my own)
I am a member of the Working Group on Open Archaeology. Recently in the discussion, Anthony Beck linked to a recent presentation of his called ‘Dig the new breed: how open approaches can empower archaeologists’:
In one of his slides, he mentions Richard Bradley, from my alma mater, the University of Reading, and how Richard used the grey literature from various commercial bodies to write his history of bronze age Britain. He links to this article. As I was reading this, it occurred to me that here is a perfect opportunity for crowdsourcing… perhaps.
What would it cost to digitize all of the UK’s grey literature? Here are the plans for a $20 DIY book scanner which uses a basic point-and-shoot digital camera. And here is an open source optical character recognition package from the good people at Google.
So only two hurdles remain: getting access to the grey literature, and the man-power to do this (hence the crowdsourcing). It would be interesting perhaps for a phd student to try this out at their local archaeological consultancy, and then perhaps use some data mining techniques (like in this example) to quickly begin to extract useful information.
The technology is there… let’s make it work!