Curating the Dark Data in the long tail of science


ABSTRACT

There is a wealth of scientific data that is almost impossible to see. This is science’s dark data. Much of this data resides in the long tail of science or “small” data collection efforts. Instrumentation has made it possible to develop large collections of relatively homogeneous data, be it from space sensors or high throughput gene sequencers. The monolithic collections are easy to find and search. Dark data on the other hand may constitute the larger mass of scientific information. The collections that make up the dark data of science are much smaller but also much more numerous, being generated by thousands of scientists, on a much broader number of scientific questions, and in a complex array of formats. Unfortunately, it is also more prone to be overlooked and lost over time. Using new technology, the economics of the internet, and change in the sociology of science it is possible to make greater use of this data than was possible in the past. Data curators are the people who develop and use these technologies and procedures to make this data more useful, insuring a more efficient return on investment in the enterprise of science.

This is a really interesting tech talk given by P. Bryan Heidorn from the National Science Foundation Division of Biological Infrastructure and Associate Professor, University of Illinois.

I found the talk to be particularly useful, I’ve never come across the term Digital Curation before, and surprised to learn that it is defined as:

Digital curtaion is the acquisition, management, appraisal, and serving 
of data to maximise it's usefulness.

Curation embraces and goes beyond that of enhanced present day
re-use, and of archival responsibility, to embrace stewardship that adds
value through the provision of context and linkage: placing emphasis
on publishing data in ways that ease re-use and promoting accountability
and integration. (Rusbridge et. al, 2005)

What surprises me is that the goals of these curators are not too dissimilar to the goals of those of us working in the Linked Open Data movement, and I’m wondering whether these two communities should work more closely together … very interesting indeed.

Talis and Creative Commons launch new Open Data licence

Yesterday we, at Talis, announced some wonderful news – Talis has been working in partnership with the Science Commons project of Creative Commons and we are all pleased to announce the release  of the new Open Data Commons Public Domain Dedication and Licence.

As an organisation Talis have been interested in the licensing issues surrounding Open Data for quite some time now, we’ve been talking about Open Data at conferences and also writing about many of these issues. In 2006 we began this process by launching our own attempt at an Open Data licence called the Talis Community Licence – this helped to shape some of our initial thoughts. Earlier this year we even convened a special workshop on Open Data at the World Wide Web conference in Banff which helped us to understand the direction we wanted to move in and who we needed to work with to make this a reality.

This new licence represents a real milestone for us. For the Semantic Web to succeed there needs to be more data coming online marked up for linking and sharing in this web of data, hopefully the licence can serve as a tool that enables more of us to share and contribute data.

Open Access and an example of how it can work in education

I’ve been thinking a lot about Open Access, Open Content and indeed Open Data for a while, they are all interrelated issues that were thinking about a lot at Talis. It’s true to say that the Open Data issue is probably the one we are focusing on primarily at the moment, in fact one of my colleagues Paul is giving a talk1 on that exact subject at XTech in a couple of weeks, and another of my colleagues, Rob, presented his thoughts2 on Open Data at EUSIDIC last month, and they’ll both be sitting on a panel discussing Open Data at WWW2007 next week in Banff.

Right now though I want to talk about Open Access and a little on Open Content.

Knowledge should be free and open to use and re-use – that’s something I believe.

There has always a been a desire amongst academics, in fact its more of a tradition, to publish their research in journals without payment but rather for sake of inquiry and sharing that knowledge. Is it altruism alone that motives these authors, these researchers? I like to believe that it is the main reason 🙂 . However I recognise that Open Access offers these individuals tangible benefits and advantages3. For one thing studies have shown4 that openly accessible articles and papers are more likely to be cited than those which are locked away behind subscriptions – accessible only to those either willing to pay for that privelege or belonging to a closed community able to gain access to them .

Open Access should make sense because openly accessible article can be harvested and indexed by search engines and can be viewed by anyone, anywhere. If your researching into a subject and come across a text you want to read there isnt a barrier preventing you from gaining access to that item.

Back in 1995 Steve Harnad wrote a seminal piece entitled the “Subversive Proposal”5 which called upon authors of esoteric writings to archive them for free online in anonymous FTP archives or on websites). His belief was that as soon as all research authors publicly self archived their refereed and unrefereed papers online, then research literature would be free and accessible to all. There was great debate around this proposal and at the time it was the commonly held view that what Harnad was asking for was naive and flawed, I managed to find an excellent retrospective piece by Richard Poynder that discusses the impact of the Subversive Proposal6 , and the history that lead up to it.

Over a decade later the Open Access movement has gained a great deal of momentum which is now threatening the entire scholarly publishing industry, there’s numerous Open Access inspired toolkits and services that are enabling authors to self archive content which is then freely available to all. Yet critics of Open Access still maintain that the pay-for-access model is necessary … but I guess when you consider that the scholarly publishing business is worth an estimated $6 billion, it’s not hard to understand why they are so opposed to this.

I felt compelled to share my thoughts today, after watching a TED Talk by Richard Baraniuk7, in which he passionately argues that textbooks and educational materials that are used in schools should be made available to all through a vast interconnected repository – allowing anyone to use the information, improve it, and not only bringing the authors, who are often academics, closer to those using their material but encouraging more people to share their knowledge in this new ecosystem. It’s not hard to see how you could abstract this out further to encompass all scholarly articles and not just textbooks. I guess this is were Open Access and Open Content become a little blurred for me but that’s only because what Richard is proposing is not only allowing people free access to these works but empowering users to mix content together to create customised works made up of different constituent parts whilst crediting the authors of each of those parts – and that’s really interesting!

Richard is the founder of Connexions which is an environment for collaboratively developing, freely sharing, and rapidly publishing scholarly content on the Web under the Creative Commons8 license. I think it’s a wonderful example of how Open Access and Open Content can be successful. Connexions is focused entirely on developing teaching materials and whilst this is only a small subset of all scholarly publishing it’s still an extremely compelling and inspiring initiative which is gaining pace. Add to this the notion of on-demand publishing where students who want an up to date physical manifestation of a book can purchase one for a significantly cheaper price than they would have paid had the title been produced by a traditional publishing company, since the middle man is effectively cut out of the loop.

When you consider that most academics who write textbooks don’t actually make a significant amount from that it’s understandable why they might wish to participate in initiatives like Connexions, most of these individuals dont write textbooks necessarily for money but to make an impact, and this type of system makes their work accessible to more people thus increasing the potential impact.

Or is my naivety showing?

  1. Opening the Silos: sustainable models for open data[back]
  2. The outlook and the Future [back]
  3. Online or Invisible? [back]
  4. The effect of open access and downloads on citation impact [back]
  5. Subversive Proposal[back]
  6. Ten Years After by Richard Poynder[back]
  7. Goodbye Textbooks; hello, open source learning[back]
  8. Creative Commons[back]