Curating the Dark Data in the long tail of science


ABSTRACT

There is a wealth of scientific data that is almost impossible to see. This is science’s dark data. Much of this data resides in the long tail of science or “small” data collection efforts. Instrumentation has made it possible to develop large collections of relatively homogeneous data, be it from space sensors or high throughput gene sequencers. The monolithic collections are easy to find and search. Dark data on the other hand may constitute the larger mass of scientific information. The collections that make up the dark data of science are much smaller but also much more numerous, being generated by thousands of scientists, on a much broader number of scientific questions, and in a complex array of formats. Unfortunately, it is also more prone to be overlooked and lost over time. Using new technology, the economics of the internet, and change in the sociology of science it is possible to make greater use of this data than was possible in the past. Data curators are the people who develop and use these technologies and procedures to make this data more useful, insuring a more efficient return on investment in the enterprise of science.

This is a really interesting tech talk given by P. Bryan Heidorn from the National Science Foundation Division of Biological Infrastructure and Associate Professor, University of Illinois.

I found the talk to be particularly useful, I’ve never come across the term Digital Curation before, and surprised to learn that it is defined as:

Digital curtaion is the acquisition, management, appraisal, and serving 
of data to maximise it's usefulness.

Curation embraces and goes beyond that of enhanced present day
re-use, and of archival responsibility, to embrace stewardship that adds
value through the provision of context and linkage: placing emphasis
on publishing data in ways that ease re-use and promoting accountability
and integration. (Rusbridge et. al, 2005)

What surprises me is that the goals of these curators are not too dissimilar to the goals of those of us working in the Linked Open Data movement, and I’m wondering whether these two communities should work more closely together … very interesting indeed.