Research data management��Who, what, where, when, why … and how?�
Developed by Belinda Weaver, The University of Queensland Library
Today’s session
A new era in research
A thousand years ago:
describing natural phenomena
Last few hundred years:
using models, generalizations
Last few decades:
simulating complex phenomena
Today: data exploration (eScience)
– Data captured by instruments or generated by simulator
– Processed by software
– Information/knowledge stored in computer
– Scientist analyses database / files using data management and statistics
The data deluge is REAL
And it’s a continuum …
We are here now
We need to be here within a year
This is where
we eventually
want to be
We will only go here
when we are ready
We have scoped what is needed
Gauge
Refer
Advise
Support
Partner
Australian Research Funding
Department of Innovation, Industry, Science and Research (DIISR)
National Collaborative Research Infrastructure Strategy (NCRIS)
A $542m, 7-year project
Provides and supports major research infrastructure
Research Data Storage Infrastructure Project (RDSI)
Platforms for Collaboration (PfC)
National Computational Infrastructure (NCI)
Australian National Data Service (ANDS)
Australian Research Collaboration Service (ARCS)
National eResearch Collaboration Tools and Resources Project (NecTAR)
Super Science
$1.1 billion to priority areas of Australian research, $90m for research infrastructure
Priority Areas
Space and Astronomy $160.5m
Marine and Climate $387.7m
Future Industries $504.0m
$$$$
Education Investment Fund (EIF)
$$$$
$$$$
$$$$
The game changers in research
Images from
State of play in the humanities
http://www.rin.ac.uk/our-work/using-and-accessing-information-resources/information-use-case-studies-humanities
State of play in the life sciences
http://www.rin.ac.uk/our-work/using-and-accessing-information-resources/patterns-information-use-and-exchange-case-studie
So … what is it the eResearch landscape like?
The Bentham Project has recently begun an initiative to transcribe the manuscripts of Jeremy Bentham (1748–1832), the utilitarian philosopher, jurist, economist, political theorist, and social reformer.�
Crowdsourcing
Civil War Diaries Transcription Project
Exhibit: http://digital.lib.uiowa.edu/cwd/index.php
Project: http://digital.lib.uiowa.edu/cwd/transcripts.html
Crowdsourcing
Galaxy Zoo
Crowdsourcing
One of a number of projects in the
http://www.zooniverse.org/
Model Earth’s climate using wartime ship logs
Helps scientists recover weather observations made by Royal Navy ships around the time of World War I.
Transcriptions will contribute to climate model projections and improve a database of weather extremes.
Historians use the work to track past ship movements and research the stories of people on board.
One of a number of projects in the
Crowdsourcing
https://www.zooniverse.org/project/oldweather
Ancient Lives
The data gathered by ANCIENT LIVES helps scholars study the Oxyrhynchus collection.
Transcriptions collected digitally are combined with human and computer logic to identify known texts and documents.
One of a number of projects in the
https://www.zooniverse.org/project/ancientlives
The citizens of this town, five days journey by road south of Memphis, called it Oxyrhynchus, or Oxyrhynchon polis … City of the Sharp-Nosed Fish
Crowdsourcing
“By February 2011 we had 20,000 plus people helping out and 30 million lines of text had been corrected during the last 2 years.”
Crowd-sourcing tips from Rose Holley of the NLA
Currently 5 million pages, and counting …
Find them in Trove
http://trove.nla.gov.au/newspaper
Crowdsourcing
The Project: http://www.nla.gov.au/ndp/
MONK Project
Data Mining
MONK is a digital environment designed to help humanities scholars discover and analyse patterns in the texts they study. It includes
MONK provides texts and tools to enable literary research through the discovery, exploration, and visualization of patterns.
Each toolset is made up of individual tools (e.g. a search tool, a browsing tool, a rating tool, and a visualization).
Comparative novelty map
in three novels
http://datamining.typepad.com/
http://www.monkproject.org/
Old Bailey Online
A fully searchable edition of the largest body of texts detailing the lives of non-elite people ever published, containing 197,745 criminal trials held at London's central criminal court.
The Proceedings of the Old Bailey contain 120 million words, recording 197,000 trials held at the Old Bailey, or Central Criminal Court in London, between 1674 and 1913.
Digitisation
http://www.oldbaileyonline.org/
Data Mining
The Digital Republic of Letters – collaborative projects
Cultures of Knowledge: An Intellectual Geography of the Seventeenth-Century Republic of Letters
A collaborative, interdisciplinary research project reconstructing correspondence networks central to the revolutionary intellectual developments of the early modern period.
Mapping the Republic of Letters
A Stanford University project on Correspondence and Intellectual Community in the Early Modern Period (1500-1800). Based around case studies of individuals such as Benjamin Franklin, Voltaire & John Locke.
https://republicofletters.stanford.edu/
http://www.history.ox.ac.uk/cofk/
Circulation of Knowledge and learned practices in the 17th century Dutch Republic
Until the publication of the first scientific journals in the 1660s, letters were the most important means of communication between intellectuals. The project has created a machine-readable and growing corpus of approximately 20,000 letters between 17th century scholars.
http://ckcc.huygens.knaw.nl/
Visualisation
The game changers in research
Images from
The Australian Code for the Responsible Conduct of Research is …
Soon a UQ requirement as well – the draft policy will be considered by the UQ Research Committee and other key stakeholders – it will address the Code especially
http://www.nhmrc.gov.au/guidelines/publications/r39
What’s in the Code?
http://www.nhmrc.gov.au/guidelines/publications/r39
A Data Management Plan …
… is a document that describes how you will collect, organise, manage, store, secure, back up, preserve, and share your data.
What needs to be included in a plan?
What are the obstacles to good planning?
Fact Sheet 3
Why include a data description?
The Code says:
“Policies are required that address the ownership of research materials and data, their storage, their retention beyond the end of the project, and appropriate access to them by the research community.”
Our role?
Scare factor?
Low
Reality?
Low
Researchers need to know
Fact Sheets 12 and 15
Examples from Research Data Australia
Repurposing data descriptions – from plans to RDA
RDA collections do not have to be
Why bother?
Fact Sheet 15
Why cover metadata?
Our role?
Scare factor?
High
Reality?
High
There are 3 main types of metadata
Fact Sheet 12
Metadata – 3 types
Descriptive metadata
Administrative metadata
Structural metadata
How to provide metadata:
In an accompanying document, an XML file, a README file, via repository metadata …
Fact Sheet 12
Why cover ownership, copyright and IP?
Our role?
✓ Refer to the appropriate source of advice
Scare factor?
High
Reality?
Low
These issues strongly affect use and potential re-use.
Key advice
The Code says:
“Policies are required that address the ownership of research materials ….”
Fact Sheet 10
Why cover issues such as ethics?
Our role?
and Innovation Division, who have
committees on human and animal
ethics with strict procedures to follow
Scare factor?
Medium
Reality?
Low
The Code says:
“Protection of human subjects is a fundamental tenet of research and an important ethical obligation for everyone involved in research projects. Disclosure of identities when privacy has been promised could result in lower participation rates and a negative impact on science.”
Researchers need to record that they have applied for the requisite ethical clearances for research with human or animal subjects.
This will involve documenting the decisions of ethical committees or recording application numbers.
Fact Sheet 9
Why cover data sharing?
The Code says:
“Policies are required that address the ownership of research materials and data …and appropriate access to them by the research community.”
Sharing may be hampered by
But sharing is encouraged by
Our role?
✓ Refer them to tools and services.
Scare factor?
Medium
Reality?
Medium
Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308. doi: 10.1371/journal.pone.0000308
– provides evidence that data sharing boosts citations and increases research visibility
Fact Sheet 7
Why cover data formats?
Researchers should be encouraged to choose and use common, robust, well-documented formats that will last
Our role?
✓ Point them to checklists.
Scare factor?
High
Reality?
Medium
The Code says:
“Policies are required that address … data … storage … [and] retention beyond the end of the project ...”
Fact Sheet 16
Why cover security and storage?
Scare factor?
Medium
Our role?
✓ Provide storage pros and cons, e.g. checklists
✓ Monitor services that offer storage
✓ Monitor services such as HPC
Reality?
Medium
They need three separate copies to be safe.
The Code specifically covers
Management of research data and primary materials (section 2)
Researchers must ensure that all research data, regardless of format, is stored securely and backed up or copied regularly.
The plan records these arrangements.
Fact Sheets 5 and 17
Why cover retention periods?
The Code says:
“The researcher must decide which data and materials should be retained, although in some cases this is determined by law, funding agency, publisher or by convention in the discipline.
The central aim is that sufficient materials and data are retained to justify the outcomes of the research and to defend them if they are challenged.
The potential value of the material for further research should also be considered, particularly where the research would be difficult or impossible to repeat.”
Scare factor?
Low
Our role?
Reality?
Low
Fact Sheet 6
Why cover long term ‘homes’ for data?
“Archival experience has demonstrated that the durability of the data increases and the cost of processing and preservation decreases when data deposits are timely.
It is important that data be deposited while the producers are still familiar with the dataset and able to transfer their knowledge fully to the archive.”
Framework for Creating a Data Management Plan
http://www.icpsr.umich.edu/icpsrweb/content/ICPSR/dmp/framework.html
Scare factor?
Medium
Our role?
for in a long term service
Reality?
Medium
Data must go somewhere
Data must be managed
Plans should provide the eventual location of the data
Repositories take the worry out of
Fact Sheet 6
It’s a Web …
Who, what,
where, when, why ?
Implementation
Outreach activity might include
At this point
Implementation
Outreach activity might include
At these points
Implementation
Outreach activity might include
At this point