1 of 40

Research data management��Who, what, where, when, why … and how?�

Developed by Belinda Weaver, The University of Queensland Library

2 of 40

Today’s session

  • Why are we here?
  • Background to what we are doing
  • Data management plans
    • Components and why each is important
    • Data descriptions – dual purpose
  • Actions from today – how to operationalise this work
  • Feedback

3 of 40

A new era in research

A thousand years ago:

  • science was empirical

describing natural phenomena

 

Last few hundred years:

  • A theoretical branch developed

using models, generalizations

 

Last few decades:

  • a computational branch developed

simulating complex phenomena

 

Today: data exploration (eScience)

  • unify theory, experiment, and simulation

– Data captured by instruments or generated by simulator

– Processed by software

– Information/knowledge stored in computer

– Scientist analyses database / files using data management and statistics

4 of 40

The data deluge is REAL

  • In one day, a high-throughput DNA-sequencing machine can read about 26 billion characters of the human genetic code. That translates into 9 terabytes – or 9 trillion data units – in one year; alongside it is a wealth of related information that can be 20 times more voluminous. The total data flow: more than 20 new US Libraries of Congress each and every year. That is from one specialised instrument, in one scientific sub-discipline; enlarge that picture across all of science, across the world, and you start to see the dimension of the opportunity and challenge presented.

5 of 40

And it’s a continuum …

We are here now

We need to be here within a year

This is where

we eventually

want to be

We will only go here

when we are ready

We have scoped what is needed

Gauge

Refer

Advise

Support

Partner

6 of 40

Australian Research Funding

Department of Innovation, Industry, Science and Research (DIISR)

National Collaborative Research Infrastructure Strategy (NCRIS)

A $542m, 7-year project

Provides and supports major research infrastructure

Research Data Storage Infrastructure Project (RDSI)

Platforms for Collaboration (PfC)

National Computational Infrastructure (NCI)

Australian National Data Service (ANDS)

Australian Research Collaboration Service (ARCS)

National eResearch Collaboration Tools and Resources Project (NecTAR)

Super Science

$1.1 billion to priority areas of Australian research, $90m for research infrastructure

Priority Areas

Space and Astronomy $160.5m

Marine and Climate $387.7m

Future Industries $504.0m

$$$$

Education Investment Fund (EIF)

$$$$

$$$$

$$$$

7 of 40

The game changers in research

  • Scope and scale – projects can go wider and bigger
  • Resources – vast numbers of datasets available, easier to find and share data of all kinds
  • New digital tools – data mining, visualisation, sharing, collaborating
  • Crowdsourcing – the growth of ‘citizen’ science
  • Expectations – that data will be made open

8 of 40

State of play in the humanities

  • use and create a wide range of information – print, manuscript and digital
  • publish by traditional means
  • more formal, systematic collaborations beginning
  • little take-up of tools for text-mining, cloud computing, or the semantic web
  • limited uptake of freely available tools for data management and data sharing
  • adoption and take up of new technologies and services hampered by lack of awareness and lack of support

http://www.rin.ac.uk/our-work/using-and-accessing-information-resources/information-use-case-studies-humanities

9 of 40

State of play in the life sciences

  • most research groups operate on a relatively small scale
  • publish by traditional means
  • overlapping collaborations
  • strong desire for information support, closely integrated with research teams and laboratories
  • reluctance to adopt new tools and services – hampered by lack of awareness and lack of training and support
  • emerging specialist roles required – bioinformaticians, statisticians, modellers and curators – these need to be developed and rewarded

http://www.rin.ac.uk/our-work/using-and-accessing-information-resources/patterns-information-use-and-exchange-case-studie

10 of 40

So … what is it the eResearch landscape like?

11 of 40

The Bentham Project has recently begun an initiative to transcribe the manuscripts of Jeremy Bentham (1748–1832), the utilitarian philosopher, jurist, economist, political theorist, and social reformer.�

  • A “crowdsourced” project
  • Collection: 60,000 folios, arranged into 174 boxes.
  • Overall: 2,519 of the 5,262 manuscripts uploaded to the website have been transcribed – (between 620,000 and 1.8 million words).

Crowdsourcing

12 of 40

Civil War Diaries Transcription Project

Exhibit: http://digital.lib.uiowa.edu/cwd/index.php

Project: http://digital.lib.uiowa.edu/cwd/transcripts.html

  • Project to transcribe diaries and letters
  • Improves access and searchability
  • Uses crowdsourcing to do the work
  • Creates new resources for scholars
  • Experts, historians and family members can contribute, thus enriching the data

Crowdsourcing

13 of 40

Galaxy Zoo

  • Project to classify galaxies by shape
  • Images taken by Hubble telescope
  • Anyone can take part
  • More than 250,000 people have contributed data so far
  • Data generated has fostered new research into what has been discovered

Crowdsourcing

One of a number of projects in the

http://www.zooniverse.org/

14 of 40

Model Earth’s climate using wartime ship logs

Helps scientists recover weather observations made by Royal Navy ships around the time of World War I.

Transcriptions will contribute to climate model projections and improve a database of weather extremes.

Historians use the work to track past ship movements and research the stories of people on board.

One of a number of projects in the

Crowdsourcing

https://www.zooniverse.org/project/oldweather

15 of 40

Ancient Lives

The data gathered by ANCIENT LIVES helps scholars study the Oxyrhynchus collection.

Transcriptions collected digitally are combined with human and computer logic to identify known texts and documents.

One of a number of projects in the

https://www.zooniverse.org/project/ancientlives

The citizens of this town, five days journey by road south of Memphis, called it Oxyrhynchus, or Oxyrhynchon polis … City of the Sharp-Nosed Fish

Crowdsourcing

16 of 40

“By February 2011 we had 20,000 plus people helping out and 30 million lines of text had been corrected during the last 2 years.

Crowd-sourcing tips from Rose Holley of the NLA

Currently 5 million pages, and counting …

Find them in Trove

http://trove.nla.gov.au/newspaper

Crowdsourcing

The Project: http://www.nla.gov.au/ndp/

17 of 40

MONK Project

Data Mining

MONK is a digital environment designed to help humanities scholars discover and analyse patterns in the texts they study. It includes

  • 525 works of American literature (18th-19th centuries)
  • 37 plays, 5 works of poetry by William Shakespeare

MONK provides texts and tools to enable literary research through the discovery, exploration, and visualization of patterns.

Each toolset is made up of individual tools (e.g. a search tool, a browsing tool, a rating tool, and a visualization).

Comparative novelty map

in three novels

http://datamining.typepad.com/

http://www.monkproject.org/

18 of 40

Old Bailey Online

A fully searchable edition of the largest body of texts detailing the lives of non-elite people ever published, containing 197,745 criminal trials held at London's central criminal court.

The Proceedings of the Old Bailey contain 120 million words, recording 197,000 trials held at the Old Bailey, or Central Criminal Court in London, between 1674 and 1913.

Digitisation

http://www.oldbaileyonline.org/

Data Mining

19 of 40

The Digital Republic of Letters – collaborative projects

Cultures of Knowledge: An Intellectual Geography of the Seventeenth-Century Republic of Letters

A collaborative, interdisciplinary research project reconstructing correspondence networks central to the revolutionary intellectual developments of the early modern period.

Mapping the Republic of Letters

A Stanford University project on Correspondence and Intellectual Community in the Early Modern Period (1500-1800). Based around case studies of individuals such as Benjamin Franklin, Voltaire & John Locke.

https://republicofletters.stanford.edu/

http://www.history.ox.ac.uk/cofk/

Circulation of Knowledge and learned practices in the 17th century Dutch Republic

Until the publication of the first scientific journals in the 1660s, letters were the most important means of communication between intellectuals. The project has created a machine-readable and growing corpus of approximately 20,000 letters between 17th century scholars.

http://ckcc.huygens.knaw.nl/

Visualisation

20 of 40

The game changers in research

  • Scope and scale – projects can go wider and bigger
  • Resources – vast numbers of datasets available, easier to find and share data of all kinds
  • New digital tools – data mining, visualisation, sharing, collaborating
  • Crowdsourcing – the growth of ‘citizen’ science
  • Expectations – that data will be made open
  • The Code

21 of 40

The Australian Code for the Responsible Conduct of Research is …

  • An ARC and NHMRC funding requirement
  • Universities Australia helped draft it and expect researchers to comply

Soon a UQ requirement as well – the draft policy will be considered by the UQ Research Committee and other key stakeholders – it will address the Code especially

    • Management of research data and primary materials (section 2)
    • Publication and dissemination of research findings (section 4)

http://www.nhmrc.gov.au/guidelines/publications/r39

22 of 40

What’s in the Code?

  • The Code says a strong research culture will demonstrate:

    • honesty and integrity
    • respect for human research participants, animals and the environment
    • appropriate acknowledgment of the role of others in research
    • responsible communication of research results
    • good stewardship of public resources used to conduct research

http://www.nhmrc.gov.au/guidelines/publications/r39

23 of 40

A Data Management Plan …

… is a document that describes how you will collect, organise, manage, store, secure, back up, preserve, and share your data.

24 of 40

What needs to be included in a plan?

  • Data description
  • Metadata
  • Ownership, copyright and IP
  • Ethics
  • Formats
  • Sharing data and collaborating
    • During research
    • After it’s finished
  • Security and storage
  • Retention and destruction
  • Long term ‘home’ for the data

What are the obstacles to good planning?

Fact Sheet 3

25 of 40

Why include a data description?

The Code says:

“Policies are required that address the ownership of research materials and data, their storage, their retention beyond the end of the project, and appropriate access to them by the research community.”

Our role?

  • Help them write data descriptions for plans
  • Help them find keywords and FoR codes
  • Help them add descriptions to Research Data Australia to make their data discoverable

Scare factor?

Low

Reality?

Low

Researchers need to know

  • What the data is
  • Who is allowed to use it
  • When they can use it
  • How they can use it
  • What they might use it for
  • Where they can use it
  • How long it will be available

Fact Sheets 12 and 15

26 of 40

  • Example 1: Great Barrier Reef coral bleaching data derived from satellite imagery
    • Coral bleaching dataset for various locations in the Great Barrier Reef, Australia for March 2002 and March 2006. The dataset describes percentage of bleaching and level of bleaching for locations identified by name and geospatial coordinates. The dataset is derived from analysis of NASA MODIS (Moderate Resolution Imaging Spectroradiometer) ocean colour and sea surface temperature data.
  • Example 2: Variables derived from 2006 Census of Population & Housing and voting results at polling booth catchment level for 2007 Australian Federal Election 
    • Dataset derived from the 2006 Australian Census of Population and Housing, voting results for the 2007 Australian Federal election, and location of polling booths from the Australian Electoral Commission. Voting results data includes total and proportion of votes cast for each political party for all electoral divisions and polling booths in Australia.
  • Example 3: Australian Pulp Fiction Collection
    • The Australian Pulp Fiction Collection contains biographical and bibliographic information, digitised cover art, and artist information for more than 5,000 Australian pulp fiction items published between 1939 and 1959 by more than 100 authors. The collection has a specific focus on the internationally successful authors Alan Yates (who wrote as 'Carter Brown').

27 of 40

Repurposing data descriptions – from plans to RDA

Research Data Australia links

RDA collections do not have to be

    • Digital
    • Accessible, open or shared

Why bother?

    • Stops research being duplicated
    • Publicises existence of data and attracts collaborators

Fact Sheet 15

28 of 40

Why cover metadata?

  • Metadata is essential for effective data use and re-use.
  • Metadata describes
    • What the data is
    • Who can use it
    • When it can be used
    • How it can be used
    • What it might be used for
    • Where it can be found
    • And how long it will be available

Our role?

  • Help them develop metadata
  • Help choose the right method, standard or schema
  • Provide tools and templates

Scare factor?

High

Reality?

High

There are 3 main types of metadata

Fact Sheet 12

29 of 40

Metadata – 3 types

Descriptive metadata

  • Enables a dataset to be discovered and identified, e.g. a project title, a description of the research scope, an abstract, and relevant keywords. In many disciplines, controlled vocabularies and nomenclature exist.

Administrative metadata

  • Helps manage the dataset. Includes rights management, access control, use requirements, technical data on file creation and quality control, file formats, software/hardware for access and use, information relevant to archiving and preservation.

Structural metadata

  • Describes how items relate to one another, e.g. that file x is the JPEG format of the archival TIFF image file z.

How to provide metadata:

In an accompanying document, an XML file, a README file, via repository metadata

Fact Sheet 12

30 of 40

Why cover ownership, copyright and IP?

  • IP questions?
    • Refer them to UQ’s IP policy
  • Copyright?
    • Refer them to UQ’s IP policy, online copyright help pages and UQ’s Copyright Lawyer (Tom)
  • Ownership?
    • They have to work that out themselves from the IP policy and the funding requirements – they may need to consult Research Legal Services (in R&ID)

Our role?

✓ Refer to the appropriate source of advice

Scare factor?

High

Reality?

Low

These issues strongly affect use and potential re-use.

Key advice

  • UQ IP Policy (in PPL)
  • UQ DM policy and procedures (draft)

The Code says:

“Policies are required that address the ownership of research materials ….”

Fact Sheet 10

31 of 40

Why cover issues such as ethics?

Our role?

  • Refer all enquiries to the Research

and Innovation Division, who have

committees on human and animal

ethics with strict procedures to follow

Scare factor?

Medium

Reality?

Low

The Code says:

“Protection of human subjects is a fundamental tenet of research and an important ethical obligation for everyone involved in research projects. Disclosure of identities when privacy has been promised could result in lower participation rates and a negative impact on science.”

Researchers need to record that they have applied for the requisite ethical clearances for research with human or animal subjects.

This will involve documenting the decisions of ethical committees or recording application numbers.

Fact Sheet 9

32 of 40

Why cover data sharing?

The Code says:

“Policies are required that address the ownership of research materials and data …and appropriate access to them by the research community.”

Sharing may be hampered by

  • Commercial considerations
  • Contracts and licensing
  • Privacy/confidentiality issues
  • Embargo periods

But sharing is encouraged by

  • Journals
  • Grant-making bodies
  • Government (open data)
  • Open Data initiatives

Our role?

  • Raise awareness about data sharing.
  • Advise them of the issues involved.

✓ Refer them to tools and services.

Scare factor?

Medium

Reality?

Medium

Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE 2(3): e308. doi: 10.1371/journal.pone.0000308

– provides evidence that data sharing boosts citations and increases research visibility

Fact Sheet 7

33 of 40

Why cover data formats?

Researchers should be encouraged to choose and use common, robust, well-documented formats that will last

  • at least as long as the research does
  • plus any statutory and legislative retention period.

Our role?

  • Inform and refer.
  • Encourage them to document their decisions.

✓ Point them to checklists.

Scare factor?

High

Reality?

Medium

The Code says:

“Policies are required that address … data … storage … [and] retention beyond the end of the project ...”

Fact Sheet 16

34 of 40

Why cover security and storage?

Scare factor?

Medium

Our role?

✓ Provide storage pros and cons, e.g. checklists

✓ Monitor services that offer storage

✓ Monitor services such as HPC

  • Monitor projects such as RDSI, NecTAR
  • Inform and refer

Reality?

Medium

They need three separate copies to be safe.

The Code specifically covers

Management of research data and primary materials (section 2)

Researchers must ensure that all research data, regardless of format, is stored securely and backed up or copied regularly.

The plan records these arrangements.

Fact Sheets 5 and 17

35 of 40

Why cover retention periods?

The Code says:

“The researcher must decide which data and materials should be retained, although in some cases this is determined by law, funding agency, publisher or by convention in the discipline.

The central aim is that sufficient materials and data are retained to justify the outcomes of the research and to defend them if they are challenged.

The potential value of the material for further research should also be considered, particularly where the research would be difficult or impossible to repeat.”

Scare factor?

Low

Our role?

  • Refer them to the Code
  • Refer them to UQ policies
  • Refer them to disciplinary standards

Reality?

Low

  • The plan records their compliance with
    • established practice in the discipline
    • local and national policies
    • record-keeping legislation
  • It explains any deviations from the norm

Fact Sheet 6

36 of 40

Why cover long term ‘homes’ for data?

“Archival experience has demonstrated that the durability of the data increases and the cost of processing and preservation decreases when data deposits are timely.

It is important that data be deposited while the producers are still familiar with the dataset and able to transfer their knowledge fully to the archive.”

Framework for Creating a Data Management Plan

http://www.icpsr.umich.edu/icpsrweb/content/ICPSR/dmp/framework.html

Scare factor?

Medium

Our role?

  • Provide checklists on what they need to look

for in a long term service

  • Find and monitor services in discipline areas
  • Advise on options

Reality?

Medium

Data must go somewhere

Data must be managed

Plans should provide the eventual location of the data

Repositories take the worry out of

  • discoverability of data
  • long term storage
  • keeping data accessible
  • mediating access over time

Fact Sheet 6

37 of 40

It’s a Web …

  • Grants
  • Ethical clearances

Who, what,

where, when, why ?

  • Assess storage needs
  • HPC services
  • Networked storage
  • Formats
  • Tools and services
  • Contracts
  • Agreements
  • MOUs
  • Internal
  • External
  • Data deposit
  • Data sharing
  • Established service culture
  • Data descriptions
  • Metadata
  • Plans, training and support
  • Copyright lawyer
  • Commercialisation
  • IP and ownership
  • Patents and licensing
  • Capacity building
  • Skills development
  • eResearch advice
  • Record-keeping
  • Legislative responsibilities
  • Privacy advice
  • Right to Information law

38 of 40

Implementation

Outreach activity might include

  • Raising awareness about issues and services
  • Surveying RHD students about data management practices
  • Identifying data collections for Research Data Australia
  • Conducting data interviews for Research Data Australia
  • Developing and providing basic checklists and templates

At this point

39 of 40

Implementation

Outreach activity might include

  • Offering grant recipients help with data scoping
  • Helping write plans and documentation
  • Partnering/adding value to funding applications
  • Data rescue advice
  • Specific skills training, e.g. data mining, visualisation
  • Assistance with specific tools or services
  • ‘Embedding’ staff within research teams

At these points

40 of 40

Implementation

Outreach activity might include

  • Digitisation projects, e.g. photographs, maps, rare books
  • Transcription projects, e.g. manuscripts, diaries, letters
  • Developing methodologies for ‘citizen’ science projects
  • Partnering on eResearch bid proposals
  • ‘Embedding’ staff within research teams
  • Managing crowd-sourced projects

At this point