1 of 23

Open Data in Biology

Tim Hubbard, @timjph

First International Open Economics Workshop

17th December 2012

2 of 23

Biology is a grand project

  • Build complete models of biological systems
    • Intelligent drug development
    • disease prediction
    • modelling individuals (e.g. virtual human, ITFoM)

  • Feels like should be parallel with economics, modelling economic systems

3 of 23

Data is organised towards this goal

  • Centralised repositories for raw data
    • one data type, one repository
    • mandatory submission linked to publication
  • Infrastructure to organise raw data for access
    • human genome presented to user as whole chromosomes instead of thousands of fragments
  • Curated databases of biological objects
    • supported by evidence from raw data repositories

  • First repositories >40 years old
  • 1,000s of full time staff supporting infrastructure distributed worldwide
  • Sanger/EBI alone >30 peta bytes of data

4 of 23

Data, Databases & Bioinformatics

Researchers

Data

Resources

Repositories

Global Infrastructure

Curation

Website

Data mining

APIs

Downloads

Researchers

Submission

Policies

Systems

Sustainable

funding

Reuse

Discoverability

Easy of use (+access)

Scientific Record

Complete sets of components

5 of 23

Human genome race��won by public project��open access for all

6 of 23

Scale

  • Human genome = 6 billion letters
    • encodes ~30,000 genes, 4 million switches
  • Human individual = 100 trillion cells
    • more than 200 cell types
  • Human population = 7 billion

7 of 23

8 of 23

Scale of data

  • Size of current raw DNA database
    • 400 trillion letters of DNA
    • 3.4 trillion DNA items

  • Project size
    • UK Biobank = 500,000 individuals
    • Plan for NHS = sequence 100,000 in 3-5 years

9 of 23

Data infrastructure for biology

  • Enforces standards for submission
    • Grants contain ‘data sharing’ section
  • Enables rapid exploration of new questions
    • Assumption that if you have a new idea at 2am you can immediately access the data to test it
  • Structured for reproducibility
    • Unique identifiers
    • Versioned data items and datasets
    • Widespread use of ontologies
    • No licenses restricting use of data allowed

10 of 23

11 of 23

Sanger implementation of �Data Sharing

  • Previously
    • Genome sequence (human consented + model organisms)
  • Today
    • Genome sequence, genotypes (some disease related), phenotype data (models), high throughput assay data (transcriptomics), WT Open Access (OA) publishing policy

  • Data sharing committee
  • Data sharing policy
  • Tracking of compliance

12 of 23

Data sharing issues for institutes

  • Tracking compliance needs to be proactive
  • Pragmatic assessment of value of intermediate data

  • WTSI approach to pre-publication data release
    • Raw data automatically, immediately deposited to repositories
    • Intermediate analysis provided via institute websites
    • Final analysis outputs (e.g. linked with publication) submitted to appropriate database, repository

13 of 23

Ensuring data submission

  • Include potential IT costs in grant application
  • Submit metadata to repository
  • Obtain Accession number from repository
  • Collect Data
  • Submit Data to repository

14 of 23

15 of 23

New policy: sanctions for non-compliance

  • Current OA compliance levels “unacceptable”
  • New sanctions:
    1. In End of Grant Report all papers listed must be OA. If not the final payment on the grant (typically 10%) will be withheld
    2. Non-compliant publications will be discounted as part of a researcher’s track record in any renewal of an existing grant or new grant application
    3. Trust-funded researchers will need to ensure that all publications associated with their Wellcome-funded research are OA before any funding renewals or new grant awards will be activated
  • Sanctions aimed at changing behaviour

16 of 23

17 of 23

Data Sharing

  • Open to all
    • Human Genome Projects where subject consented: Hapmap, 1000 genomes
    • Repository: Genbank, ENA, DDBJ (INSDC)
  • Managed access (must be bona fide researcher)
    • Genetic data for disease cohorts, with phenotypes
    • Repository: DbGaP, EGA (Encrypted distributions etc.)
  • Currently very limited access
    • Patient records
    • Repository: Clinical Practice Research Datalink (CPRD)(formally RCP) (anonymised data)

18 of 23

Traditional: (Honest Broker)

Data set A

Researcher

“Run X on A & B”

Data set B

Results

“Request A & B data set”

Algorithm X

Data set combination

and anonymisation

process

Honest Broker Model:

  • Identifiable data sets held securely by Honest Broker
  • Anonymised data sets generated on demand and distributed to researchers
  • Researcher applies custom algorithm or analysis to generate results

Honest Broker

Anonymised data set

19 of 23

Proposed: (SVM)

Data set A

Researcher

“Run X on A & B”

Data set B

Results

“Run VM on A & B”

Summary data only

via output API

(no raw data)

Algorithm X

Secure Virtual machine (SVM):

  • Access to raw data by algorithm can only be via APIs contained in template VM
  • Export of summary data can only be via output APIs contained in template VM

Honest Broker (with local cloud)

Download VM Template

Secure Virtual

Machine (SVM)

----------------

API

API

Template

API

Secure Virtual

Machine (SVM)

Algorithm X

API

API

API

20 of 23

Changing sources of biological data

Healthcare Professional

Component 4

Individual query analysis

Component 3

Additional clinical annotation

Component 2

Genotype and Phenotype relationship capture

Component 1

Human sequence data repositories

Component 5

Electronic Health Record &

Personal Genome Sequence

Data from Patients

(NHS)

Data from Collections

(Research Institutes)

21 of 23

E-Health, Genomic Medicine & Linked Data

Healthcare Professional

Component 4

Individual query analysis

Component 3

Additional clinical annotation

Component 2

Genotype and Phenotype relationship capture

Component 1

Human sequence data repositories

Component 5

Electronic Health Record &

Personal Genome Sequence

Data from Patients (NHS)

CPRD

Data from Collections

(Research Institutes)

DH/NHS

DWP

HMRC

ONS

DfE, BIS

DJ, IC

ESRC

WT

MRC

22 of 23

Openness Privacy

  • Can you have both?

23 of 23

Acknowledgements

Discussions with many at Sanger Institute, EBI, Wellcome Trust, NCBI, NHGRI, europePMC, Human Genome Strategy Group, Administrative Data Taskforce