1 of 23

Open Data in Biology

Tim Hubbard, @timjph

First International Open Economics Workshop

17th December 2012

2 of 23

Biology is a grand project

Build complete models of biological systems

Intelligent drug development
disease prediction
modelling individuals (e.g. virtual human, ITFoM)

Feels like should be parallel with economics, modelling economic systems

3 of 23

Data is organised towards this goal

Centralised repositories for raw data

one data type, one repository
mandatory submission linked to publication

Infrastructure to organise raw data for access

human genome presented to user as whole chromosomes instead of thousands of fragments

Curated databases of biological objects

supported by evidence from raw data repositories

First repositories >40 years old
1,000s of full time staff supporting infrastructure distributed worldwide
Sanger/EBI alone >30 peta bytes of data

4 of 23

Data, Databases & Bioinformatics

Researchers

Data

Resources

Repositories

Global Infrastructure

Curation

Website

Data mining

APIs

Downloads

Researchers

Submission

Policies

Systems

Sustainable

funding

Reuse

Discoverability

Easy of use (+access)

Scientific Record

Complete sets of components

5 of 23

Human genome race��won by public project��open access for all

6 of 23

Scale

Human genome = 6 billion letters

encodes ~30,000 genes, 4 million switches

Human individual = 100 trillion cells

more than 200 cell types

Human population = 7 billion

7 of 23

8 of 23

Scale of data

Size of current raw DNA database

400 trillion letters of DNA
3.4 trillion DNA items

Project size

UK Biobank = 500,000 individuals
Plan for NHS = sequence 100,000 in 3-5 years

9 of 23

Data infrastructure for biology

Enforces standards for submission

Grants contain ‘data sharing’ section

Enables rapid exploration of new questions

Assumption that if you have a new idea at 2am you can immediately access the data to test it

Structured for reproducibility

Unique identifiers
Versioned data items and datasets
Widespread use of ontologies
No licenses restricting use of data allowed

10 of 23

11 of 23

Sanger implementation of �Data Sharing

Previously

Genome sequence (human consented + model organisms)

Today

Genome sequence, genotypes (some disease related), phenotype data (models), high throughput assay data (transcriptomics), WT Open Access (OA) publishing policy

Data sharing committee
Data sharing policy
Tracking of compliance

12 of 23

Data sharing issues for institutes

Tracking compliance needs to be proactive
Pragmatic assessment of value of intermediate data

WTSI approach to pre-publication data release

Raw data automatically, immediately deposited to repositories
Intermediate analysis provided via institute websites
Final analysis outputs (e.g. linked with publication) submitted to appropriate database, repository

13 of 23

Ensuring data submission

Include potential IT costs in grant application
Submit metadata to repository
Obtain Accession number from repository
Collect Data
Submit Data to repository

14 of 23

15 of 23

New policy: sanctions for non-compliance

Current OA compliance levels “unacceptable”
New sanctions:

In End of Grant Report all papers listed must be OA. If not the final payment on the grant (typically 10%) will be withheld
Non-compliant publications will be discounted as part of a researcher’s track record in any renewal of an existing grant or new grant application
Trust-funded researchers will need to ensure that all publications associated with their Wellcome-funded research are OA before any funding renewals or new grant awards will be activated

Sanctions aimed at changing behaviour

16 of 23

17 of 23

Data Sharing

Open to all

Human Genome Projects where subject consented: Hapmap, 1000 genomes
Repository: Genbank, ENA, DDBJ (INSDC)

Managed access (must be bona fide researcher)

Genetic data for disease cohorts, with phenotypes
Repository: DbGaP, EGA (Encrypted distributions etc.)

Currently very limited access

Patient records
Repository: Clinical Practice Research Datalink (CPRD)(formally RCP) (anonymised data)

18 of 23

Traditional: (Honest Broker)

Data set A

Researcher

“Run X on A & B”

Data set B

Results

“Request A & B data set”

Algorithm X

Data set combination

and anonymisation

process

Honest Broker Model:

Identifiable data sets held securely by Honest Broker
Anonymised data sets generated on demand and distributed to researchers
Researcher applies custom algorithm or analysis to generate results

Honest Broker

Anonymised data set

19 of 23

Proposed: (SVM)

Data set A

Researcher

“Run X on A & B”

Data set B

Results

“Run VM on A & B”

Summary data only

via output API

(no raw data)

Algorithm X

Secure Virtual machine (SVM):

Access to raw data by algorithm can only be via APIs contained in template VM
Export of summary data can only be via output APIs contained in template VM

Honest Broker (with local cloud)

Download VM Template

Secure Virtual

Machine (SVM)

----------------

API

Template

API

Secure Virtual

Machine (SVM)

Algorithm X

API

20 of 23

Changing sources of biological data

Healthcare Professional

Component 4

Individual query analysis

Component 3

Additional clinical annotation

Component 2

Genotype and Phenotype relationship capture

Component 1

Human sequence data repositories

Component 5

Electronic Health Record &

Personal Genome Sequence

Data from Patients

(NHS)

Data from Collections

(Research Institutes)

21 of 23

E-Health, Genomic Medicine & Linked Data

Healthcare Professional

Component 4

Individual query analysis

Component 3

Additional clinical annotation

Component 2

Genotype and Phenotype relationship capture

Component 1

Human sequence data repositories

Component 5

Electronic Health Record &

Personal Genome Sequence

Data from Patients (NHS)

CPRD

Data from Collections

(Research Institutes)

DH/NHS

DWP

HMRC

ONS

DfE, BIS

DJ, IC

ESRC

WT

MRC

22 of 23

Openness Privacy

Can you have both?

23 of 23

Acknowledgements

Discussions with many at Sanger Institute, EBI, Wellcome Trust, NCBI, NHGRI, europePMC, Human Genome Strategy Group, Administrative Data Taskforce