1 of 159

Upcycling genomics data:

From publicly available "junk" to priceless "treasure"

Shannon E. Ellis

2 of 159

What makes primary cancer different than metastatic cancer?

www.hopkinsmedicine.org

3 of 159

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

4 of 159

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

Collect patient samples and information

3~6 months

5 of 159

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

3~6 months

1-2 wks

2-4 wks

6 of 159

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

33-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Analyze data and answer biological question

Data cleaning

7 of 159

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

33-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Total: 2+ years

Analyze data and answer biological question

Data cleaning

8 of 159

Biologists have recently gotten pretty good at making their data available to the public.

9 of 159

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

Data cleaning

33-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Total: 1+ years

Analyze data and answer biological question

10 of 159

...but they’re not great at making these data easily accessible and well-annotated.

Biologists have recently gotten pretty good at making their data available to the public.

11 of 159

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

Data cleaning

33-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Total: 1+ years

Analyze data and answer biological question

12 of 159

Project Goal: Take publicly available RNA-Seq data and make it available and easy-to-use

Easy to Get

Comprehensive

Useful for future study

13 of 159

Genetics101

14 of 159

The Central Dogma of Genetics

ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

DNA

slide adapted from jeff leek

TGACTGGATCTAGTCAGCTAGCTAGCATATGCTAATGTTTTAGTAGCCGTA

15 of 159

The Central Dogma of Genetics

ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

DNA

slide adapted from jeff leek

16 of 159

The Central Dogma of Genetics

ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

DNA

slide adapted from alyssa frazee

slide adapted from jeff leek

gene

17 of 159

The Central Dogma of Genetics

ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

DNA

slide adapted from alyssa frazee

slide adapted from jeff leek

gene

exons

18 of 159

The Central Dogma of Genetics

AUCAGUCGAUCACCGAU

transcription

ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

DNA

slide adapted from alyssa frazee

slide adapted from jeff leek

RNA

19 of 159

The Central Dogma of Genetics

AUCAGUCGAUCACCGAU

transcription

translation

ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

DNA

slide adapted from alyssa frazee

slide adapted from jeff leek

RNA

proteins

20 of 159

Two copies of DNA -> many transcripts -> many proteins

blueprint

role

in the cell

# copies/cell

2

RNA

proteins

DNA

functional

unit

gene

# unique functional

units

20,000

21 of 159

Two copies of DNA -> many transcripts -> many proteins

blueprint

carry out cellular functions

2

varies

~1010

RNA

proteins

DNA

gene

proteins

(metabolites, hormones, etc.)

20,000

~100,000

role

in the cell

# copies/cell

functional

unit

# unique functional

units

22 of 159

Two copies of DNA -> many transcripts -> many proteins

blueprint

messenger

carry out cellular functions

2

varies

~360,000

varies

~1010

RNA

proteins

DNA

gene

transcript

proteins

(metabolites, hormones, etc.)

20,000

~100,000

~100,000

role

in the cell

# copies/cell

functional

unit

# unique functional

units

23 of 159

Variability at the level of RNA allows for a heart cell to function differently than a brain cell

!=

24 of 159

Measuring � RNA levels

slide adapted from jeff leek

25 of 159

Next Generation Sequencing (NGS) Has Completely Revolutionized How We Study Genetics

26 of 159

Next Generation Sequencing (NGS) in one slide

RNA

Step 1: Extract RNA to get sample of interest

27 of 159

Next Generation Sequencing (NGS) in one slide

RNA

Step 1: Extract RNA to get sample of interest

Step 2: Chop up RNA into smaller pieces

28 of 159

Next Generation Sequencing (NGS) in one slide

RNA

Step 1: Extract RNA to get sample of interest

Step 3: Sequence the sample

Step 2: Chop up RNA into smaller pieces

29 of 159

Next Generation Sequencing (NGS) in one slide

RNA

Step 1: Extract RNA to get sample of interest

Step 3: Sequence the sample

Step 3: Obtain short read data from the sequencer

Step 2: Chop up RNA into smaller pieces

30 of 159

Next Generation Sequencing (NGS) in one slide

RNA

AUCAGUCGAUCACCGAU

A short read tells you the sequence of the RNA in that read

31 of 159

slide adapted from jeff leek

32 of 159

slide adapted from jeff leek

Sequence Identifier

Sequence

Quality scores

33 of 159

We’ve got 40M+ reads.

What does that all mean?

34 of 159

We first need to align these reads back to the genome

Genome

(DNA)

slide adapted from jeff leek

The Human Genome Project (HGP) determined the reference sequence of the human genome in 2001.

35 of 159

We first need to align these reads back to the genome

Genome

(DNA)

slide adapted from jeff leek

Nellore et al. (2016)

Bioinformatics

http://rail.bio/

We can then align each short read back to the reference genome to figure out where it came from.

36 of 159

We first need to align these reads back to the genome

coverage vector

2

6

0

11

6

Genome

(DNA)

slide adapted from jeff leek

The number of reads at each position lets us know abundance

37 of 159

We first need to align these reads back to the genome

coverage vector

2

6

0

11

6

Genome

(DNA)

slide adapted from jeff leek

We then summarize across all the positions in a gene to get an estimate for gene expression.

38 of 159

RNA-Seq = estimate expression across entire genome

estimate expression!

39 of 159

RNA-Seq = estimate expression across entire genome

estimate expression!

expression ≅ # RNA-Seq reads

40 of 159

Project Goal: Take publicly available RNA-Seq data and make it available and easy-to-use

Easy to Get

Comprehensive

Useful for future study

41 of 159

GTEx

https://commonfund.nih.gov/GTEx

42 of 159

TCGA

GTEx

https://commonfund.nih.gov/GTEx

43 of 159

SRA

44 of 159

Project

No. of Sample

GTEx

Genotype-Tissue Expression Project

9,962

TCGA

The Cancer Genome Atlas

11,284

SRA

Sequence Read Archive

49,848

45 of 159

Project

No. of Sample

GTEx

Genotype-Tissue Expression Project

9,962

TCGA

The Cancer Genome Atlas

11,284

SRA

Sequence Read Archive

49,848

We’ll take these ~70,000 samples, align each back to the reference genome, and then, for each sample, we’ll estimate expression across the genome.

46 of 159

Project Goal: Take publicly available RNA-Seq data and make it available and easy-to-use

Easy to Get

Comprehensive

Useful for future study

X

47 of 159

expression data for ~70,000 human samples

samples

expression estimates

48 of 159

expression data for ~70,000 human samples

samples

expression estimates

gene

exon

junctions

ERs

49 of 159

50 of 159

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

Data cleaning

33-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Determine expression differences between primary cancer and metastatic cancer

Find publicly available RNA-sequencing data from primary cancer and metastasis

Total: 1+ years

51 of 159

Find a researcher with access to patient samples

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

Data cleaning

3-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Determine expression differences between primary cancer and metastatic cancer

Get already processed and summarized RNA-Seq data from recount2

Total: months

What makes primary cancer different than metastatic cancer?

52 of 159

53 of 159

Since September of this year, 555 different people have accessed data in recount2

54 of 159

These 555 people have accessed 417,488 files (91,003 unique) in recount2

55 of 159

Project Goal: Take publicly available RNA-Seq data and make it available and easy-to-use

Easy to Get

Comprehensive

Useful for future study

X

X

56 of 159

expression data for ~70,000 human samples

Answer meaningful questions about human biology and expression

samples

expression estimates

gene

exon

junctions

ERs

GTEx

N=9,962

TCGA

N=11,284

SRA

N=49,848

57 of 159

expression data for ~70,000 human samples

samples

phenotypes

?

samples

expression estimates

gene

exon

junctions

ERs

Answer meaningful questions about human biology and expression

GTEx

N=9,962

TCGA

N=11,284

SRA

N=49,848

58 of 159

SRA phenotype information is far from complete

Sex

Tissue

Race

Age

6620

female

liver

NA

NA

6621

female

liver

NA

NA

6622

female

liver

NA

NA

6623

female

liver

NA

NA

6624

female

liver

NA

NA

6625

male

liver

NA

NA

6626

male

liver

NA

NA

6627

male

liver

NA

NA

6628

male

liver

NA

NA

6629

male

liver

NA

NA

6630

male

liver

NA

NA

6631

NA

blood

NA

NA

6632

NA

blood

NA

NA

6633

NA

blood

NA

NA

6634

NA

blood

NA

NA

6635

NA

blood

NA

NA

6636

NA

blood

NA

NA

SRA

59 of 159

SRA phenotype information is far from complete

Sex

Tissue

Race

Age

6620

female

liver

NA

NA

6621

female

liver

NA

NA

6622

female

liver

NA

NA

6623

female

liver

NA

NA

6624

female

liver

NA

NA

6625

male

liver

NA

NA

6626

male

liver

NA

NA

6627

male

liver

NA

NA

6628

male

liver

NA

NA

6629

male

liver

NA

NA

6630

male

liver

NA

NA

6631

NA

blood

NA

NA

6632

NA

blood

NA

NA

6633

NA

blood

NA

NA

6634

NA

blood

NA

NA

6635

NA

blood

NA

NA

6636

NA

blood

NA

NA

z

z

z

SRA

60 of 159

Even when information is provided, it’s not always clear…

Category

Frequency

F

95

female

2036

Female

51

M

77

male

1240

Male

141

Total

3640

SRA

61 of 159

Even when information is provided, it’s not always clear…

Category

Frequency

F

95

female

2036

Female

51

M

77

male

1240

Male

141

Total

3640

“1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown”

SRA

62 of 159

Even when information is provided, it’s not always clear…

Category

Frequency

F

95

female

2036

Female

51

M

77

male

1240

Male

141

Total

3640

“1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown”

# of NAs

# w/sex assigned

44,957

4,700

SRA

63 of 159

in-silico Phenotyping

slide adapted from jeff leek

64 of 159

Goal :��to accurately predict critical phenotype information for all samples in recount2

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

65 of 159

Machine Learning: Making predictions from data

Data Set #1

Data Set #2

66 of 159

Machine Learning: Making predictions from data

Data Set #1

Data Set #2

Training Data

Validation

Data

67 of 159

Machine Learning: Making predictions from data

Data Set #1

Data Set #2

Training Data

Data used to build the predictor

68 of 159

We’re interested in predicting phenotype from expression data...

69 of 159

There are a number of different curves you could fit through these data

This curve fits every point in the training data perfectly...

70 of 159

Machine Learning: Making predictions from data

Data Set #1

Data Set #2

Training Data

Data used to build the predictor

Validation

Data

Samples held back from training

71 of 159

What if we tried to predict phenotype in the validation data?

72 of 159

The curve no longer fits every point perfectly.

73 of 159

The curve no longer fits every point perfectly.

74 of 159

Machine Learning: Making predictions from data

Data Set #1

Data Set #2

Training Data

Data used to build the predictor

Validation

Data

Test

Data

Samples held back from training

Independent data set to test predictor

75 of 159

We can now test prediction accuracy in an independent set of samples.

76 of 159

The line generated from the training data accurately predicts phenotype in the test data

77 of 159

Goal :��to accurately predict critical phenotype information for all samples in recount2

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

78 of 159

Goal :��to accurately predict critical phenotype information for all samples in recount2

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

build and optimize phenotype predictor

Training Data

79 of 159

Missingness limited in GTEx phenotype data

Sex

Tissue

Race

Age

1

male

Lung

White

59

2

male

Brain

White

27

3

female

Heart

Black or African American

23

4

male

Brain

White

51

5

male

Skin

White

27

6

male

Lung

White

68

7

female

Brain

White

61

8

female

Adipose Tissue

White

42

9

male

Brain

White

40

10

female

Uterus

White

33

11

female

Nerve

White

60

12

male

Muscle

White

54

13

female

Ovary

White

31

14

male

Blood

White

53

15

female

Brain

White

56

16

male

Muscle

White

44

GTEx

80 of 159

Missingness limited in GTEx phenotype data

Category

Frequency

female

3,626

male

6,036

NA

0

Sex

Tissue

Race

Age

1

male

Lung

White

59

2

male

Brain

White

27

3

female

Heart

Black or African American

23

4

male

Brain

White

51

5

male

Skin

White

27

6

male

Lung

White

68

7

female

Brain

White

61

8

female

Adipose Tissue

White

42

9

male

Brain

White

40

10

female

Uterus

White

33

11

female

Nerve

White

60

12

male

Muscle

White

54

13

female

Ovary

White

31

14

male

Blood

White

53

15

female

Brain

White

56

16

male

Muscle

White

44

GTEx

81 of 159

Goal :��to accurately predict critical phenotype information for all samples in recount2

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

build and optimize phenotype predictor

Training Data

82 of 159

Goal :��to accurately predict critical phenotype information for all samples in recount2

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

divide samples

build and optimize phenotype predictor

test accuracy of predictor

Training Data

Validation

Data

83 of 159

Goal :��to accurately predict critical phenotype information for all samples in recount2

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

predict phenotypes across samples in TCGA

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

divide samples

build and optimize phenotype predictor

test accuracy of predictor

Training Data

Validation

Data

Test

Data

84 of 159

Goal :��to accurately predict critical phenotype information for all samples in recount2

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

divide samples

build and optimize phenotype predictor

predict phenotypes across SRA samples

test accuracy of predictor

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

predict phenotypes across samples in TCGA

Training Data

Validation

Data

Test

Data

85 of 159

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

phenopredict

86 of 159

Prediction is done using linear regression

87 of 159

Prediction is done using linear regression

88 of 159

Prediction is done using linear regression

89 of 159

Prediction is done using linear regression

90 of 159

Prediction is done using linear regression

91 of 159

Prediction is done using linear regression

92 of 159

Prediction is done using linear regression

93 of 159

Prediction is done using linear regression

New sample’s expression

94 of 159

Prediction is done using linear regression

New sample’s expression

95 of 159

Prediction is done using linear regression

New sample’s expression

New sample’s predicted phenotype

96 of 159

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

phenopredict

97 of 159

92.7%

filter_regions()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

phenopredict

98 of 159

92.7%

filter_regions()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

phenopredict

Set of discriminatory regions

99 of 159

92.7%

filter_regions()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

phenopredict

Set of discriminatory regions

build_predictor()

Extract coefficient estimates across regions

expression ~ phenotype

filtered regions (r)

male

female

phenotype (P)

100 of 159

92.7%

filter_regions()

build_predictor()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

Extract coefficient estimates across regions

expression ~ phenotype

filtered regions (r)

male

female

phenotype (P)

phenopredict

Set of discriminatory regions

Relationship between expression and phenotype

101 of 159

92.7%

filter_regions()

build_predictor()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

Extract coefficient estimates across regions

expression ~ phenotype

filtered regions (r)

male

female

phenotype (P)

phenopredict

test_predictor()

predictions

Predict phenotype and assess accuracy in training set data

male

female

0.99

0.01

0.02

0.98

0.04

0.96

0.98

0.02

Relationship between expression and phenotype

Set of discriminatory regions

Likelihood of phenotype for each individual

102 of 159

92.7%

filter_regions()

build_predictor()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

Extract coefficient estimates across regions

expression ~ phenotype

filtered regions (r)

male

female

phenotype (P)

phenopredict

test_predictor()

predictions

Predict phenotype and assess accuracy in training set data

male

female

0.99

0.01

0.02

0.98

0.04

0.96

0.98

0.02

Relationship between expression and phenotype

Set of discriminatory regions

Assign phenotype to most likely category

103 of 159

92.7%

filter_regions()

build_predictor()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

Extract coefficient estimates across regions

expression ~ phenotype

filtered regions (r)

male

female

phenotype (P)

phenopredict

test_predictor()

male

predictions

Predict phenotype and assess accuracy in training set data

male

female

0.99

0.01

0.02

0.98

0.04

0.96

0.98

0.02

Relationship between expression and phenotype

Set of discriminatory regions

Assign phenotype to most likely category

104 of 159

92.7%

filter_regions()

build_predictor()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

Extract coefficient estimates across regions

expression ~ phenotype

filtered regions (r)

male

female

phenotype (P)

phenopredict

test_predictor()

male

female

female

male

predictions

Predict phenotype and assess accuracy in training set data

male

female

0.99

0.01

0.02

0.98

0.04

0.96

0.98

0.02

Relationship between expression and phenotype

Set of discriminatory regions

Assign phenotype to most likely category

105 of 159

92.7%

filter_regions()

test_predictor()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

male

female

female

male

predictions

male

female

female

male

reported

phenopredict

Predict phenotype and assess accuracy in training set data

build_predictor()

Extract coefficient estimates across regions

expression ~ phenotype

filtered regions (r)

male

female

phenotype (P)

106 of 159

92.7%

filter_regions()

test_predictor()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

male

female

female

male

predictions

male

female

female

male

reported

Prediction accuracy: 100%

phenopredict

Predict phenotype and assess accuracy in training set data

build_predictor()

Extract coefficient estimates across regions

expression ~ phenotype

filtered regions (r)

male

female

phenotype (P)

107 of 159

filter_regions()

test_predictor()

extract_data()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

Identify regions with differential expression for each level

male

female

female

male

predictions

male

female

female

male

reported

Extract expression information at regions identified by filter_regions() in a new data set

Prediction accuracy: 100%

expression @ filtered regions

new data set samples

92.7%

sex

expression

male

female

phenopredict

Predict phenotype and assess accuracy in training set data

build_predictor()

Extract coefficient estimates across regions

expression ~ phenotype

filtered regions (r)

male

female

phenotype (P)

108 of 159

92.7%

filter_regions()

test_predictor()

extract_data()

predict_pheno()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

male

female

female

male

predictions

male

female

female

male

reported

Extract expression information at regions identified by filter_regions() in a new data set

Predict phenotypes across samples in this new data set

Prediction accuracy: 100%

expression @ filtered regions

new data set samples

apply coefficient estimates to the extracted data

phenopredict

build_predictor()

Extract coefficient estimates across regions

expression ~ phenotype

filtered regions (r)

male

female

phenotype (P)

Predict phenotype and assess accuracy in training set data

109 of 159

92.7%

filter_regions()

test_predictor()

extract_data()

predict_pheno()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

male

female

female

male

predictions

male

female

female

male

reported

Extract expression information at regions identified by filter_regions() in a new data set

Predict phenotypes across samples in this new data set

Prediction accuracy: 100%

expression @ filtered regions

new data set samples

apply coefficient estimates to the extracted data

phenopredict

build_predictor()

Extract coefficient estimates across regions

expression ~ phenotype

filtered regions (r)

male

female

phenotype (P)

Predict phenotype and assess accuracy in training set data

110 of 159

92.7%

filter_regions()

test_predictor()

extract_data()

predict_pheno()

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

sex

expression

male

female

Identify regions with differential expression for each level

male

female

female

male

predictions

male

female

female

male

reported

Extract expression information at regions identified by filter_regions() in a new data set

Predict phenotypes across samples in this new data set

Prediction accuracy: 100%

expression @ filtered regions

new data set samples

apply coefficient estimates to the extracted data

male

male

male

female

predictions in new data set

phenopredict

build_predictor()

Extract coefficient estimates across regions

expression ~ phenotype

filtered regions (r)

male

female

phenotype (P)

Predict phenotype and assess accuracy in training set data

111 of 159

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

Training Data

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

phenopredict

112 of 159

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

Training Data

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

phenopredict

Accuracy

100%

100%

100%

100%

113 of 159

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

Training Data

Validation

Data

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

phenopredict

100%

100%

100%

100%

Accuracy

114 of 159

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

Training Data

Validation

Data

Test

Data

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

phenopredict

100%

100%

100%

100%

Accuracy

115 of 159

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

Training Data

Validation

Data

Test

Data

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

phenopredict

Make predictions!

100%

100%

100%

100%

Accuracy

116 of 159

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

Training Data

Validation

Data

Test

Data

filter_regions()

build_predictor()

test_predictor()

extract_data()

predict_pheno()

functions

phenopredict

Make predictions!

100%

100%

100%

100%

Accuracy

117 of 159

Number of Regions

40

40

40

40

Number of Samples (N)

4,769

4,769

11,245

3,640

99.9%

Sex prediction is accurate across data sets

99.8%

99.0%

86.3%

GTEx (training)

GTEx (validation)

TCGA (test)

SRA

118 of 159

Number of Regions

40

40

40

40

Number of Samples (N)

4,769

4,769

11,245

3,640

99.9%

Sex prediction is accurate across data sets

99.8%

99.0%

86.3%

GTEx (training)

GTEx (validation)

TCGA (test)

SRA

119 of 159

http://www.rna-seqblog.com/

Can we use expression data to predict tissue?

120 of 159

Tissue prediction is accurate across data sets

Number of Regions

2,281

2,281

2,281

2,281

Number of Samples (N)

4,769

4,769

7,193

8,951

97.7%

96.6%

76.8%

51.9%

GTEx (training)

GTEx (validation)

TCGA (test)

SRA

121 of 159

Prediction is more accurate in healthy tissue

Number of Regions

2,281

2,281

2,281

2,281

2,281

Number of Samples (N)

4,769

4,769

613

6,579

8,951

97.7%

96.6%

92.7%

75.3%

51.9%

GTEx

(training)

GTEx (validation)

TCGA:

healthy tissue

TCGA:

cancer

SRA

122 of 159

Across the samples in recount, brain, blood, and skin are the three most frequently predicted tissues types

123 of 159

124 of 159

125 of 159

126 of 159

A sample reported to be Intestine is predicted to be Colon. That makes good sense.

127 of 159

128 of 159

129 of 159

A sample reported to be Breast is often predicted to be Adipose Tissue. Makes enough sense too…

130 of 159

131 of 159

132 of 159

But sometimes the predictions and reported tissues make less sense…

 ¯\_(ツ)_/¯

133 of 159

Tissue prediction is largely accurate across recount2

Tissue can be accurately predicted from expression data.

Discordant predictions are often made to biologically similar tissues.

Sometimes, predictions are inaccurate.

134 of 159

Project Goal: Take publicly available RNA-Seq data and make it available and easy-to-use

Easy to Get

Comprehensive

Useful for future study

X

X

X

135 of 159

Find a researcher with access to patient samples

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

Data cleaning

3-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Determine expression differences between primary cancer and metastatic cancer

Get already processed and summarized RNA-Seq data from recount2

Total: months

What makes primary cancer different than metastatic cancer?

136 of 159

Find a researcher with access to patient samples

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

Data cleaning

3-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Determine expression differences between primary cancer and metastatic cancer

Get already processed and summarized RNA-Seq data from recount2 with sample information

Total: months

What makes primary cancer different than metastatic cancer?

137 of 159

Ok. Ok. What about actually using all of these predictions…?

138 of 159

What makes primary cancer different than metastatic cancer?

www.hopkinsmedicine.org

139 of 159

Molecular Oncology, July 2014

140 of 159

Kim et al. analysis looked to identify genes that contribute to metastasis in colon cancer.

N=18

3. Liver Metastasis (MC)

1. Healthy Colon (NC)

2. Primary Cancer (PC)

141 of 159

Kim et al. analysis looked to identify genes that contribute to metastasis in colon cancer.

N=18

3. Liver Metastasis (MC)

1. Healthy Colon (NC)

2. Primary Cancer (PC)

NC vs. PC

MC vs. PC

142 of 159

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

143 of 159

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

144 of 159

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

Are the same genes found when sex is included in the analysis?

145 of 159

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

146 of 159

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

147 of 159

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

148 of 159

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

149 of 159

Predictions can be used to:�(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

150 of 159

Find a researcher with access to patient samples

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

Data cleaning

3-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Determine expression differences between primary cancer and metastatic cancer

Total: days-months

What makes primary cancer different than metastatic cancer?

151 of 159

Finally, what if YOU want to use recount2...?

152 of 159

predictions (v0.0.06)

sample_id

dataset

reported_sex

predicted_sex

accuracy_sex

reported_tissue

predicted_tissue

accuracy_tissue

SRR660824

gtex

male

male

0.999

lung

lung

0.977

SRR2166176

gtex

male

male

0.999

brain

brain

0.977

SRR606939

gtex

female

female

0.999

heart

heart

0.966

SRR2167642

gtex

male

male

0.999

brain

brain

0.966

SRR2165473

gtex

male

male

0.999

skin

skin

0.966

153 of 159

Expression data and predictions available in recount R package

>library('recount')

�> download_study('ERP001942', type='rse-gene')

> load(file.path('ERP001942', 'rse_gene.Rdata'))

> rse <- scale_counts(rse_gene)

�> rse_with_pred <- add_predictions(rse)

154 of 159

There are a lot of questions an undergraduate could answer with recount2...

  1. Which genes are expressed in which tissues?
  2. Which genes contribute to X-Inactivation?
  3. Which genes play a role in cancer? In autism? In Alzheimer’s?
  4. How different is expression between individuals?

3-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

155 of 159

…and all of this is also up on bioRxiv

http://biorxiv.org/content/early/2017/06/03/145656

156 of 159

phenopredict�https://github.com/leekgroup/phenopredict

If you want to…

Align RNA-Seq data

http://rail.bio

Learn about human expression

https://jhubiostatistics.shinyapps.io/recount/

Predict phenotype information

157 of 159

The Leek group

Collaborators

  • Huan Chen
  • Jack Fu
  • Aboozar Hadavand
  • Leslie Myint
  • Kayode Sosina
  • Sara Wang
  • Jeff Leek
  • Andrew Jaffe
  • Kasper Hansen
  • Margaret Taub
  • Leah Jager
  • Ben Langmead
  • Abhi Nellore
  • Kai Kammers
  • Leo Collado-Torres
  • Ashkaun Razmara

158 of 159

phenopredict�https://github.com/leekgroup/phenopredict

If you want to…

Align RNA-Seq data

http://rail.bio

Learn about human expression

https://jhubiostatistics.shinyapps.io/recount/

Predict phenotype information

159 of 159