1 of 102

Population scale transcriptomics for

precision oncology

Jeff Leek

VP & CDO

Professor Biostatistics Program, PHS

J Orin Edson Foundation Chair

@jtleek

1

2 of 102

2

3 of 102

3

Fred Hutchinson Cancer Center

4 of 102

www.github.com/jtleek/talks

Fred Hutchinson Cancer Center

4

5 of 102

Precision oncology:

A tour of data plumbing challenges

Jeff Leek

VP & CDO

Professor Biostatistics Program, PHS

J Orin Edson Foundation Chair

@jtleek

5

6 of 102

6

Fred Hutchinson Cancer Center

7 of 102

What gene expression patterns are prognostic of colorectal cancer metastasis ?

7

Fred Hutchinson Cancer Center

8 of 102

https://pubmed.ncbi.nlm.nih.gov/25049118/

9 of 102

Q: What genes are prognostic of colorectal cancer metastasis?

Data

Processing

Computing

Metadata

Analysis

A: This gene signature is/isn’t a potential prognostic biomarker.

9

Fred Hutchinson Cancer Center

10 of 102

Find a researcher with access to patient samples

What genes are prognostic for metastasis?

11 of 102

Find a researcher with access to patient samples

What genes are prognostic for metastasis?

Collect patient samples and information

3~6 months

12 of 102

Find a researcher with access to patient samples

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

3~6 months

1-2 wks

2-4 wks

What genes are prognostic for metastasis?

13 of 102

Find a researcher with access to patient samples

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

33-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Analyze data and answer biological question

Data cleaning

What genes are prognostic for metastasis?

14 of 102

Find a researcher with access to patient samples

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

33-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Total: 2+ years

Analyze data and answer biological question

Data cleaning

What genes are prognostic for metastasis?

15 of 102

16 of 102

Data

Processing

Computing

Metadata

Analysis

16

Fred Hutchinson Cancer Center

17 of 102

We are at the intersection of two revolutions

17

Fred Hutchinson Cancer Center

18 of 102

18

Fred Hutchinson Cancer Center

19 of 102

SAMPLE SIZE

N =

19

Fred Hutchinson Cancer Center

20 of 102

N =

($ YOU HAVE)

($ PER SAMPLE)

20

Fred Hutchinson Cancer Center

21 of 102

Langmead & Nellore, Nat Rev. Genet. 2018

21

Fred Hutchinson Cancer Center

22 of 102

http://www.washingtonpost.com/sf/national/2015/06/27/watsons-next-feat-taking-on-cancer/

23 of 102

24 of 102

Data sharing is improving over time

25 of 102

...but data aren’t always easy to use

Data sharing is improving over time

26 of 102

Data

Processing

Computing

Metadata

Analysis

26

Fred Hutchinson Cancer Center

27 of 102

Find a researcher with access to patient samples

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

33-6 months

1-3 months

1 month - 1+ years

~6 months

1-2 wks

2-4 wks

Total: 1.5+ years

Analyze data and answer biological question

Data cleaning

What genes are prognostic for metastasis?

28 of 102

AUCAGUCGAUCACCGAU

transcription

RNA

translation

protein

ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

DNA

M

M

M

slide adapted from alyssa frazee

29 of 102

AUCAGUCGAUCACCGAU

transcription

RNA

translation

protein

ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT

DNA

M

M

M

RNA

slide adapted from alyssa frazee

30 of 102

Genome

Transcripts

Reads

31 of 102

@22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2

CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA

+

GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD

@22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2

GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT

+

@=ABBBBIIIIIIIIHHGGGGIIDBDIIIIIIGIIIIHIIIIHFDD@BBDBGGFIDEE8DCC/29>BGFCGHHHGF

@22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1

TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT

+

DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>DEGHHIFDDHH8@BEDDI

@22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2

AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC

+

HIGHIHFHEGE4111:.;8@?@HDIIIIIIIEGGIHHHIIGA?=:FIIIDD8.02506A8=AC#############

@22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1

AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC

+

B9?@8=42:E@GDEDIIIIIGGHIIIFBEEAGIIDIIDHHGGHIIEGEIIIIIHIHFHFFEEFGGGGGB88>:DGH

@22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2

GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA

+

IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>D@@GGGIIHIIHGGDDHGBA=ABEG@@DFCCAA<:=>8

@22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1

TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC

+

HHIIIHIDGG@;=@GIIIIIDDGBBBEDB@8>5554,/':9B@@C?==@1:2@?=GG=;<HHHHGIIHHEC-;;3?

3 gb

32 of 102

33 of 102

Carefully!!

34 of 102

35 of 102

coverage vector

2

6

0

11

6

Genome

(DNA)

36 of 102

junction

counts

3

3

Genome

(DNA)

37 of 102

~71.7 mb

38 of 102

Data

Processing

Computing

Metadata

Analysis

38

Fred Hutchinson Cancer Center

39 of 102

SRA

Human

RNA-seq

Illumina

≈22,000 samples

≈50,000 samples

≈60,000 samples

≈247,000 samples

40 of 102

http://rail.bio/

Slide courtesy Ben Langmead

41 of 102

http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/

42 of 102

slide adapted from andrew jaffe

43 of 102

Obstacle: our research moves (spot) markets

Spike in market price due to preprocessing job flows

44 of 102

expression data for ~70,000 human samples

GTEx

N=9,962

TCGA

N=11,284

SRA

N=49,848

samples

expression estimates

gene

exon

junctions

ERs

45 of 102

< 1,000 junctions

> 20,000 junctions

46 of 102

=

47 of 102

A global view of transcript variability

Nellore et al. Genome Biology 2016

47

Fred Hutchinson Cancer Center

48 of 102

We discovered and redefined hundreds of human genes using these data!

48

Fred Hutchinson Cancer Center

49 of 102

49

Fred Hutchinson Cancer Center

50 of 102

Data

Processing

Computing

Metadata

Analysis

50

Fred Hutchinson Cancer Center

51 of 102

Find a researcher with access to patient samples

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

33-6 months

1-3 months

1 month - 1+ years

~6 months

1-2 wks

2-4 wks

Total: 1.5+ years

Analyze data and answer biological question

Data cleaning

What genes are prognostic for metastasis?

52 of 102

What % expressed?

New genes?

Important outliers?

Prognostic signatures

Sequence data

Process/quantify

ACTACTTT

Metadata

Clean/predict

53 of 102

expression data for ~70,000 human samples

GTEx

N=9,962

TCGA

N=11,284

SRA

N=49,848

samples

expression estimates

gene

exon

junctions

ERs

Answer meaningful questions about human biology and expression

54 of 102

expression data for ~70,000 human samples

samples

phenotypes

?

GTEx

N=9,962

TCGA

N=11,284

SRA

N=49,848

samples

expression estimates

gene

exon

junctions

ERs

Answer meaningful questions about human biology and expression

55 of 102

SRA phenotype information is far from complete

Sex

Tissue

Race

Age

6620

female

liver

NA

NA

6621

female

liver

NA

NA

6622

female

liver

NA

NA

6623

female

liver

NA

NA

6624

female

liver

NA

NA

6625

male

liver

NA

NA

6626

male

liver

NA

NA

6627

male

liver

NA

NA

6628

male

liver

NA

NA

6629

male

liver

NA

NA

6630

male

liver

NA

NA

6631

NA

blood

NA

NA

6632

NA

blood

NA

NA

6633

NA

blood

NA

NA

6634

NA

blood

NA

NA

6635

NA

blood

NA

NA

6636

NA

blood

NA

NA

z

z

z

56 of 102

Even when information is provided, it’s not always clear…

Level

Frequency

F

95

female

2036

Female

51

M

77

male

1240

Male

141

Total

3640

Sex across the SRA:

57 of 102

Even when information is provided, it’s not always clear…

Level

Frequency

F

95

female

2036

Female

51

M

77

male

1240

Male

141

Total

3640

Sex across the SRA:

“1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown”

# of NAs

# w/sex assigned

44,957

4,700

58 of 102

59 of 102

Goal :��to accurately predict critical phenotype information for all samples in recount2

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

60 of 102

Missingness limited in GTEx phenotype data

level

Frequency

female

3,626

male

6,036

NA

0

Sex

Tissue

Race

Age

1

male

Lung

White

59

2

male

Brain

White

27

3

female

Heart

Black or African American

23

4

male

Brain

White

51

5

male

Skin

White

27

6

male

Lung

White

68

7

female

Brain

White

61

8

female

Adipose Tissue

White

42

9

male

Brain

White

40

10

female

Uterus

White

33

11

female

Nerve

White

60

12

male

Muscle

White

54

13

female

Ovary

White

31

14

male

Blood

White

53

15

female

Brain

White

56

16

male

Muscle

White

44

GTEx

Sex across GTEx:

61 of 102

Goal :��to accurately predict critical phenotype information for all samples in recount2

Ellis et al. Nuc. Acids Res. 2018

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

divide samples

build and optimize phenotype predictor

predict phenotypes across SRA samples

test accuracy of predictor

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

predict phenotypes across samples in TCGA

Training Data

Validation

Data

Test

Data

Validation

Data

62 of 102

63 of 102

Number of Regions

40

40

40

40

Number of Samples (N)

4,769

4,769

11,245

3,640

99.9%

Sex prediction is accurate across data sets

99.8%

99.0%

86.3%

GTEx (training)

GTEx (validation)

TCGA (test)

SRA

64 of 102

To assess misreporting of sex in the SRA, we can use Y-chromosome expression

XX

XY

reported female

reported male

predicted female

predicted male

64

Fred Hutchinson Cancer Center

65 of 102

Expression from the Y chromosome suggests misreporting of sex in the SRA

65

Fred Hutchinson Cancer Center

66 of 102

67 of 102

Ellis et al. Nuc. Acids Res. 2018

expression data for ~70,000 human samples

samples

phenotypes

GTEx

N=9,962

TCGA

N=11,284

SRA

N=49,848

samples

expression estimates

gene

exon

junctions

ERs

Answer meaningful questions about human biology and expression

sex

tissue

Cell line?

M

Blood

yes

F

Heart

no

F

Liver

no

68 of 102

What % expressed?

New genes?

Important outliers?

Sequence data

Process/quantify

Metadata

Clean/predict

ACTACTTT

Primary vs.

Metastatic

69 of 102

What % expressed?

New genes?

Important outliers?

Sequence data

Process/quantify

Metadata

Clean/predict

ACTACTTT

Primary vs.

Metastatic

ACTATTTT

Alt Sequence

70 of 102

71 of 102

72 of 102

73 of 102

Data

Processing

Computing

Metadata

Analysis

73

Fred Hutchinson Cancer Center

74 of 102

Find a researcher with access to patient samples

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

3-6 months

1-3 months

1 month - 1+ years

~6 months

1-2 wks

2-4 wks

Total: 1mo - 1+years

Analyze data and answer biological question

Data cleaning

What genes are prognostic for metastasis?

75 of 102

What gene expression patterns are prognostic of colorectal cancer metastasis ?

75

Fred Hutchinson Cancer Center

76 of 102

https://pubmed.ncbi.nlm.nih.gov/25049118/

77 of 102

Kim et al. analysis looked to identify genes that contribute to metastasis in colon cancer.

N=18

3. Liver Metastasis (MC)

1. Healthy Colon (NC)

2. Primary Cancer (PC)

78 of 102

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

Are the same genes found when sex is included in the analysis?

79 of 102

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

80 of 102

Concordance at the Top (CAT) Plots

How similar are the results from Analysis A and Analysis B ?

Analysis A

Analysis B

80

Fred Hutchinson Cancer Center

81 of 102

Concordance at the Top (CAT) Plots

How similar are the results from Analysis A and Analysis B ?

Analysis A

Analysis B

10

10

81

Fred Hutchinson Cancer Center

82 of 102

Concordance at the Top (CAT) Plots

How similar are the results from Analysis A and Analysis B ?

Analysis A

Analysis B

10

10

82

Fred Hutchinson Cancer Center

83 of 102

Concordance at the Top (CAT) Plots

How similar are the results from Analysis A and Analysis B ?

Analysis A

Analysis B

The top results from A and B are the same

Some differences at less significant genes

83

Fred Hutchinson Cancer Center

84 of 102

Concordance at the Top (CAT) Plots

How similar are the results from Analysis A and Analysis B ?

Analysis A

Analysis B

The results of the orange condition are less similar between Analysis A and B than the green condition

84

Fred Hutchinson Cancer Center

85 of 102

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

86 of 102

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

87 of 102

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

NC: non-cancerous

PC: primary cancer

MC: metastatic cancer

88 of 102

Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data

89 of 102

Loss of concordance suggests that differential expression is detecting tissue differences, not cancer-related changes.

90 of 102

We have expression data from both healthy liver and colon samples (GTEx)...

90

Fred Hutchinson Cancer Center

91 of 102

So…what if we compared the MC:PC results with differential expression between colon and liver?

Hypothesis: MC:PC results should be most similar to GTEx colon vs. liver

91

Fred Hutchinson Cancer Center

92 of 102

Comparison of results with GTEx colon vs. liver suggests differential expression results detecting tissue differences

92

Fred Hutchinson Cancer Center

93 of 102

Data

Processing

Computing

Metadata

Analysis

93

Fred Hutchinson Cancer Center

94 of 102

Data

Processing

Computing

Metadata

Analysis

Software

94

Fred Hutchinson Cancer Center

95 of 102

Study explorer

https://jhubiostatistics.shinyapps.io/recount3-study-explorer/

95

Fred Hutchinson Cancer Center

96 of 102

Bioconductor packages

recount3

snaptron

dasper

96

Fred Hutchinson Cancer Center

97 of 102

Data

Processing

Computing

Metadata

Analysis

Software

Training

97

Fred Hutchinson Cancer Center

98 of 102

General purpose training

https://www.coursera.org/specializations/genomic-data-science#instructors

98

Fred Hutchinson Cancer Center

99 of 102

recount workshops

http://research.libd.org/recountWorkshop/

99

Fred Hutchinson Cancer Center

100 of 102

Thanks to many, many people!

Ben Langmead

Kasper Hansen

Abhi Nellore

Chris Wilkes

Leo Collado Torres

Shannon Ellis

Afrooz Razi

Chris Lo

Alyssa Frazee

Feng Young Chen

Rone Charles

Kai Kammers

Margaret Taub

Jack Fu

  • Many others!

Siruo Wang

Andrew Jaffe

Shije Zheng

Lance Joseph

David Zhang

Eddie Luidy Imada�Jonathan Ling

Brad Solomon

100

Fred Hutchinson Cancer Center

101 of 102

101

Fred Hutchinson Cancer Center

102 of 102

Thank you