Population scale transcriptomics for
precision oncology
Jeff Leek
VP & CDO
Professor Biostatistics Program, PHS
J Orin Edson Foundation Chair
@jtleek
1
2
3
Fred Hutchinson Cancer Center
www.github.com/jtleek/talks
Fred Hutchinson Cancer Center
4
Precision oncology:
A tour of data plumbing challenges
Jeff Leek
VP & CDO
Professor Biostatistics Program, PHS
J Orin Edson Foundation Chair
@jtleek
5
6
Fred Hutchinson Cancer Center
What gene expression patterns are prognostic of colorectal cancer metastasis ?
7
Fred Hutchinson Cancer Center
https://pubmed.ncbi.nlm.nih.gov/25049118/
Q: What genes are prognostic of colorectal cancer metastasis?
Data
Processing
Computing
Metadata
Analysis
A: This gene signature is/isn’t a potential prognostic biomarker.
9
Fred Hutchinson Cancer Center
Find a researcher with access to patient samples
What genes are prognostic for metastasis?
Find a researcher with access to patient samples
What genes are prognostic for metastasis?
Collect patient samples and information
3~6 months
Find a researcher with access to patient samples
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
3~6 months
1-2 wks
2-4 wks
What genes are prognostic for metastasis?
Find a researcher with access to patient samples
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
33-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Analyze data and answer biological question
Data cleaning
What genes are prognostic for metastasis?
Find a researcher with access to patient samples
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
33-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Total: 2+ years
Analyze data and answer biological question
Data cleaning
What genes are prognostic for metastasis?
Data
Processing
Computing
Metadata
Analysis
16
Fred Hutchinson Cancer Center
We are at the intersection of two revolutions
17
Fred Hutchinson Cancer Center
18
Fred Hutchinson Cancer Center
SAMPLE SIZE
N =
19
Fred Hutchinson Cancer Center
N =
($ YOU HAVE)
($ PER SAMPLE)
20
Fred Hutchinson Cancer Center
Langmead & Nellore, Nat Rev. Genet. 2018
21
Fred Hutchinson Cancer Center
http://www.washingtonpost.com/sf/national/2015/06/27/watsons-next-feat-taking-on-cancer/
Data sharing is improving over time
...but data aren’t always easy to use
Data sharing is improving over time
Data
Processing
Computing
Metadata
Analysis
26
Fred Hutchinson Cancer Center
Find a researcher with access to patient samples
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
33-6 months
1-3 months
1 month - 1+ years
~6 months
1-2 wks
2-4 wks
Total: 1.5+ years
Analyze data and answer biological question
Data cleaning
What genes are prognostic for metastasis?
AUCAGUCGAUCACCGAU
transcription
RNA
translation
protein
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
M
M
M
slide adapted from alyssa frazee
AUCAGUCGAUCACCGAU
transcription
RNA
translation
protein
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
M
M
M
RNA
slide adapted from alyssa frazee
Genome
Transcripts
Reads
@22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
+
GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
@22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2
GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT
+
@=ABBBBIIIIIIIIHHGGGGIIDBDIIIIIIGIIIIHIIIIHFDD@BBDBGGFIDEE8DCC/29>BGFCGHHHGF
@22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1
TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT
+
DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>DEGHHIFDDHH8@BEDDI
@22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2
AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC
+
HIGHIHFHEGE4111:.;8@?@HDIIIIIIIEGGIHHHIIGA?=:FIIIDD8.02506A8=AC#############
@22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1
AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC
+
B9?@8=42:E@GDEDIIIIIGGHIIIFBEEAGIIDIIDHHGGHIIEGEIIIIIHIHFHFFEEFGGGGGB88>:DGH
@22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2
GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA
+
IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>D@@GGGIIHIIHGGDDHGBA=ABEG@@DFCCAA<:=>8
@22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1
TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC
+
HHIIIHIDGG@;=@GIIIIIDDGBBBEDB@8>5554,/':9B@@C?==@1:2@?=GG=;<HHHHGIIHHEC-;;3?
3 gb
Carefully!!
coverage vector
2
6
0
11
6
Genome
(DNA)
junction
counts
3
3
Genome
(DNA)
~71.7 mb
Data
Processing
Computing
Metadata
Analysis
38
Fred Hutchinson Cancer Center
SRA
Human
RNA-seq
Illumina
≈22,000 samples
≈50,000 samples
≈60,000 samples
≈247,000 samples
http://rail.bio/
Slide courtesy Ben Langmead
http://blogs.citrix.com/2012/10/17/announcing-general-availability-of-sharefile-with-storagezones/
slide adapted from andrew jaffe
Obstacle: our research moves (spot) markets
Spike in market price due to preprocessing job flows
�expression data for ~70,000 human samples
GTEx
N=9,962
TCGA
N=11,284
SRA
N=49,848
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
samples
expression estimates
gene
exon
junctions
ERs
< 1,000 junctions
> 20,000 junctions
=
A global view of transcript variability
Nellore et al. Genome Biology 2016
47
Fred Hutchinson Cancer Center
We discovered and redefined hundreds of human genes using these data!
48
Fred Hutchinson Cancer Center
49
Fred Hutchinson Cancer Center
Data
Processing
Computing
Metadata
Analysis
50
Fred Hutchinson Cancer Center
Find a researcher with access to patient samples
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
33-6 months
1-3 months
1 month - 1+ years
~6 months
1-2 wks
2-4 wks
Total: 1.5+ years
Analyze data and answer biological question
Data cleaning
What genes are prognostic for metastasis?
What % expressed?
New genes?
Important outliers?
Prognostic signatures
Sequence data
Process/quantify
ACTACTTT
Metadata
Clean/predict
�expression data for ~70,000 human samples
| | | |
| | | |
| | | |
| | | |
GTEx
N=9,962
TCGA
N=11,284
SRA
N=49,848
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
samples
expression estimates
gene
exon
junctions
ERs
Answer meaningful questions about human biology and expression
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
�expression data for ~70,000 human samples
| | | |
| | | |
| | | |
| | | |
samples
phenotypes
?
GTEx
N=9,962
TCGA
N=11,284
SRA
N=49,848
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
samples
expression estimates
gene
exon
junctions
ERs
Answer meaningful questions about human biology and expression
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
SRA phenotype information is far from complete
| Sex | Tissue | Race | Age |
6620 | female | liver | NA | NA |
6621 | female | liver | NA | NA |
6622 | female | liver | NA | NA |
6623 | female | liver | NA | NA |
6624 | female | liver | NA | NA |
6625 | male | liver | NA | NA |
6626 | male | liver | NA | NA |
6627 | male | liver | NA | NA |
6628 | male | liver | NA | NA |
6629 | male | liver | NA | NA |
6630 | male | liver | NA | NA |
6631 | NA | blood | NA | NA |
6632 | NA | blood | NA | NA |
6633 | NA | blood | NA | NA |
6634 | NA | blood | NA | NA |
6635 | NA | blood | NA | NA |
6636 | NA | blood | NA | NA |
z
z
z
Even when information is provided, it’s not always clear…
Level | Frequency |
F | 95 |
female | 2036 |
Female | 51 |
M | 77 |
male | 1240 |
Male | 141 |
Total | 3640 |
Sex across the SRA:
Even when information is provided, it’s not always clear…
Level | Frequency |
F | 95 |
female | 2036 |
Female | 51 |
M | 77 |
male | 1240 |
Male | 141 |
Total | 3640 |
Sex across the SRA:
“1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown”
# of NAs | # w/sex assigned |
44,957 | 4,700 |
Goal :��to accurately predict critical phenotype information for all samples in recount2
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
Missingness limited in GTEx phenotype data
level | Frequency |
female | 3,626 |
male | 6,036 |
NA | 0 |
| Sex | Tissue | Race | Age |
1 | male | Lung | White | 59 |
2 | male | Brain | White | 27 |
3 | female | Heart | Black or African American | 23 |
4 | male | Brain | White | 51 |
5 | male | Skin | White | 27 |
6 | male | Lung | White | 68 |
7 | female | Brain | White | 61 |
8 | female | Adipose Tissue | White | 42 |
9 | male | Brain | White | 40 |
10 | female | Uterus | White | 33 |
11 | female | Nerve | White | 60 |
12 | male | Muscle | White | 54 |
13 | female | Ovary | White | 31 |
14 | male | Blood | White | 53 |
15 | female | Brain | White | 56 |
16 | male | Muscle | White | 44 |
GTEx
Sex across GTEx:
Goal :��to accurately predict critical phenotype information for all samples in recount2
Ellis et al. Nuc. Acids Res. 2018
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
divide samples
build and optimize phenotype predictor
predict phenotypes across SRA samples
test accuracy of predictor
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
predict phenotypes across samples in TCGA
Training Data
Validation
Data
Test
Data
Validation
Data
Number of Regions | 40 | 40 | 40 | 40 |
Number of Samples (N) | 4,769 | 4,769 | 11,245 | 3,640 |
99.9%
Sex prediction is accurate across data sets
99.8%
99.0%
86.3%
GTEx (training) | GTEx (validation) | TCGA (test) | SRA |
To assess misreporting of sex in the SRA, we can use Y-chromosome expression
XX
XY
reported female
reported male
predicted female
predicted male
64
Fred Hutchinson Cancer Center
Expression from the Y chromosome suggests misreporting of sex in the SRA
65
Fred Hutchinson Cancer Center
Ellis et al. Nuc. Acids Res. 2018
�expression data for ~70,000 human samples
samples
phenotypes
GTEx
N=9,962
TCGA
N=11,284
SRA
N=49,848
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
samples
expression estimates
gene
exon
junctions
ERs
Answer meaningful questions about human biology and expression
sex | tissue | Cell line? |
M | Blood | yes |
F | Heart | no |
F | Liver | no |
What % expressed?
New genes?
Important outliers?
Sequence data
Process/quantify
Metadata
Clean/predict
ACTACTTT
Primary vs.
Metastatic
What % expressed?
New genes?
Important outliers?
Sequence data
Process/quantify
Metadata
Clean/predict
ACTACTTT
Primary vs.
Metastatic
ACTATTTT
Alt Sequence
Data
Processing
Computing
Metadata
Analysis
73
Fred Hutchinson Cancer Center
Find a researcher with access to patient samples
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
3-6 months
1-3 months
1 month - 1+ years
~6 months
1-2 wks
2-4 wks
Total: 1mo - 1+years
Analyze data and answer biological question
Data cleaning
What genes are prognostic for metastasis?
What gene expression patterns are prognostic of colorectal cancer metastasis ?
75
Fred Hutchinson Cancer Center
https://pubmed.ncbi.nlm.nih.gov/25049118/
Kim et al. analysis looked to identify genes that contribute to metastasis in colon cancer.
N=18
3. Liver Metastasis (MC)
1. Healthy Colon (NC)
2. Primary Cancer (PC)
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Are the same genes found when sex is included in the analysis?
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Concordance at the Top (CAT) Plots
How similar are the results from Analysis A and Analysis B ?
Analysis A
Analysis B
80
Fred Hutchinson Cancer Center
Concordance at the Top (CAT) Plots
How similar are the results from Analysis A and Analysis B ?
Analysis A
Analysis B
10
10
81
Fred Hutchinson Cancer Center
Concordance at the Top (CAT) Plots
How similar are the results from Analysis A and Analysis B ?
Analysis A
Analysis B
10
10
82
Fred Hutchinson Cancer Center
Concordance at the Top (CAT) Plots
How similar are the results from Analysis A and Analysis B ?
Analysis A
Analysis B
The top results from A and B are the same
Some differences at less significant genes
83
Fred Hutchinson Cancer Center
Concordance at the Top (CAT) Plots
How similar are the results from Analysis A and Analysis B ?
Analysis A
Analysis B
The results of the orange condition are less similar between Analysis A and B than the green condition
84
Fred Hutchinson Cancer Center
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
Loss of concordance suggests that differential expression is detecting tissue differences, not cancer-related changes.
We have expression data from both healthy liver and colon samples (GTEx)...
90
Fred Hutchinson Cancer Center
So…what if we compared the MC:PC results with differential expression between colon and liver?
Hypothesis: MC:PC results should be most similar to GTEx colon vs. liver
91
Fred Hutchinson Cancer Center
Comparison of results with GTEx colon vs. liver suggests differential expression results detecting tissue differences
92
Fred Hutchinson Cancer Center
Data
Processing
Computing
Metadata
Analysis
93
Fred Hutchinson Cancer Center
Data
Processing
Computing
Metadata
Analysis
Software
94
Fred Hutchinson Cancer Center
Study explorer
https://jhubiostatistics.shinyapps.io/recount3-study-explorer/
95
Fred Hutchinson Cancer Center
Bioconductor packages
recount3
snaptron
dasper
96
Fred Hutchinson Cancer Center
Data
Processing
Computing
Metadata
Analysis
Software
Training
97
Fred Hutchinson Cancer Center
General purpose training
https://www.coursera.org/specializations/genomic-data-science#instructors
98
Fred Hutchinson Cancer Center
recount workshops
http://research.libd.org/recountWorkshop/
99
Fred Hutchinson Cancer Center
Thanks to many, many people!
Ben Langmead
Kasper Hansen
Abhi Nellore
Chris Wilkes
Leo Collado Torres
Shannon Ellis
Afrooz Razi
Chris Lo
Alyssa Frazee
Feng Young Chen
Rone Charles
Kai Kammers
Margaret Taub
Jack Fu
Siruo Wang
Andrew Jaffe
Shije Zheng
Lance Joseph
David Zhang
Eddie Luidy Imada�Jonathan Ling
Brad Solomon
100
Fred Hutchinson Cancer Center
101
Fred Hutchinson Cancer Center
Thank you