Upcycling genomics data:
From publicly available "junk" to priceless "treasure"
Shannon E. Ellis
What makes primary cancer different than metastatic cancer?
www.hopkinsmedicine.org
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Collect patient samples and information
3~6 months
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
3~6 months
1-2 wks
2-4 wks
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
33-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Analyze data and answer biological question
Data cleaning
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
33-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Total: 2+ years
Analyze data and answer biological question
Data cleaning
Biologists have recently gotten pretty good at making their data available to the public.
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
Data cleaning
33-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Total: 1+ years
Analyze data and answer biological question
...but they’re not great at making these data easily accessible and well-annotated.
Biologists have recently gotten pretty good at making their data available to the public.
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
Data cleaning
33-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Total: 1+ years
Analyze data and answer biological question
Project Goal: Take publicly available RNA-Seq data and make it available and easy-to-use
Easy to Get
Comprehensive
Useful for future study
Genetics101
The Central Dogma of Genetics
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
slide adapted from jeff leek
TGACTGGATCTAGTCAGCTAGCTAGCATATGCTAATGTTTTAGTAGCCGTA
The Central Dogma of Genetics
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
slide adapted from jeff leek
The Central Dogma of Genetics
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
slide adapted from alyssa frazee
slide adapted from jeff leek
gene
The Central Dogma of Genetics
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
slide adapted from alyssa frazee
slide adapted from jeff leek
gene
exons
The Central Dogma of Genetics
AUCAGUCGAUCACCGAU
transcription
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
slide adapted from alyssa frazee
slide adapted from jeff leek
RNA
The Central Dogma of Genetics
AUCAGUCGAUCACCGAU
transcription
translation
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
slide adapted from alyssa frazee
slide adapted from jeff leek
RNA
proteins
Two copies of DNA -> many transcripts -> many proteins
blueprint
role
in the cell
# copies/cell
2
RNA
proteins
DNA
functional
unit
gene
# unique functional
units
20,000
Two copies of DNA -> many transcripts -> many proteins
blueprint
carry out cellular functions
2
varies
~1010
RNA
proteins
DNA
gene
proteins
(metabolites, hormones, etc.)
20,000
~100,000
role
in the cell
# copies/cell
functional
unit
# unique functional
units
Two copies of DNA -> many transcripts -> many proteins
blueprint
messenger
carry out cellular functions
2
varies
~360,000
varies
~1010
RNA
proteins
DNA
gene
transcript
proteins
(metabolites, hormones, etc.)
20,000
~100,000
~100,000
role
in the cell
# copies/cell
functional
unit
# unique functional
units
Variability at the level of RNA allows for a heart cell to function differently than a brain cell
!=
Measuring � RNA levels
slide adapted from jeff leek
Next Generation Sequencing (NGS) Has Completely Revolutionized How We Study Genetics
Next Generation Sequencing (NGS) in one slide
RNA
Step 1: Extract RNA to get sample of interest
Next Generation Sequencing (NGS) in one slide
RNA
Step 1: Extract RNA to get sample of interest
Step 2: Chop up RNA into smaller pieces
Next Generation Sequencing (NGS) in one slide
RNA
Step 1: Extract RNA to get sample of interest
Step 3: Sequence the sample
Step 2: Chop up RNA into smaller pieces
Next Generation Sequencing (NGS) in one slide
RNA
Step 1: Extract RNA to get sample of interest
Step 3: Sequence the sample
Step 3: Obtain short read data from the sequencer
Step 2: Chop up RNA into smaller pieces
Next Generation Sequencing (NGS) in one slide
RNA
AUCAGUCGAUCACCGAU
A short read tells you the sequence of the RNA in that read
slide adapted from jeff leek
slide adapted from jeff leek
Sequence Identifier
Sequence
Quality scores
We’ve got 40M+ reads.
What does that all mean?
We first need to align these reads back to the genome
Genome
(DNA)
slide adapted from jeff leek
The Human Genome Project (HGP) determined the reference sequence of the human genome in 2001.
We first need to align these reads back to the genome
Genome
(DNA)
slide adapted from jeff leek
Nellore et al. (2016)
Bioinformatics
http://rail.bio/
We can then align each short read back to the reference genome to figure out where it came from.
We first need to align these reads back to the genome
coverage vector
2
6
0
11
6
Genome
(DNA)
slide adapted from jeff leek
The number of reads at each position lets us know abundance
We first need to align these reads back to the genome
coverage vector
2
6
0
11
6
Genome
(DNA)
slide adapted from jeff leek
We then summarize across all the positions in a gene to get an estimate for gene expression.
RNA-Seq = estimate expression across entire genome
estimate expression!
RNA-Seq = estimate expression across entire genome
estimate expression!
expression ≅ # RNA-Seq reads
Project Goal: Take publicly available RNA-Seq data and make it available and easy-to-use
Easy to Get
Comprehensive
Useful for future study
GTEx
https://commonfund.nih.gov/GTEx
TCGA
GTEx
https://commonfund.nih.gov/GTEx
SRA
Project | No. of Sample |
GTEx Genotype-Tissue Expression Project | 9,962 |
TCGA The Cancer Genome Atlas | 11,284 |
SRA Sequence Read Archive | 49,848 |
Project | No. of Sample |
GTEx Genotype-Tissue Expression Project | 9,962 |
TCGA The Cancer Genome Atlas | 11,284 |
SRA Sequence Read Archive | 49,848 |
We’ll take these ~70,000 samples, align each back to the reference genome, and then, for each sample, we’ll estimate expression across the genome.
Project Goal: Take publicly available RNA-Seq data and make it available and easy-to-use
Easy to Get
Comprehensive
Useful for future study
X
�expression data for ~70,000 human samples
| | | |
| | | |
| | | |
| | | |
samples
expression estimates
�expression data for ~70,000 human samples
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
samples
expression estimates
gene
exon
junctions
ERs
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
Data cleaning
33-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Determine expression differences between primary cancer and metastatic cancer
Find publicly available RNA-sequencing data from primary cancer and metastasis
Total: 1+ years
Find a researcher with access to patient samples
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
Data cleaning
3-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Determine expression differences between primary cancer and metastatic cancer
Get already processed and summarized RNA-Seq data from recount2
Total: months
What makes primary cancer different than metastatic cancer?
Since September of this year, 555 different people have accessed data in recount2
These 555 people have accessed 417,488 files (91,003 unique) in recount2
Project Goal: Take publicly available RNA-Seq data and make it available and easy-to-use
Easy to Get
Comprehensive
Useful for future study
X
X
�expression data for ~70,000 human samples
Answer meaningful questions about human biology and expression
samples
expression estimates
gene
exon
junctions
ERs
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
GTEx
N=9,962
TCGA
N=11,284
SRA
N=49,848
�expression data for ~70,000 human samples
| | | |
| | | |
| | | |
| | | |
samples
phenotypes
?
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
samples
expression estimates
gene
exon
junctions
ERs
Answer meaningful questions about human biology and expression
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
GTEx
N=9,962
TCGA
N=11,284
SRA
N=49,848
SRA phenotype information is far from complete
| Sex | Tissue | Race | Age |
6620 | female | liver | NA | NA |
6621 | female | liver | NA | NA |
6622 | female | liver | NA | NA |
6623 | female | liver | NA | NA |
6624 | female | liver | NA | NA |
6625 | male | liver | NA | NA |
6626 | male | liver | NA | NA |
6627 | male | liver | NA | NA |
6628 | male | liver | NA | NA |
6629 | male | liver | NA | NA |
6630 | male | liver | NA | NA |
6631 | NA | blood | NA | NA |
6632 | NA | blood | NA | NA |
6633 | NA | blood | NA | NA |
6634 | NA | blood | NA | NA |
6635 | NA | blood | NA | NA |
6636 | NA | blood | NA | NA |
SRA
SRA phenotype information is far from complete
| Sex | Tissue | Race | Age |
6620 | female | liver | NA | NA |
6621 | female | liver | NA | NA |
6622 | female | liver | NA | NA |
6623 | female | liver | NA | NA |
6624 | female | liver | NA | NA |
6625 | male | liver | NA | NA |
6626 | male | liver | NA | NA |
6627 | male | liver | NA | NA |
6628 | male | liver | NA | NA |
6629 | male | liver | NA | NA |
6630 | male | liver | NA | NA |
6631 | NA | blood | NA | NA |
6632 | NA | blood | NA | NA |
6633 | NA | blood | NA | NA |
6634 | NA | blood | NA | NA |
6635 | NA | blood | NA | NA |
6636 | NA | blood | NA | NA |
z
z
z
SRA
Even when information is provided, it’s not always clear…
Category | Frequency |
F | 95 |
female | 2036 |
Female | 51 |
M | 77 |
male | 1240 |
Male | 141 |
Total | 3640 |
SRA
Even when information is provided, it’s not always clear…
Category | Frequency |
F | 95 |
female | 2036 |
Female | 51 |
M | 77 |
male | 1240 |
Male | 141 |
Total | 3640 |
“1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown”
SRA
Even when information is provided, it’s not always clear…
Category | Frequency |
F | 95 |
female | 2036 |
Female | 51 |
M | 77 |
male | 1240 |
Male | 141 |
Total | 3640 |
“1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown”
# of NAs | # w/sex assigned |
44,957 | 4,700 |
SRA
in-silico Phenotyping
slide adapted from jeff leek
Goal :��to accurately predict critical phenotype information for all samples in recount2
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
Machine Learning: Making predictions from data
Data Set #1
Data Set #2
Machine Learning: Making predictions from data
Data Set #1
Data Set #2
Training Data
Validation
Data
Machine Learning: Making predictions from data
Data Set #1
Data Set #2
Training Data
Data used to build the predictor
We’re interested in predicting phenotype from expression data...
There are a number of different curves you could fit through these data
This curve fits every point in the training data perfectly...
Machine Learning: Making predictions from data
Data Set #1
Data Set #2
Training Data
Data used to build the predictor
Validation
Data
Samples held back from training
What if we tried to predict phenotype in the validation data?
The curve no longer fits every point perfectly.
The curve no longer fits every point perfectly.
Machine Learning: Making predictions from data
Data Set #1
Data Set #2
Training Data
Data used to build the predictor
Validation
Data
Test
Data
Samples held back from training
Independent data set to test predictor
We can now test prediction accuracy in an independent set of samples.
The line generated from the training data accurately predicts phenotype in the test data
Goal :��to accurately predict critical phenotype information for all samples in recount2
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
Goal :��to accurately predict critical phenotype information for all samples in recount2
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
build and optimize phenotype predictor
Training Data
Missingness limited in GTEx phenotype data
| Sex | Tissue | Race | Age |
1 | male | Lung | White | 59 |
2 | male | Brain | White | 27 |
3 | female | Heart | Black or African American | 23 |
4 | male | Brain | White | 51 |
5 | male | Skin | White | 27 |
6 | male | Lung | White | 68 |
7 | female | Brain | White | 61 |
8 | female | Adipose Tissue | White | 42 |
9 | male | Brain | White | 40 |
10 | female | Uterus | White | 33 |
11 | female | Nerve | White | 60 |
12 | male | Muscle | White | 54 |
13 | female | Ovary | White | 31 |
14 | male | Blood | White | 53 |
15 | female | Brain | White | 56 |
16 | male | Muscle | White | 44 |
GTEx
Missingness limited in GTEx phenotype data
Category | Frequency |
female | 3,626 |
male | 6,036 |
NA | 0 |
| Sex | Tissue | Race | Age |
1 | male | Lung | White | 59 |
2 | male | Brain | White | 27 |
3 | female | Heart | Black or African American | 23 |
4 | male | Brain | White | 51 |
5 | male | Skin | White | 27 |
6 | male | Lung | White | 68 |
7 | female | Brain | White | 61 |
8 | female | Adipose Tissue | White | 42 |
9 | male | Brain | White | 40 |
10 | female | Uterus | White | 33 |
11 | female | Nerve | White | 60 |
12 | male | Muscle | White | 54 |
13 | female | Ovary | White | 31 |
14 | male | Blood | White | 53 |
15 | female | Brain | White | 56 |
16 | male | Muscle | White | 44 |
GTEx
Goal :��to accurately predict critical phenotype information for all samples in recount2
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
build and optimize phenotype predictor
Training Data
Goal :��to accurately predict critical phenotype information for all samples in recount2
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
divide samples
build and optimize phenotype predictor
test accuracy of predictor
Training Data
Validation
Data
Goal :��to accurately predict critical phenotype information for all samples in recount2
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
predict phenotypes across samples in TCGA
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
divide samples
build and optimize phenotype predictor
test accuracy of predictor
Training Data
Validation
Data
Test
Data
Goal :��to accurately predict critical phenotype information for all samples in recount2
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
divide samples
build and optimize phenotype predictor
predict phenotypes across SRA samples
test accuracy of predictor
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
predict phenotypes across samples in TCGA
Training Data
Validation
Data
Test
Data
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
phenopredict
Prediction is done using linear regression
Prediction is done using linear regression
Prediction is done using linear regression
Prediction is done using linear regression
Prediction is done using linear regression
Prediction is done using linear regression
Prediction is done using linear regression
Prediction is done using linear regression
New sample’s expression
Prediction is done using linear regression
New sample’s expression
Prediction is done using linear regression
New sample’s expression
New sample’s predicted phenotype
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
phenopredict
92.7%
filter_regions()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
phenopredict
92.7%
filter_regions()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
phenopredict
Set of discriminatory regions
92.7%
filter_regions()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
phenopredict
Set of discriminatory regions
build_predictor()
Extract coefficient estimates across regions
expression ~ phenotype
| |
| |
| |
| |
filtered regions (r)
male
female
phenotype (P)
92.7%
filter_regions()
build_predictor()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
Extract coefficient estimates across regions
expression ~ phenotype
| |
| |
| |
| |
filtered regions (r)
male
female
phenotype (P)
phenopredict
Set of discriminatory regions
Relationship between expression and phenotype
92.7%
filter_regions()
build_predictor()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
Extract coefficient estimates across regions
expression ~ phenotype
| |
| |
| |
| |
filtered regions (r)
male
female
phenotype (P)
phenopredict
test_predictor()
|
|
|
|
predictions
Predict phenotype and assess accuracy in training set data
male | female |
0.99 | 0.01 |
0.02 | 0.98 |
0.04 | 0.96 |
0.98 | 0.02 |
Relationship between expression and phenotype
Set of discriminatory regions
Likelihood of phenotype for each individual
92.7%
filter_regions()
build_predictor()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
Extract coefficient estimates across regions
expression ~ phenotype
| |
| |
| |
| |
filtered regions (r)
male
female
phenotype (P)
phenopredict
test_predictor()
|
|
|
|
predictions
Predict phenotype and assess accuracy in training set data
male | female |
0.99 | 0.01 |
0.02 | 0.98 |
0.04 | 0.96 |
0.98 | 0.02 |
Relationship between expression and phenotype
Set of discriminatory regions
Assign phenotype to most likely category
92.7%
filter_regions()
build_predictor()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
Extract coefficient estimates across regions
expression ~ phenotype
| |
| |
| |
| |
filtered regions (r)
male
female
phenotype (P)
phenopredict
test_predictor()
male |
|
|
|
predictions
Predict phenotype and assess accuracy in training set data
male | female |
0.99 | 0.01 |
0.02 | 0.98 |
0.04 | 0.96 |
0.98 | 0.02 |
Relationship between expression and phenotype
Set of discriminatory regions
Assign phenotype to most likely category
92.7%
filter_regions()
build_predictor()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
Extract coefficient estimates across regions
expression ~ phenotype
| |
| |
| |
| |
filtered regions (r)
male
female
phenotype (P)
phenopredict
test_predictor()
male |
female |
female |
male |
predictions
Predict phenotype and assess accuracy in training set data
male | female |
0.99 | 0.01 |
0.02 | 0.98 |
0.04 | 0.96 |
0.98 | 0.02 |
Relationship between expression and phenotype
Set of discriminatory regions
Assign phenotype to most likely category
92.7%
filter_regions()
test_predictor()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
male |
female |
female |
male |
predictions
male |
female |
female |
male |
reported
phenopredict
Predict phenotype and assess accuracy in training set data
build_predictor()
Extract coefficient estimates across regions
expression ~ phenotype
| |
| |
| |
| |
filtered regions (r)
male
female
phenotype (P)
92.7%
filter_regions()
test_predictor()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
male |
female |
female |
male |
predictions
male |
female |
female |
male |
reported
Prediction accuracy: 100%
phenopredict
Predict phenotype and assess accuracy in training set data
build_predictor()
Extract coefficient estimates across regions
expression ~ phenotype
| |
| |
| |
| |
filtered regions (r)
male
female
phenotype (P)
filter_regions()
test_predictor()
extract_data()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
Identify regions with differential expression for each level
male |
female |
female |
male |
predictions
male |
female |
female |
male |
reported
Extract expression information at regions identified by filter_regions() in a new data set
Prediction accuracy: 100%
| | | |
| | | |
| | | |
| | | |
expression @ filtered regions
new data set samples
92.7%
sex
expression
male
female
phenopredict
Predict phenotype and assess accuracy in training set data
build_predictor()
Extract coefficient estimates across regions
expression ~ phenotype
| |
| |
| |
| |
filtered regions (r)
male
female
phenotype (P)
92.7%
filter_regions()
test_predictor()
extract_data()
predict_pheno()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
male |
female |
female |
male |
predictions
male |
female |
female |
male |
reported
Extract expression information at regions identified by filter_regions() in a new data set
Predict phenotypes across samples in this new data set
Prediction accuracy: 100%
| | | |
| | | |
| | | |
| | | |
expression @ filtered regions
new data set samples
apply coefficient estimates to the extracted data
phenopredict
build_predictor()
Extract coefficient estimates across regions
expression ~ phenotype
| |
| |
| |
| |
filtered regions (r)
male
female
phenotype (P)
Predict phenotype and assess accuracy in training set data
92.7%
filter_regions()
test_predictor()
extract_data()
predict_pheno()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
male |
female |
female |
male |
predictions
male |
female |
female |
male |
reported
Extract expression information at regions identified by filter_regions() in a new data set
Predict phenotypes across samples in this new data set
Prediction accuracy: 100%
| | | |
| | | |
| | | |
| | | |
expression @ filtered regions
new data set samples
apply coefficient estimates to the extracted data
phenopredict
build_predictor()
Extract coefficient estimates across regions
expression ~ phenotype
| |
| |
| |
| |
filtered regions (r)
male
female
phenotype (P)
Predict phenotype and assess accuracy in training set data
92.7%
filter_regions()
test_predictor()
extract_data()
predict_pheno()
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
sex
expression
male
female
Identify regions with differential expression for each level
male |
female |
female |
male |
predictions
male |
female |
female |
male |
reported
Extract expression information at regions identified by filter_regions() in a new data set
Predict phenotypes across samples in this new data set
Prediction accuracy: 100%
| | | |
| | | |
| | | |
| | | |
expression @ filtered regions
new data set samples
apply coefficient estimates to the extracted data
male |
male |
male |
female |
predictions in new data set
phenopredict
build_predictor()
Extract coefficient estimates across regions
expression ~ phenotype
| |
| |
| |
| |
filtered regions (r)
male
female
phenotype (P)
Predict phenotype and assess accuracy in training set data
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
Training Data
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
phenopredict
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
Training Data
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
phenopredict
Accuracy
100%
100%
100%
100%
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
Training Data
Validation
Data
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
phenopredict
100%
100%
100%
100%
Accuracy
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
Training Data
Validation
Data
Test
Data
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
phenopredict
100%
100%
100%
100%
Accuracy
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
Training Data
Validation
Data
Test
Data
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
phenopredict
Make predictions!
100%
100%
100%
100%
Accuracy
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
Training Data
Validation
Data
Test
Data
filter_regions()
build_predictor()
test_predictor()
extract_data()
predict_pheno()
functions
phenopredict
Make predictions!
100%
100%
100%
100%
Accuracy
Number of Regions | 40 | 40 | 40 | 40 |
Number of Samples (N) | 4,769 | 4,769 | 11,245 | 3,640 |
99.9%
Sex prediction is accurate across data sets
99.8%
99.0%
86.3%
GTEx (training) | GTEx (validation) | TCGA (test) | SRA |
Number of Regions | 40 | 40 | 40 | 40 |
Number of Samples (N) | 4,769 | 4,769 | 11,245 | 3,640 |
99.9%
Sex prediction is accurate across data sets
99.8%
99.0%
86.3%
GTEx (training) | GTEx (validation) | TCGA (test) | SRA |
http://www.rna-seqblog.com/
Can we use expression data to predict tissue?
Tissue prediction is accurate across data sets
Number of Regions | 2,281 | 2,281 | 2,281 | 2,281 |
Number of Samples (N) | 4,769 | 4,769 | 7,193 | 8,951 |
97.7%
96.6%
76.8%
51.9%
GTEx (training) | GTEx (validation) | TCGA (test) | SRA |
Prediction is more accurate in healthy tissue
Number of Regions | 2,281 | 2,281 | 2,281 | 2,281 | 2,281 |
Number of Samples (N) | 4,769 | 4,769 | 613 | 6,579 | 8,951 |
97.7%
96.6%
92.7%
75.3%
51.9%
GTEx (training) | GTEx (validation) | TCGA: healthy tissue | TCGA: cancer | SRA |
Across the samples in recount, brain, blood, and skin are the three most frequently predicted tissues types
A sample reported to be Intestine is predicted to be Colon. That makes good sense.
A sample reported to be Breast is often predicted to be Adipose Tissue. Makes enough sense too…
But sometimes the predictions and reported tissues make less sense…
¯\_(ツ)_/¯
Tissue prediction is largely accurate across recount2
Tissue can be accurately predicted from expression data.
Discordant predictions are often made to biologically similar tissues.
Sometimes, predictions are inaccurate.
Project Goal: Take publicly available RNA-Seq data and make it available and easy-to-use
Easy to Get
Comprehensive
Useful for future study
X
X
X
Find a researcher with access to patient samples
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
Data cleaning
3-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Determine expression differences between primary cancer and metastatic cancer
Get already processed and summarized RNA-Seq data from recount2
Total: months
What makes primary cancer different than metastatic cancer?
Find a researcher with access to patient samples
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
Data cleaning
3-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Determine expression differences between primary cancer and metastatic cancer
Get already processed and summarized RNA-Seq data from recount2 with sample information
Total: months
What makes primary cancer different than metastatic cancer?
Ok. Ok. What about actually using all of these predictions…?
What makes primary cancer different than metastatic cancer?
www.hopkinsmedicine.org
Molecular Oncology, July 2014
Kim et al. analysis looked to identify genes that contribute to metastasis in colon cancer.
N=18
3. Liver Metastasis (MC)
1. Healthy Colon (NC)
2. Primary Cancer (PC)
Kim et al. analysis looked to identify genes that contribute to metastasis in colon cancer.
N=18
3. Liver Metastasis (MC)
1. Healthy Colon (NC)
2. Primary Cancer (PC)
NC vs. PC
MC vs. PC
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Are the same genes found when sex is included in the analysis?
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Predictions can be used to:��(1) Identify studies of interest��(2) appropriately analyze data
NC: non-cancerous
PC: primary cancer
MC: metastatic cancer
Find a researcher with access to patient samples
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
Data cleaning
3-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Determine expression differences between primary cancer and metastatic cancer
Total: days-months
What makes primary cancer different than metastatic cancer?
Finally, what if YOU want to use recount2...?
predictions (v0.0.06)
sample_id | dataset | reported_sex | predicted_sex | accuracy_sex | … | reported_tissue | predicted_tissue | accuracy_tissue |
SRR660824 | gtex | male | male | 0.999 | … | lung | lung | 0.977 |
SRR2166176 | gtex | male | male | 0.999 | … | brain | brain | 0.977 |
SRR606939 | gtex | female | female | 0.999 | … | heart | heart | 0.966 |
SRR2167642 | gtex | male | male | 0.999 | … | brain | brain | 0.966 |
SRR2165473 | gtex | male | male | 0.999 | … | skin | skin | 0.966 |
Expression data and predictions available in recount R package
>library('recount')
�> download_study('ERP001942', type='rse-gene')
> load(file.path('ERP001942', 'rse_gene.Rdata'))
> rse <- scale_counts(rse_gene)
�> rse_with_pred <- add_predictions(rse)
There are a lot of questions an undergraduate could answer with recount2...
3-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
…and all of this is also up on bioRxiv
http://biorxiv.org/content/early/2017/06/03/145656
phenopredict�https://github.com/leekgroup/phenopredict
If you want to… | |
Align RNA-Seq data | http://rail.bio |
Learn about human expression | https://jhubiostatistics.shinyapps.io/recount/ |
Predict phenotype information | |
The Leek group
Collaborators
phenopredict�https://github.com/leekgroup/phenopredict
If you want to… | |
Align RNA-Seq data | http://rail.bio |
Learn about human expression | https://jhubiostatistics.shinyapps.io/recount/ |
Predict phenotype information | |