Inference after prediction
(what we do after we have machine learned everything)
@jtleek
Inference after prediction
(what we do after we have machine learned everything)
@jtleek
jtleek.com/talks
Siruo (Sara) Wang - the real hero of our journey
Reminds me of another grad student
An old observation (ca 2011)
It happened everywhere….
...and caused lots of problems
A key observation
=
+
genes/transcripts/probes
+
Error
samples
Group
Batch
A key observation
=
+
genes/transcripts/probes
+
Error
samples
Group
Batch
Back to Sara
What makes primary cancer different than metastatic cancer?
www.hopkinsmedicine.org
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Collect patient samples and information
3~6 months
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
3~6 months
1-2 wks
2-4 wks
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
33-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Analyze data and answer biological question
Data cleaning
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
33-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Total: 2+ years
Analyze data and answer biological question
Data cleaning
What % expressed?
New genes?
Important outliers?
log det 𝞡 -tr S𝞡 - 𝞀||𝞡||1
Best methods?
Sequence data
Process/quantify
Metadata
Clean/predict
ACTACTTT
Find a researcher with access to patient samples
What makes primary cancer different than metastatic cancer?
Collect patient samples and information
Extract DNA/RNA from samples
Sequence samples
Process sequencing data
Data cleaning
33-6 months
1-3 months
1 month - 1+ years
3~6 months
1-2 wks
2-4 wks
Total: 1+ years
Analyze data and answer biological question
A new observation
�expression data for ~70,000 human samples
| | | |
| | | |
| | | |
| | | |
GTEx
N=9,962
TCGA
N=11,284
SRA
N=49,848
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
samples
expression estimates
gene
exon
junctions
ERs
Answer meaningful questions about human biology and expression
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
A new problem
�expression data for ~70,000 human samples
| | | |
| | | |
| | | |
| | | |
samples
phenotypes
?
GTEx
N=9,962
TCGA
N=11,284
SRA
N=49,848
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
samples
expression estimates
gene
exon
junctions
ERs
Answer meaningful questions about human biology and expression
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
SRA phenotype information is far from complete
| Sex | Tissue | Race | Age |
6620 | female | liver | NA | NA |
6621 | female | liver | NA | NA |
6622 | female | liver | NA | NA |
6623 | female | liver | NA | NA |
6624 | female | liver | NA | NA |
6625 | male | liver | NA | NA |
6626 | male | liver | NA | NA |
6627 | male | liver | NA | NA |
6628 | male | liver | NA | NA |
6629 | male | liver | NA | NA |
6630 | male | liver | NA | NA |
6631 | NA | blood | NA | NA |
6632 | NA | blood | NA | NA |
6633 | NA | blood | NA | NA |
6634 | NA | blood | NA | NA |
6635 | NA | blood | NA | NA |
6636 | NA | blood | NA | NA |
z
z
z
Even when information is provided, it’s not always clear…
Level | Frequency |
F | 95 |
female | 2036 |
Female | 51 |
M | 77 |
male | 1240 |
Male | 141 |
Total | 3640 |
Sex across the SRA:
Even when information is provided, it’s not always clear…
Level | Frequency |
F | 95 |
female | 2036 |
Female | 51 |
M | 77 |
male | 1240 |
Male | 141 |
Total | 3640 |
“1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown”
# of NAs | # w/sex assigned |
44,957 | 4,700 |
Sex across the SRA:
Goal :��to accurately predict critical phenotype information for all samples in recount2
gene, exon, exon-exon junction and expressed region RNA-Seq data
SRA
Sequence Read Archive
N=49,848
divide samples
build and optimize phenotype predictor
predict phenotypes across SRA samples
test accuracy of predictor
TCGA
The Cancer Genome Atlas
N=11,284
GTEx
Genotype Tissue Expression Project
N=9,662
predict phenotypes across samples in TCGA
Training Data
Validation
Data
Test
Data
Validation
Data
Problem solved (thanks Shannon!)
�expression data for ~70,000 human samples
samples
phenotypes
GTEx
N=9,962
TCGA
N=11,284
SRA
N=49,848
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
samples
expression estimates
gene
exon
junctions
ERs
Answer meaningful questions about human biology and expression
sex | tissue | Cell line? |
M | Blood | yes |
F | Heart | no |
F | Liver | no |
But sometimes the predictions and reported tissues make less sense…
¯\_(ツ)_/¯
It’s happening everywhere….
It can cause problems
It can cause problems
Simulation
Simulation
Simulation
It can cause problems
Underestimated variance
It can get worse!
Seems a little circular - but it is common!
“Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a ρ(R,X) ≈ −0.005 for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from 1% of the US eligible voters, that is, n ≈ 2,300,000, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size n ≈ 400, a 99.98% reduction of sample size (and hence our confidence). “
A key observation
A key observation
Derivation Algorithm
=
Problem solved (thanks Shannon!)
�expression data for ~70,000 human samples
samples
phenotypes
GTEx
N=9,962
TCGA
N=11,284
SRA
N=49,848
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
samples
expression estimates
gene
exon
junctions
ERs
Answer meaningful questions about human biology and expression
sex | tissue | Cell line? |
M | Blood | yes |
F | Heart | no |
F | Liver | no |
RIN model
We can work with the machines!
Siruo (Sara) Wang - the real hero of our journey
The Leek group
Collaborators