1 of 60

Inference after prediction

(what we do after we have machine learned everything)

@jtleek

2 of 60

Inference after prediction

(what we do after we have machine learned everything)

@jtleek

3 of 60

4 of 60

jtleek.com/talks

5 of 60

Siruo (Sara) Wang - the real hero of our journey

6 of 60

Reminds me of another grad student

7 of 60

An old observation (ca 2011)

8 of 60

It happened everywhere….

9 of 60

...and caused lots of problems

10 of 60

A key observation

=

+

genes/transcripts/probes

+

Error

samples

Group

Batch

11 of 60

A key observation

=

+

genes/transcripts/probes

+

Error

samples

Group

Batch

12 of 60

Back to Sara

13 of 60

What makes primary cancer different than metastatic cancer?

www.hopkinsmedicine.org

14 of 60

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

15 of 60

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

Collect patient samples and information

3~6 months

16 of 60

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

3~6 months

1-2 wks

2-4 wks

17 of 60

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

33-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Analyze data and answer biological question

Data cleaning

18 of 60

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

33-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Total: 2+ years

Analyze data and answer biological question

Data cleaning

19 of 60

What % expressed?

New genes?

Important outliers?

log det 𝞡 -tr S𝞡 - 𝞀||𝞡||1

Best methods?

Sequence data

Process/quantify

Metadata

Clean/predict

ACTACTTT

20 of 60

Find a researcher with access to patient samples

What makes primary cancer different than metastatic cancer?

Collect patient samples and information

Extract DNA/RNA from samples

Sequence samples

Process sequencing data

Data cleaning

33-6 months

1-3 months

1 month - 1+ years

3~6 months

1-2 wks

2-4 wks

Total: 1+ years

Analyze data and answer biological question

21 of 60

A new observation

expression data for ~70,000 human samples

GTEx

N=9,962

TCGA

N=11,284

SRA

N=49,848

samples

expression estimates

gene

exon

junctions

ERs

Answer meaningful questions about human biology and expression

22 of 60

A new problem

expression data for ~70,000 human samples

samples

phenotypes

?

GTEx

N=9,962

TCGA

N=11,284

SRA

N=49,848

samples

expression estimates

gene

exon

junctions

ERs

Answer meaningful questions about human biology and expression

23 of 60

SRA phenotype information is far from complete

Sex

Tissue

Race

Age

6620

female

liver

NA

NA

6621

female

liver

NA

NA

6622

female

liver

NA

NA

6623

female

liver

NA

NA

6624

female

liver

NA

NA

6625

male

liver

NA

NA

6626

male

liver

NA

NA

6627

male

liver

NA

NA

6628

male

liver

NA

NA

6629

male

liver

NA

NA

6630

male

liver

NA

NA

6631

NA

blood

NA

NA

6632

NA

blood

NA

NA

6633

NA

blood

NA

NA

6634

NA

blood

NA

NA

6635

NA

blood

NA

NA

6636

NA

blood

NA

NA

z

z

z

24 of 60

Even when information is provided, it’s not always clear…

Level

Frequency

F

95

female

2036

Female

51

M

77

male

1240

Male

141

Total

3640

Sex across the SRA:

25 of 60

Even when information is provided, it’s not always clear…

Level

Frequency

F

95

female

2036

Female

51

M

77

male

1240

Male

141

Total

3640

“1 Male, 2 Female”, “2 Male, 1 Female”, “3 Female”, “DK”, “male and female” “Male (note: ….)”, “missing”, “mixed”, “mixture”, “N/A”, “Not available”, “not applicable”, “not collected”, “not determined”, “pooled male and female”, “U”, “unknown”, “Unknown”

# of NAs

# w/sex assigned

44,957

4,700

Sex across the SRA:

26 of 60

27 of 60

Goal :��to accurately predict critical phenotype information for all samples in recount2

gene, exon, exon-exon junction and expressed region RNA-Seq data

SRA

Sequence Read Archive

N=49,848

divide samples

build and optimize phenotype predictor

predict phenotypes across SRA samples

test accuracy of predictor

TCGA

The Cancer Genome Atlas

N=11,284

GTEx

Genotype Tissue Expression Project

N=9,662

predict phenotypes across samples in TCGA

Training Data

Validation

Data

Test

Data

Validation

Data

28 of 60

Problem solved (thanks Shannon!)

expression data for ~70,000 human samples

samples

phenotypes

GTEx

N=9,962

TCGA

N=11,284

SRA

N=49,848

samples

expression estimates

gene

exon

junctions

ERs

Answer meaningful questions about human biology and expression

sex

tissue

Cell line?

M

Blood

yes

F

Heart

no

F

Liver

no

29 of 60

But sometimes the predictions and reported tissues make less sense…

 ¯\_(ツ)_/¯

30 of 60

31 of 60

It’s happening everywhere….

32 of 60

It can cause problems

33 of 60

It can cause problems

34 of 60

Simulation

35 of 60

Simulation

36 of 60

Simulation

37 of 60

It can cause problems

38 of 60

Underestimated variance

39 of 60

It can get worse!

40 of 60

Seems a little circular - but it is common!

41 of 60

“Estimates obtained from the Cooperative Congressional Election Study (CCES) of the 2016 US presidential election suggest a ρ(R,X) ≈ −0.005 for self-reporting to vote for Donald Trump. Because of LLP, this seemingly minuscule data defect correlation implies that the simple sample proportion of the self-reported voting preference for Trump from 1% of the US eligible voters, that is, n ≈ 2,300,000, has the same mean squared error as the corresponding sample proportion from a genuine simple random sample of size n ≈ 400, a 99.98% reduction of sample size (and hence our confidence). “

42 of 60

43 of 60

44 of 60

45 of 60

A key observation

46 of 60

A key observation

47 of 60

48 of 60

Derivation Algorithm

49 of 60

50 of 60

51 of 60

52 of 60

53 of 60

=

54 of 60

55 of 60

Problem solved (thanks Shannon!)

expression data for ~70,000 human samples

samples

phenotypes

GTEx

N=9,962

TCGA

N=11,284

SRA

N=49,848

samples

expression estimates

gene

exon

junctions

ERs

Answer meaningful questions about human biology and expression

sex

tissue

Cell line?

M

Blood

yes

F

Heart

no

F

Liver

no

56 of 60

RIN model

57 of 60

58 of 60

We can work with the machines!

59 of 60

Siruo (Sara) Wang - the real hero of our journey

60 of 60

The Leek group

Collaborators

  • Jack Fu
  • Aboozar Hadavand
  • Leslie Myint
  • Kayode Sosina
  • Sara Wang
  • Shannon Ellis
  • Andrew Jaffe
  • Kasper Hansen
  • Margaret Taub
  • Leah Jager
  • Sean Kross
  • Ben Langmead
  • Abhi Nellore
  • Kai Kammers
  • Leo Collado-Torres
  • Ashkaun Razmara