Statistical Genomics & Genetics
Johns Hopkins Biostatistics
February 21, 2020
Stephanie Hicks
Assistant Professor, Biostatistics Department�Faculty Member, Johns Hopkins Data Science Lab��stephaniehicks.com�Twitter: @stephaniehicks
what makes us diverse?
slide adapted from alyssa frazee
how does this happen?
slide adapted from rafa irizarry
how does a healthy cell become a cancer cell?
AUCAGUCGAUCACCGAU
transcription
RNA
translation
protein
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
central dogma
slide adapted from alyssa frazee
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
genetics
phenotype
Different genomes, different phenotypes
Sloth
Human
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
genetics
AUCAGUCGAUCACCGAU
transcription
RNA
translation
protein
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
central dogma
slide adapted from alyssa frazee
AUCAGUCGAUCACCGAU
transcription
RNA
translation
protein
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
genomics
M
M
M
slide adapted from alyssa frazee
Taub, Rucinski, Chatterjee, Zhao
�Hansen, Hicks��Ji, Hansen��Hicks, Ji, Hansen, Leek��Ruczinski
DNA-seq�
DNAm�
ChIP-seq�
RNA-seq�
Protein
Genome
Function
Slide courtesy: Ben Langmead
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
data generation
data generation
GATCGATCGTATACGAT
Fragments
ACTGACCTAGATCAGTC
TACAAAATCATCGGCAT
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
data generation
GATCGATCGTATACGAT
Fragments
ACTGACCTAGATCAGTC
TACAAAATCATCGGCAT
Reads
TACAAAATCA
AGATCAGTC
GATCGATCG
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
@22:16362385-16362561W:ENST00000440999:2:177:-40:244:S/2
CCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCA
+
GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
@22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2
GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT
+
@=ABBBBIIIIIIIIHHGGGGIIDBDIIIIIIGIIIIHIIIIHFDD@BBDBGGFIDEE8DCC/29>BGFCGHHHGF
@22:16362385-16362561W:ENST00000440999:4:177:137:254:S/1
TCACCATCCTGGTGGAACTCTCCTGTGAGGACAGCCAAGGCCTGAACTACCTGCaGTGGGGAGCACCTCAGGGTTT
+
DDGBBCGGGIGGGBDDDHIIGGDGD77=BDIIIIIIIIFHHHHIIIHEFFHGGDD8A>DEGHHIFDDHH8@BEDDI
@22:16362385-16362561W:ENST00000440999:5:177:68:251:S/2
AGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCCTGGAGCGAGTTGTGGATGGCAAAAAGACNCGCC
+
HIGHIHFHEGE4111:.;8@?@HDIIIIIIIEGGIHHHIIGA?=:FIIIDD8.02506A8=AC#############
@22:16362385-16362561W:ENST00000440999:6:177:348:453:S/1
AAGGCCTGAACTACCTGCGGTGGGGAGCACCTCAGGGTTTGCCCAGGCAACCAGCCAGCCCTGGTCCAAGGCATCC
+
B9?@8=42:E@GDEDIIIIIGGHIIIFBEEAGIIDIIDHHGGHIIEGEIIIIIHIHFHFFEEFGGGGGB88>:DGH
@22:51205934-51222090C:ENST00000464740:132:612:223:359:S/2
GGAAGTATGATGCTGATGACAACGTGAAGATCATCTGCCTGGGAGACAGCGCAGTGGGCAAATCCAAACTCATGGA
+
IIEHHHHHIIIIIIIHGGDGHHEDDG8=;?==19;<<>D@@GGGIIHIIHGGDDHGBA=ABEG@@DFCCAA<:=>8
@22:51205934-51222090C:ENST00000464740:125:612:-1:185:S/1
TGGAGTGCGCTGCGGCGCGAGCTGGGCCGGCGGGCGTGGTTCGAGAGCGCGCAGAGTCCAGACTGGCGGCAGGGCC
+
GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
@22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2
GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT
+
GGFF<BB=>GBGIIIIIIIIIIIIIIEGEHGHHIIIIIIIIHFHBB2/:=??EGGGEGFHHIHHEDBD?@@DDHHD
@22:16362385-16362561W:ENST00000440999:3:177:-56:294:S/2
GCGTGAGCCACAGGGCCCAGCCCACCTGAGGCTTCTTTTTCCTTCCCAAGCCACATCACCATCCTGGTGGAACTCT
+
billions more
N =
SAMPLE SIZE
N =
($ YOU HAVE)
($ PER SAMPLE)
$ per (human) Genome
http://www.genome.gov/sequencingcosts/
All the data
ACTGACCTAGATCAGTCGATCGATCGTATACGATTACAAAATCATCGGCAT
DNA
genetics
phenotype
Rare and common variants → relative risk of disease
Ingo Ruczinski
Family study (rare variants)�Goal: Identify highly penetrant disease variants
by sequencing distant relatives
Genome-wide association study (common variants) �Telomere length from 75,000 individuals
Manhattan plot showing peak genetic signals
Margaret Taub
TACAAAATCA
AGATCAGTC
GATCGATCG
All the dataz
+
what we do
TACAAAATCA
AGATCAGTC
GATCGATCG
All the dataz
+
experimental design
experimental design
TACAAAATCA
AGATCAGTC
GATCGATCG
All the dataz
+
experimental design
preprocessing
+
normalization
TACAAAATCA
AGATCAGTC
GATCGATCG
All the dataz
+
genomic �data science
Ni Zhao
Measuring the impact of the microbiome�MiRKAT: kernel methods for associating microbiome data with phenotypes of interest
Kasper Hansen
De-noising DNA methylation data
Kasper Hansen
De-noising DNA methylation data
Kasper Hansen
Understanding changes in DNA methylation�in colon cancer
“...10 billion observations (or cells) by 2020”
Stephanie Hicks
Modeling single-cell RNA-sequencing data
(single-cell) RNA-seq data are nonnegative integers
Stephanie Hicks
Generalized Principal Components �Analysis (GLM PCA)
PCA
GLM PCA
In a nutshell
Interesting, intellectually challenging, �scientifically important problems��Big and complex data��Unique contributions to both science and statistics
slide adapted from kasper hansen
Two major roles for statisticians
As safeguards against mistakes
As engines of discovery
slide adapted from hongkai ji
Outside department: �JHU genomics (broader hopkins community) �meets 2x month��Inside department: �Lots of working groups in biostats for �both statistical genetics and genomics
Group meetings
Taub
Leek
Ji
Hansen
Hicks
Zhao
Ruczinkski
Chatterjee