Poisson processes in biology.
Aaron Quinlan
Departments of Human Genetics and Biomedical Informatics
USTAR Center for Genetic Discovery
University of Utah
quinlanlab.org
Siméon-Denis Poisson
French mathematician, engineer, physicist (1781 – 1840)
The Poisson distribution is used to describe the distribution of rare events in a large population. For example, at any particular time, there is a certain probability that a particular cell within a large population of cells will acquire a mutation.
A Poisson process is appropriate here because mutation acquisition is a rare event, and each mutation event is ~independent of one another.
Poisson Process
Useful model when counting the occurrences of events that appear to happen at a certain rate, but at random.
Let's say that there is 1 earthquake above 6.0 worldwide per month, on average.
Month 1
Month 32
2
0
0
1
0
1
0
3
0
2
0
0
1
1
1
1
1
0
3
0
1
2
3
1
0
1
2
0
1
0
3
1
Important requirement: the number of events per time period are independent.
Poisson Distribution. Discrete.
One parameter. Lambda
Expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate
P(2 earthquakes in one month) = e-1(12/2!)
# Plug lambda and k into the equation
exp(-1) * (1^2/factorial(2))
[1] 0.1839397
# Shortcut (dpois)
dpois(x=2,lambda=1)
[1] 0.1839397
How many red cards do we expect in an English Premier League game?
library(ggplot2)
num_rc <- 0:10
p_num_rc <- dpois(0:10, lambda=0.11)
rc_prob = data.frame(num_rc, p_num_rc)
ggplot(rc_prob, aes(x=as.factor(num_rc), y=p_num_rc)) +
geom_col()
Average of 0.11 bookings per game in the 2018/2019 season.
How many home runs do we
expect in a Major League Baseball game?
library(ggplot2)
num_hr <- 0:10
p_num_hr <- dpois(0:10, lambda=1.26)
home_run_prob = data.frame(num_hr, p_num_hr)
ggplot(home_run_prob, aes(x=as.factor(num_hr), y=p_num_hr))
+ geom_col()
Average #home runs in 2017
was 1.26 (a record).
Luria-Delbruck experiment
How do phage-resistant bacteria colonies arise?
Mutation rates at autosomal nucleotide sites are roughly 10−9 per year.
Consider a position in your genome. If you could trace its ancestry back across the last 109 years, what is the probability that you would find no mutations?
The expected number of mutations is λ = ut, where u = 10−9 and t = 109. Thus, λ = 1. The probability of no mutations at the site you chose is:
What about the probability of one or more mutations?
1^0 * exp(-1) / factorial(0)
[1] 0.3678794
dpois(x=0,lambda=1)
[1] 0.3678794
Alan Rogers
Is the number of chocolate chunks in a cookie a Poisson random variable?
Count the visible chunks. Enter the count here.
Load our observed data into R.
3
Use the "Addins" dropdown to paste as tribble
install.packages("datapasta")
library(datapasta)
1
Install the "datapasta" package
4
Create a data frame from the generated code
2
Copy the data from google sheets
Plot our observed data into R.
library(ggplot2)
ggplot(chunks,aes(x=num_chunks)) +
geom_bar()
5
Plot the number of chips per cookie
6
chunks <- tibble::tribble(
~unid, ~num_chunks,
"u1007787", 0,
"u1007788", 0,
"u1007789", 2,
"u1007790", 0,
"u1007791", 1,
"u1007792", 1,
"u1007793", 0,
"u1007794", 3,
"u1007795", 0,
"u1007796", 0,
"u1007797", 0,
"u1007798", 2,
"u1007799", 1,
"u1007800", 1,
"u1007801", 1,
"u1007802", 1,
"u1007803", 0,
"u1007804", 1,
"u1007805", 1
)
4
Create a tibble from the generated code (below is an example)
table(chunks$num_chunks)
5
Look at the freqs with table()
Is the number of chocolate chunks in a cookie a Poisson random variable?
install.packages("vcd")
library("vcd")
gf <- goodfit(chunks$num_chunks, "poisson")
rootogram(gf)
Too few cookies with 1 or 2 chunks
Too many cookies with 3 or 4 chunks
Control: what does the fit look like with data simulated from a Poisson random variable?
# simulate 1000 cookies with a Poisson
# distributed number of chunks. The mean
# is 3 chunks per cookie
sim_chunks <- rpois(1000, lambda=3)
gf2 <- goodfit(sim_chunks, "poisson")
rootogram(gf2)
Drop-seq (single cell sequencing in droplets)
Drop-seq (single cell sequencing in droplets)
Goal: one cell per droplet
Should one shoot for an average of one cell per droplet (lambda = 1)?
Let's simulate.
sc_sim <- rpois(2000, lambda=1)
barplot(table(sc_sim), col="dodgerblue")
Lot's of cells with >1 cell.
"doublets", "triplets".
What fraction are expected to
be this way?
What about (lambda = 0.5)?
sc_sim <- rpois(2000, lambda=0.5)
barplot(table(sc_sim), col="chartreuse4")
Better, but still many cells with >1 cell.
What about (lambda = 0.1)?
sc_sim <- rpois(2000, lambda=0.1)
barplot(table(sc_sim), col="darkred")
> dpois(x=1, lambda=0.1)
[1] 0.09048374
> 0.09048374 * 2000
[1] 180.9675
Goal: one cell and one bead per droplet
Lamdba = 0.1 to minimize
>1 cell per droplet
> dpois(x=1, lambda=0.1)
[1] 0.09048374
Lamdba = 0.1 to minimize
>1 bead per droplet
> dpois(x=1, lambda=0.1)
[1] 0.09048374
X
=
0.0081
Bulk RNA expression
Single-cell RNAseq
Bulk RNAseq
Poisson in differential (bulk) RNA expression
# Simulate 10,000 genes
N_genes = 10000
# Simulate a range of 10000 lambdas to reflect mean
# expression of each of the 10000 genes.
lambdas = seq(1,1024,len=N_genes) #creates 10000 evenly-spaced numbers
# between 1 and 1024
# Simulate two technical replicates for an RNA-seq
# experiment. The observed expression for each of the
# 10,000 genes will be sampled from a Poisson distribution
# using the same lambda assigned to the gene for each
# replicate.
rep1 = rpois(N_genes, lambdas)
rep2 = rpois(N_genes, lambdas)
# remove genes where the expression was 0 in either rep.
non_zero = which(rep1>0 & rep2>0)
lambdas = lambdas[non_zero]
rep1 = rep1[non_zero]
rep2 = rep2[non_zero]
# make a data frame of lambdas and expression from replicates
rna_sim = data.frame(lambdas, rep1, rep2)
Sources of variance. Poisson (counting) noise
Number of events
Expected frequency
# plot the expression ratio from the two replicates as a function of # the mean expression for the gene (lambda)
# plot in log2 space for better separation of the data
library(ggplot2)
ggplot(rna_sim, aes(x=lambdas, y=log2(rep1/rep2))) + geom_point(size=0.5)
Far greater random or "technical" variation in genes with low expression
Sources of variance. Poisson (counting) noise
http://michelebusby.tumblr.com/post/26913184737/thinking-about-designing-rna-seq-experiments-to
Gene A
Gene B
Gene C
In each case, there is an apparent 2X difference in the mean read counts between control and experimental.
Yet poisson variance (noise) is higher relative to the total count when counts are low versus when they are high.
For example, the difference in expression of a gene measured with one read versus two reads is inherently less certain than the differences in expression of a gene measured with 100 reads versus 200 reads, even though both differences are nominally a 2X fold change.
Thank you.
Multiple testing
In biology, the problem is worse. 20,000 genes instead of just 20 jelly beans to test for the rejection of the null hypothesis.
The p-value histogram
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
BiocManager::install("airway")
The p-value histogram
library("DESeq2")
library("dplyr")
library("airway")
data("airway")
aw = DESeqDataSet(se = airway, design = ~ cell + dex)
aw = DESeq(aw)
awde = as.data.frame(results(aw)) %>% dplyr::filter(!is.na(pvalue))
* Airway dataset: gene expression measurements (counts) of four primary human airway smooth muscle cell lines with and without treatment with dexamethosone
The p-value histogram
library("ggplot2")
ggplot(awde, aes(x = pvalue)) +
geom_histogram(binwidth = 0.025, boundary = 0)
The p-value histogram
Histogram is a mixtures of p-values from tests where the null hypothesis is TRUE and from tests where the null hypothesis is FALSE.
Suppose we reject the null for all tests with a p-value less than alpha=0.025
alpha = binw = 0.025
pi0 = 2 * mean(awde$pvalue > 0.5)
ggplot(awde,
aes(x = pvalue)) + geom_histogram(binwidth = binw, boundary = 0) +
geom_hline(yintercept = pi0 * binw * nrow(awde), col = "blue") +
geom_vline(xintercept = alpha, col = "red")
Mostly
true positives
Mostly
false positives
FDR estimate "by eye": ~950/4700 = ~0.20
Too high?
The Benjamini-Hochberg algorithm for controlling the FDR
Arrange all of your individual P values in order, from smallest to largest. The smallest P value has a rank of i=1, then next smallest has i=2, etc.
Compare each individual P value to its Benjamini-Hochberg critical value, (i/m)Q, where i is the rank, m is the total number of tests, and Q is the false discovery rate you choose.
The largest P value that has P<(i/m)Q is significant, and all of the P values smaller than it are also significant, even the ones that aren't less than their Benjamini-Hochberg critical value.
The Benjamini-Hochberg algorithm for controlling the FDR. An example with FDR set to 0.25
http://www.biostathandbook.com/multiplecomparisons.html.
* p-values are associations of 25 dietary variables with mammographic density, a risk factor for breast cancer
Apply B-H
m = 25
Q = 0.25
The largest P value that has P<(i/m)Q is significant
Mostly
true positives
Mostly
false positives
FDR estimate "by eye": ~950/4700 = ~0.20
Too high?
Apply B-H ranking to the airway p-values.
This time, set FDR to 0.10
fdr = 0.10
awde = mutate(awde, rank = rank(pvalue))
m = nrow(awde)
ggplot(dplyr::filter(awde, rank <= 7000), aes(x = rank, y = pvalue)) +
geom_line() + geom_abline(slope = fdr / m, col = "red")
← Reject the null
~10% of these significant tests will be
False positives
DESeq2 uses B-H to create "padj"
> head(awde)
baseMean log2FoldChange lfcSE stat pvalue padj rank
1 708.6021697 0.38125398 0.10065441 3.7877523 1.520163e-04 1.283630e-03 2135
2 520.2979006 -0.20681260 0.11221865 -1.8429433 6.533729e-02 1.965461e-01 6097
3 237.1630368 -0.03792043 0.14344465 -0.2643558 7.915057e-01 9.114595e-01 22414
4 57.9326331 0.08816818 0.28714182 0.3070545 7.588019e-01 8.950328e-01 21579
5 0.3180984 1.37822703 3.49987280 0.3937935 6.937335e-01 NA 19261
6 5817.3528677 -0.42640213 0.08831339 -4.8282840 1.377146e-06 1.821511e-05 1363
Q-value versus adjusted p-values
In 1995, Yoav Benjamini and Yosef Hochberg proposed controlling the false discovery rate (FDR) as a more statistically powerful alternative to controlling the FWER in multiple hypothesis testing.[2] The pFDR and the q-value were introduced by John D. Storey in 2002 in order to improve upon a limitation of the FDR, namely that the FDR is not defined when there are no positive results.[1][4]
https://en.wikipedia.org/wiki/Q-value_(statistics)