1 of 40

Poisson processes in biology.

Aaron Quinlan

Departments of Human Genetics and Biomedical Informatics

USTAR Center for Genetic Discovery

University of Utah

quinlanlab.org

2 of 40

Siméon-Denis Poisson

French mathematician, engineer, physicist (1781 – 1840)

One of 72 scientists whose name is on the Eiffel tower.
>300 publications on math, physics, and astronomy
Creator of the Poisson distribution. Incredibly useful in science for cases where we need to count and model random events

3 of 40

The Poisson distribution is used to describe the distribution of rare events in a large population. For example, at any particular time, there is a certain probability that a particular cell within a large population of cells will acquire a mutation.

A Poisson process is appropriate here because mutation acquisition is a rare event, and each mutation event is ~independent of one another.

https://www.sciencedirect.com/topics/biochemistry-genetics-and-molecular-biology/poisson-distribution

4 of 40

Poisson Process

Useful model when counting the occurrences of events that appear to happen at a certain rate, but at random.

Let's say that there is 1 earthquake above 6.0 worldwide per month, on average.

Month 1

Month 32

2

0

1

0

1

0

3

0

2

0

1

0

3

0

1

2

3

1

0

1

2

0

1

0

3

1

Important requirement: the number of events per time period are independent.

5 of 40

Poisson Distribution. Discrete.

One parameter. Lambda

Expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate

P(2 earthquakes in one month) = e^-1(1²/2!)

# Plug lambda and k into the equation

exp(-1) * (1^2/factorial(2))

[1] 0.1839397

# Shortcut (dpois)

dpois(x=2,lambda=1)

[1] 0.1839397

6 of 40

How many red cards do we expect in an English Premier League game?

library(ggplot2)

num_rc <- 0:10

p_num_rc <- dpois(0:10, lambda=0.11)

rc_prob = data.frame(num_rc, p_num_rc)

ggplot(rc_prob, aes(x=as.factor(num_rc), y=p_num_rc)) +

geom_col()

https://www.gambling.com/online-betting/strategy/8-essential-premier-league-stats-10400

Average of 0.11 bookings per game in the 2018/2019 season.

7 of 40

How many home runs do we

expect in a Major League Baseball game?

library(ggplot2)

num_hr <- 0:10

p_num_hr <- dpois(0:10, lambda=1.26)

home_run_prob = data.frame(num_hr, p_num_hr)

ggplot(home_run_prob, aes(x=as.factor(num_hr), y=p_num_hr))

+ geom_col()

https://www.boston.com/sports/mlb/2019/05/02/mlb-home-runs-strikeouts-record

Average #home runs in 2017

was 1.26 (a record).

8 of 40

Luria-Delbruck experiment

How do phage-resistant bacteria colonies arise?

In other words, genetic mutations arise in the absence of selection, rather than being a response to selection.

https://en.wikipedia.org/wiki/Luria%E2%80%93Delbr%C3%BCck_experiment

9 of 40

Mutation rates at autosomal nucleotide sites are roughly 10⁻⁹ per year.

Consider a position in your genome. If you could trace its ancestry back across the last 10⁹ years, what is the probability that you would find no mutations?

The expected number of mutations is λ = ut, where u = 10⁻⁹ and t = 10⁹. Thus, λ = 1. The probability of no mutations at the site you chose is:

What about the probability of one or more mutations?

1^0 * exp(-1) / factorial(0)

[1] 0.3678794

dpois(x=0,lambda=1)

[1] 0.3678794

Alan Rogers

10 of 40

Is the number of chocolate chunks in a cookie a Poisson random variable?

11 of 40

Count the visible chunks. Enter the count here.

Chip count data

12 of 40

Load our observed data into R.

3

Use the "Addins" dropdown to paste as tribble

install.packages("datapasta")

library(datapasta)

1

Install the "datapasta" package

4

Create a data frame from the generated code

2

Copy the data from google sheets

13 of 40

Plot our observed data into R.

library(ggplot2)

ggplot(chunks,aes(x=num_chunks)) +

geom_bar()

5

Plot the number of chips per cookie

6

chunks <- tibble::tribble(

~unid, ~num_chunks,

"u1007787", 0,

"u1007788", 0,

"u1007789", 2,

"u1007790", 0,

"u1007791", 1,

"u1007792", 1,

"u1007793", 0,

"u1007794", 3,

"u1007795", 0,

"u1007796", 0,

"u1007797", 0,

"u1007798", 2,

"u1007799", 1,

"u1007800", 1,

"u1007801", 1,

"u1007802", 1,

"u1007803", 0,

"u1007804", 1,

"u1007805", 1

)

4

Create a tibble from the generated code (below is an example)

table(chunks$num_chunks)

5

Look at the freqs with table()

14 of 40

Is the number of chocolate chunks in a cookie a Poisson random variable?

We want to know the "goodness of fit". That is, which theoretical distribution fits our data best? Is it Poisson?
The "vcd" (Visualizing Categorical Data) R package has two very helpful functions: "goodfit" and "rootogram" for testing this.

install.packages("vcd")

library("vcd")

gf <- goodfit(chunks$num_chunks, "poisson")

rootogram(gf)

https://www.huber.embl.de/msmb/Chap-Models.html

Too few cookies with 1 or 2 chunks

Too many cookies with 3 or 4 chunks

15 of 40

Control: what does the fit look like with data simulated from a Poisson random variable?

# simulate 1000 cookies with a Poisson

# distributed number of chunks. The mean

# is 3 chunks per cookie

sim_chunks <- rpois(1000, lambda=3)

gf2 <- goodfit(sim_chunks, "poisson")

rootogram(gf2)

16 of 40

Drop-seq (single cell sequencing in droplets)

https://www.dolomite-bio.com/wp-content/uploads/Drop-Seq-Workflow.png

17 of 40

Drop-seq (single cell sequencing in droplets)

https://www.youtube.com/watch?time_continue=3&v=usK71SG30t0

18 of 40

Goal: one cell per droplet

Should one shoot for an average of one cell per droplet (lambda = 1)?

Let's simulate.

sc_sim <- rpois(2000, lambda=1)

barplot(table(sc_sim), col="dodgerblue")

Lot's of cells with >1 cell.

"doublets", "triplets".

What fraction are expected to

be this way?

What about (lambda = 0.5)?

sc_sim <- rpois(2000, lambda=0.5)

barplot(table(sc_sim), col="chartreuse4")

Better, but still many cells with >1 cell.

What about (lambda = 0.1)?

sc_sim <- rpois(2000, lambda=0.1)

barplot(table(sc_sim), col="darkred")

> dpois(x=1, lambda=0.1)

[1] 0.09048374

> 0.09048374 * 2000

[1] 180.9675

19 of 40

Goal: one cell and one bead per droplet

Lamdba = 0.1 to minimize

>1 cell per droplet

> dpois(x=1, lambda=0.1)

[1] 0.09048374

Lamdba = 0.1 to minimize

>1 bead per droplet

> dpois(x=1, lambda=0.1)

[1] 0.09048374

X

=

0.0081

20 of 40

Bulk RNA expression

Single-cell RNAseq

Bulk RNAseq

21 of 40

Poisson in differential (bulk) RNA expression

https://www.youtube.com/watch?v=KuxRYdG1EdQ

# Simulate 10,000 genes

N_genes = 10000

# Simulate a range of 10000 lambdas to reflect mean

# expression of each of the 10000 genes.

lambdas = seq(1,1024,len=N_genes) #creates 10000 evenly-spaced numbers

# between 1 and 1024

# Simulate two technical replicates for an RNA-seq

# experiment. The observed expression for each of the

# 10,000 genes will be sampled from a Poisson distribution

# using the same lambda assigned to the gene for each

# replicate.

rep1 = rpois(N_genes, lambdas)

rep2 = rpois(N_genes, lambdas)

# remove genes where the expression was 0 in either rep.

non_zero = which(rep1>0 & rep2>0)

lambdas = lambdas[non_zero]

rep1 = rep1[non_zero]

rep2 = rep2[non_zero]

# make a data frame of lambdas and expression from replicates

rna_sim = data.frame(lambdas, rep1, rep2)

22 of 40

Sources of variance. Poisson (counting) noise

Number of events

Expected frequency

https://www.youtube.com/watch?v=KuxRYdG1EdQ

# plot the expression ratio from the two replicates as a function of # the mean expression for the gene (lambda)

# plot in log2 space for better separation of the data

library(ggplot2)

ggplot(rna_sim, aes(x=lambdas, y=log2(rep1/rep2))) + geom_point(size=0.5)

Far greater random or "technical" variation in genes with low expression

23 of 40

24 of 40

Sources of variance. Poisson (counting) noise

http://michelebusby.tumblr.com/post/26913184737/thinking-about-designing-rna-seq-experiments-to

Gene A

Gene B

Gene C

In each case, there is an apparent 2X difference in the mean read counts between control and experimental.

Yet poisson variance (noise) is higher relative to the total count when counts are low versus when they are high.

For example, the difference in expression of a gene measured with one read versus two reads is inherently less certain than the differences in expression of a gene measured with 100 reads versus 200 reads, even though both differences are nominally a 2X fold change.

25 of 40

Thank you.

Check out seeing theory.

26 of 40

Multiple testing

https://xkcd.com/882/

27 of 40

In biology, the problem is worse. 20,000 genes instead of just 20 jelly beans to test for the rejection of the null hypothesis.

https://www.huber.embl.de/msmb/

28 of 40

The p-value histogram

if (!requireNamespace("BiocManager", quietly = TRUE))

install.packages("BiocManager")

BiocManager::install("DESeq2")

BiocManager::install("airway")

29 of 40

The p-value histogram

library("DESeq2")

library("dplyr")

library("airway")

data("airway")

aw = DESeqDataSet(se = airway, design = ~ cell + dex)

aw = DESeq(aw)

awde = as.data.frame(results(aw)) %>% dplyr::filter(!is.na(pvalue))

* Airway dataset: gene expression measurements (counts) of four primary human airway smooth muscle cell lines with and without treatment with dexamethosone

30 of 40

The p-value histogram

library("ggplot2")

ggplot(awde, aes(x = pvalue)) +

geom_histogram(binwidth = 0.025, boundary = 0)

31 of 40

The p-value histogram

Histogram is a mixtures of p-values from tests where the null hypothesis is TRUE and from tests where the null hypothesis is FALSE.

32 of 40

Suppose we reject the null for all tests with a p-value less than alpha=0.025

alpha = binw = 0.025

pi0 = 2 * mean(awde$pvalue > 0.5)

ggplot(awde,

aes(x = pvalue)) + geom_histogram(binwidth = binw, boundary = 0) +

geom_hline(yintercept = pi0 * binw * nrow(awde), col = "blue") +

geom_vline(xintercept = alpha, col = "red")

33 of 40

Mostly

true positives

Mostly

false positives

FDR estimate "by eye": ~950/4700 = ~0.20

Too high?

34 of 40

The Benjamini-Hochberg algorithm for controlling the FDR

Arrange all of your individual P values in order, from smallest to largest. The smallest P value has a rank of i=1, then next smallest has i=2, etc.

Compare each individual P value to its Benjamini-Hochberg critical value, (i/m)Q, where i is the rank, m is the total number of tests, and Q is the false discovery rate you choose.

The largest P value that has P<(i/m)Q is significant, and all of the P values smaller than it are also significant, even the ones that aren't less than their Benjamini-Hochberg critical value.

35 of 40

The Benjamini-Hochberg algorithm for controlling the FDR. An example with FDR set to 0.25

http://www.biostathandbook.com/multiplecomparisons.html.

* p-values are associations of 25 dietary variables with mammographic density, a risk factor for breast cancer

Apply B-H

m = 25

Q = 0.25

The largest P value that has P<(i/m)Q is significant

36 of 40

Mostly

true positives

Mostly

false positives

FDR estimate "by eye": ~950/4700 = ~0.20

Too high?

37 of 40

Apply B-H ranking to the airway p-values.

This time, set FDR to 0.10

fdr = 0.10

awde = mutate(awde, rank = rank(pvalue))

m = nrow(awde)

ggplot(dplyr::filter(awde, rank <= 7000), aes(x = rank, y = pvalue)) +

geom_line() + geom_abline(slope = fdr / m, col = "red")

38 of 40

← Reject the null

~10% of these significant tests will be

False positives

39 of 40

DESeq2 uses B-H to create "padj"

> head(awde)

baseMean log2FoldChange lfcSE stat pvalue padj rank

1 708.6021697 0.38125398 0.10065441 3.7877523 1.520163e-04 1.283630e-03 2135

2 520.2979006 -0.20681260 0.11221865 -1.8429433 6.533729e-02 1.965461e-01 6097

3 237.1630368 -0.03792043 0.14344465 -0.2643558 7.915057e-01 9.114595e-01 22414

4 57.9326331 0.08816818 0.28714182 0.3070545 7.588019e-01 8.950328e-01 21579

5 0.3180984 1.37822703 3.49987280 0.3937935 6.937335e-01 NA 19261

6 5817.3528677 -0.42640213 0.08831339 -4.8282840 1.377146e-06 1.821511e-05 1363

40 of 40

Q-value versus adjusted p-values

In 1995, Yoav Benjamini and Yosef Hochberg proposed controlling the false discovery rate (FDR) as a more statistically powerful alternative to controlling the FWER in multiple hypothesis testing.[2] The pFDR and the q-value were introduced by John D. Storey in 2002 in order to improve upon a limitation of the FDR, namely that the FDR is not defined when there are no positive results.[1][4]

https://en.wikipedia.org/wiki/Q-value_(statistics)