2 of 43

Gaussian distributions and Quantile-Quantile (QQ) plots

Applied Computational Genomics

https://github.com/quinlan-lab/applied-computational-genomics

Aaron Quinlan

Departments of Human Genetics and Biomedical Informatics

USTAR Center for Genetic Discovery

University of Utah

quinlanlab.org

3 of 43

Discrete v. continuous random variables

Continuous random variables (Gaussian, Student's t, 𝛸-squared):

Your exact height
Your dog's exact weight
The winning time in a race
Exact distance between stars
Your exact age
Time it takes a computer to complete a task.
Infinite number of possible values

Discrete random variables (e.g., Poisson, binomial):

The number of lightbulbs that burnout in a warehouse per week.
The number of heads when flipping a coin 10 times.
The number of green (i.e., best!) M&Ms in a bag
The number of times a given genome nucleotide is sequenced.

4 of 43

Johann Carl Friedrich Gauss

(30 April 1777 – 23 February 1855)

Prolific German mathematician.
Child Prodigy
At age 7, was asked "What is the sum of 1 through 100?" Quickly answered 5050 without writing anything down...
Numerous contributions to many different areas of mathematics.
Listen to this podcast for more: https://www.bbc.co.uk/programmes/b09gbnfj

5 of 43

Gaussian distributions

Two parameters: mean (𝜇) and standard deviation (𝜎)

6 of 43

Gaussian: randomness from many small effects

Francis Galton

1822 – 1911

Galton conceived of standard deviation to quantify variation.
Guessed the near exact weight of an ox after it had been slaughtered and dressed by using the median guess from >800 participants.
An early proponent of the study of eugenics.
Invented the "bean machine" (aka "quincux"), a pachinko-like device to illustrate the random effects that lead to the normal distribution.

7 of 43

The bean machine (a simulation)

library(animation)

set.seed(123)

ani.options(nmax = 200 + 15 - 2, 2)

freq = quincunx(balls = 200, col.balls = rainbow(200))

8 of 43

Gaussian ("Normal") distribution

library(ggplot2)

library(cowplot)

x = seq(-4, 4, length=100)

y = dnorm(x, mean = 0, sd = 1)

gauss <- data.frame(x, y)

ggplot(gauss, aes(x, y)) +

geom_line() +

theme_half_open()

9 of 43

Gaussian ("Normal") distribution

library(ggplot2)

library(cowplot)

x = seq(-4, 4, length=100)

y <- dnorm(x, mean = 0, sd = 1)

gauss <- data.frame(x, y)

ggplot(gauss, aes(x, y)) +

geom_line() +

geom_area(stat = "function",

fun = dnorm,

fill = "#00998a",

alpha=1,

xlim = c(-1, 1))

68.3% of the data are within +/- 1 standard deviation of the mean

10 of 43

Gaussian ("Normal") distribution

library(ggplot2)

library(cowplot)

x = seq(-4, 4, length=100)

y <- dnorm(x, mean = 0, sd = 1)

gauss <- data.frame(x, y)

ggplot(gauss, aes(x, y)) +

geom_line() +

geom_area(stat = "function",

fun = dnorm,

fill = "#00998a",

alpha=1.0,

xlim = c(-1, 1)) +

geom_area(stat = "function",

fun = dnorm,

fill = "#00998a",

alpha=0.50,

xlim = c(-2, 2))

95.4% of the data are within +/- 2 standard deviation of the mean

11 of 43

Gaussian ("Normal") distribution

Z-scores (aka, a standard score) are simply the number of s.d. from mean. They also have predictable areas under curve (AUC). More later.

library(ggplot2)

library(cowplot)

x = seq(-4, 4, length=100)

y <- dnorm(x, mean = 0, sd = 1)

gauss <- data.frame(x, y)

ggplot(gauss, aes(x, y)) +

geom_line() +

geom_area(stat = "function",

fun = dnorm,

fill = "#00998a",

alpha=1.0,

xlim = c(-1, 1)) +

geom_area(stat = "function",

fun = dnorm,

fill = "#00998a",

alpha=0.50,

xlim = c(-2, 2)) +

geom_area(stat = "function",

fun = dnorm,

fill = "#00998a",

alpha=0.25,

xlim = c(-3, 3))

99.7% of the data are within +/- 3 standard deviation of the mean

12 of 43

What is the absolute probability of a baby being exactly 7 pounds (3175.15 grams) at birth?

13 of 43

What is the absolute probability of a baby being exactly 7 pounds (3175.15 grams) at birth?

14 of 43

What is the relative likelihood of a baby being exactly 7 pounds (3175.15 grams) at birth?

library(ggplot2)

library(cowplot)

weight = seq(2000, 6000, length=10000)

# PDF

density <- dnorm(weight, mean = 3586, sd = 460)

gauss <- data.frame(weight, density)

ggplot(gauss, aes(weight, density)) +

geom_line()

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2802014/

Probability Density Function

15 of 43

Probability Density Function (PDF) versus

Cumulative Density Function (CDF)

library(ggplot2)

library(cowplot)

weight = seq(2000, 6000, length=10000)

# PDF

density <- dnorm(weight, mean = 3586, sd = 460)

pdf <- data.frame(weight, density)

ggplot(pdf, aes(weight, density)) +

geom_line()

# CDF

p <- pnorm(weight, mean = 3586, sd = 460, lower.tail=TRUE)

cdf <- data.frame(weight, p)

ggplot(cdf, aes(weight, p)) +

geom_line()

16 of 43

Probability Density Function (PDF) versus

Cumulative Density Function (CDF)

library(ggplot2)

library(cowplot)

weight = seq(2000, 6000, length=10000)

# PDF

density <- dnorm(weight, mean = 3586, sd = 460)

pdf <- data.frame(weight, density)

ggplot(pdf, aes(weight, density)) +

geom_line()

# CDF

p <- pnorm(weight, mean = 3586, sd = 460, lower.tail=TRUE)

cdf <- data.frame(weight, p)

ggplot(cdf, aes(weight, p)) +

geom_line()

> curve_area = function(weight) {dnorm(weight, mean = 3586, sd = 460)}

> weight = seq(2000, 6000, length=10000)

> density <- dnorm(weight, mean = 3586, sd = 460)

> gauss <- data.frame(weight, density)

> ggplot(gauss, aes(weight, density)) +

+ geom_line() +

+ geom_area(stat = "function",

+ fun = curve_area,

+ fill = "#00998a",

+ alpha=1.0,

+ xlim = c(2000, 3000))

P(X<3000)

pnorm(3000, mean = 3586,

sd = 460, lower.tail=TRUE)

[1] 0.1013471

17 of 43

Probability Density Function (PDF) versus

Cumulative Density Function (CDF)

P(X<3586)

pnorm(3586, mean = 3586,

sd = 460, lower.tail=TRUE)

[1] 0.5

library(ggplot2)

library(cowplot)

weight = seq(2000, 6000, length=10000)

# PDF

density <- dnorm(weight, mean = 3586, sd = 460)

pdf <- data.frame(weight, density)

ggplot(pdf, aes(weight, density)) +

geom_line()

# CDF

p <- pnorm(weight, mean = 3586, sd = 460, lower.tail=TRUE)

cdf <- data.frame(weight, p)

ggplot(cdf, aes(weight, p)) +

geom_line()

> curve_area = function(weight) {dnorm(weight, mean = 3586, sd = 460)}

> weight = seq(2000, 6000, length=10000)

> density <- dnorm(weight, mean = 3586, sd = 460)

> gauss <- data.frame(weight, density)

> ggplot(gauss, aes(weight, density)) +

+ geom_line() +

+ geom_area(stat = "function",

+ fun = curve_area,

+ fill = "#00998a",

+ alpha=1.0,

+ xlim = c(2000, 3586))

18 of 43

Probability Density Function (PDF) versus

Cumulative Density Function (CDF)

P(X<5000)

pnorm(5000, mean = 3586,

sd = 460, lower.tail=TRUE)

[1] 0.9989436

library(ggplot2)

library(cowplot)

weight = seq(2000, 6000, length=10000)

# PDF

density <- dnorm(weight, mean = 3586, sd = 460)

pdf <- data.frame(weight, density)

ggplot(pdf, aes(weight, density)) +

geom_line()

# CDF

p <- pnorm(weight, mean = 3586, sd = 460, lower.tail=TRUE)

cdf <- data.frame(weight, p)

ggplot(cdf, aes(weight, p)) +

geom_line()

19 of 43

What is the probability of a birthweight being greater than or equal to 4000 grams?

# or flip tails of distribution

pnorm(4000, mean = 3586,

sd = 460, lower.tail=FALSE)

[1] 0.1840601

# use complement rule

1 - pnorm(4000, mean = 3586,

sd = 460, lower.tail=TRUE)

[1] 0.1840601

20 of 43

What is the probability of a birthweight being between

3000 and 4000 grams?

pnorm(4000, mean = 3586,

sd = 460, lower.tail=TRUE)

[1] 0.8159399

pnorm(3000, mean = 3586,

sd = 460, lower.tail=TRUE)

[1] 0.1013471

# subtract the two CDF probabilities

pnorm(4000, mean = 3586,

sd = 460, lower.tail=TRUE) -

pnorm(3000, mean = 3586,

sd = 460, lower.tail=TRUE)

[1] 0.7145928

21 of 43

Galton and the heredity of height

http://www1.kcn.ne.jp/~h-uchii/galton86.html

In 1885, Francis Galton began studies of human heredity. He was specifically interested in the relationship between a parental heights and the height of their offspring. He collected the data of 928 children (heights of adult children) and their parents.

This analysis gave rise to the notion of regression to the mean (more on this later).

22 of 43

Are paternal heights normally distributed?

https://rstudio-pubs-static.s3.amazonaws.com/204984_dd2112475db84af2a03260c4a4f830ac.html

# install.packages("UsingR", dependencies=TRUE)

# optionally:

library(UsingR)

library(ggplot2)

library(cowplot)

data("father.son")

head(father.son)

fheight sheight

1 65.04851 59.77827

2 63.25094 63.21404

3 64.95532 63.34242

4 65.75250 62.79238

5 61.13723 64.28113

6 63.02254 64.24221

mean(father.son$fheight)

[1] 67.6871

sd(father.son$fheight)

[1] 2.744868

ggplot(father.son) +

geom_histogram(aes(fheight)) +

geom_vline(xintercept=mean(father.son$sheight),

col="orange", size=2)

23 of 43

Quantile-quantile plots (QQ-plots) can help to assess how well data fit a theoretical distribution (e.g., Gaussian)

24 of 43

But what is a quantile?

Values which divide a frequency distribution into groups with an equal number of values.

"quartiles"

"deciles"

quintiles (5 groups), vigintile (20 groups)

25 of 43

What is the purpose of a QQ plot?

How similar are the quantiles in my dataset compared to what the quantiles of my dataset would be if my dataset followed a theoretical probability distribution?

26 of 43

Q-Q plots from statquest

27 of 43

For example, we may want to know if an observed data set is normally-distributed. Why?

28 of 43

For example, we may want to know if an observed data set is normally-distributed. Why?

Knowing the underlying distribution gives us access to convenient tools for measuring probabilities, Z-scores, p-values, etc. directly from the underlying distribution. In other words, it makes life simpler.

29 of 43

Making a Q-Q plot.

observed <- c(7.19, 6.31, 5.89, 4.5, 3.77, 4.25, 5.19, 5.79, 6.79)

sort(observed)

[1] 3.77 4.25 4.50 5.19 5.79 5.89 6.31 6.79 7.19

https://www.statisticshowto.datasciencecentral.com/q-q-plots/

Step 1: order the observed data from smallest to largest

30 of 43

Making a Q-Q plot.

sort(observed)

[1] 3.77 4.25 4.50 5.19 5.79 5.89 6.31 6.79 7.19

https://www.statisticshowto.datasciencecentral.com/q-q-plots/

Step 2: "Draw" a "standard normal" (mean=0, sd=1) distribution curve. Divide the curve into n+1 segments. We have 9 values, so divide the curve into 10 equally-sized areas. For this example, each segment is 10% of the area (because 100% / 10 = 10%).

31 of 43

Making a Q-Q plot.

https://www.statisticshowto.datasciencecentral.com/q-q-plots/

Step 3: Find the z-value (cut-off point) for each segment in Step 2. These segments are areas, so refer to a z-table (or use software) to get a z-value for each segment.

sort(observed)

[1] 3.77 4.25 4.50 5.19 5.79 5.89 6.31 6.79 7.19

Z-values: �10% = -1.28

20% = -0.84

30% = -0.52

40% = -0.25

50% = 0

60% = 0.25

70% = 0.52

80% = 0.84

90% = 1.28

100% = 3.0

10% of AUC

32 of 43

Digression: z-scores and

z-tables

http://users.stat.ufl.edu/~athienit/Tables/Ztable.pdf

33 of 43

Digression: z-scores and

z-tables

http://users.stat.ufl.edu/~athienit/Tables/Ztable.pdf

Z-values: �10% = -1.28

20% = -0.84

30% = -0.52

40% = -0.25

50% = 0

60% = 0.25

70% = 0.52

80% = 0.84

90% = 1.28

100% = 3.0

34 of 43

Making a Q-Q plot.

https://www.statisticshowto.datasciencecentral.com/q-q-plots/

Step 4: Plot your sorted data set values (Step 1) against your normal distribution cut-off points (Step 3).

> sort(observed)

[1] 3.77 4.25 4.50 5.19 5.79 5.89 6.31 6.79 7.19

> zvals <- c(-1.28, -0.84, -0.52, -0.25, 0, 0.25, 0.52, 0.84, 1.28, 3.0)

> plot(zvals[1:9], sort(observed), xlab="Theoretical quantiles (z-scores)", ylab="Observed quantiles")

> abline(lm(sort(observed) ~zvals[1:9]))

35 of 43

What might we conclude?

Our observed data are well correlated with what we would expect if they were generated by a normal distribution.
It isn't perfect…why?

36 of 43

What do we not know?

The Q-Q is non-parametric
Thus, we don't know the parameters of the underlying normal distribution. That is, the mean (𝛍) and standard deviation (𝛔).
Is there a theoretical distribution that is a better fit than the normal? One can test with other theoretical distributions. Use qqplot()

37 of 43

R makes qqplots automatically (qqnorm)

qqnorm(sort(observed))

38 of 43

R makes qqplots automatically (qqnorm)

qqnorm(sort(observed))

qqline(sort(observed))

39 of 43

Are paternal heights normally distributed?

https://rstudio-pubs-static.s3.amazonaws.com/204984_dd2112475db84af2a03260c4a4f830ac.html

library(UsingR)

library(ggplot2)

library(cowplot)

data("father.son")

head(father.son)

fheight sheight

1 65.04851 59.77827

2 63.25094 63.21404

3 64.95532 63.34242

4 65.75250 62.79238

5 61.13723 64.28113

6 63.02254 64.24221

ggplot(father.son) +

geom_histogram(aes(fheight))

40 of 43

Let's make a QQ plot

qqnorm(sort(father.son$fheight))

qqline(sort(father.son$fheight))

41 of 43

A positive control experiment: simulate data (N=1000) from a normal distribution.

sim <- rnorm(1000, mean=67.68, sd=2.74)

qqnorm(sim)

qqline(sim)

42 of 43

A negative control: compare values from a Chi-squared distribution. What do we expect the Q-Q plot to show?

sim <- rchisq(1000, 3)

qqnorm(sim)

qqline(sim)

43 of 43

Thank you.