Meeting #12. Gaussian distributions and QQ plots
Aaron Quinlan
September 16, 2019
bit.ly//sllobs
Gaussian distributions and Quantile-Quantile (QQ) plots
Applied Computational Genomics
https://github.com/quinlan-lab/applied-computational-genomics
Aaron Quinlan
Departments of Human Genetics and Biomedical Informatics
USTAR Center for Genetic Discovery
University of Utah
quinlanlab.org
Discrete v. continuous random variables
Continuous random variables (Gaussian, Student's t, 𝛸-squared):
Discrete random variables (e.g., Poisson, binomial):
Johann Carl Friedrich Gauss
(30 April 1777 – 23 February 1855)
Gaussian distributions
Two parameters: mean (𝜇) and standard deviation (𝜎)
Gaussian: randomness from many small effects
Francis Galton
1822 – 1911
The bean machine (a simulation)
library(animation)
set.seed(123)
ani.options(nmax = 200 + 15 - 2, 2)
freq = quincunx(balls = 200, col.balls = rainbow(200))
Gaussian ("Normal") distribution
library(ggplot2)
library(cowplot)
x = seq(-4, 4, length=100)
y = dnorm(x, mean = 0, sd = 1)
gauss <- data.frame(x, y)
ggplot(gauss, aes(x, y)) +
geom_line() +
theme_half_open()
Gaussian ("Normal") distribution
library(ggplot2)
library(cowplot)
x = seq(-4, 4, length=100)
y <- dnorm(x, mean = 0, sd = 1)
gauss <- data.frame(x, y)
ggplot(gauss, aes(x, y)) +
geom_line() +
geom_area(stat = "function",
fun = dnorm,
fill = "#00998a",
alpha=1,
xlim = c(-1, 1))
68.3% of the data are within +/- 1 standard deviation of the mean
Gaussian ("Normal") distribution
library(ggplot2)
library(cowplot)
x = seq(-4, 4, length=100)
y <- dnorm(x, mean = 0, sd = 1)
gauss <- data.frame(x, y)
ggplot(gauss, aes(x, y)) +
geom_line() +
geom_area(stat = "function",
fun = dnorm,
fill = "#00998a",
alpha=1.0,
xlim = c(-1, 1)) +
geom_area(stat = "function",
fun = dnorm,
fill = "#00998a",
alpha=0.50,
xlim = c(-2, 2))
95.4% of the data are within +/- 2 standard deviation of the mean
Gaussian ("Normal") distribution
Z-scores (aka, a standard score) are simply the number of s.d. from mean. They also have predictable areas under curve (AUC). More later.
library(ggplot2)
library(cowplot)
x = seq(-4, 4, length=100)
y <- dnorm(x, mean = 0, sd = 1)
gauss <- data.frame(x, y)
ggplot(gauss, aes(x, y)) +
geom_line() +
geom_area(stat = "function",
fun = dnorm,
fill = "#00998a",
alpha=1.0,
xlim = c(-1, 1)) +
geom_area(stat = "function",
fun = dnorm,
fill = "#00998a",
alpha=0.50,
xlim = c(-2, 2)) +
geom_area(stat = "function",
fun = dnorm,
fill = "#00998a",
alpha=0.25,
xlim = c(-3, 3))
99.7% of the data are within +/- 3 standard deviation of the mean
What is the absolute probability of a baby being exactly 7 pounds (3175.15 grams) at birth?
What is the absolute probability of a baby being exactly 7 pounds (3175.15 grams) at birth?
0
What is the relative likelihood of a baby being exactly 7 pounds (3175.15 grams) at birth?
library(ggplot2)
library(cowplot)
weight = seq(2000, 6000, length=10000)
density <- dnorm(weight, mean = 3586, sd = 460)
gauss <- data.frame(weight, density)
ggplot(gauss, aes(weight, density)) +
geom_line()
Probability Density Function
Probability Density Function (PDF) versus
Cumulative Density Function (CDF)
library(ggplot2)
library(cowplot)
weight = seq(2000, 6000, length=10000)
density <- dnorm(weight, mean = 3586, sd = 460)
pdf <- data.frame(weight, density)
ggplot(pdf, aes(weight, density)) +
geom_line()
# CDF
p <- pnorm(weight, mean = 3586, sd = 460, lower.tail=TRUE)
cdf <- data.frame(weight, p)
ggplot(cdf, aes(weight, p)) +
geom_line()
Probability Density Function (PDF) versus
Cumulative Density Function (CDF)
library(ggplot2)
library(cowplot)
weight = seq(2000, 6000, length=10000)
density <- dnorm(weight, mean = 3586, sd = 460)
pdf <- data.frame(weight, density)
ggplot(pdf, aes(weight, density)) +
geom_line()
# CDF
p <- pnorm(weight, mean = 3586, sd = 460, lower.tail=TRUE)
cdf <- data.frame(weight, p)
ggplot(cdf, aes(weight, p)) +
geom_line()
> curve_area = function(weight) {dnorm(weight, mean = 3586, sd = 460)}
> weight = seq(2000, 6000, length=10000)
> density <- dnorm(weight, mean = 3586, sd = 460)
> gauss <- data.frame(weight, density)
> ggplot(gauss, aes(weight, density)) +
+ geom_line() +
+ geom_area(stat = "function",
+ fun = curve_area,
+ fill = "#00998a",
+ alpha=1.0,
+ xlim = c(2000, 3000))
P(X<3000)
pnorm(3000, mean = 3586,
sd = 460, lower.tail=TRUE)
[1] 0.1013471
Probability Density Function (PDF) versus
Cumulative Density Function (CDF)
P(X<3586)
pnorm(3586, mean = 3586,
sd = 460, lower.tail=TRUE)
[1] 0.5
library(ggplot2)
library(cowplot)
weight = seq(2000, 6000, length=10000)
density <- dnorm(weight, mean = 3586, sd = 460)
pdf <- data.frame(weight, density)
ggplot(pdf, aes(weight, density)) +
geom_line()
# CDF
p <- pnorm(weight, mean = 3586, sd = 460, lower.tail=TRUE)
cdf <- data.frame(weight, p)
ggplot(cdf, aes(weight, p)) +
geom_line()
> curve_area = function(weight) {dnorm(weight, mean = 3586, sd = 460)}
> weight = seq(2000, 6000, length=10000)
> density <- dnorm(weight, mean = 3586, sd = 460)
> gauss <- data.frame(weight, density)
> ggplot(gauss, aes(weight, density)) +
+ geom_line() +
+ geom_area(stat = "function",
+ fun = curve_area,
+ fill = "#00998a",
+ alpha=1.0,
+ xlim = c(2000, 3586))
Probability Density Function (PDF) versus
Cumulative Density Function (CDF)
P(X<5000)
pnorm(5000, mean = 3586,
sd = 460, lower.tail=TRUE)
[1] 0.9989436
library(ggplot2)
library(cowplot)
weight = seq(2000, 6000, length=10000)
density <- dnorm(weight, mean = 3586, sd = 460)
pdf <- data.frame(weight, density)
ggplot(pdf, aes(weight, density)) +
geom_line()
# CDF
p <- pnorm(weight, mean = 3586, sd = 460, lower.tail=TRUE)
cdf <- data.frame(weight, p)
ggplot(cdf, aes(weight, p)) +
geom_line()
What is the probability of a birthweight being greater than or equal to 4000 grams?
# or flip tails of distribution
pnorm(4000, mean = 3586,
sd = 460, lower.tail=FALSE)
[1] 0.1840601
# use complement rule
1 - pnorm(4000, mean = 3586,
sd = 460, lower.tail=TRUE)
[1] 0.1840601
What is the probability of a birthweight being between
3000 and 4000 grams?
pnorm(4000, mean = 3586,
sd = 460, lower.tail=TRUE)
[1] 0.8159399
pnorm(3000, mean = 3586,
sd = 460, lower.tail=TRUE)
[1] 0.1013471
# subtract the two CDF probabilities
pnorm(4000, mean = 3586,
sd = 460, lower.tail=TRUE) -
pnorm(3000, mean = 3586,
sd = 460, lower.tail=TRUE)
[1] 0.7145928
Galton and the heredity of height
In 1885, Francis Galton began studies of human heredity. He was specifically interested in the relationship between a parental heights and the height of their offspring. He collected the data of 928 children (heights of adult children) and their parents.
This analysis gave rise to the notion of regression to the mean (more on this later).
Are paternal heights normally distributed?
# install.packages("UsingR", dependencies=TRUE)
# optionally:
library(UsingR)
library(ggplot2)
library(cowplot)
data("father.son")
head(father.son)
fheight sheight
1 65.04851 59.77827
2 63.25094 63.21404
3 64.95532 63.34242
4 65.75250 62.79238
5 61.13723 64.28113
6 63.02254 64.24221
mean(father.son$fheight)
[1] 67.6871
sd(father.son$fheight)
[1] 2.744868
ggplot(father.son) +
geom_histogram(aes(fheight)) +
geom_vline(xintercept=mean(father.son$sheight),
col="orange", size=2)
Quantile-quantile plots (QQ-plots) can help to assess how well data fit a theoretical distribution (e.g., Gaussian)
But what is a quantile?
Values which divide a frequency distribution into groups with an equal number of values.
"quartiles"
"deciles"
quintiles (5 groups), vigintile (20 groups)
What is the purpose of a QQ plot?
How similar are the quantiles in my dataset compared to what the quantiles of my dataset would be if my dataset followed a theoretical probability distribution?
Q-Q plots from statquest
For example, we may want to know if an observed data set is normally-distributed. Why?
For example, we may want to know if an observed data set is normally-distributed. Why?
Knowing the underlying distribution gives us access to convenient tools for measuring probabilities, Z-scores, p-values, etc. directly from the underlying distribution. In other words, it makes life simpler.
Making a Q-Q plot.
observed <- c(7.19, 6.31, 5.89, 4.5, 3.77, 4.25, 5.19, 5.79, 6.79)
sort(observed)
[1] 3.77 4.25 4.50 5.19 5.79 5.89 6.31 6.79 7.19
https://www.statisticshowto.datasciencecentral.com/q-q-plots/
Step 1: order the observed data from smallest to largest
Making a Q-Q plot.
sort(observed)
[1] 3.77 4.25 4.50 5.19 5.79 5.89 6.31 6.79 7.19
https://www.statisticshowto.datasciencecentral.com/q-q-plots/
Step 2: "Draw" a "standard normal" (mean=0, sd=1) distribution curve. Divide the curve into n+1 segments. We have 9 values, so divide the curve into 10 equally-sized areas. For this example, each segment is 10% of the area (because 100% / 10 = 10%).
Making a Q-Q plot.
https://www.statisticshowto.datasciencecentral.com/q-q-plots/
Step 3: Find the z-value (cut-off point) for each segment in Step 2. These segments are areas, so refer to a z-table (or use software) to get a z-value for each segment.
sort(observed)
[1] 3.77 4.25 4.50 5.19 5.79 5.89 6.31 6.79 7.19
Z-values: �10% = -1.28
20% = -0.84
30% = -0.52
40% = -0.25
50% = 0
60% = 0.25
70% = 0.52
80% = 0.84
90% = 1.28
100% = 3.0
10% of AUC
10% of AUC
10% of AUC
Digression: z-scores and
z-tables
http://users.stat.ufl.edu/~athienit/Tables/Ztable.pdf
Digression: z-scores and
z-tables
http://users.stat.ufl.edu/~athienit/Tables/Ztable.pdf
Z-values: �10% = -1.28
20% = -0.84
30% = -0.52
40% = -0.25
50% = 0
60% = 0.25
70% = 0.52
80% = 0.84
90% = 1.28
100% = 3.0
Making a Q-Q plot.
https://www.statisticshowto.datasciencecentral.com/q-q-plots/
Step 4: Plot your sorted data set values (Step 1) against your normal distribution cut-off points (Step 3).
> sort(observed)
[1] 3.77 4.25 4.50 5.19 5.79 5.89 6.31 6.79 7.19
> zvals <- c(-1.28, -0.84, -0.52, -0.25, 0, 0.25, 0.52, 0.84, 1.28, 3.0)
> plot(zvals[1:9], sort(observed), xlab="Theoretical quantiles (z-scores)", ylab="Observed quantiles")
> abline(lm(sort(observed) ~zvals[1:9]))
What might we conclude?
What do we not know?
R makes qqplots automatically (qqnorm)
qqnorm(sort(observed))
R makes qqplots automatically (qqnorm)
qqnorm(sort(observed))
qqline(sort(observed))
Are paternal heights normally distributed?
library(UsingR)
library(ggplot2)
library(cowplot)
data("father.son")
head(father.son)
fheight sheight
1 65.04851 59.77827
2 63.25094 63.21404
3 64.95532 63.34242
4 65.75250 62.79238
5 61.13723 64.28113
6 63.02254 64.24221
ggplot(father.son) +
geom_histogram(aes(fheight))
Let's make a QQ plot
qqnorm(sort(father.son$fheight))
qqline(sort(father.son$fheight))
A positive control experiment: simulate data (N=1000) from a normal distribution.
sim <- rnorm(1000, mean=67.68, sd=2.74)
qqnorm(sim)
qqline(sim)
A negative control: compare values from a Chi-squared distribution. What do we expect the Q-Q plot to show?
sim <- rchisq(1000, 3)
qqnorm(sim)
qqline(sim)
Thank you.
More on QQplots:
https://data.library.virginia.edu/understanding-q-q-plots/
Next lecture will be on the Central Limit Theorem and Confidence Intervals. It will likely be Sept. 30.
Please read "Importance of being uncertain" beforehand.
https://www.nature.com/articles/nmeth.2613