LECTURE 2
CENTRAL TENDENCY + DATA SPREAD
DATA EXPLORATION
THE NORMAL DISTRIBUTION
CENTRAL LIMIT THEOREM
DATA VISUALIZATION
The average value
It is...
The middle value
The most frequent value
If n is odd, the median is the middle value. If n is even, the median is the average of the two central values.
Whichever value appears most frequently in your samples. There can be multiple modes.
NOTES
MEAN
MEDIAN
MODE
Calculated
by...
Sample variance is a more robust measure of data spread because it is not so highly influence by outliers and takes into account all observations. In words, variance is the mean squared distance of observations from the sample mean.
Mathematically:
Units of variance are the units of the observation squared, [xi]2
VARIANCE
sample
STANDARD DEVIATION
From histograms to probability density
(board drawing)
A = 1
Random variable
Probability Density
What is the probability of randomly selecting a fish that is…
Fish Length (inches)
[1 – 4)
[4 – 6)
[6 – 9)
[9 – 13)
[13 – 17)
0.09
0.13
0.31
0.40
0.07
Probability Density
Areas shown in gray
NORMALLY DISTRIBUTED DATA
μ
Spread defined by sd (σ)
The normal distribution is common. But NEVER assume your data is normally distributed without looking at it and thinking critically first..
There are infinite normal distributions.
Their mean and sd are what change the particular shape.
R code for this graph on GauchoSpace
Normal Distribution: 68-95-99.7
Study.com
Example
A researcher tells you that masses of the elusive Magic Rainbowfish are normally distributed with a population mean of 415 g and a population standard deviation of 30 g.
a) greater than 445 g
b) between 355 and 445 g?
c) exactly 415 g?
d) less than 485 g?
Z-score: How many standard deviations a value is away from the mean
That’s easy enough if you’re interested in values at exactly 1, 2 or 3 SDs, but what about everything else?
http://www.stat.ufl.edu/~athienit/Tables/Ztable.pdf
If female blue whales have a normally distributed mean length of 82 ft with a standard deviation of 4 ft (both for the population), what is the probability of randomly selecting a blue whale that is:
The mean and standard deviation for the American pika at 6 wks old is 75.0 ± 13.2 g [1]. Assuming that this data describes the normally distributed population:
[1] Hayssen et al. (1993). Asdell's Patterns of Mammalian Reproduction: A Compendium of Species-Specific Data, Cornell University Press, Ithaca, New York.
THE CENTRAL LIMIT THEOREM
Many analyses require an assumption of normality – meaning that the sampling distribution of a parameter being compared is normal.
The Central Limit Theorem: Regardless of the underlying shape of a population, when repeated samples are taken from the population and a parameter (e.g. mean) is calculated for each sample, the sampling distribution of that parameter will be normal (especially as sample size n is sufficiently large).
What is the sampling distribution?
Why does that matter?
For many types of hypothesis tests, there is an assumption that the sampling distribution of the parameter being studied is normal.
If we have a large enough sample size (> 30), then we may be able to use those tests even if the underlying population shape is unknown because the CLT tells us that the sampling distribution of the statistic will be normal.
That does NOT mean you don’t have to look at and explore your data if n > 30.
Wait…you just told me that I can find the sampling distribution if I take MULTIPLE samples (> 30 observations each ).
Is that feasible? NO. So...
Either, we have to convince ourselves that the sampling distribution will be normal (especially at low n), or...
Simulate a greater number of samples in order to estimate the sampling distribution (e.g. Bootstrapping).
HOW DO I EVALUATE NORMALITY?
FIRST: THINK about the data. Would it make sense for it to be normally distributed at high enough n, even if it doesn’t look highly normal for my sample size? Or do you know something about the variable that makes you think it is NOT normally distributed?
SECOND: LOOK AT YOUR DATA. LOOK AT YOUR DATA. LOOK AT YOUR DATA. Histograms, jitterplots, boxplots and quantile-quantile plots are a good place to start.
THIRD: Quantitative analsysis (skewness, kurtosis, tests for normality, etc.) - but use these with caution (far less important than 1 & 2).
FIRST: THINK ABOUT YOUR DATA
SECOND: LOOK AT YOUR DATA
HISTOGRAMS
JITTERPLOTS
BOXPLOTS
Q-Q PLOTS
Samples can be taken from a normally distributed population, but at small n the same may appear non-normally distributed.
Not all populations are normally distributed, even at high n!
Quantile-Quantile (QQ) plots
What is a quantile-quantile plot?
An example:
Let’s say we have a sample, with 7 values:
1, 2, 4, 5, 8, 10, 15
A Q-Q plot answers the question: How does this selection of 6 observations compare to the values we would EXPECT to get if we drew 7 observations from a normal distribution?
R code for this graph on GauchoSpace
The closer to linearity on a Q-Q plot, the closer the data to normally distributed.
Skewness: measure of distribution asymmetry
Kurtosis: measure of peakedness
THIRD: QUANTIFY SOME CHARACTERISTICS
Source: Wikipedia
Source: StackExchange – Cross Validated
Skewness (Bulmer 1979):
Kurtosis
Every time: