1 of 31

LECTURE 2

CENTRAL TENDENCY + DATA SPREAD

DATA EXPLORATION

THE NORMAL DISTRIBUTION

CENTRAL LIMIT THEOREM

DATA VISUALIZATION

2 of 31

The average value

It is...

The middle value

The most frequent value

If n is odd, the median is the middle value. If n is even, the median is the average of the two central values.

Whichever value appears most frequently in your samples. There can be multiple modes.

NOTES

Most commonly reported measure of central tendency (doesn’t make it best)
“Pulled” by outliers

Improves as a measure of central tendency as n increases
Less influenced by outliers + skew

More often reported for discrete data
Less common

MEAN

MEDIAN

MODE

Calculated

by...

3 of 31

Sample variance is a more robust measure of data spread because it is not so highly influence by outliers and takes into account all observations. In words, variance is the mean squared distance of observations from the sample mean.

Mathematically:

Units of variance are the units of the observation squared, [x_i]²

VARIANCE

sample

4 of 31

STANDARD DEVIATION

Square root of sample variance

A measure of dispersion of the sample

Smaller SD = more values closer to mean, larger SD = greater data spread from mean

Units are units of the random variable

5 of 31

From histograms to probability density

(board drawing)

6 of 31

A = 1

Random variable

Probability Density

7 of 31

What is the probability of randomly selecting a fish that is…

Less than 6” long?
Less than 9” long or greater than 13” long?
Exactly 4” long?
Less than 4” long and greater than 9” long?

Fish Length (inches)

[1 – 4)

[4 – 6)

[6 – 9)

[9 – 13)

[13 – 17)

0.09

0.13

0.31

0.40

0.07

Probability Density

Areas shown in gray

8 of 31

NORMALLY DISTRIBUTED DATA

Symmetric
Bell shaped
Unimodal
Centered at mean (μ)
Data spread by sd (σ)

Spread defined by sd (σ)

10 of 31

The normal distribution is common. But NEVER assume your data is normally distributed without looking at it and thinking critically first..

11 of 31

There are infinite normal distributions.

Their mean and sd are what change the particular shape.

R code for this graph on GauchoSpace

12 of 31

Normal Distribution: 68-95-99.7

Study.com

13 of 31

Example

A researcher tells you that masses of the elusive Magic Rainbowfish are normally distributed with a population mean of 415 g and a population standard deviation of 30 g.

Sketch the distribution. Label axes.
Label the x-axis at values for ±1 and ±2 SD
What is the probability of randomly selecting a Rainbowfish with a mass…

a) greater than 445 g

b) between 355 and 445 g?

c) exactly 415 g?

d) less than 485 g?

14 of 31

Z-score: How many standard deviations a value is away from the mean

Find Z-score associated with random variable value of interest

Find associated probabilities

That’s easy enough if you’re interested in values at exactly 1, 2 or 3 SDs, but what about everything else?

16 of 31

http://www.stat.ufl.edu/~athienit/Tables/Ztable.pdf

If female blue whales have a normally distributed mean length of 82 ft with a standard deviation of 4 ft (both for the population), what is the probability of randomly selecting a blue whale that is:

More than 88 ft long?
Less than 79 ft long?
Between 79 and 88 ft long?
Exactly 82 ft long?

17 of 31

The mean and standard deviation for the American pika at 6 wks old is 75.0 ± 13.2 g[1]. Assuming that this data describes the normally distributed population:

What is the probability that if you randomly select a 6 week old pika, its mass will be between 70 and 80 g?

What mass of a 6 week old pika would put it at the 25^th percentile in mass?

[1] Hayssen et al. (1993). Asdell's Patterns of Mammalian Reproduction: A Compendium of Species-Specific Data, Cornell University Press, Ithaca, New York.

18 of 31

THE CENTRAL LIMIT THEOREM

Many analyses require an assumption of normality – meaning that the sampling distribution of a parameter being compared is normal.

The Central Limit Theorem: Regardless of the underlying shape of a population, when repeated samples are taken from the population and a parameter (e.g. mean) is calculated for each sample, the sampling distribution of that parameter will be normal (especially as sample size n is sufficiently large).

19 of 31

What is the sampling distribution?

20 of 31

Why does that matter?

For many types of hypothesis tests, there is an assumption that the sampling distribution of the parameter being studied is normal.

If we have a large enough sample size (> 30), then we may be able to use those tests even if the underlying population shape is unknown because the CLT tells us that the sampling distribution of the statistic will be normal.

That does NOT mean you don’t have to look at and explore your data if n > 30.

21 of 31

Wait…you just told me that I can find the sampling distribution if I take MULTIPLE samples (> 30 observations each ).

Is that feasible? NO. So...

Either, we have to convince ourselves that the sampling distribution will be normal (especially at low n), or...

Simulate a greater number of samples in order to estimate the sampling distribution (e.g. Bootstrapping).

22 of 31

HOW DO I EVALUATE NORMALITY?

FIRST: THINK about the data. Would it make sense for it to be normally distributed at high enough n, even if it doesn’t look highly normal for my sample size? Or do you know something about the variable that makes you think it is NOT normally distributed?

SECOND: LOOK AT YOUR DATA. LOOK AT YOUR DATA. LOOK AT YOUR DATA. Histograms, jitterplots, boxplots and quantile-quantile plots are a good place to start.

THIRD: Quantitative analsysis (skewness, kurtosis, tests for normality, etc.) - but use these with caution (far less important than 1 & 2).

23 of 31

FIRST: THINK ABOUT YOUR DATA

24 of 31

SECOND: LOOK AT YOUR DATA

HISTOGRAMS

JITTERPLOTS

BOXPLOTS

Q-Q PLOTS

25 of 31

Samples can be taken from a normally distributed population, but at small n the same may appear non-normally distributed.

26 of 31

Not all populations are normally distributed, even at high n!

27 of 31

Quantile-Quantile (QQ) plots

A way of graphically comparing YOUR data to a theoretical distribution (most common: comparing to the normal distribution)

Uses quantiles (can think of as percentiles)

28 of 31

What is a quantile-quantile plot?

An example:

Let’s say we have a sample, with 7 values:

1, 2, 4, 5, 8, 10, 15

A Q-Q plot answers the question: How does this selection of 6 observations compare to the values we would EXPECT to get if we drew 7 observations from a normal distribution?

29 of 31

R code for this graph on GauchoSpace

The closer to linearity on a Q-Q plot, the closer the data to normally distributed.

30 of 31

Skewness: measure of distribution asymmetry

Kurtosis: measure of peakedness

THIRD: QUANTIFY SOME CHARACTERISTICS

Source: Wikipedia

Source: StackExchange – Cross Validated

Skewness (Bulmer 1979):

0 : perfectly symmetrical
Between |0 and 0.5| : approximately symmetric
Between |0.5 and 1| : moderately skewed
Greater than |1| : highly skewed

Kurtosis

= 3 : standard normal distribution
> 3: sharper peak, high kurtosis (leptokurtic)
< 3: flatter peak, low kurtosis (platykurtic)

31 of 31

Every time:

Think really hard

Type of data
Expected distribution
How it was collected

Visual data exploration - multiple ways

Histograms, boxplots, jitterplots, QQ plots

Other analyses, evaluation

Skewness, kurtosis, tests for normality, etc.