1 of 33

Chapter 7 - Visualizing distributions

Badri ADHIKARI

Badri Adhikari

Visualizing distributions

Slide

2 of 33

Three models for presenting the same data

Below are three models for presenting the distribution of the same data. Which one is most effective? Why?

Model 1

“Upon surveying men and women between 20 and 50, here is what we found. Most women like men of their own age. To most men, however, 20- and 21-year-old women look best. For example, the age of women that most men of 40 find most good-looking is 21.”

Model 2

Model 3

Badri Adhikari

Visualizing distributions

Slide

3 of 33

Model 1 needs “numerical summaries of spread”!

Numerical summaries of spread/dispersion supplement the numerical summaries of central tendencies
Presenting only the central tendencies (mean, mode, etc.) could create a different reality in a reader’s mind
Many news stories report just averages, with no mention of ranges, distributions, frequencies, or outliers

For example, the summary “most men of 40 find women of 21 most good looking” could mean several things:

Badri Adhikari

Visualizing distributions

Slide

4 of 33

Which technique is most effective in conveying spread?

Example 1: [19, 20, 21, 22, 23]

N = 5, Mean (μ) = 21

Example 2: [15, 18, 21, 24, 27]

N = 5, Mean (μ) = 21

Example 3: [8, 18, 21, 24, 34]

N = 5, Mean (μ) = 21

All three examples have same mean

This shows that mean alone is ineffective in summarizing a list

“Spread” measures how far the values are away from the mean

Technique 1: mean of [“μ - value” for all values]

Spread in example 1 = mean of [ 2, 1, 0, -1, -2 ] = 0
Spread in example 2 = mean of [ 6, 3, 0, -3, -6 ] = 0
Spread in example 3 = mean of [ 13, 3, 0, -3, -13 ] = 0

Technique 2: mean of | μ - value | for all values (AVEDEV in Google Sheets)

Spread in example 1 = mean of [ 2, 1, 0, 1, 2 ] = 1.2
Spread in example 2 = mean of [ 6, 3, 0, 3, 6 ] = 3.6
Spread in example 3 = mean of [ 13, 3, 0, 3, 13 ] = 6.4

Technique 3: square root of ( ( sum of ( μ - value ) ^ 2 ) / total items ) [the standard deviation]

Spread in example 1 = 1.4
Spread in example 2 = 4.2
Spread in example 3 = 8.4

Badri Adhikari

Visualizing distributions

Slide

5 of 33

Technique 3 is more sensitive to outliers (than 2)

Technique 3

Technique 2

Then, why calculate standard deviation as your measure of spread/dispersion?

Badri Adhikari

Visualizing distributions

Slide

6 of 33

The Standard Deviation Exposed

This week’s interview: Getting the measure of standard deviation

https://docs.google.com/document/d/1vYg798SnG-eJIHrKbnBkDrw0O-6aze_y6PAxcCQJxb0/edit?usp=sharing

Badri Adhikari

Visualizing distributions

Slide

7 of 33

If you know mean and standard deviation,

you know almost everything about the data at hand.

mean

N = 30

Distribution

Our data

Strip plot

The standard normal distribution

Badri Adhikari

Visualizing distributions

Slide

8 of 33

Peculiarities of the normal distribution

Its mean, median, and mode are the same
The distribution is symmetrical

50 percent of scores are above the mean, and 50 percent are below it

We know what percentage of scores lay in between certain ranges:

68.2 percent of cases in the data are 1 standard deviation (sd) away from the mean,
95.4 percent are within 2 sd, and
99.8 percent are within 3

We can do some arithmetic with those figures (next slide)

Badri Adhikari

Visualizing distributions

Slide

9 of 33

Example problem

The mean of science test scores is 54 and the standard deviation is 14.

How likely is it to find a score of 82 or higher?

Answer: just around 2.2% of the scores are that high.

Badri Adhikari

Visualizing distributions

Slide

10 of 33

But wait… are we assuming that our data is “normal”?

If you are summarizing a list of numbers using “mean” and “standard deviation” and performing any further analysis based on these two numbers

you are automatically assuming that the numbers are “normally” distributed

How can we check if a list of numbers is “normally” distributed?

Badri Adhikari

Visualizing distributions

Slide

11 of 33

Q-Q plots

Left-skewed data Under-dispersed data Over-dispersed data

Normally distributed data Right-skewed data

https://www.ucd.ie/ecomodel/Resources/QQplots_WebVersion.html

Badri Adhikari

Visualizing distributions

Slide

12 of 33

How to generate a Q-Q plot?

Free online tool: http://www.wessa.net/rwasp_varia1.wasp

PIMA dataset at: https://badriadhikari.github.io/DV/week5_explore_data/homework/

Obtain Q-Q plots for age and blood pressure variables.

Which one is more normally distributed?

Badri Adhikari

Visualizing distributions

Slide

13 of 33

Sample problem: Generate numbers

Using the statistical summaries below, generate an unordered list of numbers that comply with the summary:

Mean = 20
Standard deviation = 1
Number of items = 40
Distribution type = normal

https://goodcalculators.com/normal-distribution-generator/

Badri Adhikari

Visualizing distributions

Slide

14 of 33

Sample problem: Generate summary statistics

Use minimum possible number of summary statistics to describe the list of numbers below. Your summary measure values should describe the entire list as accurately as possible. When an accurate description is not possible, you can resort to an approximate summary.

Example:

A = [ 20 20 20 20 20 20 20 ]

Summary: Mean = 20, Std. dev. = 0, N = 7

Problem:

B = [ 19 19 20 20 21 21 100 102 ]

Badri Adhikari

Visualizing distributions

Slide

15 of 33

Summary

Measures of spread are extremely important when summarizing data

Without spread, your data may read differently to different people

When calculating and presenting mean and standard deviation, we may be automatically assuming that the data follows normal distribution
Q-Q plots allow us to test for normal distribution
Some data are almost impossible to summarize

As should be presented as they are (whenever possible)

Badri Adhikari

Visualizing distributions

Slide

16 of 33

Standard Deviation and Standard Scores

(for comparing distributions)

Badri Adhikari

Visualizing distributions

Slide

17 of 33

Example 1: Comparing salary distributions

U.S. salaries. Mean: 122,400; Standard deviation: 10,746

Nigerian salaries. Mean: 29,170; Standard deviation: 12,589

Question 1: In which country salary are more or less equal?

Question 2: Say, an IT employee makes $125,335 a year in the United States. What would be a similar salary in Nigeria?

Badri Adhikari

Visualizing distributions

Slide

18 of 33

Example 2: English vs Math scores

Your score in English = 83 / 100

Your score in Maths = 68 / 100

English

Class mean = 79
Std. dev. = 7
Your score = 83

Maths

Class mean = 59
Std. dev. = 5
Your score = 68

In which course did you do better?

Your grade in the math exam: 68 -> z-score or standard score = (68-59)/5 = 1.8

Your grade in the English exam: 83 -> z-score or standard score = (83-79)/7 = 0.6

(z-score of a score = (Raw score – mean)/standard deviation)

Badri Adhikari

Visualizing distributions

Slide

19 of 33

Example 1: Comparing salary distributions

U.S. salaries. Mean: 122,400; Standard deviation: 10,746

Nigerian salaries. Mean: 29,170; Standard deviation: 12,589

Question 2: Say, an IT employee makes $125,335 a year in the United States. What would be a similar salary in Nigeria?

(z-score of a score = (Raw score – mean)/standard deviation)

US -> (Raw score – mean)/standard deviation = z-score (Z)

Nigeria -> (? – mean)/standard deviation = z-score (Z)

Badri Adhikari

Visualizing distributions

Slide

20 of 33

Summary

Any normal distribution can be converted to standard normal distribution by standardizing the data
Two compare two distributions, standardize the distributions

Badri Adhikari

Visualizing distributions

Slide

21 of 33

How standardized scores mislead us

Badri Adhikari

Visualizing distributions

Slide

22 of 33

Counties with highest and lowest cancer death rates

Upon analyzing a kidney cancer dataset, you observe that rural, sparsely populated counties has the highest and the lowest figures

In other words, estimates based on small populations tend to show more variation than estimates based on large populations. Why?

Badri Adhikari

Visualizing distributions

Slide

23 of 33

Galton probability box can explain why!

The more marbles you send rolling, the more of them will end up in the middle portion of the box
As we add more, the marbles will start forming a peculiar bell shape

the mean of all scores will approximate to 7

At the end, it WILL form a bell shape, but what happens towards the beginning is EXTREMELY important

At the beginning, the mean and standard deviation will fluctuate a lot
In other words, when sample size is small you have very high chance of variation

https://www.dropbox.com/s/043ft0h1hebq7x7/20220905_142458.mp4?dl=0

Badri Adhikari

Visualizing distributions

Slide

24 of 33

Back to the kidney cancer example

Counties with small populations, have high variation in cancer rates

many of them being at the highest and lowest extremes

A county has a population of 10 and one person dies of kidney cancer

Rate of cancer = 1/10 = 10%

A county has a population of 10,000 and 900 people die of kidney cancer

Rate of cancer = 900/10000 = 9%

Implication

Researchers conducting any study (a survey or an experiment) always try to increase the size of their random samples as much as possible
A study based on a sample size of 10,000 is much more reliable than another based on only 10 samples

Badri Adhikari

Visualizing distributions

Slide

25 of 33

Use “funnel plot” for investigation

X-axis is population, measured on a logarithmic scale
Y-axis is age-adjusted cancer rate
Each dot is a county
Many counties with small population (on the left side) show both: very high and very low cancer rates

The more we move to the right, the narrower the variation

Badri Adhikari

Visualizing distributions

Slide

26 of 33

Summary

Estimates based on small populations tend to show more variation than estimates based on large populations
Funnel plots help us check the possible impact/effect of small populations
Very small samples are living bait for demons of chance

Badri Adhikari

Visualizing distributions

Slide

27 of 33

Percentiles, quartiles, and box plot

Badri Adhikari

Visualizing distributions

Slide

28 of 33

Use quartiles when exploring data!

Badri Adhikari

Visualizing distributions

Slide

29 of 33

Box plots are excellent for comparing distributions

Badri Adhikari

Visualizing distributions

Slide

30 of 33

Issue 1: Box plots can conceal data!

Badri Adhikari

Visualizing distributions

Slide

31 of 33

Issue 2: Deciding outliers

Expectation Reality

Badri Adhikari

Visualizing distributions

Slide

32 of 33

Show the data!

Use raincloud plot, jittered strip plot, violin plot, bean/frequency plot, boxen plot, sina plot, and more..

Badri Adhikari

Visualizing distributions

Slide

33 of 33

Summary

Box plot has two issues

It does not show the data
Deciding the rule for outliers can be tricky

But still they are useful for visualizing and comparing distributions
Many other plotting techniques such as raincloud plots and violin plots should be used instead

Badri Adhikari

Visualizing distributions

Slide