1 of 33

Chapter 7 - Visualizing distributions

Badri ADHIKARI

Badri Adhikari

Visualizing distributions

1

Slide

2 of 33

Three models for presenting the same data

Below are three models for presenting the distribution of the same data. Which one is most effective? Why?

Model 1

“Upon surveying men and women between 20 and 50, here is what we found. Most women like men of their own age. To most men, however, 20- and 21-year-old women look best. For example, the age of women that most men of 40 find most good-looking is 21.”

Model 2

Model 3

Badri Adhikari

Visualizing distributions

2

Slide

3 of 33

Model 1 needs “numerical summaries of spread”!

  • Numerical summaries of spread/dispersion supplement the numerical summaries of central tendencies
  • Presenting only the central tendencies (mean, mode, etc.) could create a different reality in a reader’s mind
  • Many news stories report just averages, with no mention of ranges, distributions, frequencies, or outliers

For example, the summary “most men of 40 find women of 21 most good looking” could mean several things:

Badri Adhikari

Visualizing distributions

3

Slide

4 of 33

Which technique is most effective in conveying spread?

  • Example 1: [19, 20, 21, 22, 23]
    • N = 5, Mean (μ) = 21
  • Example 2: [15, 18, 21, 24, 27]
    • N = 5, Mean (μ) = 21
  • Example 3: [8, 18, 21, 24, 34]
    • N = 5, Mean (μ) = 21
  • All three examples have same mean
    • This shows that mean alone is ineffective in summarizing a list
  • “Spread” measures how far the values are away from the mean
    • Technique 1: mean of [“μ - value” for all values]
      • Spread in example 1 = mean of [ 2, 1, 0, -1, -2 ] = 0
      • Spread in example 2 = mean of [ 6, 3, 0, -3, -6 ] = 0
      • Spread in example 3 = mean of [ 13, 3, 0, -3, -13 ] = 0
    • Technique 2: mean of | μ - value | for all values (AVEDEV in Google Sheets)
      • Spread in example 1 = mean of [ 2, 1, 0, 1, 2 ] = 1.2
      • Spread in example 2 = mean of [ 6, 3, 0, 3, 6 ] = 3.6
      • Spread in example 3 = mean of [ 13, 3, 0, 3, 13 ] = 6.4
    • Technique 3: square root of ( ( sum of ( μ - value ) ^ 2 ) / total items ) [the standard deviation]
      • Spread in example 1 = 1.4
      • Spread in example 2 = 4.2
      • Spread in example 3 = 8.4

Badri Adhikari

Visualizing distributions

4

Slide

5 of 33

Technique 3 is more sensitive to outliers (than 2)

Technique 3

Technique 2

Then, why calculate standard deviation as your measure of spread/dispersion?

Badri Adhikari

Visualizing distributions

5

Slide

6 of 33

The Standard Deviation Exposed

This week’s interview: Getting the measure of standard deviation

Badri Adhikari

Visualizing distributions

6

Slide

7 of 33

If you know mean and standard deviation,

you know almost everything about the data at hand.

mean

9

4

5

N = 30

Distribution

Our data

Strip plot

The standard normal distribution

Badri Adhikari

Visualizing distributions

7

Slide

8 of 33

Peculiarities of the normal distribution

  • Its mean, median, and mode are the same
  • The distribution is symmetrical
    • 50 percent of scores are above the mean, and 50 percent are below it
  • We know what percentage of scores lay in between certain ranges:
    • 68.2 percent of cases in the data are 1 standard deviation (sd) away from the mean,
    • 95.4 percent are within 2 sd, and
    • 99.8 percent are within 3
  • We can do some arithmetic with those figures (next slide)

Badri Adhikari

Visualizing distributions

8

Slide

9 of 33

Example problem

The mean of science test scores is 54 and the standard deviation is 14.

How likely is it to find a score of 82 or higher?

Answer: just around 2.2% of the scores are that high.

Badri Adhikari

Visualizing distributions

9

Slide

10 of 33

But wait… are we assuming that our data is “normal”?

  • If you are summarizing a list of numbers using “mean” and “standard deviation” and performing any further analysis based on these two numbers
    • you are automatically assuming that the numbers are “normally” distributed
  • How can we check if a list of numbers is “normally” distributed?

Badri Adhikari

Visualizing distributions

10

Slide

11 of 33

Q-Q plots

Left-skewed data Under-dispersed data Over-dispersed data

Normally distributed data Right-skewed data

https://www.ucd.ie/ecomodel/Resources/QQplots_WebVersion.html

Badri Adhikari

Visualizing distributions

11

Slide

12 of 33

How to generate a Q-Q plot?

Free online tool: http://www.wessa.net/rwasp_varia1.wasp

PIMA dataset at: https://badriadhikari.github.io/DV/week5_explore_data/homework/

Obtain Q-Q plots for age and blood pressure variables.

Which one is more normally distributed?

Badri Adhikari

Visualizing distributions

12

Slide

13 of 33

Sample problem: Generate numbers

Using the statistical summaries below, generate an unordered list of numbers that comply with the summary:

  • Mean = 20
  • Standard deviation = 1
  • Number of items = 40
  • Distribution type = normal

https://goodcalculators.com/normal-distribution-generator/

Badri Adhikari

Visualizing distributions

13

Slide

14 of 33

Sample problem: Generate summary statistics

Use minimum possible number of summary statistics to describe the list of numbers below. Your summary measure values should describe the entire list as accurately as possible. When an accurate description is not possible, you can resort to an approximate summary.

Example:

A = [ 20 20 20 20 20 20 20 ]

Summary: Mean = 20, Std. dev. = 0, N = 7

Problem:

B = [ 19 19 20 20 21 21 100 102 ]

Badri Adhikari

Visualizing distributions

14

Slide

15 of 33

Summary

  • Measures of spread are extremely important when summarizing data
    • Without spread, your data may read differently to different people
  • When calculating and presenting mean and standard deviation, we may be automatically assuming that the data follows normal distribution
  • Q-Q plots allow us to test for normal distribution
  • Some data are almost impossible to summarize
    • As should be presented as they are (whenever possible)

Badri Adhikari

Visualizing distributions

15

Slide

16 of 33

Standard Deviation and Standard Scores

(for comparing distributions)

Badri Adhikari

Visualizing distributions

16

Slide

17 of 33

Example 1: Comparing salary distributions

U.S. salaries. Mean: 122,400; Standard deviation: 10,746

Nigerian salaries. Mean: 29,170; Standard deviation: 12,589

Question 1: In which country salary are more or less equal?

Question 2: Say, an IT employee makes $125,335 a year in the United States. What would be a similar salary in Nigeria?

Badri Adhikari

Visualizing distributions

17

Slide

18 of 33

Example 2: English vs Math scores

Your score in English = 83 / 100

Your score in Maths = 68 / 100

  • English
    • Class mean = 79
    • Std. dev. = 7
    • Your score = 83
  • Maths
    • Class mean = 59
    • Std. dev. = 5
    • Your score = 68

In which course did you do better?

Your grade in the math exam: 68 -> z-score or standard score = (68-59)/5 = 1.8

Your grade in the English exam: 83 -> z-score or standard score = (83-79)/7 = 0.6

(z-score of a score = (Raw score – mean)/standard deviation)

Badri Adhikari

Visualizing distributions

18

Slide

19 of 33

Example 1: Comparing salary distributions

U.S. salaries. Mean: 122,400; Standard deviation: 10,746

Nigerian salaries. Mean: 29,170; Standard deviation: 12,589

Question 2: Say, an IT employee makes $125,335 a year in the United States. What would be a similar salary in Nigeria?

(z-score of a score = (Raw score – mean)/standard deviation)

US -> (Raw score – mean)/standard deviation = z-score (Z)

Nigeria -> (? – mean)/standard deviation = z-score (Z)

Badri Adhikari

Visualizing distributions

19

Slide

20 of 33

Summary

  • Any normal distribution can be converted to standard normal distribution by standardizing the data
  • Two compare two distributions, standardize the distributions

Badri Adhikari

Visualizing distributions

20

Slide

21 of 33

How standardized scores mislead us

Badri Adhikari

Visualizing distributions

21

Slide

22 of 33

Counties with highest and lowest cancer death rates

  • Upon analyzing a kidney cancer dataset, you observe that rural, sparsely populated counties has the highest and the lowest figures

  • In other words, estimates based on small populations tend to show more variation than estimates based on large populations. Why?

Badri Adhikari

Visualizing distributions

22

Slide

23 of 33

Galton probability box can explain why!

  • The more marbles you send rolling, the more of them will end up in the middle portion of the box
  • As we add more, the marbles will start forming a peculiar bell shape
    • the mean of all scores will approximate to 7

  • At the end, it WILL form a bell shape, but what happens towards the beginning is EXTREMELY important
    • At the beginning, the mean and standard deviation will fluctuate a lot
    • In other words, when sample size is small you have very high chance of variation

Badri Adhikari

Visualizing distributions

23

Slide

24 of 33

Back to the kidney cancer example

  • Counties with small populations, have high variation in cancer rates
    • many of them being at the highest and lowest extremes

  • A county has a population of 10 and one person dies of kidney cancer
    • Rate of cancer = 1/10 = 10%
  • A county has a population of 10,000 and 900 people die of kidney cancer
    • Rate of cancer = 900/10000 = 9%

Implication

  • Researchers conducting any study (a survey or an experiment) always try to increase the size of their random samples as much as possible
  • A study based on a sample size of 10,000 is much more reliable than another based on only 10 samples

Badri Adhikari

Visualizing distributions

24

Slide

25 of 33

Use “funnel plot” for investigation

  • X-axis is population, measured on a logarithmic scale
  • Y-axis is age-adjusted cancer rate
  • Each dot is a county
  • Many counties with small population (on the left side) show both: very high and very low cancer rates
    • The more we move to the right, the narrower the variation

Badri Adhikari

Visualizing distributions

25

Slide

26 of 33

Summary

  • Estimates based on small populations tend to show more variation than estimates based on large populations
  • Funnel plots help us check the possible impact/effect of small populations
  • Very small samples are living bait for demons of chance

Badri Adhikari

Visualizing distributions

26

Slide

27 of 33

Percentiles, quartiles, and box plot

Badri Adhikari

Visualizing distributions

27

Slide

28 of 33

Use quartiles when exploring data!

Badri Adhikari

Visualizing distributions

28

Slide

29 of 33

Box plots are excellent for comparing distributions

Badri Adhikari

Visualizing distributions

29

Slide

30 of 33

Issue 1: Box plots can conceal data!

Badri Adhikari

Visualizing distributions

30

Slide

31 of 33

Issue 2: Deciding outliers

Expectation Reality

Badri Adhikari

Visualizing distributions

31

Slide

32 of 33

Show the data!

Use raincloud plot, jittered strip plot, violin plot, bean/frequency plot, boxen plot, sina plot, and more..

Badri Adhikari

Visualizing distributions

32

Slide

33 of 33

Summary

  • Box plot has two issues
    • It does not show the data
    • Deciding the rule for outliers can be tricky
  • But still they are useful for visualizing and comparing distributions
  • Many other plotting techniques such as raincloud plots and violin plots should be used instead

Badri Adhikari

Visualizing distributions

33

Slide