Chapter 7 - Visualizing distributions
Badri ADHIKARI
Badri Adhikari
Visualizing distributions
1
Slide
Three models for presenting the same data
Below are three models for presenting the distribution of the same data. Which one is most effective? Why?
Model 1
“Upon surveying men and women between 20 and 50, here is what we found. Most women like men of their own age. To most men, however, 20- and 21-year-old women look best. For example, the age of women that most men of 40 find most good-looking is 21.”
Model 2
Model 3
Badri Adhikari
Visualizing distributions
2
Slide
Model 1 needs “numerical summaries of spread”!
For example, the summary “most men of 40 find women of 21 most good looking” could mean several things:
Badri Adhikari
Visualizing distributions
3
Slide
Which technique is most effective in conveying spread?
Badri Adhikari
Visualizing distributions
4
Slide
Technique 3 is more sensitive to outliers (than 2)
Technique 3
Technique 2
Then, why calculate standard deviation as your measure of spread/dispersion?
Badri Adhikari
Visualizing distributions
5
Slide
The Standard Deviation Exposed
This week’s interview: Getting the measure of standard deviation
Badri Adhikari
Visualizing distributions
6
Slide
If you know mean and standard deviation,
you know almost everything about the data at hand.
mean
9
4
5
N = 30
Distribution
Our data
Strip plot
The standard normal distribution
Badri Adhikari
Visualizing distributions
7
Slide
Peculiarities of the normal distribution
Badri Adhikari
Visualizing distributions
8
Slide
Example problem
The mean of science test scores is 54 and the standard deviation is 14.
How likely is it to find a score of 82 or higher?
Answer: just around 2.2% of the scores are that high.
Badri Adhikari
Visualizing distributions
9
Slide
But wait… are we assuming that our data is “normal”?
Badri Adhikari
Visualizing distributions
10
Slide
Q-Q plots
Left-skewed data Under-dispersed data Over-dispersed data
Normally distributed data Right-skewed data
https://www.ucd.ie/ecomodel/Resources/QQplots_WebVersion.html
Badri Adhikari
Visualizing distributions
11
Slide
How to generate a Q-Q plot?
Free online tool: http://www.wessa.net/rwasp_varia1.wasp
PIMA dataset at: https://badriadhikari.github.io/DV/week5_explore_data/homework/
Obtain Q-Q plots for age and blood pressure variables.
Which one is more normally distributed?
Badri Adhikari
Visualizing distributions
12
Slide
Sample problem: Generate numbers
Using the statistical summaries below, generate an unordered list of numbers that comply with the summary:
Badri Adhikari
Visualizing distributions
13
Slide
Sample problem: Generate summary statistics
Use minimum possible number of summary statistics to describe the list of numbers below. Your summary measure values should describe the entire list as accurately as possible. When an accurate description is not possible, you can resort to an approximate summary.
Example:
A = [ 20 20 20 20 20 20 20 ]
Summary: Mean = 20, Std. dev. = 0, N = 7
Problem:
B = [ 19 19 20 20 21 21 100 102 ]
Badri Adhikari
Visualizing distributions
14
Slide
Summary
Badri Adhikari
Visualizing distributions
15
Slide
Standard Deviation and Standard Scores
(for comparing distributions)
Badri Adhikari
Visualizing distributions
16
Slide
Example 1: Comparing salary distributions
U.S. salaries. Mean: 122,400; Standard deviation: 10,746
Nigerian salaries. Mean: 29,170; Standard deviation: 12,589
Question 1: In which country salary are more or less equal?
Question 2: Say, an IT employee makes $125,335 a year in the United States. What would be a similar salary in Nigeria?
Badri Adhikari
Visualizing distributions
17
Slide
Example 2: English vs Math scores
Your score in English = 83 / 100
Your score in Maths = 68 / 100
In which course did you do better?
Your grade in the math exam: 68 -> z-score or standard score = (68-59)/5 = 1.8
Your grade in the English exam: 83 -> z-score or standard score = (83-79)/7 = 0.6
(z-score of a score = (Raw score – mean)/standard deviation)
Badri Adhikari
Visualizing distributions
18
Slide
Example 1: Comparing salary distributions
U.S. salaries. Mean: 122,400; Standard deviation: 10,746
Nigerian salaries. Mean: 29,170; Standard deviation: 12,589
Question 2: Say, an IT employee makes $125,335 a year in the United States. What would be a similar salary in Nigeria?
(z-score of a score = (Raw score – mean)/standard deviation)
US -> (Raw score – mean)/standard deviation = z-score (Z)
Nigeria -> (? – mean)/standard deviation = z-score (Z)
Badri Adhikari
Visualizing distributions
19
Slide
Summary
Badri Adhikari
Visualizing distributions
20
Slide
How standardized scores mislead us
Badri Adhikari
Visualizing distributions
21
Slide
Counties with highest and lowest cancer death rates
Badri Adhikari
Visualizing distributions
22
Slide
Galton probability box can explain why!
Badri Adhikari
Visualizing distributions
23
Slide
Back to the kidney cancer example
Implication
Badri Adhikari
Visualizing distributions
24
Slide
Use “funnel plot” for investigation
Badri Adhikari
Visualizing distributions
25
Slide
Summary
Badri Adhikari
Visualizing distributions
26
Slide
Percentiles, quartiles, and box plot
Badri Adhikari
Visualizing distributions
27
Slide
Use quartiles when exploring data!
Badri Adhikari
Visualizing distributions
28
Slide
Box plots are excellent for comparing distributions
Badri Adhikari
Visualizing distributions
29
Slide
Issue 1: Box plots can conceal data!
Badri Adhikari
Visualizing distributions
30
Slide
Issue 2: Deciding outliers
Expectation Reality
Badri Adhikari
Visualizing distributions
31
Slide
Show the data!
Use raincloud plot, jittered strip plot, violin plot, bean/frequency plot, boxen plot, sina plot, and more..
Badri Adhikari
Visualizing distributions
32
Slide
Summary
Badri Adhikari
Visualizing distributions
33
Slide