Data literacy
Making sense of stats
Giving life to numbers
X | A | B | C |
1 | 20 | | |
1.1 | 0 | | |
1.2 | 7 | | |
1.5 | 10 | | |
1.8 | 7 | | |
2 | 0 | | |
2.2 | 1 | 5 | 5 |
2.5 | 2.5 | 10 | 1 |
2.8 | 4 | 9 | 0 |
3 | 5 | 5 | 0 |
3.1 | | | 0 |
3.2 | | | 2 |
3.3 | | | 20 |
3.4 | | | 4 |
3.5 | | | 1 |
3.6 | | | 0 |
3.7 | | | 2 |
3.8 | | | 20 |
3.9 | | | 4 |
4 | | | 1 |
4.1 | | | 0 |
4.2 | 5 | 5 | 6 |
4.3 | 3 | 7 | 8 |
4.4 | 2 | 8 | 9 |
4.8 | 0 | 10 | 10 |
5.3 | 1 | | 8 |
5.5 | 5 | | 5 |
5.6 | | | |
6 | 0 | 5 | |
6.1 | 1 | 20 | |
What we will look into today…
How statistics can be presented misleadingly:
Break
How statistics themselves can be misleading:
Research in the media
Which are the real headlines?
Eating Marmite could help prevent dementia
Scientists discover way to say ‘I love you’ to dogs in way they understand
If you snore you could be THREE TIMES more likely to die of coronavirus, docs warn
Don't laugh, but a good giggle can help you live longer...
Too much caffeine ‘can wake the dead’
Statistical evidence is seen as more persuasive…
… and people do tend to exaggerate numbers in line with their beliefs
Number format can also exaggerate effects…
Number format can also exaggerate effects…
Number format can also exaggerate effects…
Number format
25 people prefer mountains | 75 people prefer beaches |
25% | 75% |
0.25 | 0.75 |
One in four | Three in four |
1/4 | 3/4 |
A quarter | Three quarters |
1:4 | 3:4 |
Three times more people prefer beaches to mountains | |
Only a third of the number of people who prefer beaches prefer mountains | |
300% more people prefer beaches to mountains |
If the number of people who own dogs increased from 12% in 2018 to 36% in 2023…
The number did not increase by 24%
It increased by 300%
As 36% is three times bigger than 12%
However there was an increase of 24% percentage points
Dubious election graphs…
We need to talk about 2019
Do we need to talk about 2019?
Context is everything
Carter Racing (Brittain & Sitkin, 1987)
7 engine failures in 24 races (29%)
Incidents by temperature
Adding the missing data
In reality…
They raced. 🍎
Space launches influence the awarding of sociology doctorates
tylervigen.com
Correlation: 78.92% (r=0.78915)
Data sources: Federal Aviation Administration and National Science Foundation
Nic Cage films influence pool drownings
tylervigen.com
Correlation: 66.6% (r=0.666004)
Data sources: Centers for Disease Control & Prevention and Internet Movie Database
Finding unusually correlated data
Google Trends
Try to find two seemingly unrelated search terms that over the past 12 months appear to be closely correlated
trends.google.com
Spreadsheet fails
92 out of 97 lecturers eat catfood
Why 97?
Who was surveyed?
How much catfood are they actually eating?
WEIRD Samples
White, Educated, Industrialized, Rich, Democratic
There can be limits in place (funding restrictions mostly) that influence accessibility to diverse samples.
Important to replicate studies
Sample size matters…
A small sample size is unlikely to be representative
In the population most variables will be normally distributed
Central limit theorem - the �more participants in the �sample the closer the�distribution will be to normal
A sample that is normally distributed allows you to run parametric tests which are more likely to detect effects
Sample size matters…
A large sample size makes a statistics more persuasive
However a large sample size is more likely to return a significant result - saying there is a relationship between variables
Population vs. Sample
We can’t test everyone
Collect a smaller sample from the wider population
Is there is a consistent enough effect in the sample that there is a high likelihood that the same effect exists in the population?
Hypotheses
Null Hypothesis -That there is no pattern or differences
Alternative or Experimental Hypothesis - That there are patterns or differences
Significance
Statistical tests tell us if there is a significant difference/association in our sample data.
In a statistical test the calculated p-value should be p<.05 for a test to be statistically significant.
This represents allowing ourselves a 5% chance of making a false positive
P-value
Is a 5% chance of making a false positive claim too high?
Or is it too low?
There is debate surrounding p-values and whether the threshold should be lowered.
Is having a threshold too strict?
P-value
Phrases publishes papers have used to describe p-values above .05
non-insignificant result (p=0.500)
very closely brushed the limit of statistical significance (p=0.051)
a clear tendency to significance (p=0.052)
just failed significance (p=0.057)
just borderline significant (p=0.058)
just above the arbitrary level of significance (p=0.07)
a barely detectable statistically significant difference (p=0.073)
narrowly eluded statistical significance (p=0.0789)
moderately significant (p>0.11)
non-significant in the statistical sense (p>0.05)
Effect sizes
A measure of the size of the pattern or differences in your sample
“Lies, damned lies and statistics”
33.7% of scientists admitted to questionable practices that could lead to misleading or false statistics
Data pruning and removing outliers unreasonably
P-Hacking
Complicated models will explain more - parsimony (simpler models to explain largest effect) is important
Falsifying and fabricating data
Conflict of interests
Studies being funded or researched with a motive in mind and conducted in a manner to achieve that motive
Researchers choices
Research only includes the variables, measures and methods selected by the researcher
One study gave 29 teams of analysts the same data set and asked them to find an answer to “whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players”.
Statistical analyses methods varied and there were 21 unique combinations of variables chosen to be included. 20 teams found a significant result, 9 teams did not.
Better practices
Better statistical practices - consider what the data looks like, dig deeper than relying on p-values alone
Diversify samples and run replications
Collate research in one area with meta-analyses and systematic reviews
Honest graphs
Check research for any conflicts of interest
More of this sort of thing…