HOW TO AVOID COMMON MISTAKES WITH STATISTICS
Hints and free tools
OVERVIEW
WHY AM I DOING THIS?
I FEEL YOUR PAIN
SO!
I made a free online stats tools that:
https://springsenglish.online/stats
→ Partially to help me learn programming
SPSS’ STRANGLEHOLD IS GONE
| | SPSS |
Free | Free | EXPENSIVE |
Significance (effect size)�Correlation +�Interrater Reliability | Bayesian Statistics�Machine Learning�My site (minus DA) | Significance (most ES)�Correlation (costs extra!) (+/-)�Interrater Reliability ($$) (+/-) |
No graphs | Excellent graphs | Some graphs |
Helps you choose the test | Doesn’t help | Doesn’t help |
Has linguistics / language acquisition based examples | Doesn’t have such examples | Doesn’t have any examples |
R is difficult, but…
FOR THIS WORKSHOP!
springsenglish.online/stats
(my name)
(my site)
(what I teach)
(because I’m poor)
(to go to the stats part)
.com is REALLY expensive!!!
COMMON MISTAKE 1:�CHOOSING THE WRONG TEST
Part 1:
Considering Data Types
COMMON MISTAKE 1:�CHOOSING THE WRONG TEST
Consideration 1: Is my data normal and continuous?
1. Average in the middle
2. Most data close to average
COMMON MISTAKE 1:�CHOOSING THE WRONG TEST
Normal data might look like:
COMMON MISTAKE 1:�CHOOSING THE WRONG TEST
Why does data normality matter?
・Beware the AVERAGE TRAP!!
WHY DOES THIS MATTER?
If the data isn’t normal / parametric:
The average doesn’t mean much�
M = 70; students knew about 70%? YES!
M = 70; students knew about 70%? NO!
WHY DOES THIS MATTER (2)?
t-test = What is the chance there is so much difference in averages? (Are these curvatures from the same normalized sample?)
Mann-Whitney = What is the chance that so many high scores are in one of these groups? (Do these rank distributions come from the same sample?)
Calculations are ENTIRELY DIFFERENT!!!
COMMON MISTAKE 1:�CHOOSING THE WRONG TEST
Consideration 2: Are the data sets paired?
| Test 1 | Test 2 |
Student A | 100 | 90 |
Student B | 40 | 50 |
Student C | 70 | 50 |
Student D | 60 | 60 |
Student E | 50 | 70 |
OR
| Test 1 | | Test 2 |
Student A | 100 | Student F | 90 |
Student B | 40 | Student G | 50 |
Student C | 70 | Student H | 50 |
Student D | 60 | Student I | 60 |
Student E | 50 | Student J | 70 |
WHY DOES THIS MATTER?
INDEPENDENT t-test = Is there a difference in these groups’ averages? (Are these curvatures from the same normalized sample?)
DEPENDENT t-test = Is the average of these individual’s differences large? (Are these averages from the same normalized sample indicative of a change?)
Calculations are ENTIRELY DIFFERENT!!!
LET’S TRY USING THE TOOL!
COMMON MISTAKE 1:�CHOOSING THE WRONG TEST
Consideration 3: How many data sets do I have?
2 ONLY
-T-test
-Mann-Whitney
-Wilcoxon�-etc.
3+
-Anova
-Kruskall-Wallis
-Friedman
-etc.
WHY DOES THIS MATTER?
LET’S TRY USING THE TOOL!
COMMON MISTAKE 1:�CHOOSING THE WRONG TEST
Part 2:
Misunderstanding p >.05
COMMON MISTAKE 1:�CHOOSING THE WRONG TEST
GENERALLY…
p < .05 means there IS a difference (effect / etc.)
But…
p > .05
The groups are the same
We don’t KNOW if there is a difference
WHY DOES THIS MATTER?
(e.g.) Two classes of 10 students: 1~5 Likert Scale question response
p = .09
BUT!!!!
WHY DOES THIS MATTER?
p = .06
p = .02
Many people think:�Group 2 improved more than Group 1
BUT!!!!
Look at the data:�Clearly they improve about the same amount
Pre-Post of 2 groups is FOUR GROUPS
No double t-tests!
WHY DOES THIS MATTER?
COMMON MISTAKE 2:�NOT REPORTING EFFECT SIZE
Effect size:
WHY DOES THIS MATTER (1)?
HELPS TO INTERPRET RESULTS BEYOND P VALUE
-P value is affected by sample size
-Effect size is not*
-Prevents “p-hacking”
-P value is becoming less admissible by American Statistical Association (Wasserstein & Lazar, 2016)
REMEMBER THIS EXAMPLE?
(e.g.) Two classes of 10 students: 1~5 Likert Scale question response
p = .09
PRACTICAL EXAMPLE (2)
Year | N | Pre-test | Post-test | Improvement | P-value | Effect Size |
2020 | 1500 | 80 | 86 | 6 points | p < .01 | d = 0.6 |
2021 | 1549 | 81 | 84 | 3 points | p < .01 | d = 0.3 |
2022 | 1497 | 82 | 88 | 6 points | p < .01 | d = 1.2 |
Which year had the best learning outcomes?
More students improved overall; more even spread of improvement
With large sample sizes, p value becomes uninformative.
PRACTICAL EXAMPLE (3)
Differences in Males/Females to survey questions
Question | Male | Female | Diff. |
Q1 | 4.4 | 4.5 | p = .23 |
Q2 | 3.3 | 3.5 | p = .22 |
Q3 | 3.8 | 3.9 | p = .42 |
Q4 | 4.2 | 4.2 | p = .87 |
Q5 | 4.1 | 4.1 | p = .99 |
… | … | … | … |
Q47 | 4.9 | 4.5 | p = .05 |
Q48 | 3.7 | 3.9 | p = .21 |
If you did this randomly with no theory for N=40~90, how many times would you expect to see p<=.05 for 50 questions?
A: depends on sample size, but about 2 times
Effect sizes were all about the same
WHY DOES THIS MATTER (2)?
-Effect size is more directly comparable across various studies
-By including it, your study can be included in META – ANALYSES
-Very difficult to include without effect size reporting
COMMON MISSED OPPORTUNITY:�NOT CONSIDERING CORREL./ASSOC.
WHY DOES THIS MATTER (1)?
Question | Correlation to Q4 |
I like music | rs = 0.14, p = 0.54 |
I like movies | rs = 0.46, p = 0.03 |
I like English | rs = 0.07, p = 0.78 |
Students who like movies liked the lesson much more than those who don’t like movies
Example: �Four Likert-Scale Questions
WHY DOES THIS MATTER (2)?
PreTest | PostTest | Improv. | HW | Dictogloss | ReadingBk |
54.5 (8.9) | 84.5 (10.9) | 30 (11.3) | 78.1 (14.4) | 81.1 (14.7) | 77.9 (16.3) |
Definite Improvement:
t(19) = 11.92, p < 0.01, d = 2.67
What activity pushed improvement?
Variable | b value | Beta | t value | p value | Relative Weight |
PreTest | -0.665 | -0.525 | -3.639 | 0.002 | 0.229 (31.34%) |
Homework | 0.192 | 0.246 | 1.349 | 0.197 | 0.113 (15.49%) |
Dictogloss | 0.448 | 0.587 | 3.271 | 0.005 | 0.344 (47.08%) |
ReadingBk | -0.018 | -0.025 | -0.164 | 0.872 | 0.044 (6.07%) |
Multiple Correlation with Relative Importance: F = 10.810, p < .001, R2 = 0.730
LET’S USE THE TOOL!
QUESTIONS?
Email:
Spring.ryan.Edward.c4@tohoku.ac.jp
Homepage:
Please cite or work with me!�(I have many other free tools as well)