1 of 35

HOW TO AVOID COMMON MISTAKES WITH STATISTICS

Hints and free tools

2 of 35

OVERVIEW

  1. Why am I doing this?
  2. The free tools & workshop download
  3. Common mistake 1: Choosing the wrong test�→ Not considering data types�→ Misunderstanding p > .05
  4. Common mistake 2: Not reporting effect size�→ Not realizing the importance
  5. Common missed extra step: Not considering association / correlation

3 of 35

WHY AM I DOING THIS?

  • I review a lot of papers
  • I see a lot of the same mistakes
  • I used to make these mistakes (and more)
  • We need to do better as a field (Plonsky, 2023)
  • I am not a jerk (Wife, 2022)

4 of 35

I FEEL YOUR PAIN

  1. Most examples not from our field�→ Most textbooks: examples from medicine or economics�
  2. Mix of different types of studies in our field�(sometimes old tests, styles, etc.)�→ Copying previous studies’ stats methods might be dangerous

5 of 35

SO!

I made a free online stats tools that:

  1. Helps ensure you have the right test
  2. Helps ensure you are justifying / reporting correctly
  3. Gives examples in the field of linguistics / language learning

https://springsenglish.online/stats

→ Partially to help me learn programming

6 of 35

SPSS’ STRANGLEHOLD IS GONE

SPSS

Free

Free

EXPENSIVE

Significance (effect size)�Correlation +�Interrater Reliability

Bayesian Statistics�Machine Learning�My site (minus DA)

Significance (most ES)�Correlation (costs extra!) (+/-)�Interrater Reliability ($$) (+/-)

No graphs

Excellent graphs

Some graphs

Helps you choose the test

Doesn’t help

Doesn’t help

Has linguistics / language acquisition based examples

Doesn’t have such examples

Doesn’t have any examples

R is difficult, but…

7 of 35

FOR THIS WORKSHOP!

  • Please go to my website and download today’s example data sets

springsenglish.online/stats

(my name)

(my site)

(what I teach)

(because I’m poor)

(to go to the stats part)

.com is REALLY expensive!!!

8 of 35

COMMON MISTAKE 1:�CHOOSING THE WRONG TEST

Part 1:

Considering Data Types

9 of 35

COMMON MISTAKE 1:�CHOOSING THE WRONG TEST

Consideration 1: Is my data normal and continuous?

1. Average in the middle

2. Most data close to average

  1. Many values
  2. + / - “equal”

10 of 35

COMMON MISTAKE 1:�CHOOSING THE WRONG TEST

Normal data might look like:

11 of 35

COMMON MISTAKE 1:�CHOOSING THE WRONG TEST

Why does data normality matter?

・Beware the AVERAGE TRAP!!

12 of 35

WHY DOES THIS MATTER?

If the data isn’t normal / parametric:

The average doesn’t mean much

M = 70; students knew about 70%? YES!

M = 70; students knew about 70%? NO!

13 of 35

WHY DOES THIS MATTER (2)?

t-test = What is the chance there is so much difference in averages? (Are these curvatures from the same normalized sample?)

Mann-Whitney = What is the chance that so many high scores are in one of these groups? (Do these rank distributions come from the same sample?)

Calculations are ENTIRELY DIFFERENT!!!

14 of 35

COMMON MISTAKE 1:�CHOOSING THE WRONG TEST

Consideration 2: Are the data sets paired?

Test 1

Test 2

Student A

100

90

Student B

40

50

Student C

70

50

Student D

60

60

Student E

50

70

OR

Test 1

Test 2

Student A

100

Student F

90

Student B

40

Student G

50

Student C

70

Student H

50

Student D

60

Student I

60

Student E

50

Student J

70

15 of 35

WHY DOES THIS MATTER?

INDEPENDENT t-test = Is there a difference in these groups’ averages? (Are these curvatures from the same normalized sample?)

DEPENDENT t-test = Is the average of these individual’s differences large? (Are these averages from the same normalized sample indicative of a change?)

Calculations are ENTIRELY DIFFERENT!!!

16 of 35

LET’S TRY USING THE TOOL!

  • Get data from the first tab “ParametricOrNot”

  1. Is there a significant difference in Parametric TestScore(A) and TestScore(B)?
  2. How about the Non-Parametric sets?
  3. How about between TestScore(B) Parametric and TestScore(B) Nonparametric? �(try saying that both sets are parametric and see what happens!)

17 of 35

COMMON MISTAKE 1:�CHOOSING THE WRONG TEST

Consideration 3: How many data sets do I have?

2 ONLY

-T-test

-Mann-Whitney

-Wilcoxon�-etc.

3+

-Anova

-Kruskall-Wallis

-Friedman

-etc.

18 of 35

WHY DOES THIS MATTER?

19 of 35

LET’S TRY USING THE TOOL!

  • Get data from the first tab “3Sets”
  • Where are the differences in the classes?�→ If these are different students and the data is parametric?�→ If these are the same students and the data is non-parametric?

20 of 35

COMMON MISTAKE 1:�CHOOSING THE WRONG TEST

Part 2:

Misunderstanding p >.05

21 of 35

COMMON MISTAKE 1:�CHOOSING THE WRONG TEST

GENERALLY…

p < .05 means there IS a difference (effect / etc.)

But…

p > .05

The groups are the same

We don’t KNOW if there is a difference

22 of 35

WHY DOES THIS MATTER?

(e.g.) Two classes of 10 students: 1~5 Likert Scale question response

p = .09

BUT!!!!

  • Same trend for 11 students each p = .05�
  • Potential type 1 error�
  • You should take more data, NOT NECESSARILY give up on this idea

23 of 35

WHY DOES THIS MATTER?

p = .06

p = .02

Many people think:Group 2 improved more than Group 1

BUT!!!!

Look at the data:Clearly they improve about the same amount

24 of 35

Pre-Post of 2 groups is FOUR GROUPS

No double t-tests!

  • There is a separate page for this in my tool
  • Multiple pieces of data (groups, times, etc.) MUST be put into a combined model
  • Let’s try the data in my example set

WHY DOES THIS MATTER?

25 of 35

COMMON MISTAKE 2:�NOT REPORTING EFFECT SIZE

Effect size:

  • How meaningful the relationship is
  • Do we care?
  • Does the trend hold for most points in the data set?

26 of 35

WHY DOES THIS MATTER (1)?

HELPS TO INTERPRET RESULTS BEYOND P VALUE

-P value is affected by sample size

-Effect size is not*

-Prevents “p-hacking”

-P value is becoming less admissible by American Statistical Association (Wasserstein & Lazar, 2016)

27 of 35

REMEMBER THIS EXAMPLE?

(e.g.) Two classes of 10 students: 1~5 Likert Scale question response

p = .09

  • The effect size is MODERATE�
  • Alerts you that more data would likely give you p < .05

28 of 35

PRACTICAL EXAMPLE (2)

Year

N

Pre-test

Post-test

Improvement

P-value

Effect Size

2020

1500

80

86

6 points

p < .01

d = 0.6

2021

1549

81

84

3 points

p < .01

d = 0.3

2022

1497

82

88

6 points

p < .01

d = 1.2

Which year had the best learning outcomes?

More students improved overall; more even spread of improvement

With large sample sizes, p value becomes uninformative.

29 of 35

PRACTICAL EXAMPLE (3)

Differences in Males/Females to survey questions

Question

Male

Female

Diff.

Q1

4.4

4.5

p = .23

Q2

3.3

3.5

p = .22

Q3

3.8

3.9

p = .42

Q4

4.2

4.2

p = .87

Q5

4.1

4.1

p = .99

Q47

4.9

4.5

p = .05

Q48

3.7

3.9

p = .21

If you did this randomly with no theory for N=40~90, how many times would you expect to see p<=.05 for 50 questions?

A: depends on sample size, but about 2 times

Effect sizes were all about the same

30 of 35

WHY DOES THIS MATTER (2)?

-Effect size is more directly comparable across various studies

-By including it, your study can be included in META – ANALYSES

-Very difficult to include without effect size reporting

31 of 35

COMMON MISSED OPPORTUNITY:�NOT CONSIDERING CORREL./ASSOC.

  1. Starting point for analyzing survey data
  2. Data triangulation
  3. Finding factors involved in improvement

32 of 35

WHY DOES THIS MATTER (1)?

Question

Correlation to Q4

I like music

rs = 0.14, p = 0.54

I like movies

rs = 0.46, p = 0.03

I like English

rs = 0.07, p = 0.78

Students who like movies liked the lesson much more than those who don’t like movies

Example: �Four Likert-Scale Questions

33 of 35

WHY DOES THIS MATTER (2)?

PreTest

PostTest

Improv.

HW

Dictogloss

ReadingBk

54.5 (8.9)

84.5 (10.9)

30 (11.3)

78.1 (14.4)

81.1 (14.7)

77.9 (16.3)

Definite Improvement:

t(19) = 11.92, p < 0.01, d = 2.67

What activity pushed improvement?

Variable

b value

Beta

t value

p value

Relative Weight

PreTest

-0.665

-0.525

-3.639

0.002

0.229 (31.34%)

Homework

0.192

0.246

1.349

0.197

0.113 (15.49%)

Dictogloss

0.448

0.587

3.271

0.005

0.344 (47.08%)

ReadingBk

-0.018

-0.025

-0.164

0.872

0.044 (6.07%)

Multiple Correlation with Relative Importance: F = 10.810, p < .001, R2 = 0.730

34 of 35

LET’S USE THE TOOL!

  • Check the Correl1 tab
  • Compare the correlation between Q1 and Q2
  • Make a correlation matrix

  • Check the MultiCorrel tab
  • Check my calculations on the previous slide!
  • See how different the calculations are if you predict the post-test scores instead of “improvement”

35 of 35

QUESTIONS?

Email:

Spring.ryan.Edward.c4@tohoku.ac.jp

Homepage:

www.springsenglish.online

Please cite or work with me!�(I have many other free tools as well)