1 of 165

Community for Rigor

Confirmation bias: the original error

2 of 165

Activity: Deduce the number rule.

Let’s try some hypothesis testing.

But first, brief instructions.

3 of 165

Activity: Enter numbers to guess a secret rule.

Initial screen

Test the number sequence

Does it match? Keep testing.

Initial screen

Input a number sequence

Input a number sequence

Test the number sequence

Does it match? Keep testing.

Know the rule? Submit hypothesis.

4 of 165

Activity: Deduce the number rule.

Now it’s your turn to determine the secret rule.

5 of 165

So, what happened?

6 of 165

Discuss: Did you falsify your hypothesis?

How did this activity go for you? Were there moments when you realized you needed a new strategy?

Think

Pair

Share

7 of 165

What have researchers found about bias?

Here’s one experiment that shows how context inserts bias that interferes with expertise.

5 fingerprint experts agreed to a study and examined prints they had previously identified as matches.

The experts were instructed to ignore contextual information and given unlimited time to examine the prints.

The catch—this time the experts were told the prints were “erroneously matched.”

8 of 165

We are swayed by misleading context even when we know better.

“... experts are vulnerable to irrelevant and misleading contextual influences.”

(Dror et al. 2006).

4 out of 5 experts contradicted their own previous identification, and incorrectly assessed the prints were not a match.

Only 1 expert maintained the original "match" judgment.

9 of 165

We discount information that undermines our past judgments.

In a betting scenario, participants significantly increased their bets when their partners agreed with them.

They only slightly decreased their wagers, however, when their partners disagreed (Kappes et al. 2020).

10 of 165

As you can see, we tend to confirm our hypotheses.

11 of 165

This impairs our research.

12 of 165

This tendency is a cognitive bias called confirmation bias.

13 of 165

Our biased brains

Lesson 1

14 of 165

People’s tendency to process information by looking for, or interpreting, information that is consistent with their existing beliefs.

Confirmation bias:

15 of 165

By distorting how we think,

Confirmation bias skews our ability to conceive, test, and evaluate scientific hypotheses

(Kahneman & Tversky, 1996).

16 of 165

In this unit, confirmation bias also covers two related cognitive biases:

expectation bias & observer bias.

17 of 165

Student expectations affected training & created performance differences by the end of the study.

There were "maze-bright" & "maze-dull" rats

Expectations can produce changes

In the 1960s, Robert Rosenthal and Kermit Fode asked students to train 2 breeds of rats to solve mazes.

The rats were actually genetically identical.

18 of 165

Also known as experimenter’s bias, it is:

This is expectation bias

The tendency for researcher expectations to influence subject and the outcomes.

Experimenters expectations can produce an actual, measurable difference in performance.

19 of 165

Students did not interact with the pigs, so difference resulted from varying human perceptions.

Video recordings of normal vs high social breeding value [SBV+]

SBV+

Normal

Expectations can affect data collection

In a more recent study, students were asked to rate the sociability of 2 breeds of pigs (Tuyttens et al. 2014).

The videos were identical, but SBV+ pigs were rated as more social.

20 of 165

This can look like:

This is observer bias

Tendency for researcher expectations to influence their perceptions during a study, thus affecting outcomes.

Expectations of data collectors can cause perceivable differences that are not actually there.

21 of 165

These biases impede rigorous science & impair our ability to:

Design informative, objective experiments and interpret experimental results impartially.

22 of 165

We can manage this inclination toward confirmation bias.

23 of 165

Frame & ask questions.

Seek out information.

Collect observations.

Make sense of data.

Since we know confirmation bias distorts how we:

24 of 165

Building habits to formulate better hypotheses.

Design experiments to rigorously test and compare hypotheses.

Use appropriate experimental methods to reduce errors.

Place results correctly on the exploratory/confirmatory axis.

Frame & ask questions.

Seek out information.

Collect observations.

Make sense of data.

Since we know confirmation bias distorts how we:

We can mitigate bias by:

25 of 165

Unit overview

  1. Our biased brains
  • “Favored” vs. alternative hypotheses
  • CB distorts observations & data collection
  • Mitigating CB through masking (blinding)
  • How good is your mask?
  • CB motivates poor analytical practices
  • Statistical models need data masking
  • Bonus biases that disrupt research

26 of 165

Next, let’s learn how to formulate better hypotheses to address how we frame and ask questions.

27 of 165

“Favored” vs. alternative hypotheses

Lesson 2

28 of 165

Many studies don't �rigorously test a hypothesis.

Instead, they show weak evidence in support of a favored hypothesis.

29 of 165

How does this happen?

30 of 165

Start with a vague hypothesis.

Compare a favored hypothesis with a trivial null hypothesis.

Don't suitably compare or test multiple plausible hypotheses.

Often studies hit stumbling blocks because they:

31 of 165

Start with a vague hypothesis.

Compare a favored hypothesis with a trivial null hypothesis.

Don't suitably compare or test multiple plausible hypotheses.

We interpret any outcome as supporting our hypothesis.

We get results that don't provide any new insights.

We design the experiment to disprove H0.

What happens as a result?

Often studies hit stumbling blocks because they:

Tip!

H₀ = Null Hypothesis.

It’s the default assumption that there is no effect.

32 of 165

What’s the harm?

Without a connection to a specific hypothesis or related models or theories, the result has limited value to others working in this area.

Vague hypothesis

Disproving a null hypothesis gives no information about what is occurring. A false H0 is consistent with many hypotheses, including the favored Ha.

Disproving H0

Ha = Alternative Hypothesis.

e.g. The effect exists / the magnitude is NOT 0.

When H1 and H2 are not mutually exclusive, both could be right (or both wrong)—the study results don't give clear answers one way or another.

H1 & H2 are �NOT mutually exclusive.

H1 = Explanatory Hypothesis 1.

e.g. The effect operates through mechanism 1.

H2 = Explanatory Hypothesis 2.

e.g. The effect operates through mechanism 2.

33 of 165

What solutions are there?

To find a hypothesis worth testing we can run exploratory studies, we can search the literature, consult an LLM, or ask an expert in the area.

Vague hypothesis

Make the hypotheses exclusive so that any result is informative; or design a study to be comparative, e.g. about the magnitude of H1 vs H2.

H1 & H2 are NOT mutually exclusive.

Find one of the other possible hypotheses. Other researchers in this area (that we disagree with) usually have one that they like.

Disproving H0

Let’s focus on solving this problem.

Make the hypotheses exclusive so that any result is informative; or design a study to be comparative, e.g. about the magnitude of H1 vs H2.

34 of 165

Experiments with 2 mutually exclusive hypotheses are more likely to develop

a clear differentiation between potential explanations.

35 of 165

When hypotheses are not mutually exclusive, however, they increase the likelihood of

a favored hypothesis because we don't consider other plausible outcomes in the experiment.

36 of 165

How does a “favored” hypothesis lead to confirmation bias?

When only one hypothesis is tested, we tend to design studies to look for results consistent with it being true (e.g. having a hypothesized number rule, and testing a matching sequence).

37 of 165

Ex: The Left-Brain/Right-Brain Myth

Misinterpretation of research conducted by neuropsychologist Roger Sperry on specialized functions of the two brain hemispheres in the ‘70s and ‘80s (Sperry 1967, Sperry 1968) gave rise to early theories that the left hemisphere is exclusively logical and the right exclusively creative (Gazzaniga 2005).

38 of 165

This dichotomy became a “favored” explanation for personality traits, relying on oversimplified evidence and ignoring conflicting findings.

Focusing on supporting data reinforced the myth and discouraged consideration of more complex models.

Confirmation bias in action

39 of 165

The persistence of this myth influenced teaching strategies, career counseling, and how individuals viewed their cognitive abilities—potentially limiting personal growth and misdirecting educational efforts.

Impact on education and self-perception

40 of 165

What does the science say?

Neuroimaging studies reveal extensive communication between the two hemispheres and show that complex behaviors result from coordinated activity across these brain regions (Toga & Thompson, 2003).

Cognitive functions emerge from an integrated network that span both hemispheres—not from isolated “left” or “right” processes.

41 of 165

They methodically evaluate both through their comprehensive review of anatomical differences between hemispheres and their developmental origins.

Toga and Thompson exemplified exploring a mutually exclusive hypothesis by examining whether structural brain asymmetries are intrinsically determined or experientially shaped.

42 of 165

So, we have a way out: develop competing hypotheses!

Strong(er) inference practices depend on us making �better hypotheses

(Platt 1964).

43 of 165

ACTIVITY: develop an competing hypothesis.

Consider other possible explanations to build out other plausible hypotheses.

44 of 165

Which prompts helped you to think of a competing hypothesis?

Think

Pair

Share

Discussion

45 of 165

There are 2 hypotheses that are mutually exclusive.

Both H1 and H2 are plausible.

Recap: What’s a good place to start?

46 of 165

Next up: countering bias that creeps into experimental design.

47 of 165

Researcher degrees of freedom

Lesson 3

48 of 165

Confirmation bias can creep into our experimental design.

49 of 165

Watch out for design choices, such as:

Which population/animal model to test.

what specific form of treatment is applied.

How the outcome is measured.

Which kind of statistical test to use.

50 of 165

Consider these 2 workable hypotheses:

H1 = Emotional images have a localized neural correlate.

H2 = Emotional images have a diffuse neural correlate.

What could go wrong?

51 of 165

Consider these 2 workable hypotheses.

H1 = Emotional images have a localized neural correlate.

H2 = Emotional images have a diffuse neural correlate.

H1 = Emotional images have a localized neural correlate.

H2 = Emotional images have a diffuse neural correlate.

H1 = Emotional images have a localized neural correlate.

H2 = Emotional images have a diffuse neural correlate.

H1 = Emotional images have a localized neural correlate.

H2 = Emotional images have a diffuse neural correlate.

H1 = Emotional images have a localized neural correlate.

H2 = Emotional images have a diffuse neural correlate.

Diffuse: Multiple or undefined regions of the brain undergo significant change in response to a stimuli

Emotional Images: Images that stimulate a dominant emotion such as happiness or fear through depictions of people and animals.

Localized: A specific region of the brain undergoes significant change in response to a stimuli.

Neural correlate: A measurable pattern of nervous-system activity that reliably accompanies a mental state or event, measured by an fMRI technique that infers brain activity from localized changes in blood oxygenation.

52 of 165

ACTIVITY: Pick the most biased choice

Compare choices and pick the option that you think will most bias results toward H1, then explain why.

53 of 165

Discussion: How was voting?

Did you agree or disagree with the results of the voting? What did you notice that helped you make your decision?

Think

Pair

Share

54 of 165

Let’s take a closer look at how some of the researcher choices insert bias:

The authors choose smoothing, region of interest, and voxel to analyze, so that the effects are as large as possible, favoring H1.

We need to avoid double dipping - never use what you see to decide how to analyze the same data.

Choice 1: Double dipping

55 of 165

Let’s take a closer look at how some of the researcher choices insert bias:

The authors screen participants to find personality types more likely to create strong results. How do we know those people are representative of a larger population? Do we know how personality type affects the brain?

Choice 2: Personality screen

We need avoid selecting ourselves into populations that seem more likely have the result we want.

56 of 165

Both choices closely favored H1 & took complex steps to erase real variability from data.

57 of 165

How does this happen?

H1

The study favors H1.

Researchers make MANY choices that cut real variability from data.

Then,averaging, biased selection, & augmenting data help make H1 SEEM true.

Hazard!

These choices all make H1 likely to be supported.

58 of 165

The myriad choices researchers have when designing experiments & the flexibility that arises from these choices, which allow bias to creep in via subjectivity & interpretation.

Researcher degrees of freedom:

59 of 165

Selecting subsets of data.

“This is how we do it in our lab.”

Researcher degrees of freedom introduce many instances of flexibility including (Gelman and Loken 2016):

Vague hypotheses that could be validated by different results.

Choosing when to stop data collection.

Removing data anomalies.

Selecting and designing models.

Measuring multiple outcomes to select from.

Choosing the results or analyses to emphasize in the paper.

60 of 165

The more degrees of freedom we have

the more we’re going to choose options

that favor our initial hypothesis.

61 of 165

Reporting null and unexpected results.

Masking study participants, researchers, and analysts.

Developing specific, falsifiable hypotheses.

Randomizing subject allocation.

Pre-registering hypotheses and analysis plans.

Transparently reporting ALL data analysis.

We minimize confirmation bias posed by this flexibility through rigorous experimental practices:

62 of 165

Confirmation bias can introduce error into research.

However, one key principle of rigorous experimental design that will help us combat it is: masking.

63 of 165

Let’s dive deeper to see how masking curbs researcher degrees of freedom & bias.

64 of 165

Mitigating bias through masking

Lesson 4

65 of 165

Experiment design

& execution involve many decisions.

66 of 165

These decision points can be a place where confirmation bias affects our work.

67 of 165

Let’s look at some choices in one segment of an experiment.

68 of 165

Administrator (You)

Data collector (Also You)

Mouse 1, 2, 3 are all more active

Only mouse 6 is more active

Analyst (Also Also You)

After outlier correction the treatment group responds well, relative to the control group (p < 0.05).

Hypothesis: Treated mice are more active.

Ex: Testing a new Parkinson’s drug

Preparer (You!)

Treatment

Control

Choice bias!

You’re aware which mice look more active before the experiment. You may assign them to treatment (which you want to work).

Labeling bias!

You expect the treated mice to do better so you may view them as more active.

Data analysis bias!

You feel that the outlier (which is doing well) needs to be rejected. Thankfully your results were statistically significant.

69 of 165

Discuss: What went wrong?

What problems occurred in this procedure?

Think

Pair

Share

70 of 165

By not withholding information at important steps, all these choices introduced bias.

71 of 165

What should we do instead?

Mask our study!

72 of 165

“Masking, is the process by which

information that has the potential to influence study results is withheld from one or more parties involved in a research study.”

(Monaghan et al. 2021)

Terminology

Masking & blinding are terms that describe the same practice. We use masking due to ableist connotations of associating blindness with being unaware of information.

73 of 165

But how can you know if biases are distorting results?

To answer that, we need to review p-values.

74 of 165

A p-value (p):

Is a measurement ranging from 0 to 1, that quantifies the probability of data as extreme as observed, if the null hypothesis, H0, is true.

If the null hypothesis H0 is true, all values from 0 to 1 are equally likely.

A small p-value suggests that data are extreme, relative to what would be expected if H0 were true.

A p-value lower than a pre-set significance threshold (usually 0.05) is used to reject H0.

75 of 165

Can masking really have an impact on experiments?

Let's look at some examples.

76 of 165

Failing to mask significantly inflates effect sizes

(Vesterinen et al. 2010).

When treatment allocation is masked,

measured efficacy of drug NXY-059 drops from an average of 54.0% to 25.1%.

(MacLeod et al. 2008).

77 of 165

In a meta-analysis of 290 animal research studies,

When treatment allocation is masked,

measured efficacy of drug NXY-059 drops from an average of 54.0% to 25.1%.

(MacLeod et al. 2008).

the odds of reporting a positive result were 5.2 times greater when neither masking nor randomization were used(Bebarta et al. 2003).

78 of 165

In other words, failing to mask adds a bias that can increase effect size by ~0.28 to 0.91σ.

σ = standard deviation.

79 of 165

ACTIVITY: explore the impact of bias.

Use these power curves to explore how the probability of detecting an effect is affected by masking failures.

80 of 165

Discuss: What is the impact of sample size?

What are the implications for the reliability of research findings when there is a bias on effect size?

How is this different for small or large samples?

Think

Pair

Share

81 of 165

How can we use masking to combat confirmation bias?

82 of 165

Data analysts make modeling choices.

Raters make judgments when collecting data.

It also exposes research to risks, such as:

Withholding details that affect direct observation and judgment.

Providing de-identified, coded data to prevent biased analysis.

We can use masking to mitigate bias by:

Collect observations.

Make sense of data.

Frame & ask questions.

Seek out information.

Just as confirmation bias distorts how we:

83 of 165

Think back to our example.

How do you actually do the masking?

84 of 165

Analyst (New Colleague!)

Your colleague carefully compares the two groups and finds there is no statistically significant difference.

Administrator

(Colleague!)

Mouse 6 is more active.

Mouse 1 and 3 are more active.

Data collector (You)

Hypothesis: Treated mice are more active.

Ex: Testing a new Parkinson’s drug

Preparer (You!)

Treatment

Control

Choice bias averted!

Your colleague ideally doesn’t know which mice are more or less active, they’re randomly assigning mice to each group.

Another bias averted!

Since you don’t know the group assignments, your observations will be more objective.

What next?

Since you followed a thoughtful experimental design, you have results you can trust.

85 of 165

Discuss: What is different in this experiment?

How did taking steps to mask this study impact the results?

Think

Pair

Share

86 of 165

So, who & what must

be masked?

87 of 165

Mask anyone with a possible influence on the outcome of the experiment:

Single-masked:

participants (for human studies) &

animal care staff (for animal studies)

Double-masked:

experimentalists

& clinicians

Triple-masked:

data collectors &

data analysts

Who gets masked

88 of 165

Mask samples

So, people conducting the experiments (administering drug or placebo)

&

people assessing the outcome (making observations & recording data)

are not aware of:

which treatment is being administered

which group a given sample belongs to

What gets masked

89 of 165

Mask samples

Also called allocation concealment: for example, “using sealed envelopes … until after a participant had been irrevocably entered into a trial” (Schulz et. al., 2018).

which treatment is being administered

which group a given sample belongs to

What gets masked

Allocation concealment works to prevent selection bias by using “a mechanism to prevent foreknowledge of upcoming assignments.”

90 of 165

As you can see, masking is a complex process even when seemingly simple.

91 of 165

What should you do if you have limited resources or tools?

Randomization: Use a random number generator to set treatment-allocation order, reducing selection bias.

Strategic masking: Keep key support staff (e.g., coders, care staff) unaware of group assignments.

Robust measurements: Design studies with measurements that resist manipulation from personal insights.

Pre-registration safeguard: Preregister study protocols to validate and monitor potential biases from unmasked practices.

92 of 165

Next, let’s examine some risks to unmasking.

93 of 165

How good is your mask?

Lesson 5

94 of 165

You’ve masked your study!

Or have you?!

Masking may not be fully effective depending on the circumstances.

95 of 165

What if …

The treatment causes inflammation at the site of injection, whereas the control doesn’t.

A psychoactive treatment causes an obvious good time and patients notice changes in perception and behavior.

Treatment

Control

Animal strains have differences that are unrelated to the behavior of interest, but are noticed by experimenters (like raven funk).

96 of 165

Even when you’ve taken steps to mask an experiment, there are unmasking risks.

Certain clues can accidentally unmask an experiment and come from sources such as:

Caretaker behavior.

Equipment or procedure discrepancies.

Differing side effects.

Healing rates.

97 of 165

Let’s explore where in an experiment unmasking might occur.

98 of 165

ACTIVITY: Spot potential risks for unmasking!

Review a study and suggest solutions to a research team.

99 of 165

Discuss: How can we manage these challenges?

How do the lessons learned from this case study apply to other areas of research where similar biases or rigor issues could occur?

Think

Pair

Share

100 of 165

But should we also formally assess whether participants, researchers, or administrators have been unmasked?

101 of 165

Assessing if masking has been done (or done correctly) can be complicated or misleading.

There is also no consensus on how the efficacy of masking should be assessed (Born 2024).

102 of 165

In fact, sometimes explicitly assessing the effectiveness of masking inserts bias when questioning study participants (Schulz et al. 2010).

103 of 165

Use placebo controls to prevent treatment effects from revealing information.

Strategies to mitigate & test for accidental unmasking broadly (Muthukumaraswamy et al 2021):

Provide clear, neutral instructions to minimize expectation bias.

Report participant expectations and unmasking instances for accurate result interpretation.

Review past masking procedures to improve future designs.

Placebo controls aren’t perfect!

Differential side effects (e.g., noticeable nausea or drowsiness) or other treatment cues can lead to unmasking.

104 of 165

What if an experiment absolutely cannot be masked?

Sometimes masking is nearly impossible (as in psychedelic drug trials)

Even so, bias can be reduced by measuring & accounting for pre-trial treatment expectations.

Report it!

105 of 165

State masking was incomplete & describe why (e.g., due to the unmistakable psychoactive effects of the treatment).

Strategies for reporting studies that cannot be masked (Muthukumaraswamy et al 2021):

Summarize participant/assessor guesses (ideally using indexes like the Bang/James Blinding Index) & include confidence ratings for these guesses.

Describe key trial instructions, advertising, & consent details shaping perceptions.

Explain the potential bias in treatment effect estimates & note any analytic strategies (like conditioning on participants’ beliefs) to adjust for these biases.

Present baseline & post-intervention expectancy using standardized tools

(like the Stanford Expectations of Treatment Scale) to quantify how participant beliefs may bias outcomes.

106 of 165

Next, learn about important analytical practices to prevent bias when exploring data.

107 of 165

Analytical practices to mitigate bias

Lesson 6

108 of 165

We’ve seen how confirmation bias can influence how data is collected.

But what happens after that?

109 of 165

Let’s explore the choices that occur in a data analysis.

The StudentLife dataset (Wang et al. 2014) measures:

student sleep habits, exercise, socialization, etc.

110 of 165

When exploring a dataset for relationships of interest, we often start by computing a

correlation coefficient

111 of 165

A correlation coefficient (r):

Is a specific measure, ranging from -1 to 1, that quantifies the strength & direction of a linear relationship between two variables.

The closer r is to zero, the weaker the relationship.

Positive r values indicate a positive correlation: both variables increase or decrease together.

Negative r values indicate a negative correlation: as one variable increases, the other decreases.

112 of 165

Specifically, �the correlation coefficient, r, �is useful for:

Quantifying trends.

Identifying relationships.

Making predictions.

113 of 165

ACTIVITY: Now, let’s look at the Dartmouth data.

Test a given hypothesis and try to justify it with the data provided.

114 of 165

Discuss: What did you discover?

For the relationships that you found in the data, �did you generate (causal) stories or explanations?

Think

Pair

Share

115 of 165

Although our explanations may be plausible, did we actually test them as formal hypotheses?

(No)

116 of 165

We did engage in exploratory data analysis!

Discovering trends & patterns can generate hypotheses for further testing.

117 of 165

Much of research is exploratory in nature.

(and that’s okay!!)

Note:

118 of 165

However, exploratory work is sometimes reported as confirmatory.

(and that’s NOT okay!!)

119 of 165

Recall the warning from lesson 2: When you have a vague hypothesis, you are inclined to find evidence for it.

The problem is that many different data patterns can be interpreted as support.

Undergrad students who socialize more have better mental health.

For example, both of these specific patterns in the data:

Could be interpreted as evidence for this vague hypothesis:

Students who spend more time conversing also report less stress.

Students who interact with more people also report feeling less lonely.

120 of 165

This is an example of:

Then interpreting the pattern as a test of statistical inference.

First exploring a data set.

Making analysis choices that reveal an interesting pattern.

Watch out!

Making data analysis choices to obtain a statistically significant result is a questionable research practice known as p-hacking.

121 of 165

What do we do instead?

122 of 165

Validate trends & patterns by conducting confirmatory tests on other datasets.

Pre-register predictive hypotheses & test them by collecting new data.

See trends after exploratory research? Awesome! Now you can:

Use identified relationships as the basis to develop specific hypotheses for why the relationship exists.

123 of 165

Ultimately, we must be careful to not mistake an exploration of data with a confirmatory test of a hypothesis.

124 of 165

Next, see how even machine learning models can have bias.

125 of 165

Data masking for machine learning

Lesson 7

126 of 165

We have seen how researcher degrees of freedom allow bias to occur in data analyses.

But what about when we deploy machine learning to make those choices for us?

127 of 165

Scenario: A research team is

developing a machine learning model to detect Parkinson’s disease.

128 of 165

The plan:

The data:

Medical data from 40 patients, half with PD & half without.

PD-

PD+

Training:

Train the model on data from all 40 patients.

Testing:

Evaluate model performance on a subset of patients.

PD?

129 of 165

We've already used all the data to train the model.

Evaluating the model on the same data used for training is “double-dipping” and it contaminates our evaluation of the model.

But wait!

130 of 165

Leakage occurs when model training includes inappropriate information, such as the same data used to evaluate performance.

This problem is referred to as leakage!

131 of 165

Data leakage is a flaw in machine learning that leads to over-optimistic results”

Per a survey that shows leakage affects 294 papers across 17 scientific fields

(Kapoor & Narayanan 2023).

132 of 165

So, what should we do instead?

133 of 165

Prevent leakage through data holdout!

Standard data holdout partitions data for training & testing across non-overlapping splits.

Training:

134 of 165

Prevent leakage through data holdout!

Standard data holdout partitions data for training & testing across non-overlapping splits.

Hold this data out and use it for testing only!

Training:

135 of 165

Activity: Build a Parkinson’s Disease detector!

See these principles in action for yourself.

136 of 165

Discuss: How to evaluate model performance?

How did the performance of the model change?�What other kinds of information in the training data could result in leakage?

Think

Pair

Share

137 of 165

Data leakage can occur in several ways:

Splitting temporal data into segments can �still result in overlap due to trends.

Trying multiple models or features sets and selecting the one that performs best on the test set.

Curated datasets may include the same data points in both training and testing, compromising generalization.

Augmenting small datasets with simulated data may inadvertently include training information in the test set.

138 of 165

When doing model building & evaluation,

make sure to:

Include more than 50 subjects, if possible.

Training & test datasets don’t share extra information.

Train on a subset & test on the remaining (e.g. 80% and 20%).

139 of 165

Reminder:

To use data holdout to assess model generalizability & identify potential overfitting:

Use appropriate partitions to avoid leakage.

Recognize that inflated performance can mislead model evaluation & subsequent decision-making.

Always separate training & test data to avoid shared subject-specific information.

140 of 165

Next up is our final lesson! Let’s learn about the wider world of cognitive biases (& errors).

141 of 165

Bonus biases that disrupt research

Lesson 8

142 of 165

Confirmation bias is not the only bias to watch out for.

The Catalogue of Bias reveals dozens of other biases that can distort research outcomes.

143 of 165

Further, confirmation bias is:

A cognitive phenomenon (& bias) that

stacks with other biases &

cognitive phenomena to

create even more rigor problems.

144 of 165

Confirmation bias can compound with other distortions, like:

Both of which can:

And it’s not the only bias to avoid.

Compound the effects of confirmation bias.

Result in systematically inaccurate decision-making.

The bandwagon effect.

Cognitive dissonance.

Create errors that impair research quality.

145 of 165

Let’s look at examples for these new biases.

146 of 165

Example 1: the bandwagon effect

Recall the misinterpretation of neuropsychologist Roger Sperry’s work that gave rise to a “favored” explanation for personality traits aligned to the left vs right brain hemispheres.

The willingness of so many others to ignore conflicting findings is also indicative of a tendency to align with the thinking of others.

147 of 165

Solomon Asch ran several experiments with groups of 8 participants made up of: 1 subject and 7 actors.

They were asked to identify which of various lines matched the reference.

They played 15 rounds where actors gave obviously wrong answers for 12 out of 15 rounds.

Reference

Comparison

1951 Swarthmore experiment:

Example 2: the bandwagon effect

148 of 165

In control trials, with no pressure to conform to actors, the error rate on identifying the right line was less than 0.7%.

In test trials, 74% of participants gave at least one incorrect answer out of the 12 critical trials.

35.7% conformed to the larger group’s incorrect responses for a majority of their answers.

12% of participants followed the group in nearly all of the tests.

Results: Swarthmore experiment

149 of 165

They started to really believe that they themselves were wrong!

Participants who conformed to the majority on at least 50% of trials reported experiencing what Asch termed a “distortion of perception”:

Of the participants who followed the group at some point, most stated afterward that they knew the group was wrong, but didn’t want to go against them.

What happened?

150 of 165

Also known as the bandwagon effect, it is:

The tendency to change opinion or behavior to align with the majority.

Occurs even when we believe the group is clearly wrong.

This is conformity bias

151 of 165

Example 3: cognitive dissonance

Data from a survey on attitudes toward data sharing & pre-registration in the social sciences shows a significant belief-behavior gap.

While engagement in open science nearly doubled between 2010 and 2020, the strong favorable opinion greatly outstrips practice (Ferguson et al. 2023).

Beliefs

Behavior

Beliefs

Behavior

Posting data/code

Preregisting

Very to moderately in favor

Very to moderately not in favor

Neutral

Has done task

0%

100%

50%

25%

75%

152 of 165

(The control group were not asked to speak with the actors.)

Leon Festinger & James Carlsmith ran several experiments with students tasked with completing boring, repetitive tasks.

Subjects in group 2 were paid $1 to lie.

Subjects in group 1 were paid $20 to lie.

Festinger & Carlsmith’s Prediction: Group 2 ($1) would rate their experience higher than Group 1 ($20).

After the tasks, students were told to lie to the next group of participants, describing the tasks as fun and exciting, and then rate their experience.

The 1954 Social Comparison Theory Experiments at Stanford

Example 4: cognitive dissonance

153 of 165

Subjects who were paid $20 rated the mundane tasks slightly more positively than the subjects in the control group.

Subjects who were paid $1 rated the mundane tasks more positively than the subjects who were paid $20 rated the tasks.

Results: Stanford experiment

154 of 165

Why would they do this?

Festinger & Carlsmith concluded that the subjects who were told to lie contorted their experience to offset the contradiction between their experience and their lie.

Subjects paid $20 did not change as much because the larger payment justified lying.

Subjects paid $1 couldn’t justify lying due to a reward, so they “revised” their experience to decrease their discomfort.

155 of 165

Like confirmation bias, cognitive dissonance is a cognitive phenomenon that:

Causes a mental disturbance when our thoughts and actions conflict.

When this occurs, we try to change either our actions or our values.

The great threat to rigor is when we deceive ourselves about reality to ease this discomfort.

156 of 165

Let’s explore more ways different biases and rigor issues connect!

157 of 165

Activity: map the biases!

158 of 165

Discuss: what connections did you make?

In what ways might rigor issues stack to create greater problems? How could research in your field be improved through an awareness of these rigor issues?

Think

Pair

Share

159 of 165

Bias infiltrates all levels of the research process & burdens decision-making with errors that stack.

Reporting

Publication bias & spin amplify prior errors by highlighting positive results & omitting counter-evidence or limitations.

Conducting

Observer & measurement biases skew data collection, further distorting study findings.

Designing

Framing biases steer study design away from rigorous tests of well-specified hypotheses.

Selecting

Sampling & attrition biases lead to unrepresentative study groups, compounding initial biases.

160 of 165

Bias infiltrates all levels of the research process & burdens decision-making with errors that stack.

Confirmation bias can trigger compounding distortions across the entire research process & balloon into a hydra of other biases.

161 of 165

Bias amounts to systematically inaccurate decision-making.

These decisions insert error at every stage of how science is supposed to produce knowledge.

162 of 165

By seeping into every corner of our scientific work, bias undermines even the best intentions.

163 of 165

Rigor is our antidote.

It systematically reduces error at every stage & serves as our ultimate safeguard against bias.

164 of 165

Better Science Every Day

Confirmation Bias

165 of 165