1 of 24

The Good, the Bad, the Challenging: �Interrater Reliability

Christine Thomas, PhD, RN, CHSE-A

Vivian Bowman, MSN RN PCCN CNE

Jennifer Wendel, MEd RN MSN CHSE

2 of 24

Objectives

  • Discuss the purpose and value in establishing interrater reliability
  • Examine biases that impacts the consistency and objectivity of the rater’s assessment
  • Appraise the process used to establish interrater reliability

3 of 24

Why IRR is Essential

  • NLN Jeffries Simulation Theory
  • Facilitator actions guide learning
  • Strategies impact engagement and performance
  • Bias can affect how we judge students
  • IRR = consistent, fair scoring
  • IRR reduces bias and improves validity
  • Critical for grading, progression, and readiness

4 of 24

The Importance of IRR

  • Promotes consistency and fairness in evaluation
  • Reduces bias, ensures equitable assessment
  • Strengthens validity of rubrics and tools
  • Supports competency-based education (CBE)
  • Essential for high-stakes decisions
  • Ensures objective evaluation
  • Improves clarity and quality of student feedback
  • Fosters growth in clinical judgment and skill

5 of 24

IRR and Standards of Best Practice

  • Evaluators must be trained in simulation assessment
  • Use valid, reliable tools aligned with best practices
  • Consistent scoring across evaluators is essential
  • Multiple raters recommended
  • Programs must validate tools and demonstrate IRR
  • IRR protects program credibility and decision integrity
  • Ensures fair, bias-free, defensible evaluations
  • Reflects commitment to excellence and student fairness

6 of 24

The Impact of IRR on Students

  • Builds trust in the learning environment
  • Reduces perceived evaluator bias
  • Clarifies expectations for clinical competence
  • Promotes resilience through meaningful feedback
  • Inconsistent evaluation can erode confidence and learning

7 of 24

Factors that impact IRR

    • Inter-rater factors
      • Consistency between raters
      • Veering away from objectives
      • Interpretation of descriptions/items
      • Content knowledge
      • Systematic bias (dove vs hawk)
      • Rater bias (irrelevant characteristics)
      • Gender, race, prior performance of student

8 of 24

Factors that impact IRR

    • Consistency within the rater
      • Internal factors
        • Mood, fatigue, illness
      • External factors
        • Order of evaluation, time of day, noise, temperature
      • Rater drift
        • Leniency/severity drift
        • Interpretation drift
        • Fatigue drift

Image generated by Copilot

9 of 24

Interrater Reliability vs Instrument Validity

  • Interrater reliability refers to consistency between raters
  • Instrument validity refers to whether the instrument measures what it is supposed to measure
  • The instrument must be used consistently between raters
    • Face validity
    • Delineate different rating options
    • Training is crucial

10 of 24

Establishing IRR

Training best done with all evaluators in a group to reach consensus and enable calibration

Ideal to observe performances at all levels of proficiency and identify behaviors that are characteristic at each level

Consensus approach most common (Percent agreement or Cohen’s kappa statistic)

11 of 24

Simulation Rubrics

  • Simulation rubrics require interrater reliability training
  • Creighton Competency Evaluation Instrument (CCEI):
    • Measures competencies (communication, assessment, critical thinking, & skills)
    • Defining qualifiers for each case improves IRR
  • Lasater Clinical Judgment Rubric (LCJR):
    • Assesses clinical judgment development
    • Requires faculty calibration for consistent scoring

12 of 24

Lasater Clinical Judgement Rubric

13 of 24

Lasater Clinical Judgement Rubric for Today

14 of 24

Let’s Practice

  • Watch video
  • Rate student performance
  • Write notes as to rationale for rating

15 of 24

Video #1

16 of 24

Inter rater reliability calculations

  • Kappa
  • ICC
  • Percentage of Agreement

17 of 24

18 of 24

 

 

NOTICING

 

RESPONDING

 

 

Focused Observation

agree 1/0

Recognizing Deviations�from Expected Patterns

agree 1/0

Information Seeking

agree 1/0

 

Calm, Confident Manner

agree 1/0

Clear Communication

agree 1/0

Well-Planned�Intervention/ Flexibility

agree 1/0

Being Skillful

agree 1/0

 

Example

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Chris

3

0

3

0

1

1

 

2

1

3

1

3

0

2

1

 

Vivian

1

0

1

1

1

1

 

1

0

3

1

1

0

1

0

 

Jen

2

1

2

0

1

1

 

2

1

3

1

1

0

2

1

 

Carrie

2

1

1

1

3

0

 

2

1

3

1

2

1

2

1

 

Emily

2

1

3

0

2

0

 

2

1

3

1

2

1

2

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

item agreement percentage

 

0.6

 

0.4

 

0.6

 

 

0.8

 

1

 

0.4

 

0.8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

overall percentage

0.6

 

19 of 24

Calculate Percent Agreement

Discussion

What bias do you have?

Can you come to a consensus or agreement?

20 of 24

Video #2

21 of 24

Recalculate Percent Agreement

22 of 24

Maintaining IRR

  • Regular calibration and refresher training
  • Share feedback regarding use of tool and process
  • Clear and detailed rubrics
  • Monitor rater performance
  • Encourage open communication about uncertainties or challenges in scoring.

One-Time Training

23 of 24

Thank You

Christine Thomas PhD, RN, CHSE-A christine.thomas@gwu.edu

Vivian Bowman MSN, RN, PCCN, CNE vivian.bowman@gwu.edu

Jennifer Wendel MEd, MSN, RN, CHSE jennifer.wendel@gwu.edu

24 of 24

References

Adamson, K.A, Kardong-Edren, S. A (2012). A method and resources for assessing the reliability of simulation evaluation instruments. Nursing Education Perspectives, 33(5),334-339.

Burns, M. K. (2014). How to establish interrater reliability. Nursing, 44(10), 56–57. https://doi.org/10.1097/01.NURSE.0000453705.41413.c6

Haerling, K. A. (2021). Simulation evaluation. In P. R. Jeffries (Ed.), Simulation in nursing education: From conceptualization to evaluation (pp. 83–99). Wolters Kluwer.

Lasater, K. (2007). Clinical judgment development: Using simulation to create a rubric. Journal of Nursing Education, 46(11), 496–503.

McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276–282. https://doi.org/10.11613/BM.2012.031

Oermann, M. H., Kardong-Edgren, S., & Rizzolo, M. A. (2016). Towards an evidence-based methodology for high-stakes evaluation of nursing students’ clinical performance using simulation. Teaching and Learning in Nursing, 11(4), 133–137. https://doi.org/10.1016/j.teln.2016.04.004

Polit, D. F., & Beck, C. T. (2021). Nursing research: Generating and assessing evidence for nursing practice (11th ed.). Wolters Kluwer.

Yudkowsky, R., Park, Y., & Downing, S. (2020). Assessment in health professions education (2nd ed.). Routledge.