Planning for Mistakes
1
Machine Learning in Production
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
What do you remember from the �last lecture?
2
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Recall: The importance of understanding goals and assumptions
3
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
More Requirements to Understand Risks
4
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Learning goals:
5
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Readings
Required reading: Kocielnik, Rafal, Saleema Amershi, and Paul N. Bennett. "Will you accept an imperfect ai? exploring designs for adjusting end-user expectations of ai systems." In Proceedings of the 2019 CHI conference on human factors in computing systems, pp. 1-14. 2019.
6
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
ML Models = Unreliable Components
7
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Models make mistakes
8
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Common excuse: Software mistake -- nobody's fault
9
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Common excuse: The problem is just data
10
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Common excuse: Nobody could have foreseen this...
11
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
What responsibility do designers have to anticipate problems?
12
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
13
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Designing for Mistakes
14
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Planning for �Mistakes
15
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Living with ML mistakes
No model is every "correct"
Some mistakes are unavoidable
Anticipate the eventual mistake
ML model = unreliable component
16
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Many different strategies
Based on fault-tolerant design, assuming that there will be software/ML mistakes or environment changes violating assumptions
We will cover today:
17
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Designing for Mistakes Strategy:�Human in the Loop
18
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Today's Running Example: Autonomous Train
19
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Human-AI Interaction Design �(Human in the Loop)
Recall:
20
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Human in the Loop
21
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Human in the Loop - Examples
22
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Human in the Loop - Examples
23
Fall detection /
crash detection
with smartwatch
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
From the reading…
24
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Human in the Loop - Examples?
25
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Designing for Mistakes Strategy:�Undoable Actions
26
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Undoable actions
Examples?
27
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Undoable actions - �Examples
28
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Undoable actions - Examples?
29
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Designing for Mistakes Strategy:�Guardrails
30
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Guardrails
Ensures safe operation parameters despite wrong model predictions without having to detect mistakes
Traditionally symbolic guardrails, today often another model to increase reliability
31
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Guardrails: Bollards
32
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Guardrails: Bollards
33
https://twitter.com/WorldBollard/status/1542959589276192770
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Guardrails: Bollards
34
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Guardrails - Examples
Recall: Thermal fuse in smart toaster
35
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Guardrails - Examples?
36
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Guardrails - Examples
37
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Designing for Mistakes Strategy:�Detection and Recovery
38
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Mistake detection and �recovery
Design a recovery mechanism if mistakes are detectable, directly or indirectly
Requires (1) a detection mechanism (e.g., external monitor, redundancy) and (2) a response
39
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Mistake detection
An independent mechanism to detect problems (in the real world)
Example: Gyrosensor to detect a train taking a turn too fast
40
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Mistake detection -- many strategies
Examples in autonomous train scenario?
41
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Doer-Checker Example: AV
ML-based controller (doer): Generate commands to steer the vehicle
Safety controller (checker): Checks commands from ML controller; overrides it with a safe default command if the ML action is risky
42
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Doer-Checker Example: AV
43
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Graceful Degradation (Fail-safe)
Goal: When a component failure is detected, achieve system safety by reducing functionality and performance
Switches operating mode when failure detected (e.g., slower, conservative)
44
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Designing for Mistakes Strategy:�Redundancy
45
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Redundancy
Useful for problem detection and response
Challenge: Software + models are rarely really independent
46
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Redundancy Example: Sensor Fusion
Combine data from a wide range of sensors
Provides partial information even when some sensor is faulty
A critical part of modern self-driving vehicles
47
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Designing for Mistakes Strategy:�Containment and Isolation
48
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Containment: Decoupling & Isolation
Design principle: Faults in a low-critical (LC) components should not impact high-critical (HC) components
Example: Do not connect fly-by-wire software with plane's entertainment system
Example in autonomous train?
49
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Poor Decoupling: USS Yorktown (1997)
Invalid data entered into DB; divide-by-zero crashes entire network
Required rebooting the whole system; ship dead in water for 3h
Lesson: Handle expected component faults; prevent propagation
50
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Recall: A Secure (but Less Useful?) Version
def analyze_email(email):
prompt = f"Rate sentiment: 1=positive, 0=neutral, -1=negative\n{email}"
response = ai_model.generate(prompt)
return int(response.strip()) if response.strip() in ['-1','0','1'] else 0
def generate_report(email_batch):
emails = split_emails(email_batch)
scores = [analyze_email(email) for email in emails]
positive = scores.count(1)
negative = scores.count(-1)
total = len(scores)
return f"Sentiment Report: {positive}/{total} positive, {negative}/{total} negative"
Note the clear separation between symbolic and neural reasoning
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Containment in AI Agents?
def react_agent(query):
context = llm(f"Think: How to answer '{query}'?")
while not is_final_answer(context):
action = llm(f"Context: {context}\nAction:")
if "get_email" in action:
result = get_email(extract_date_range(action))
elif "send_email" in action:
result = send_email(extract_to_body(action))
context = llm(f"Previous: {context}\n
Action: {action}\nResult: {result}\nContext:")
return llm(f"Final answer based on: {context}")
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Poor Decoupling: Automotive Security
53
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Containment: Decoupling & Isolation
Design principle: Faults in a low-critical (LC) components should not impact high-critical (HC) components
Apply the principle of least privilege: LC components should have minimal necessary access
Limit interactions across criticality boundaries: Deploy LC & HC components on different networks; add monitors/checks at interfaces
Is an ML component in my system performing an LC or HC task? If HC, can we "demote" it into LC? Alternatively, if possible, replace/augment HC ML components with non-ML ones
54
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Simplified Agent Implementation
def react_agent(query):
context = llm(f"Think: How to answer '{query}'?")
while not is_final_answer(context):
action = llm(f"Context: {context}\nAction:")
if "get_email" in action:
result = get_email(extract_date_range(action))
elif "send_email" in action:
result = send_email(extract_to_body(action))
context = llm(f"Previous: {context}\n
Action: {action}\nResult: {result}\nContext:")
return llm(f"Final answer based on: {context}")
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Design Strategies Summary
Human in the loop
Undoable actions
Guardrails
Mistake detection and recovery (monitoring, doer-checker, fail-over, redundancy)
Containment and isolation
56
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Breakout
What harms from ML mistakes are possible and what design strategies would you consider to mitigate them?
Consider: Human in the loop, Undoable actions, Guardrails, Mistake detection and recovery (monitoring, doer-checker, fail-over, redundancy), Containment and isolation
As a group, post #lecture and tag all group members.
57
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Hazard Analysis:�Anticipating and Analyzing Risks
58
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
What's the worst that could happen?
59
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
60
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
What's the worst that could happen?
61
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
What's the worst that could happen?
62
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
What's the worst that could happen?
63
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
What's the worst that could happen?
64
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
What to mitigate?
Recall:
We can reduce/eliminate many risks, but not for free
We can only mitigate risks we know
Wait for problems to occur or be proactive?
65
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Hazard Analysis for Risk Identification
Proactively identifying potential problems before they occur
Traditional safety engineering techniques
Essentially “structured brainstorming”
Resulting risks are subsequently analyzed and possibly mitigated
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
STPA
67
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
STPA (system-theoretic process analysis)
Identify stakeholders, including indirect ones
Identify their stakes, values, goals
For each value explore corresponding loss�(e.g., loss of life, injury, damage, loss of mission, loss of customer satisfaction, financial loss, environmental loss, information leakage)
For each loss, identify requirement to prevent it
Identify possible reasons for violating requirement
…
See also Leveson, Nancy G. Engineering a safer world: Systems thinking applied to safety. The MIT Press, 2016.
STPA Handbook https://psas.scripts.mit.edu/home/materials/
We only use the initial steps of �STPA for hazard identification here.
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
STPA Example
69
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
STPA Advice
Explore stakeholders broadly, both direct (e.g., train passengers, operators) and indirect (e.g., people living near tracks, city government) – often many
Understand what they care about, see user goals (e.g., fast travel times, safety)
Explore possible losses broadly, small and large, including financial, injuries, mental stress (e.g., late for work, injured in doors of train)
Translation to requirements often straightforward (e.g., leave on time, do not trap passengers in doors)
Be comprehensive but focus on more severe problems
70
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Example: Hazard Analysis �for Trail Recommendation
Stakeholders: end users, app developers, API providers, trail management organizations, local businesses.
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Hazard Analysis as Structured Brainstorming
Lots of paperwork? Tedious?
Reliable?
LLM assistance?
72
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Other Risk Identification Strategies
Brainstorm worst case scenarios and their causes � (from the perspective of different stakeholders)
Read about common risks in domains (e.g., web security risks, accounting) and accidents/failures in competing projects
Expert opinions
Early warning indicators, incident analysis, near-miss reporting
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Risk Analysis
For each risk judge severity and likelihood to prioritize
Focus on high severity or high frequency issues first
Involve more people in the conversation, plan next steps
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Fault Tree Analysis
75
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Analyzing Possible Causes of Loss
Risk identification with hazard analysis tells us what to avoid
Next: What can go wrong to lead to the loss? How can we prevent it?
76
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Fault Tree Analysis (FTA)
Fault tree: A diagram that displays relationships between a system failure (i.e., requirement violation) and potential causes.
Often used for safety & reliability, but can also be used for other types of requirements (e.g., poor performance, security attacks...)
77
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Fault Tree Analysis & ML
78
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Fault Trees: Basic Building Blocks
Event: An occurrence of a fault or an undesirable action
Gate: Logical relationship between an event & its immediate subevents
79
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Fault Tree Example
Every tree begins with a TOP event (typically a violation of a requirement)
Every branch of the tree must terminate with a basic event
80
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Analysis: What can we do with fault trees?
81
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Minimal Cut Set Analysis
Cut set: A set of basic events whose simultaneous occurrence is sufficient to guarantee that the TOP event occurs.
Minimal cut set: A cut set from which a smaller cut set can't be obtained by removing a basic event.
What are minimal cut sets here?
82
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Failure Probability Analysis
To compute the probability of the top event:
In this class, we won't ask you to do this.
83
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
FTA Process
84
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Example: Autonomous Train
85
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Example: Autonomous Train
Modern ML-powered vision system to efficiently and safely close doors before departure
Using a fault tree identify possible problems that could lead to trapping a person in the door.
86
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
87
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
88
Probably unavoidable, but can increase reliability
Necessary risk remaining?
Reduce with legal threats?
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
89
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
90
Mitigation so that vision failure alone does not cause violation (add AND event)
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
91
Terrible design, have�safe default
Single point of failure?
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
92
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
93
Eliminate software crash�as possible cause
(remove event)
Add redundancy to increase reliability (add AND event)
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
One more example: FTA for Lane Assist
REQ: The vehicle must be prevented from veering off the lane.
SPEC: Lane detector accurately identifies lane markings in the input image; the controller generates correct steering commands
ASM: Sensors are providing accurate information about the lane; driver responses when given warning; steering wheel is functional
Possible mitigations?
94
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
FTA: Caveats
In general, building a complete tree is impossible
Domain knowledge is crucial for improving coverage
FTA is still very valuable for risk reduction!
95
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Breakout: Fault Tree
REQ: The generated music featured on the front page should not contain lyrics denigrating minorities
As a group,
Use pen&paper or any software. As a group, post photo or screenshot to #lecture, tagging all members
96
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Aside: STPA View
Most losses are caused by complex issues, not just single mistakes
Consider the entire system, control structure, including humans and their training/oversight
Systematically evaluate whether the controls are effective
97
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Zooming Out: �General Safety Engineering Strategies
Identify possible hazards from stakeholder goals (STPA, early steps)
Identify possible hazards from component failures (FMEA, HAZOP)
Forward: from cause to hazard
Analyze causes of anticipated/known hazards (FTA)
Backward: from hazard to cause
Analyze effectiveness of control mechanisms for anticipated/known hazards, including non-technical controls (STPA)
98
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Bonus Slides: FMEA
99
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Failure Mode and Effects Analysis (FMEA)
Forward search from possible root causes to hazards
Does not assume the hazards are known (as FTA requires)
Consider component failures (SPEC violations) and failed assumptions (ASM violations) as possible causes
Widely used in aeronautics, automotive, healthcare, food services, semiconductor processing, and (to some extent) software
100
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Failure Mode and Effects Analysis (FMEA)
101
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
FMEA Process
(a) Identify system components
(b) Enumerate potential failure modes for each component
(c) For each failure mode, identify:
102
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
FMEA Example: Autonomous Train Doors
Failure modes? Failure effects? Detection? Mitigation?
103
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
FMEA Example Excerpt: Autonomous Car
104
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
"Wrong Prediction" as Failure Mode?
"Wrong prediction" is a coarse grained failure mode of every model
May not be possible to decompose further
However, may evaluate causes of wrong prediction for better understanding, as far as possible (FTA could be used for this)
105
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
FMEA Summary
Forward analysis: From components to possible failures
Focus on single component failures, no interactions
Identifying failure modes may require domain understanding
106
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Bonus Slides: HAZOP
107
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Hazard and Interoperability Study (HAZOP)
Identify hazards and component fault scenarios through guided inspection of requirements
108
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Hazard and Operability Study (HAZOP)
A forward search method to identify potential hazards from component failures (and assumption violations)
For each component, use a set of guide words to generate possible deviations from expected behavior
Consider the impact of each generated deviation: Can it result in a system-level hazard?
109
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
HAZOP Example: Emergency Braking (EB)
Specification: EB must apply a maximum braking command to the engine.
110
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
HAZOP & ML
In addition to traditional analysis: Analyze possible mistakes of all ML components
Original guidewords: NO OR NOT, MORE, LESS, AS WELL AS, PART OF, REVERSE, OTHER THAN / INSTEAD, EARLY, LATE, BEFORE, AFTER
Additional ML-specific guidewords: WRONG, INVALID, INCOMPLETE, PERTURBED, and INCAPABLE.
111
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Breakout: Automated Train Doors
Analyze the vision component to detect obstacles in train doors
NO OR NOT, MORE, LESS, AS WELL AS, PART OF, REVERSE, OTHER THAN / INSTEAD, EARLY, LATE, BEFORE, AFTER, WRONG, INVALID, INCOMPLETE, PERTURBED, and INCAPABLE.
Using HAZOP: As a group answer in #lecture, tagging group members:
112
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
HAZOP: Benefits & Limitations
Easy to use; encourages systematic reasoning about component faults
Can be combined with STPA/FMEA to generate faults (i.e., basic events in FTA)
Potentially labor-intensive; relies on engineer's judgement
Does not guarantee to find all hazards (but also true for other techniques)
113
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Remarks: Hazard Analysis
None of these methods guarantee completeness
Intended as structured approaches to thinking about failures
114
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Summary
115
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Summary
Accept that a failure is inevitable
Design strategies for mitigating mistakes
Use risk analysis to identify and mitigate potential problems
116
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025
Further readings
🕮 Google PAIR. People + AI Guidebook. 2019, especially chapters “Errors + Graceful Failure” and “Mental Models.”
🗎 Martelaro, Nikolas, Carol J. Smith, and Tamara Zilovic. “Exploring Opportunities in Usable Hazard Analysis Processes for AI Engineering.” In AAAI Spring Symposium Series Workshop on AI Engineering: Creating Scalable, Human-Centered and Robust AI Systems (2022).
🗎 Qi, Yi, Philippa Ryan Conmy, Wei Huang, Xingyu Zhao, and Xiaowei Huang. “A Hierarchical HAZOP-Like Safety Analysis for Learning-Enabled Systems.” In AISafety2022 Workshop at IJCAI2022 (2022).
🗎 Beachum, David Robert. “Methods for assessing the safety of autonomous vehicles.” MSc thesis, 2019.
🗎 Amershi, Saleema, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh et al. “Guidelines for human-AI interaction.” In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019.
🗎 Shneiderman, Ben. “Bridging the gap between ethics and practice: Guidelines for reliable, safe, and trustworthy Human-Centered AI systems.” ACM Transactions on Interactive Intelligent Systems (TiiS) 10, no. 4 (2020): 1–31.
🗎 Rismani, Shalaleh, Renee Shelby, Andrew Smart, Edgar Jatho, Joshua Kroll, AJung Moon, and Negar Rostamzadeh. "From plane crashes to algorithmic harm: applicability of safety engineering frameworks for responsible ML." In Proceedings CHI, pp. 1-18. 2023.
🗎 Hong, Yining, Christopher S. Timperley, and Christian Kästner. "From hazard identification to controller design: Proactive and llm-supported safety engineering for ml-powered systems." In Proc. CAIN, pp. 113-118. IEEE, 2025.
🕮 Leveson, Nancy G. Engineering a safer world: Systems thinking applied to safety. The MIT Press, 2016.
🗎 STPA Handbook https://psas.scripts.mit.edu/home/materials/
117
Machine Learning in Production • Christian Kaestner & Bogdan Vasilescu, Carnegie Mellon University • Fall 2025