A filled-in example of this template is provided at the end of this document.
For the 7 Rules of Causation, visit https://delbius.com/rules-of-causation/.
For all related RCA resources, visit https://delbius.com/resources#rca-guide.
Incident ID/Title | |
Date of Incident | |
RCA Team | |
Date of RCA |
Briefly describe the event or issue that occurred. Indicate what happened vs what should have happened. When and where did it happen?
[Start write-up here]
What immediate steps were taken to address or remediate the issue?
[Start write-up here]
Action | Owner | Target Date | Related Links |
[Remediate] Action 1 statement. | Owner 1 | [date 1] | Link 1 |
State the key causes using simple causal statements, taking care to note the nature of the cause: policy, system, human error, or process failure. (See also: 7 Rules of Causation.)
[Start write-up here]
Create an action plan, assign owners for each action, and set a timeline. Identify immediate remediation actions and actions that address the root causes and contributing factors (RC/CF).
These actions have been assigned with the relevant task tickets created/linked:
Action | Owner | Target Date | Related Links |
[Address RC/CF] Action statement. | Owner 1 | [date 1] | Link 1 |
What metrics will you use to confirm that the action(s) worked?
[Start write-up here]
Who will collect and report on these metrics?
[Start write-up here]
Action | Owner | Target Date | Related Links |
[Measure] Measurement Action 1. | Owner 1 | [date 1] | Link 1 |
Decide how long the impact of the action will be tracked.
[Start write-up here]
Did the action resolve the issue? Are further improvements needed? Outline any additional actions if required.
[Start write-up here]
Action | Owner | Target Date | Related Links |
[F/Up] Follow-up Action 1. | Owner 1 | [date 1] | Link 1 |
For all related RCA resources, visit https://delbius.com/resources#rca-guide.
Texts that are styled as hyperlinks are used for illustration purposes only.
Incident ID/Title | [INC-544] Misclassification of the term “slay mother” by moderation tool |
Date of Incident | 2 September 2024 |
RCA Team | Anjali Bahri (T&S Policy), Casey Denton (T&S Internal Tools), Élodie Fournier (T&S Ops), Goichi Hagiwara (T&S Product), Ivria Jackson (T&S Heuristics), Kwanele Lungu (Incident Management). |
Date of RCA | 5 September 2024 |
Briefly describe the event or issue that occurred. Indicate what happened vs what should have happened. When and where did it happen?
On 28 August 2024, the Policy team published OPSUPDATE-037 to add a new set of slang/terms to the Allow List. This policy change is intended to reflect how different marginalized communities have reappropriated these terms.
The policy change was scheduled to take effect on 2 September 2024 at 12:00 PM PT. Unexpectedly, the auto-classification bots continued to classify user posts that included the slang/terms as violative.
From 12:00 PM PT on 2 September 2024 until the auto-classification bots were paused, a total of 94,673 posts were wrongly flagged for content violations, affecting 43,862 unique users and resulting in these user accounts being placed under temporary account restrictions.
The issue was first identified because Anjali Bahri (T&S Policy) noticed on 3 September that user complaints about their accounts being restricted due to these slang/terms continued despite the planned policy change.
What immediate steps were taken to address or mitigate the issue?
Upon discovering the issue on 3 September 2024, the Policy team filed incident ticket INC-544.
The Incident Manager On Call filed a ticket with the Heuristics team (HEUR-934) to request that the auto-classification of posts for the relevant terms be halted pending investigation.
The Heuristics team successfully paused the relevant auto-classification bots on 3 September 2024, at 2:23 PM PT, pending the integration of OPSUPDATE-037.
Actual integration of OPSUPDATE-037 was completed on 3 September at 3:45 PM PT (HEUR-935), at which point, the bots were unpaused, and normal auto-classification operations resumed. The bots were configured to start reviewing all posts made since the moment they had been paused.
On 4 September 2024, the Internal Tools team identified the affected users based on audit logs (INTTOOLS-1021) and lifted account restrictions (INTTOOLS-1022) for all accounts after confirming that the accounts had not also been flagged as violative by other mechanisms.
The affected users were notified about the error and subsequent lifting of account restrictions via an in-app notification (INTTOOLS-1023) on 4 September 2024.
Action | Owner | Target Date | Related Links |
[Remediate] Pause auto-classification bots for the slang/terms in OPSUPDATE-037. | Ivria Jackson, Heuristics | 3 Sept 2024 | HEUR-934 |
[Remediate] Update the moderation tool’s algorithm and content library to include the latest slang and contextual references in OPSUPDATE-037. | Ivria Jackson, Heuristics | 3 Sept 2024 | HEUR-935 |
[Remediate] Identify users affected by INC-544. | Casey Denton, Internal Tools | 4 Sept 2024 | INTTOOLS-1021 |
[Remediate] Lift account restrictions for users identified in INTTOOLS-1021 after confirming they had not been flagged as violative by other mechanisms. | Casey Denton, Internal Tools | 4 Sept 2024 | INTTOOLS-1022 |
[Remediate] Send “mea culpa” in-app notification to users identified in INTTOOLS-1021 that their account restrictions have been lifted. | Casey Denton, Internal Tools | 4 Sept 2024 | INTTOOLS-1023 |
State the key causes using simple causal statements, taking care to note the nature of the cause: policy, system, human error, or process failure. (See also: 7 Rules of Causation.)
Process Failures
Create an action plan, assign owners for each action, and set a timeline. Identify immediate remediation actions and actions that address the root causes and contributing factors (RC/CF).
These actions have been assigned with relevant task tickets created/linked.
Action | Owner | Target Date | Related Links |
[Address RC/CF] Add a task to the parental leave checklist for managers to require checking and reassigning of any on-call tickets when a team member goes on leave. Notify all managers. | Ivria Jackson, Heuristics | In place by | HEUR-945 |
[Address RC/CF] Create a daily exception alert auto-email that will notify a manager if any of their team members who are on leave have an open on-call ticket. | Ivria Jackson, Heuristics | In place by | HEUR-946 |
[Address RC/CF] Update the Policy team’s OPSUPDATE checklist to include a step where the policy ticket owner must confirm that the changes went live correctly. | Anjali Bahri, Policy | In place by | POLICY-719 |
What metrics will you use to confirm that the action(s) worked?
The actions taken will be considered successful if:
Numerator: number of incidents arising from un-reassigned tickets = 0
Denominator: total number of incidents.
Numerator: number of email alerts sent
Denominator: total number of un-reassigned tickets that should have triggered an email alert.
Who will collect and report on these metrics?
These steps will be used to track and monitor the relevant metrics.
The owner of the Quarterly Incident Post-Mortem Meeting will be responsible for collecting and presenting the relevant metrics at the meeting.
Action | Owner | Target Date | Related Links |
[Measure] Add “On-Call Ticket Not Reassigned” as a valid value for the Incident Cause field in the incident tracker. Notify all incident managers of this update. | Kwanele Lungu, Incident Management | 10 Sept 2024 | IMOC-841 |
[Measure] Update the template agenda of the Quarterly Incident Post-mortem Meeting to include a review of incidents of this nature. | Kwanele Lungu, Quarterly Post-mortem Meeting Owner | 10 Sept 2024 | IMOC-842 |
Decide how long the impact of the action will be tracked.
The impact of the [Address RC/CF] actions identified in earlier sections will be monitored for one quarter (21 September to 21 December) to evaluate effectiveness.
The effectiveness of the changes will be assessed at the next Quarterly Incident Post-mortem Meeting, currently scheduled for 23 December.
Did the action resolve the issue? Are further improvements needed? Outline any additional actions if required.
If needed, follow-up actions will be identified at the discretion of the Quarterly Incident Post-mortem Meeting owner during the next quarterly meeting.
Action | Owner | Target Date | Related Links |
[Follow-up] Review metrics at the next Quarterly Incident Post-mortem Meeting and identify follow-up actions if needed. | Kwanele Lungu, Quarterly Post-mortem Meeting Owner | 23 Dec 2024 | IMOC-843 |