A filled-in example of this template is provided at the end of this document.

For the 7 Rules of Causation, visit https://delbius.com/rules-of-causation/.

For all related RCA resources, visit https://delbius.com/resources#rca-guide.

A. Incident Info

Incident ID/Title

Date of Incident

RCA Team

Date of RCA

B. What happened?

B.1 Incident Summary

Briefly describe the event or issue that occurred. Indicate what happened vs what should have happened. When and where did it happen?

[Start write-up here]

B.2 Initial Actions

What immediate steps were taken to address or remediate the issue?

[Start write-up here]

Action

Owner

Target Date

Related Links

[Remediate] Action 1 statement.

Owner 1

[date 1]

Link 1

C. Why did it happen?

State the key causes using simple causal statements, taking care to note the nature of the cause: policy, system, human error, or process failure. (See also: 7 Rules of Causation.)

[Start write-up here]

D. How can we prevent it from happening again?

Create an action plan, assign owners for each action, and set a timeline. Identify immediate remediation actions and actions that address the root causes and contributing factors (RC/CF).

These actions have been assigned with the relevant task tickets created/linked: 

Action

Owner

Target Date

Related Links

[Address RC/CF] Action statement.

Owner 1

[date 1]

Link 1

E. Measuring success

E.1 Set quantifiable metrics

What metrics will you use to confirm that the action(s) worked?

[Start write-up here]

E.2 Define how data will be collected and by whom

Who will collect and report on these metrics?

[Start write-up here]

Action

Owner

Target Date

Related Links

[Measure] Measurement Action 1.

Owner 1

[date 1]

Link 1

E.3 Monitor impact

Decide how long the impact of the action will be tracked.

[Start write-up here]

F. Follow-up: Review Outcome

Did the action resolve the issue? Are further improvements needed? Outline any additional actions if required.

[Start write-up here]

Action

Owner

Target Date

Related Links

[F/Up] Follow-up Action 1.

Owner 1

[date 1]

Link 1

For all related RCA resources, visit https://delbius.com/resources#rca-guide.



FILLED-OUT TEMPLATE EXAMPLE BELOW

Texts that are styled as hyperlinks are used for illustration purposes only.

A. Incident Info

Incident ID/Title

[INC-544] Misclassification of the term “slay mother” by moderation tool

Date of Incident

2 September 2024

RCA Team

Anjali Bahri (T&S Policy), Casey Denton (T&S Internal Tools), Élodie Fournier (T&S Ops), Goichi Hagiwara (T&S Product), Ivria Jackson (T&S Heuristics), Kwanele Lungu (Incident Management). 

Date of RCA

5 September 2024

B. What happened?

B.1 Incident Summary

Briefly describe the event or issue that occurred. Indicate what happened vs what should have happened. When and where did it happen?

On 28 August 2024, the Policy team published OPSUPDATE-037 to add a new set of slang/terms to the Allow List. This policy change is intended to reflect how different marginalized communities have reappropriated these terms.

The policy change was scheduled to take effect on 2 September 2024 at 12:00 PM PT. Unexpectedly, the auto-classification bots continued to classify user posts that included the slang/terms as violative.

From 12:00 PM PT on 2 September 2024 until the auto-classification bots were paused, a total of 94,673 posts were wrongly flagged for content violations, affecting 43,862 unique users and resulting in these user accounts being placed under temporary account restrictions.

The issue was first identified because Anjali Bahri (T&S Policy) noticed on 3 September that user complaints about their accounts being restricted due to these slang/terms continued despite the planned policy change.

B.2 Initial Actions

What immediate steps were taken to address or mitigate the issue?

Upon discovering the issue on 3 September 2024, the Policy team filed incident ticket INC-544.

The Incident Manager On Call filed a ticket with the Heuristics team (HEUR-934) to request that the auto-classification of posts for the relevant terms be halted pending investigation.

The Heuristics team successfully paused the relevant auto-classification bots on 3 September 2024, at 2:23 PM PT, pending the integration of OPSUPDATE-037.

Actual integration of OPSUPDATE-037 was completed on 3 September at 3:45 PM PT (HEUR-935), at which point, the bots were unpaused, and normal auto-classification operations resumed. The bots were configured to start reviewing all posts made since the moment they had been paused.

On 4 September 2024, the Internal Tools team identified the affected users based on audit logs (INTTOOLS-1021) and lifted account restrictions (INTTOOLS-1022) for all accounts after confirming that the accounts had not also been flagged as violative by other mechanisms.

The affected users were notified about the error and subsequent lifting of account restrictions via an in-app notification (INTTOOLS-1023) on 4 September 2024.

Action

Owner

Target Date

Related Links

[Remediate] Pause auto-classification bots for the slang/terms in OPSUPDATE-037.

Ivria Jackson, Heuristics

3 Sept 2024

HEUR-934

[Remediate] Update the moderation tool’s algorithm and content library to include the latest slang and contextual references in OPSUPDATE-037.

Ivria Jackson, Heuristics

3 Sept 2024

HEUR-935

[Remediate] Identify users affected by INC-544.

Casey Denton, Internal Tools

4 Sept 2024

INTTOOLS-1021

[Remediate] Lift account restrictions for users identified in INTTOOLS-1021 after confirming they had not been flagged as violative by other mechanisms.

Casey Denton, Internal Tools

4 Sept 2024

INTTOOLS-1022

[Remediate] Send “mea culpa” in-app notification to users identified in INTTOOLS-1021 that their account restrictions have been lifted.

Casey Denton, Internal Tools

4 Sept 2024

INTTOOLS-1023

C. Why did it happen?

State the key causes using simple causal statements, taking care to note the nature of the cause: policy, system, human error, or process failure. (See also: 7 Rules of Causation.)

Process Failures

  • The moderation tool’s algorithm was not updated to handle the new slang and context variations documented in OPSUPDATE-037 because the Heuristics team member who was auto-assigned the ticket based on the on-call rotation had started parental leave earlier than expected.
  • There is no established process to ensure that already-assigned tickets are reassigned when the on-call rotation is retroactively updated. As such, OPSUPDATE-037 was not bundled with other planned updates and did not ship as scheduled.
  • There is also no established process within the Policy team for confirming that planned policy updates shipped as scheduled.

D. How can we prevent it from happening again?

Create an action plan, assign owners for each action, and set a timeline. Identify immediate remediation actions and actions that address the root causes and contributing factors (RC/CF).

These actions have been assigned with relevant task tickets created/linked.

Action

Owner

Target Date

Related Links

[Address RC/CF] Add a task to the parental leave checklist for managers to require checking and reassigning of any on-call tickets when a team member goes on leave. Notify all managers.

Ivria Jackson, Heuristics

In place by
10 Sept 2024

HEUR-945

[Address RC/CF] Create a daily exception alert auto-email that will notify a manager if any of their team members who are on leave have an open on-call ticket.

Ivria Jackson, Heuristics

In place by
10 Sept 2024

HEUR-946

[Address RC/CF] Update the Policy team’s OPSUPDATE checklist to include a step where the policy ticket owner must confirm that the changes went live correctly.

Anjali Bahri, Policy

In place by
10 Sept 2024

POLICY-719

E. Measuring success

E.1 Set quantifiable metrics

What metrics will you use to confirm that the action(s) worked?

The actions taken will be considered successful if:

  1. No further incidents arise due to on-call tickets not being reassigned when a team member goes on leave.

Numerator: number of incidents arising from un-reassigned tickets = 0
Denominator: total number of incidents.

  1. Daily email alerts are generated for 100% of the cases where an on-call ticket is not reassigned within 24 hours of the on-call rotation getting a retroactive update.

Numerator: number of email alerts sent
Denominator: total number of un-reassigned tickets that should have triggered an email alert.

E.2 Define how data will be collected and by whom

Who will collect and report on these metrics?

These steps will be used to track and monitor the relevant metrics.

  • Incident Cause Logging. The ticket for any incident that is found to be caused by on-call tickets not being reassigned will have “On-Call Ticket Not Reassigned” tagged as the official incident cause.
  • Quarterly Incident Post-mortem Meeting. The Quarterly Incident Post-mortem Meetings will now include a review of incidents caused by the non-reassignment of on-call tickets.

The owner of the Quarterly Incident Post-Mortem Meeting will be responsible for collecting and presenting the relevant metrics at the meeting.

Action

Owner

Target Date

Related Links

[Measure] Add “On-Call Ticket Not Reassigned” as a valid value for the Incident Cause field in the incident tracker. Notify all incident managers of this update.

Kwanele Lungu, Incident Management

10 Sept 2024

IMOC-841

[Measure] Update the template agenda of the Quarterly Incident Post-mortem Meeting to include a review of incidents of this nature.

Kwanele Lungu, Quarterly Post-mortem Meeting Owner

10 Sept 2024

IMOC-842

E.3 Monitor impact

Decide how long the impact of the action will be tracked.

The impact of the [Address RC/CF] actions identified in earlier sections will be monitored for one quarter (21 September to 21 December) to evaluate effectiveness.

The effectiveness of the changes will be assessed at the next Quarterly Incident Post-mortem Meeting, currently scheduled for 23 December.

F. Follow-up

Did the action resolve the issue? Are further improvements needed? Outline any additional actions if required.

If needed, follow-up actions will be identified at the discretion of the Quarterly Incident Post-mortem Meeting owner during the next quarterly meeting.

Action

Owner

Target Date

Related Links

[Follow-up] Review metrics at the next Quarterly Incident Post-mortem Meeting and identify follow-up actions if needed.

Kwanele Lungu, Quarterly Post-mortem Meeting Owner

 23 Dec 2024

IMOC-843