Rings of Power - Multiple Sign Off System for Balrog

Overview

Balrog is Mozilla’s update server. It is responsible for deciding which updates to deliver for a given update request. Because updates deliver arbitrary code to users this means that a badly configured update server could result in orphaning users, or be used as an attack vector to infect users with malware. It is crucial that we make it more difficult for a single user’s account to make changes that affect large populations of users. Not only does this provide some footgun protection, but it safeguards our users from attacks if an account is compromise or an employee goes rogue.

While the current version of Balrog has a notion of permissions, most people effectively have carte-blanche access to one or more products. This means that an under-caffeinated Release Engineer could ship the wrong thing, or a single compromised account can begin an attack. Requiring multiple different accounts to sign off on any sensitive changes will protect us against both of these scenarios.

Multiple sign offs may also be used to enhance Balrog’s ability to support workflows that are more reflective of reality. For example, the Release Management team are the final gatekeepers for most products (ie: we can’t ship without their sign off), but they are usually not the people in the best place to propose changes to Rules. A multiple sign off system that supports different types of roles would allow some people to propose changes and others to sign off on them.

Scope

Because so much automation relies on Balrog we can’t simply require multiple sign off any row of any table in Balrog’s database, but there are some things that clearly must be protected by it. At minimum, we must require multiple sign off for:

All Rules or Scheduled Rule Changes that affect primary release channels. This is inclusive of the Firefox, Fennec, GMP, and SystemAddons products on the beta, release, or esr channels^[1].
Any Releases that are mapped to by any Rule or Scheduled Rule Change matching the above criteria.^[2]
All permissions.
Changes to any sign-off requirements, if we choose to store them in Balrog’s database.

On the flip side, in order to ensure that automation continues to function correctly, we must NOT require multiple sign off for:

Rules or Scheduled Rule Changes that only affect “nightly-style” channels.
Rules or Scheduled Rule Changes that only affect release “test” channels (eg: anything ending in -localtest or -cdntest).
Releases that are mapped to only by Rules or Scheduled Rule Changes matching any of the above criteria.
Release builds that have not yet been made live on a test channel.
Global shut-off rule.

We may need to consider another way of implementing an emergency shut-off.

These explicit includes and excludes cover the vast majority of objects in Balrog’s database. We can decide later whether or not to require multiple sign off for things not covered by the above.

Also explicitly out of scope, but things to keep in mind when we implement:

The ability to group multiple changes that would be completed together in a single transaction.
Multi-factor Authentication

Other Requirements

Without dictating the details, the following requirements must be met for a successful implementation of multiple sign off:

Users must be able to have zero to many Roles^[3].
User Roles must be data driven - not hardcoded in the source code.

This does not preclude the possibility of User Roles being defined in LDAP^[4] or another outside authentication system.

A User must not be able to sign off on their own changes.
Each unique product+channel combination may define its own set of Required Signoffs.
The number of sign offs per Required Role must be configurable.

Eg: 2 sign offs from “relman” must be possible.
Could be data driven or done statically like domain whitelisting.

Proposal and sign off of changes must be possible through the Admin API and UI.

Implementation

Overview

The recently completed Scheduled Changes system has provided us with a way to “stage” changes to Rules, which is one of the key pieces needed to implement Multiple Sign Off. We can use Scheduled Changes as a base, and enhance it to support the idea of optional Signoffs. Using it is a base has a number of advantages:

Scheduled Changes already know how to store future versions of objects, so we don’t need to invent another way of doing that.
The Balrog Agent already knows how to enact Scheduled Changes.
It enables Signoff of Scheduled Changes for free (vs. a completely separate Signoff implementation, where we’d have to tie the two together).

While the Scheduled Changes tables will take care of storing proposed changes, we will still need new tables to associate Users with Roles, define Required Signoffs for different types of changes, and track Signoffs to proposed changes. These new tables will also need support in the web API and UI to expose them to humans and the Balrog Agent.

Scheduled Changes Enhancements

Currently, the Scheduled Changes system is only enabled for the Rules table. We will need to enable it for Releases, Permissions, and Required Signoffs (see the “Required Signoffs” section for more on this) in order to enable Multiple Signoff for them. No changes will be needed to the schema of the Scheduled Changes tables, as Signoff information will be tracked elsewhere (see the “Signoffs” section for more on this), but the Scheduled Changes table will need to be taught how to find and validate Signoffs.

Scheduled Changes currently have two conditions that can cause them to be enacted: a timestamp or when uptake hits a certain point. Signoff will be implemented as an additional type of condition. Unlike existing conditions, Signoff will not be user controllable when creating a Scheduled Change - it will be implied for any changes to objects that have Required Signoffs defined.

At this time, Scheduled Changes does not support scheduling the deletion of objects. Because changes to the Required Signoffs table will be implemented as a Scheduled Change, we must add support for scheduling the deletion of objects to prevent a potential attacker from removing Required Signoffs and directly manipulating important Rules or Releases.

We will also modify the Scheduled Changes implementation to make it possible to disable the uptake condition on certain tables, because it will not make sense to have anything except Rules accept them.

New UI will be added for Scheduled Changes on Releases, Permissions, and Required Signoffs.

The Balrog Agent will need to be taught how to look for Scheduled Changes for Releases, Permissions, and Required Signoffs. It was also need to be taught how to determine when Required Signoffs have been satisfied.

User Roles

We will add a new table that associates Users with Roles. The table will have composite key consisting of username and role. No other columns are required.

New web APIs will be added to let us manage these User Roles. The existing Permissions UI will be modified to allow for management of User Roles.

Modifying User Roles will not require Multiple Signoff. Even if an attacker were to gain access to highly privileged account, they would not be able to ship anything by modifying User Roles, so this does not provide an indirect way of bypassing Multiple Sign Off.

Required Signoffs

Different types of objects have different dependencies that affect which types of Signoffs are needed. Because of this, we will add a new table to track Required Signoffs for each of: Rules & Releases (one table both) and Permissions. These tables will be structured as follows:

Object(s)

Table

Columns

Rules, Releases,

Required Signoffs

product_channel_required_signoffs

product (PK), channel (PK), role (PK), signoffs_required

Permissions

permissions_required_signoffs

product (PK), role (PK), signoffs_required

Including “role” in the key allows us to require Signoffs from multiple different Roles for the same type of change. The signoffs_required column lets us require more than one Signoff from within one Role (eg: two Signoffs from “relman”).

Some columns have special constraints on them:

In the Rules table, “channel” may have wildcards in it, (eg: release*), which must be considered when checking if a change needs Signoff. Wildcards will not be interpreted in the “channel” column here (to avoid confusion and footguns), but when comparing to a Rule’s channel, that one must be interpreted.
When new Required Signoffs are added, we must ensure the signoffs_required does not exceed the number of users who hold the given Role. Eg: if we required 10 Signoffs from the “releng” Role, but only 5 Users held that Role, making that change (or even removing that Signoff requirement) would be impossible.

Changes to these tables will also require Signoff, but the Required Signoffs will be inherited from associated table. That is to say, if changes to Firefox release channel Rules requires 2 relman Signoffs, changing the number of Required Signoffs for Firefox release channel rules will require 2 relman Signoff. This prevents a bad actor from indirectly modifying something by reducing the Signoff requirements without the need to duplicate the Required Signoffs across two tables.

New web APIs will be added to let us manage Required Signoffs. New sections will be added to the existing Rules, Releases, and Permissions UI to let us manage them from the Admin Interface.

Signoffs

The Signoffs tables are responsible for tracking who has signed off on proposed changes. Because they are related to the Scheduled Changes tables, we will need one Signoffs table for each of: Rules, Releases, Permissions, and Required Signoffs. Each table will have a composite key that consists of sc_id (a reference to the associated scheduled_changes table) and username, and also a role^[5] column. The role column is not part of the primary key, because a user may only Signoff on any Change under one Role.

A new web API will be needed to allow users to Signoff on changes. UI for this will be part of the Scheduled Changes sections. Existing Signoffs will be shown for each proposed change as well as which Roles still need to Signoff. An “Approve” button will be shown for any Change that still requires signoff. If the User holds multiple Roles, they will be prompted to specify which Role they are signing off under.

If a proposer of a change is a member of a one of the groups that change requires Signoff from, a Signoff from them will be recorded as part of the proposal. This means that anything that we want to require more than one person to change must have the sum of its number_required be 2 or more. If the User is not a member of one of the groups required they may still propose a change, but no signoff will be recorded for it. For example, if a change requires 2 RelMan signoffs and a RelEng person proposes it, 2 RelMan people will need to approve it. If a change requires 2 RelMan signoffs and a RelMan person proposes it, only 1 of them will need to explicitly approve it.

Other Enhancements

As a safeguard measure against future changes to the Signoffs system, whenever a change is made to a Rule, Release, Permission, or Required Role, Balrog will look up if that change should’ve required Signoff, and verify that Signoff actually happened.. This safeguard will be implemented outside of the Scheduled Changes logic to ensure that it is a fully separate safeguard (and therefore less likely to break when the Signoff system is modified in the future).

Use Cases

Scenario 1: Shipping Firefox

Initial State

The firefox_release Rule points at Firefox-52.0-build1 with a Background Rate of 100%.
No other Rules or Scheduled Rule Changes point at Firefox-52.0-build1.
The Firefox-53.0-build1 Release exists and is pointed at by the release-{local,cdn}test Rules.
Required Signoffs for the “Firefox” Product and “release” Channel are:

role: relman, signoffs_required: 2

Janet, and Alice have the “relman” Role.

Workflow

Janet decides we’re ready to ship Firefox 53.0.
She opens the Balrog Admin UI, finds the main Firefox release Rule, and clicks its “Propose a Change” button.
In the modal dialog, she makes the following changes to the fields, and hits “Save”.

Mapping: Firefox-53.0-build1
Background Rate: 25%
When: February 23, 2017 @ 6am Pacific.

On the backend, the following changes are made:

The rules_scheduled_changes table stores Janet’s proposed new version of the Rule.
The rules_signoffs table stores a new row that shows that Janet has signed off (username: janet, role: relman)

Janet asks Alice to review what she’s proposed.
Alice opens the Balrog Admin UI and reviews the proposed change in the Proposed Changes section of the UI.
She decides that everything looks good, and clicks “Approve”.
On the backend, the rules_signoffs table stores a new row that shows Alice has signed off (username: alice, role: relman).
The first time the Balrog Agent runs after February 23, 2017 @ 6am Pacific, it notices that both the time and sign off requirements of the scheduled change have been satisfied, and enacts it.

Scenario 2: Granting a new Permission

Initial State

No permission for username “janet@mozilla.com” exists in the “permissions” table.
Required Signoffs for Permissions to the “Firefox” Product are:

role: relman, numbered_required: 1
role: releng, number_required: 1

Scott has the “relman” role.
Mary has the “releng” role.

Workflow

Janet asks Mary to be granted access to manage Firefox releases.
Mary opens the Permissions section of the Balrog Admin UI and clicks “Add a Permission”.
In the modal dialog, she enters the following and hits “Save”:

Username: janet@mozilla.com
Permission: admin
Options: {“products”: [“Firefox”, “Fennec”]}

On the backend, the permissions_scheduled_changes table store’s Mary’s proposed Permission grant.
Mary asks Scott to approve Janet’s access.
Scott opens the Balrog Admin UI, but doesn’t agree that Janet should have access to Fennec.
Scott edits the proposed Permission to remove Fennec, and clicks Save.
On the backend, the following changes are made:

The existing proposed change is updated to reflect Scott’s change.
Mary’s earlier Signoff is removed from the permissions_signoffs table.

Scott asks Mary to review the updated Permission grant.
Mary decides that everything looks good, and clicks “Approve”.
The next time the Balrog Agent runs it notices that the Signoff requirements have been met, and enacts the proposed change.

Scenario 3: Adding a What’s New Page to an already shipped Firefox

Initial State

The firefox_release Rule points at the Firefox-51.0-build1 Release.
No other Rules or Scheduled Changes point at the Firefox-51.0-build1 Release.
Firefox-51.0-build1 Release exists.
Required Signoffs for the “Firefox” Product and “release” Channel are:

role: relman, signoffs_required: 2

Scott and Janet have the “relman” role.
Jeff has the “releng” role.

Workflow

After Firefox 51.0 has already shipped, Janet realizes that we should have a What’s New Page for it.
Because Janet is unfamiliar with Release blob structure, she asks Jeff to make the necessary change in Balrog.
Jeff goes to the Releases section of the Balrog Admin UI and downloads the current Firefox-51.0-build1 Release.
After editing it to add the What’s New Page he clicks “Propose a Change” and uploads the modified Blob.
On the backend, the following changes are made:

The releases_scheduled_changes table stores Jeff’s new version of the Firefox-51.0-build1 blob with the What’s New Page enabled.
The releases_signoffs table stores a new row that shows that Jeff has signed off (username: jeff, role: releng)^[6]

Jeff asks Janet and Scott to review the proposed change to ensure the What’s New Page URL is correct.
Janet thinks things are fine, and clicks “Approve”.
On the backend, the releases_signoffs table stores a row that shows that she has signed off (username: janet, role: relman)
Scott also thinks things are correct, and clicks “Approve”.
On the backend, the releases_signoffs table stores a row that shows that she has signed off (username: scott, role: relman)
The first time the Balrog Agent runs, it notices that the sign off requirements of the scheduled change have been satisfied, and enacts it.

Scenario 4: Shipping a new version of the CDM

Initial State

The lowest priority Rule with the “CDM” Product points at CDM-19.
“CDM-20” does not exist in the Releases table.
Required Signoffs for the “CDM” Product are:

role: mediateam, signoffs_required: 1
role: releng, signoffs_required: 1
role: relman, signoffs_required: 1

Tim has the “mediateam” role.
Jason has the “releng” role.
Scott has the “relman” role.

Workflow

Tim decides he wants to ship CDM 20.

He prepares and uploads the CDM-20 Release with the Balrog Admin UI.

Because no Rules point at it, no Signoff is needed.

In the Rules section of the UI, he finds the lowest priority Rule for the “CDM” product, and clicks its “Propose a Change” button.
In the modal dialog, he updates the form as follows, then hits “Save”:

Mapping: CDM-20

On the backend, the following changes are made:

The rules_scheduled_changes table stores his proposed new version of the Rule.
The rules_signoffs table stores a new row that shows that he has signed off (username: tim, role: mediateam)

Tim asks Jason to review his proposed Rule for validity, and asks Scott to Signoff if it’s a good time to ship.
Jason looks at the proposed Rule, and the new Release it points at, and decides that everything looks fine. He clicks “Approve”.
On the backend, the rules_signoffs table stores a new row that shows Jason has signed off (username: jason, role: releng)
Scott is fine with the new version of the CDM shipping, and also clicks “Approve”.
On the backend, the rules_signoffs table stores a new row that shows he has signed off (username: scott, role: relman)
The next time the Balrog Agent runs it notices that the Signoff requirements have been met, and enacts the proposed change.

Scenario 5: Shipping a System Addon

Initial State

No Rule exists with product “SystemAddons” and version “53.0”.
“HotNewFeature-1.0” does not exist in the Releases table.
Required Signoffs for the “SystemAddons” Product and “release” Channel are:

role: gofaster, signoffs_required: 1
role: relman, signoffs_required: 1
role: qa, signoffs_required: 1

Carmen has the “gofaster” role.
Alice has the “relman” role.
Georges has the “qa” role.

Workflow

Carmen has an awesome new System Addon that he wants to ship (HotNewFeature 1.0).
He prepares and uploads the HotNewFeature-1.0 Release with the Balrog Admin UI.

Because no Rules point at it, no Signoff is needed.

In the Rules section of the UI, he proposes a new Rule with the following values and hits “Save”:

Product: SystemAddons
Version: 52.0
Channel: release
Mapping: HotNewFeature-1.0

On the backend, the following changes are made:

The rules_scheduled_changes table stores his proposed new Rule.
The rules_signoffs table stores a new row that shows his signoff (username: carmen, role: gofaster)

Carmen asks Georges to QA HotNewFeature 1.0.
Georges does some QA and decides that everything looks good. He clicks “Approve”.
On the backend, the rules_signoffs table stores a new row that shows that he has signed off (username: georges, role:qa)
Carmen then asks Alice to Signoff on shipping HotNewFeature 1.0.
Alice looks at the proposed Rule and the new Release it points at, and sees that QA has signed off and decides that everything looks fine. She clicks “Approve”.
On the backend, the rules_signoffs table stores a new row that shows that she has signed off (username: alice, role:relman)
The next time the Balrog Agent runs it notices that the Signoff requirements have been met, and enacts the proposed change, creating the new Rule.

[1] The Firefox aurora channel will also require Signoff, but only during periods where updates are frozen during uplifts.

[2] This does not affect how existing automation submits Releases because beta, release, and esr channel Releases are not mapped to until after automation finishes populating them.

[3] Depending on implementation, proposing a change may require one role, and approving it may require another. Even if multiple Roles per user are not strictly necessary for the implementation to work, it’s still handy to keep us flexible for the future.

[4] After some security events in the past, we should ensure LDAP groups are not automatically granted if we go this route.

[5] It may seem unintuitive that Role is in this table, but it is necessary because a User may hold multiple Roles, and we need to know which they have performed the Signoff under.

[6] This does not count as one of the required Signoffs because Jeff does not hold the “relman” role. Recording this Signoff is not strictly necessary, and we may choose not to do so.