Opt-in SCT Auditing

This Document is Public

Emily Stark (estark@chromium.org), Chris Thompson (cthomp@chromium.org).

Last modified: September 30, 2020

Status: Implementation complete

One-page overview

Summary

This design doc outlines an approach to verify that server certificates are being properly logged via Certificate Transparency (CT). This is a step along the way to CT achieving its full intended security goals.

Currently, Chrome validates Signed Certificate Timestamps (SCTs) that it receives alongside server certificates. SCTs are verifiable promises that the certificate will be publicly logged. Chrome validates that each SCT is properly signed, but does not currently verify that the corresponding certificate is actually logged as promised. In this doc, we propose that Chrome reports a sample of SCTs and certificates to Google servers for opted-in users. Google audits the SCTs (that is, verifies that the certificate is properly logged as the SCT promised) and raises an alert about any SCTs that can’t be audited as proof of log misbehavior.

This document covers client-side changes, with server-side design covered in a Google-internal doc.

Team

estark@chromium.org, asymmetric@chromium.org, cthomp@chromium.org, jdeblasio@chromium.org

Platforms

Linux, Mac, Windows, Chrome OS, maybe Android

Bug

Pending.

Code affected

Network stack, Safe Browsing code to enable/configure reporting

Design

As a step on the way to achieving CT’s intended security goals, Chrome needs to audit SCTs. That is, when Chrome receives a server certificate with SCTs, it needs to check that the logs that issued the SCTs actually incorporated and published the certificate. Otherwise, we must trust logs to publish certificates that they say they will publish, and this is undesirable because CT logs might be malicious or misbehaving (just as a CA can be malicious or misbehaving).

A client can audit an SCT by querying an API endpoint from the log to request a Merkle audit proof, which can then be verified via simple cryptographic operations. However, this simple operation is surprisingly complex to deploy in practice. The Google-internal PRD explores some of the constraints and requirements. In particular, if Chrome clients were to query for inclusion proofs directly, it would leak browsing history to the CT logs, which is unacceptable from a privacy perspective.

To protect users’ privacy while still auditing some SCTs, this document proposes that Chrome reports a sample of SCTs observed by Safe Browsing Extending Reporting users to Google. These users have opted in to sharing additional data with Google for security purposes. Google will audit these SCTs asynchronously and raise an alert if any log misbehavior is detected. The server-side auditing and alerting is covered in a separate Google-internal doc.

The resulting security guarantees are closer to what CT is intended to achieve, providing a form of herd immunity, though there are still various ways that an attacker can abuse a misissued certificate without detection (see Security discussion below). The Chrome and CT teams are currently exploring a number of other possible proposals that provide better security and privacy; they are likely longer-term solutions that we are researching in parallel to this effort.

Design overview

In Chrome’s net stack, we aim to report SCTs for a sample of HTTPS certificate validations to an embedder-configured endpoint. There are a few constraints on how we should accomplish this:

We want to send reports generated by the profile’s main network context, but the SystemNetworkContext should be the network context that actually sends the reports to a Safe Browsing endpoint. This is to avoid mingling state from the user’s regular web browsing with the report requests and to avoid any circular reporting (where the report requests themselves trigger reports).
We want to avoid sending SCTs and certificate chains across the IPC boundary; we typically avoid doing so for performance reasons.
Ideally we would like to deduplicate reports, at least per-profile if not browser-wide, so that we don’t report and audit the same SCT multiple times from the same client.

To meet these requirements, we’ll add a cache for pending SCT reports in the NetworkService. NetworkContext will have a method to enable/disable SCT reporting and configure parameters such as the report URL and the URLLoaderFactory to use for sending reports. When enabled, connections via that network context will queue SCT reports into the NetworkService’s cache, which will send them immediately if the report is sampled.

This design allows us to deduplicate reports browser-wide, and there are no additional IPCs imposed for sending SCT reports.

The following sections discuss this design in more detail.

Reporting in the network service

All SCT reports will go through a cache that we’ll add to the NetworkService. This cache deduplicates reports across profiles. Later, it may also be responsible for persisting reports that fail to send and retrying later (see Persistence and retry). The privacy considerations of deduplicating reports across profiles are discussed in Privacy considerations.

The cache will have a method MaybeSendReport method that takes a NetworkContext that generated the report (used to check that reporting is enabled for that NetworkContext) and the data associated with a report (see Report format). It caches the report data in memory if the same report is not already cached and if it is not discarded due to sampling. If the report was cached rather than discarded, the cache sends the report with a URLLoaderFactory configured at network service startup time.

Consumers of a CTVerifier are responsible for reporting SCTs to the NetworkService’s SCT reporting cache. The CTVerifier itself does not report SCTs because logically auditing SCTs is a separate operation from validating them (e.g., verifying their signatures); thus, CTVerifier implementations can remain focused on validating SCTs. For example:

SSLClientSocketImpl will pass the certificate chain and SCTs into the cache after calling CTVerifier::Verify.
NetworkContext::OnCertVerifyForSignedExchangeComplete will similarly pass the certificate chain and SCTs into the cache after verifying SCTs.

We will only report SCTs for certs that have valid SCTs. This excludes certs from private roots (which we don’t care about) and certs that only have invalid SCTs (for which we already show a CT interstitial). To reduce the size of reports, we will only include the valid SCTs in the case of a connection with a mixture of valid and invalid SCTs. See Ordering of CT validation with certificate validation for more details about this decision.

Finally, the embedder needs to be able to dynamically enable/disable reporting (because the Safe Browsing Extended Reporting preference can be toggled over the course of the browser’s lifetime). The NetworkContext will therefore expose a SetSCTAuditingEnabled(bool) mojo API that sets whether SCT reporting is enabled or disabled. The report cache in the Network Service can then query the enabled state and only add reports to the cache if the NetworkContext for that report has auditing enabled.

See Configuring reporting in Chrome for how SCT reports will be configured in //chrome.

Sampling

Sampling strategy

To avoid overwhelming the report server, we don’t want to report every SCT that every Chrome client sees. We’ll define a Finch feature in the SCT reporting cache with a parameter that allows us to throttle the reports (by sending a random fraction of generated reports) or disable them entirely. The feature will initially be default off and then enabled only for Finch users, and then enabled in a subsequent milestone by default with a conservative throttling parameter. Every time a report is enqueued to the cache, it will generate a random number and only send the report if it is within the threshold set by the Finch parameter. We have implemented Option C below.

Option A: Sample before caching

Draw random n=[0,1)
If n < sampling_threshold, then

Check if already in cache*, if not

Insert into cache and notify embedder

This generates reports for a sample of connections (but sending at most one report per certificate). The cache is used to maintain a list of already handled SCTs to avoid sending twice, and to store the report until the embedder can send it (to avoid passing the report over IPC).

Over time, we’d expect a user to report all certificates they regularly see (eventually a connection for one of them will get sampled).

Pros: Never tries to send the same SCTs twice (and doesn’t need a separate “sent” bit -- cache deduplication handles that). Fewer entries in the cache (only get added if they get sampled).

Cons: “unique connections” != “unique hosts” necessarily, and the relationship between the two may be unclear, so it may be harder to reason about the privacy properties of sampling. May be overloading the cache some to have it implicitly track whether a report was “sent” or not.

Option B: Cache before sampling

Insert into cache (bumping the last_seen time if already present)

Draw random n=[0,1)
If n < sampling_threshold, then

Notify embedder to try to send (if isPending is true)
(When the report is sent, set isPending to false)

This lets each connection have a chance to notify the embedder about the report. The report will already be in the cache. This has the same sampling properties as Option A (a fraction of connections will cause reports to be sent), but presence in the cache is separated from tracking whether the SCTs have been reported yet.

This is probably closest to the original wording in the section above.

To the server, Option A and Option B should be indistinguishable.

Pros: Cache may be conceptually more explicit than in Option A (see “Cons” in Option A).

Cons: Requires a separate “sent” bit to avoid sending duplicate reports. Cache fills up quicker (and does more work even when not sampled).

Option C: Cache SCTs, only sample if not already in cache [IMPLEMENTED]

Let item = cache.Get(cache_key) // Bumps the last_seen time if present
If item wasn’t in the cache, then

Insert into cache
Draw random n=[0,1)
If n < sampling_threshold, then

Notify embedder to try to send

This sends reports for a sample of certificates. Certificates not chosen are kept in the cache to “save” the sampling decision. For example, if a user visits 100 different sites overall and the sampling rate was 1/10, we would expect them to send reports for 10 sites, ever.

* A second dimension is whether checking if in the cache should bump the last_seen time or not.

Ordering of CT validation with certificate validation

In addition to probabilistic sampling (described above), we won’t be sampling from the set of all SCTs received by Chrome. We are only interested in auditing SCTs with valid signatures, associated with certificates that Chrome trusts.

We are only interested in SCTs with valid signatures because our goal is to determine whether logs honestly logged everything they promised to log, and the log hasn’t promised to log an SCT that doesn’t have a valid signature.

We are only interested in SCTs for certificates that Chrome trusts because of three considerations:

Conceptually, we want to improve our confidence that every certificate trusted by Chrome is publicly logged, and we aren’t interested in verifying that untrusted certificates are logged (because we already don’t trust them). In other words, we are far more interested in auditing SCTs that directly affect Chrome’s trust decisions than those that don’t, though there is some nuance to consider. Consider a malformed or misissued certificate issued by a CA that Chrome trusts. For example, a certificate using an invalid encoding or invalid serial number might be rejected by Chrome but accepted by other software or platforms. This certificate could come with validly signed SCTs which could in theory be audited. But our proposal is to not audit them, with the following considerations:

If the CA is misbehaving but the log is behaving honestly, this certificate would be detected in the log, and SCT auditing has no additional security benefit.
If the CA is misbehaving and the log is misbehaving, auditing the SCT would be useful: the results could detect misbehavior that could impact confidence in both the CA and the log. But we won’t detect this situation, because we don’t feel convinced that this indirect security benefit is worth the costs described subsequently.

Reporting SCTs for invalid certificates would introduce some additional privacy risks. For example, Chrome’s certificate validator doesn’t distinguish between certificates that are malformed but legitimately signed by a trusted CA and certificates that are malformed junk data with no valid issuer signature. Because invalid certificates typically fast-fail before signature validation, we can’t be sure whether or not they’re public, and so default to assuming they may have personally identifying information in them. This might be acceptable because the user is opting in, but it complicates the privacy considerations to achieve only an indirect security benefit (as described in #1).
For efficiency and simplicity, Chrome already bypasses CT validation if the certificate isn’t otherwise trusted, and we’d have to rework this assumption and refactor the code to change this and report SCTs for invalid certificates.

Report format

The report will be a protobuf object containing a subset of the fields from the Expect-CT report format. We omit served-certificate-chain, effective-expiration-date, failure-mode, and test-report, since they are not relevant.

Using protocol buffers instead of JSON reduces overhead by almost a third (~6.5KB vs ~9KB for JSON) and simplifies implementation on the server side.

Persistence and retry

In the first version of this project, reports will be fire-and-forget for simplicity. However, for a coherent threat model, we will eventually need to persist reports to disk and retry over time if they fail to send. This is because an active network attacker could block SCT auditing reports to evade detection. All SCT auditing solutions must therefore retry over time to be effective. (See, for example, this SCT gossip proposal: “Note that clients will send the same SCTs and chains to a server multiple times with the assumption that any man-in-the-middle attack eventually will cease, and an honest server will eventually receive collected malicious SCTs and certificate chains.”)

The NetworkService will also expose a method to clear the cache, which can be called when the user clears browsing data. This sacrifices a bit of auditing coverage for the sake of simplicity, because reports generated by all profiles will be cleared even if only a single profile’s browsing data is cleared.

A future iteration of the NetworkService’s SCT report cache can persist reports to disk if they fail to send and periodically retry. We’ll need to take the following considerations into account:

Reports should be deleted from disk when the user clears their browsing data.
An attacker can try to fill up the client’s storage and thereby evict malicious SCTs so that they never get reported. (See this blog post.)

Configuring reporting in Chrome

//chrome code will configure reporting according to the Safe Browsing Extended Reporting preference. This preference can change on-the-fly, for example by the user manually toggling it. The SafeBrowsingService lets other code subscribe to changes in this preference.

We’ll introduce a new SCTReportingService and corresponding SCTReportingServiceFactory in //chrome/browser/safe_browsing. Similar to CertificateReportingServiceFactory, the SCTReportingServiceFactory is a BrowserContextKeyedServiceFactory that will create a SCTReportingService per BrowserContext.

The SCTReportingService will subscribe to Safe Browsing preference changes for the profile. When a change is observed, it will retrieve the NetworkContext for the profile’s StoragePartition and call a method to enable/disable SCT reporting.

Alternatives considered

This section documents alternative implementation approaches for opt-in SCT auditing approaches in which Chrome sends SCTs to Google to be audited. Longer-term privacy-preserving proposals are explored in other Google-internal docs.

Send reports with SafeBrowsingNetworkContext

The proposed design described above sends reports with the SystemNetworkContext. It would instead be possible to send reports with the SafeBrowsingNetworkContext corresponding to the profile that generated the report. In this design, when the SCT report cache receives a report, instead of sending it immediately, it returns a cache key to the embedder’s NetworkContextClient that generated the report. The embedder retrieves the profile’s SafeBrowsingNetworkContext and asks it to send the report. This imposes two additional IPCs per SCT report (one to notify the embedder of the pending report, and one to send the report), but the data included in the IPC is minimal (just the cache key).

However, there doesn’t seem to be any need to use the SafeBrowsingNetworkContext rather than the SystemNetworkContext; the SafeBrowsingNetworkContext is useful particularly when we want to send Safe Browsing cookies along with the request (which are unnecessary in this context). Using the SystemNetworkContext simplifies the design by reducing cross-process communication and state.

Plumb SCTs out to embedder and embedder reports them

We could consider plumbing SCTs out of the network stack to the embedder, and the embedder reports them. We rejected this design for performance reasons: IPCing data on every network request is expensive. Even though SCTs are fairly small, we want to report the certificate as well, and for performance reasons we only send certificate chains to the embedder when explicitly requested (i.e., for main-frame navigation requests or when DevTools is enabled).

Manage reporting entirely within NetworkContext

An earlier design had all reporting logic live within the NetworkContext, with the embedder only getting involved to enable/disable reporting and set an endpoint. This design had two shortcomings: (1) reports could only be de-duplicated within a NetworkContext, not across NetworkContexts, and (2) reports could be sent only by the NetworkContext that made the connection that generated the report. Instead, we chose to cache reports in the NetworkService for better de-duplication, and to let the embedder decide which NetworkContext is used to send the reports.

Orchestrate reporting from the embedder via URLLoaderThrottles

An earlier design notified the embedder when there was a report ready to be sent via a URLLoaderThrottle. This was rejected because not all certificate verifications happen via URLLoader requests. For example, the nascent WebTransport API allows websites to create QUIC sockets directly. This design also had the drawback that separate throttle implementations were needed for navigation requests versus subresource requests, since the latter go to the renderer directly without passing through the browser process.

Metrics

Success metrics

This isn’t a user-facing feature, so there’s no success metric per se. A steady stream of received SCT reports will indicate that the feature is working as expected (exact volume TBD depending on how aggressively we throttle them). We can also track the % of SCTs that Chrome observes in a given time period that are audited via this mechanism. This will always be a small fraction because we’ll have to sample, but higher is better.

Regression metrics

Our regression metrics will be standard heartbeat metrics like crash rate and memory consumption. We don’t expect this feature to affect metrics like navigations or retention because it’s not user-facing.

Experiments

As described above, the feature will be gated by a Finch feature with a parameter that controls how aggressively reports are throttled.

TODO(cthomp): Describe new metrics we want to add to Chrome for helping us make decisions about sampling rate.

WIP Brainstorming metrics we might want to have:

# unique certificates seen per-user per-day (proxy for the post-deduplication number of reports we could send with no sampling)

Not sure how to implement without privacy complications

Metrics for how well we deduplicate reports

“Would have sent a report but was deduplicated” event vs. “Sent a report” event (lets us measure deduplication rate)

Rollout plan

Standard Finch-based rollout. Unlike a standard Finch feature, we plan to leave the Finch flag around indefinitely so that we can use the parameter to throttle reports as needed.

Core principle considerations

Speed

This feature introduces new network requests, which could impact speed by consuming sockets, bandwidth, etc. However, these requests are non-blocking, only enabled for a subset of users (Safe Browsing Extended Reporting users), and throttled to affect only a small percentage of connections for those users. There is one additional IPC (with minimal associated data) associated with each such connection. We therefore don’t expect this feature to have a significant impact on Chrome speed overall.

Security

Security against network attackers

The goal of this feature is to improve user security by discovering misbehaving CT logs that do not log certificates that they are supposed to log. However, there a variety of reasons why, in practice, an attacker might still succeed in preventing a malicious certificate from being published in CT logs:

Sampling. If the attacker does not use their malicious certificate against all users, then there is a chance that its SCTs will never be reported to Google, because the certificate may not be seen by SBER users, or may not be reported by them due to the throttling mechanism. However, we expect to achieve a strong chance of detecting attacks that target large swaths of users.
SBER fingerprinting. An attacker could fingerprint users who have opted in to SBER (e.g., by watching their network traffic for a period of time before mounting their attack) and avoid serving their malicious certificate to those users. This would circumvent our detection.
Report blocking. An attacker could block SCT reports while the user remains on the attacker-controlled network, preventing Google from ever detecting log misbehavior.

We plan to address some of these limitations in future versions of this feature (for example, by developing a privacy-preserving auditing mechanism so that we don’t have to limit auditing to SBER users, and by persisting and retrying SCT reports on failure).

SCT auditing will likely always be circumventable by a sufficiently determined or lucky attacker. For example, we might always have to choose a random sample of SCTs to audit, or an attacker might be able to indefinitely prevent a victim’s device from reporting a malicious SCT. Even so, this SCT auditing strategy raises the cost of performing an undetected attack. For example, an attacker would need to maintain a long-term MITM instead of a transient one, and would need to maintain awareness of Chrome-specific implementation characteristics and tailor their attack specifically to counter them. In addition, SCT auditing will catch log misbehavior due to accident or incompetence, overall improving the health and strength of the CT log ecosystem.

Security against Google

Because we will be checking SCTs for inclusion in a table of valid SCTs maintained by the CT team, we are trusting Google to identify log misbehavior. Ordinarily, trusting Google to verify security properties on behalf of Chrome clients would be obviously acceptable. However, it bears further discussion in this case because one of the goals of this project is to remove trust in Google’s CT logs (the “One Google” log requirement, as discussed in the Google-internal PRD). We may end up replacing the One Google requirement with an SCT auditing system that relies on the same Google team to determine what is a valid SCT. This is acceptable because the purpose of removing the One Google requirement is not because we don’t trust the Google CT team or Google infrastructure. Instead, the point is primarily to remove Google from the critical path of certificate issuance, and to demonstrate that CT can be deployed in other browsers without a dependency on Google.

Stability

No specific stability concerns

Simplicity

This feature is not user-facing and therefore doesn’t increase the user-facing complexity of the product.

Privacy considerations

Browsing history

SCTs and certificate chains are a rough proxy for browsing history. This information will be sent to Google, but only for Safe Browsing Extended Reporting users who have explicitly opted in to sharing browsing data with Google for security purposes.

Third-party logs don’t receive any information about Chrome users’ browsing history because we query Google-operated mirrors of CT data instead of querying the logs directly. Even if we decide to fall back to querying logs directly, we would do it periodically, batching all clients’ browsing history together over periods of at least an hour to avoid leaking any individual user’s history.

Because pending SCT reports contain information about the user’s browsing history, the cache of pending reports will be cleared when the user clears browsing data. (This is overly conservative from a privacy perspective, as we will be clearing reports generated by all profiles, not just the one for which browsing data is being cleared.) When we later implement Persistence and retry in hopes of achieving better security properties, we may want to refine this strategy to only clear reports for the relevant profile.

Cross-profile leakage

As discussed in Reporting in the network service, reports will be deduplicated across the entire browser. This opens up the opportunity for profiles to influence each other and potentially leak information. For example, if one profile’s network context enqueues a report to some website but learns from the NetworkService that the report is redundant, then that network context “knows” that the user visited that website in another profile. However, reports are always sent from the Safe Browsing network context for the profile that issued the connection that triggered the report, so data should not become intermingled among network contexts for different profiles.

Testing plan

This project shouldn’t require any special testing considerations. It introduces no new UI and should be fully covered by automated tests.