Opt-in SCT Auditing
This Document is Public
Emily Stark (estark@chromium.org), Chris Thompson (cthomp@chromium.org).
Last modified: September 30, 2020
Status: Implementation complete
This design doc outlines an approach to verify that server certificates are being properly logged via Certificate Transparency (CT). This is a step along the way to CT achieving its full intended security goals.
Currently, Chrome validates Signed Certificate Timestamps (SCTs) that it receives alongside server certificates. SCTs are verifiable promises that the certificate will be publicly logged. Chrome validates that each SCT is properly signed, but does not currently verify that the corresponding certificate is actually logged as promised. In this doc, we propose that Chrome reports a sample of SCTs and certificates to Google servers for opted-in users. Google audits the SCTs (that is, verifies that the certificate is properly logged as the SCT promised) and raises an alert about any SCTs that can’t be audited as proof of log misbehavior.
This document covers client-side changes, with server-side design covered in a Google-internal doc.
estark@chromium.org, asymmetric@chromium.org, cthomp@chromium.org, jdeblasio@chromium.org
Linux, Mac, Windows, Chrome OS, maybe Android
Pending.
Network stack, Safe Browsing code to enable/configure reporting
As a step on the way to achieving CT’s intended security goals, Chrome needs to audit SCTs. That is, when Chrome receives a server certificate with SCTs, it needs to check that the logs that issued the SCTs actually incorporated and published the certificate. Otherwise, we must trust logs to publish certificates that they say they will publish, and this is undesirable because CT logs might be malicious or misbehaving (just as a CA can be malicious or misbehaving).
A client can audit an SCT by querying an API endpoint from the log to request a Merkle audit proof, which can then be verified via simple cryptographic operations. However, this simple operation is surprisingly complex to deploy in practice. The Google-internal PRD explores some of the constraints and requirements. In particular, if Chrome clients were to query for inclusion proofs directly, it would leak browsing history to the CT logs, which is unacceptable from a privacy perspective.
To protect users’ privacy while still auditing some SCTs, this document proposes that Chrome reports a sample of SCTs observed by Safe Browsing Extending Reporting users to Google. These users have opted in to sharing additional data with Google for security purposes. Google will audit these SCTs asynchronously and raise an alert if any log misbehavior is detected. The server-side auditing and alerting is covered in a separate Google-internal doc.
The resulting security guarantees are closer to what CT is intended to achieve, providing a form of herd immunity, though there are still various ways that an attacker can abuse a misissued certificate without detection (see Security discussion below). The Chrome and CT teams are currently exploring a number of other possible proposals that provide better security and privacy; they are likely longer-term solutions that we are researching in parallel to this effort.
In Chrome’s net stack, we aim to report SCTs for a sample of HTTPS certificate validations to an embedder-configured endpoint. There are a few constraints on how we should accomplish this:
To meet these requirements, we’ll add a cache for pending SCT reports in the NetworkService. NetworkContext will have a method to enable/disable SCT reporting and configure parameters such as the report URL and the URLLoaderFactory to use for sending reports. When enabled, connections via that network context will queue SCT reports into the NetworkService’s cache, which will send them immediately if the report is sampled.
This design allows us to deduplicate reports browser-wide, and there are no additional IPCs imposed for sending SCT reports.
The following sections discuss this design in more detail.
All SCT reports will go through a cache that we’ll add to the NetworkService. This cache deduplicates reports across profiles. Later, it may also be responsible for persisting reports that fail to send and retrying later (see Persistence and retry). The privacy considerations of deduplicating reports across profiles are discussed in Privacy considerations.
The cache will have a method MaybeSendReport method that takes a NetworkContext that generated the report (used to check that reporting is enabled for that NetworkContext) and the data associated with a report (see Report format). It caches the report data in memory if the same report is not already cached and if it is not discarded due to sampling. If the report was cached rather than discarded, the cache sends the report with a URLLoaderFactory configured at network service startup time.
Consumers of a CTVerifier are responsible for reporting SCTs to the NetworkService’s SCT reporting cache. The CTVerifier itself does not report SCTs because logically auditing SCTs is a separate operation from validating them (e.g., verifying their signatures); thus, CTVerifier implementations can remain focused on validating SCTs. For example:
We will only report SCTs for certs that have valid SCTs. This excludes certs from private roots (which we don’t care about) and certs that only have invalid SCTs (for which we already show a CT interstitial). To reduce the size of reports, we will only include the valid SCTs in the case of a connection with a mixture of valid and invalid SCTs. See Ordering of CT validation with certificate validation for more details about this decision.
Finally, the embedder needs to be able to dynamically enable/disable reporting (because the Safe Browsing Extended Reporting preference can be toggled over the course of the browser’s lifetime). The NetworkContext will therefore expose a SetSCTAuditingEnabled(bool) mojo API that sets whether SCT reporting is enabled or disabled. The report cache in the Network Service can then query the enabled state and only add reports to the cache if the NetworkContext for that report has auditing enabled.
See Configuring reporting in Chrome for how SCT reports will be configured in //chrome.
To avoid overwhelming the report server, we don’t want to report every SCT that every Chrome client sees. We’ll define a Finch feature in the SCT reporting cache with a parameter that allows us to throttle the reports (by sending a random fraction of generated reports) or disable them entirely. The feature will initially be default off and then enabled only for Finch users, and then enabled in a subsequent milestone by default with a conservative throttling parameter. Every time a report is enqueued to the cache, it will generate a random number and only send the report if it is within the threshold set by the Finch parameter. We have implemented Option C below.
Option A: Sample before caching
This generates reports for a sample of connections (but sending at most one report per certificate). The cache is used to maintain a list of already handled SCTs to avoid sending twice, and to store the report until the embedder can send it (to avoid passing the report over IPC).
Over time, we’d expect a user to report all certificates they regularly see (eventually a connection for one of them will get sampled).
Pros: Never tries to send the same SCTs twice (and doesn’t need a separate “sent” bit -- cache deduplication handles that). Fewer entries in the cache (only get added if they get sampled).
Cons: “unique connections” != “unique hosts” necessarily, and the relationship between the two may be unclear, so it may be harder to reason about the privacy properties of sampling. May be overloading the cache some to have it implicitly track whether a report was “sent” or not.
Option B: Cache before sampling
This lets each connection have a chance to notify the embedder about the report. The report will already be in the cache. This has the same sampling properties as Option A (a fraction of connections will cause reports to be sent), but presence in the cache is separated from tracking whether the SCTs have been reported yet.
This is probably closest to the original wording in the section above.
To the server, Option A and Option B should be indistinguishable.
Pros: Cache may be conceptually more explicit than in Option A (see “Cons” in Option A).
Cons: Requires a separate “sent” bit to avoid sending duplicate reports. Cache fills up quicker (and does more work even when not sampled).
Option C: Cache SCTs, only sample if not already in cache [IMPLEMENTED]
This sends reports for a sample of certificates. Certificates not chosen are kept in the cache to “save” the sampling decision. For example, if a user visits 100 different sites overall and the sampling rate was 1/10, we would expect them to send reports for 10 sites, ever.
* A second dimension is whether checking if in the cache should bump the last_seen time or not.
In addition to probabilistic sampling (described above), we won’t be sampling from the set of all SCTs received by Chrome. We are only interested in auditing SCTs with valid signatures, associated with certificates that Chrome trusts.
We are only interested in SCTs with valid signatures because our goal is to determine whether logs honestly logged everything they promised to log, and the log hasn’t promised to log an SCT that doesn’t have a valid signature.
We are only interested in SCTs for certificates that Chrome trusts because of three considerations:
The report will be a protobuf object containing a subset of the fields from the Expect-CT report format. We omit served-certificate-chain, effective-expiration-date, failure-mode, and test-report, since they are not relevant.
Using protocol buffers instead of JSON reduces overhead by almost a third (~6.5KB vs ~9KB for JSON) and simplifies implementation on the server side.
In the first version of this project, reports will be fire-and-forget for simplicity. However, for a coherent threat model, we will eventually need to persist reports to disk and retry over time if they fail to send. This is because an active network attacker could block SCT auditing reports to evade detection. All SCT auditing solutions must therefore retry over time to be effective. (See, for example, this SCT gossip proposal: “Note that clients will send the same SCTs and chains to a server multiple times with the assumption that any man-in-the-middle attack eventually will cease, and an honest server will eventually receive collected malicious SCTs and certificate chains.”)
The NetworkService will also expose a method to clear the cache, which can be called when the user clears browsing data. This sacrifices a bit of auditing coverage for the sake of simplicity, because reports generated by all profiles will be cleared even if only a single profile’s browsing data is cleared.
A future iteration of the NetworkService’s SCT report cache can persist reports to disk if they fail to send and periodically retry. We’ll need to take the following considerations into account:
//chrome code will configure reporting according to the Safe Browsing Extended Reporting preference. This preference can change on-the-fly, for example by the user manually toggling it. The SafeBrowsingService lets other code subscribe to changes in this preference.
We’ll introduce a new SCTReportingService and corresponding SCTReportingServiceFactory in //chrome/browser/safe_browsing. Similar to CertificateReportingServiceFactory, the SCTReportingServiceFactory is a BrowserContextKeyedServiceFactory that will create a SCTReportingService per BrowserContext.
The SCTReportingService will subscribe to Safe Browsing preference changes for the profile. When a change is observed, it will retrieve the NetworkContext for the profile’s StoragePartition and call a method to enable/disable SCT reporting.
This section documents alternative implementation approaches for opt-in SCT auditing approaches in which Chrome sends SCTs to Google to be audited. Longer-term privacy-preserving proposals are explored in other Google-internal docs.
The proposed design described above sends reports with the SystemNetworkContext. It would instead be possible to send reports with the SafeBrowsingNetworkContext corresponding to the profile that generated the report. In this design, when the SCT report cache receives a report, instead of sending it immediately, it returns a cache key to the embedder’s NetworkContextClient that generated the report. The embedder retrieves the profile’s SafeBrowsingNetworkContext and asks it to send the report. This imposes two additional IPCs per SCT report (one to notify the embedder of the pending report, and one to send the report), but the data included in the IPC is minimal (just the cache key).
However, there doesn’t seem to be any need to use the SafeBrowsingNetworkContext rather than the SystemNetworkContext; the SafeBrowsingNetworkContext is useful particularly when we want to send Safe Browsing cookies along with the request (which are unnecessary in this context). Using the SystemNetworkContext simplifies the design by reducing cross-process communication and state.
We could consider plumbing SCTs out of the network stack to the embedder, and the embedder reports them. We rejected this design for performance reasons: IPCing data on every network request is expensive. Even though SCTs are fairly small, we want to report the certificate as well, and for performance reasons we only send certificate chains to the embedder when explicitly requested (i.e., for main-frame navigation requests or when DevTools is enabled).
An earlier design had all reporting logic live within the NetworkContext, with the embedder only getting involved to enable/disable reporting and set an endpoint. This design had two shortcomings: (1) reports could only be de-duplicated within a NetworkContext, not across NetworkContexts, and (2) reports could be sent only by the NetworkContext that made the connection that generated the report. Instead, we chose to cache reports in the NetworkService for better de-duplication, and to let the embedder decide which NetworkContext is used to send the reports.
An earlier design notified the embedder when there was a report ready to be sent via a URLLoaderThrottle. This was rejected because not all certificate verifications happen via URLLoader requests. For example, the nascent WebTransport API allows websites to create QUIC sockets directly. This design also had the drawback that separate throttle implementations were needed for navigation requests versus subresource requests, since the latter go to the renderer directly without passing through the browser process.
This isn’t a user-facing feature, so there’s no success metric per se. A steady stream of received SCT reports will indicate that the feature is working as expected (exact volume TBD depending on how aggressively we throttle them). We can also track the % of SCTs that Chrome observes in a given time period that are audited via this mechanism. This will always be a small fraction because we’ll have to sample, but higher is better.
Our regression metrics will be standard heartbeat metrics like crash rate and memory consumption. We don’t expect this feature to affect metrics like navigations or retention because it’s not user-facing.
As described above, the feature will be gated by a Finch feature with a parameter that controls how aggressively reports are throttled.
TODO(cthomp): Describe new metrics we want to add to Chrome for helping us make decisions about sampling rate.
WIP Brainstorming metrics we might want to have:
Standard Finch-based rollout. Unlike a standard Finch feature, we plan to leave the Finch flag around indefinitely so that we can use the parameter to throttle reports as needed.
This feature introduces new network requests, which could impact speed by consuming sockets, bandwidth, etc. However, these requests are non-blocking, only enabled for a subset of users (Safe Browsing Extended Reporting users), and throttled to affect only a small percentage of connections for those users. There is one additional IPC (with minimal associated data) associated with each such connection. We therefore don’t expect this feature to have a significant impact on Chrome speed overall.
The goal of this feature is to improve user security by discovering misbehaving CT logs that do not log certificates that they are supposed to log. However, there a variety of reasons why, in practice, an attacker might still succeed in preventing a malicious certificate from being published in CT logs:
We plan to address some of these limitations in future versions of this feature (for example, by developing a privacy-preserving auditing mechanism so that we don’t have to limit auditing to SBER users, and by persisting and retrying SCT reports on failure).
SCT auditing will likely always be circumventable by a sufficiently determined or lucky attacker. For example, we might always have to choose a random sample of SCTs to audit, or an attacker might be able to indefinitely prevent a victim’s device from reporting a malicious SCT. Even so, this SCT auditing strategy raises the cost of performing an undetected attack. For example, an attacker would need to maintain a long-term MITM instead of a transient one, and would need to maintain awareness of Chrome-specific implementation characteristics and tailor their attack specifically to counter them. In addition, SCT auditing will catch log misbehavior due to accident or incompetence, overall improving the health and strength of the CT log ecosystem.
Because we will be checking SCTs for inclusion in a table of valid SCTs maintained by the CT team, we are trusting Google to identify log misbehavior. Ordinarily, trusting Google to verify security properties on behalf of Chrome clients would be obviously acceptable. However, it bears further discussion in this case because one of the goals of this project is to remove trust in Google’s CT logs (the “One Google” log requirement, as discussed in the Google-internal PRD). We may end up replacing the One Google requirement with an SCT auditing system that relies on the same Google team to determine what is a valid SCT. This is acceptable because the purpose of removing the One Google requirement is not because we don’t trust the Google CT team or Google infrastructure. Instead, the point is primarily to remove Google from the critical path of certificate issuance, and to demonstrate that CT can be deployed in other browsers without a dependency on Google.
No specific stability concerns
This feature is not user-facing and therefore doesn’t increase the user-facing complexity of the product.
SCTs and certificate chains are a rough proxy for browsing history. This information will be sent to Google, but only for Safe Browsing Extended Reporting users who have explicitly opted in to sharing browsing data with Google for security purposes.
Third-party logs don’t receive any information about Chrome users’ browsing history because we query Google-operated mirrors of CT data instead of querying the logs directly. Even if we decide to fall back to querying logs directly, we would do it periodically, batching all clients’ browsing history together over periods of at least an hour to avoid leaking any individual user’s history.
Because pending SCT reports contain information about the user’s browsing history, the cache of pending reports will be cleared when the user clears browsing data. (This is overly conservative from a privacy perspective, as we will be clearing reports generated by all profiles, not just the one for which browsing data is being cleared.) When we later implement Persistence and retry in hopes of achieving better security properties, we may want to refine this strategy to only clear reports for the relevant profile.
As discussed in Reporting in the network service, reports will be deduplicated across the entire browser. This opens up the opportunity for profiles to influence each other and potentially leak information. For example, if one profile’s network context enqueues a report to some website but learns from the NetworkService that the report is redundant, then that network context “knows” that the user visited that website in another profile. However, reports are always sent from the Safe Browsing network context for the profile that issued the connection that triggered the report, so data should not become intermingled among network contexts for different profiles.
This project shouldn’t require any special testing considerations. It introduces no new UI and should be fully covered by automated tests.