HTTP/2 Receive Window Auto-Tuning
Eric Anderson
gRPC is an RPC system intended for wide-deployment: in data centers, on mobile, and on consumer machines. It uses HTTP/2 as its transport mechanism. It maps each RPC onto a different HTTP/2 stream.
TCP has a receive window that is used for flow control. Flow control “pushes back” on a sender when the receiver is slower than the sender, preventing the receiver from over-buffering and exceeding its memory.
Flow control requires a window of an appropriately chosen size: too large can waste memory by orders of magnitude but too small can reduce throughput by orders of magnitude. Unfortunately, multiple aspects of HTTP/2 defeat TCP’s flow control so it needs to be reimplemented at the HTTP/2 level.
Note that no discussion needs to occur surrounding congestion control. TCP’s congestion control operates without issue with HTTP/2.
While users should likely be permitted to manually adjust the window size, most users should be able to use gRPC without choosing a window size, in a similar manner to how they might use TCP. Because of the wide variety of environments and how those environments change over time, that necessitates auto-tuning the window size based on network behavior.
(Note: The approach here isn’t exactly what is being implemented; it is conceptually the same though. The most important change is bufferbloat avoidance/detection.)
Use HTTP/2 PING frames to measure the bandwidth-delay product (BDP) for auto-scaling. We would not measure round-trip time (RTT) or bandwidth, but rather the BDP directly.
When receiving a DATA frame, send a PING frame if there isn’t an ACK already outstanding. Keep a counter of all DATA payload bytes received from that point forward (including the just-received frame). When the PING ACK arrives, note the counter’s current value. Call that value the observed BDP (OBDP). The OBDP should not be able to be larger than the current connection window size.
If the OBDP is less than 2/3 the current connection window size, then the current window size remains unmodified. If the OBDP is greater than 2/3 the current connection window size, then the connection window size will become 2 * OBDP. The connection window size will be initialized to 64 KiB.
We measure DATA payload bytes instead of raw bytes as DATA payload bytes are the only bytes that observe the flow control window.
RTT-sensitive. Overly aggressive. May fight with TCP congestion (CUBIC) and TCP slow start. But simple, better than hard-coded, and can be iterated on.
Takes 4 RTT to go from 64 KiB to 1 MiB.
It is trivial for an attacker to emulate an infinitely large BDP. There must be a limit in place. A default limit of 2-4 MiB is suggested.
For certain platforms, we could possibly query and make use of TCP statistics made available by the OS. For example, getsockopt with SO_RCVBUF or TCP_INFO on Linux.
(Note: while I doubted TCP_INFO’s usefulness when writing this, we were later able to show it more clearly to be unhelpful. Basically, you need the remote’s TCP_INFO.)
These could be considered optimizations on certain platforms and could possibly reduce avenue for attack, but some minimum level of auto-tuning independent of OS is still necessary.
Now that BBR is public, we could investigate adapting its approach to HTTP/2. BBR tries to adjust its congestion window based on actual bandwidth and latency, instead of heavy reliance on packet loss. This corresponds reasonably with what we are able to measure. We would also need to investigate whether pseudo-BBR (something of our creation) on BBR-enabled TCP behaves well, or whether one BBR may interfere with the other. Understand that this work would be non-trivial because BBR is at the sender, and we need the logic at the receiver.
HTTP/2 multiplexes multiple independent streams onto one TCP connection. This causes no problems for congestion control but since streams are independent, they may be consumed at different speeds and thus need independent flow control. HTTP/2 supports a window-based flow control similar to TCP. To manage that flow control, HTTP/2 has control frames that are sent on the same TCP connection as the data. In order to prevent deadlock an HTTP/2 receiver must consistently drain the TCP connection, which prevents utilizing TCP’s flow control for the connection. Thus HTTP/2 also has its own connection-wide flow control.
As is well-understood for TCP, for optimal throughput the receive window must be >= bandwidth * delay of the connection, also known as the bandwidth-delay product (BDP). Delay is typically measured by the “round-trip time” (RTT) and is the time it takes after sending a message before a response could arrive due to the time it takes for the message to travel through the network. That is, the time it takes for the message to travel from the sender to the receiver plus the time it takes for the response to travel back. To calculate the BDP with the bandwidth and delay is simply a multiplication with units (as done in stoichiometry in Chemistry), a 100 Mbit/s Ethernet connection may have a 1 ms RTT, so 100 Mb/s * 1 ms = 100 Kb/ms * 1ms = 100 Kbit, or a 12.5 KB BDP.
A rough way to conceptualize that 12.5 KB is that if the sender is constantly transmitting, at any point in time there would be 6.25 KB of data “in the network” from the sender to the receiver and 6.25 KB of requests for more data in the opposite direction. Neither the sender nor the receiver would be buffering that data, as it is effectively buffered inside the network itself as electrons flowing through wires.
Although it is easier to conceptualize the inbound and outbound network as having the same bandwidth and delay, virtually all consumer Internet access is non-symmetric. Thus, to be precise, the bandwidth is the receive bandwidth. Although the delays in each direction may be different, the RTT of a connection includes the delay of both the inbound and outbound network. It can be useful to realize that the sending and receiving paths may be completely different, as receiving could use a high-bandwidth satellite connection, but due to the cost of transmitting to a satellite, sending could use a low-bandwidth modem.
HTTP/2 specifies that receiver’s windows default to 64 KB, but does not specify any preferred algorithm for determining a window size. 64 KB is a reasonable value for most website serving to consumers, but in data centers a value of 1 MB is easily required. Choosing too low of a value wastes bandwidth, and choosing too high of a value wastes memory. Choosing 64 KB for a data center which has 1 MB BDP would limit throughput to 6% of what is available. Since there can be 100 or 1000 streams easily on one connection, choosing 1 MB when 64 KB was appropriate could waste 1 GB of memory for no benefit.
Unfortunately, the BDP varies over time, due to behavior of other users of the network, fiber cuts, wireless signal strength, and many other factors. Even still, the BDP varies per-connection, since a single server may simultaneously be communicating with a mobile user, another server on the same rack, and a server on a different continent.
Although a fully-”optimal” solution does not exist for all environments, the window sizing issue is a solved problem as far as most users are concerned when using TCP. The default TCP behavior of their OS is “good enough” for their needs.