Multiplexed Flow Control

Issues:

Multiple streams on a multiplexed ingress connection to a proxy server may have different egress rates. If the stream ingress rate exceeds its egress rate, then the proxy server needs to buffer data or somehow instruct the sender not to send more data for the stream, or stop reading from the connection, which will block all streams.
Flow control should, in absence of other factors such as memory pressure, allow for maximal ingress throughput unless egress links cannot keep up, regardless of whether or not the connection only has a single stream, or many streams. Yet we still want to maintain reasonable memory consumption per connection.
Kernel based TCP implementations require user space applications to not read() data into user space in order to cause the TCP flow control window to fill up (the data sits in kernel buffers). It’s better to have said data in user space so that the application can examine it and then perhaps use it (e.g. a proxy can forward data onward). On the other hand, it’s may be more difficult in user space to know the appropriate size of buffers for connections (allow too much and you have memory issues, allow too little and you may not saturate the link).

(1)

Let’s say we have a client sending two requests over a multiplexed connection to a proxy server with multiple backends. Imagine a scenario like so:

+----------+ +---------+ +------------+

| | | | | |

| | | | +---------------> | Backend X |

| | | | | |

| | Request 1, Request 2 | | +------------+

| User | +------------------> | Proxy |

| | | |

| | | | +------------+

| | | | | |

| | | | +---------------> | Backend Y |

| | | | | |

+----------+ +---------+ +------------+

The two streams will have the same ingress rate at the proxy server, but it’s possible to have the egress rate from the proxy server to its backends be different and potentially slower than the ingress rate from the client. For the sake of discussion, let’s assume the proxy => backend Y egress rate is slower than the ingress rate of an individual request/stream at the proxy, but the proxy => backend X egress rate exceeds the ingress rate. When that happens, then the proxy has a few options:

Stop reading from the ingress link so that backend Y can keep up. This stalls the multiplexed connection, so progress on both request 1 and request 2 is halted.
Buffer the data for request 2. This consumes memory. Unless there’s a way to bound the buffering, the proxy will either OOM or fall back to stopping reading from the ingress link.
Cancel existing streams, refuse new streams, or close down the entire session.

Preventing requests 1 and 2 from being able to independently make progress is rather unacceptable, but it’s likewise unacceptable to run have unbounded memory consumption. The obvious solution is to introduce per-stream flow control windows to bound memory consumption on a per-stream basis.

Note that to avoid roundtrips to open up the stream flow control window large enough to allow saturating the ingress link if desired, it’s necessary to have a “reasonable” initial stream flow control window size. There can be a default for all connections, and a control frame can tweak this initial stream flow control window size^[1]. It’s important to note that, given a stream window update control message, there’s no requirement that streams all need to have the same target window size. Indeed, they need only have the same initial window size, but future window updates^[2] for the stream can be controlled independently of other streams, so each stream can have differently sized windows / buffer allocations.

(2)

As explained in (1), the difference in egress rates from a proxy for corresponding incoming stream data may lead to memory pressure issues, which stream flow control helps address by allowing the proxy to specify memory constraints (via window sizes) on a per stream basis. However, since a multiplexed protocol often wants to support a large number of streams^[3], this means that the effective receive window for the session is the per-stream window size * max number of streams. We want a single stream to be able to saturate the ingress link at the proxy, which means that the window needs to at least be the BDP. In absence of any other flow control mechanism, such as may be provided by the transport layer^[4], that means the effective receive window could be ingress BDP * max streams, which is too much. It’s possible to provide a session flow control window. This would allow an important second axis of flow control.

It’s important to recognize that stream and session flow control serve two separate roles. Session flow control helps manage the memory consumption per session, whereas stream flow control helps manage what portion of that that memory an individual stream is allowed consume. And to reiterate, different streams can receive different window updates, so different streams can consume different amounts of the available memory per session.

(3)

As noted previously, it’s possible to use either the transport flow control mechanism or introduce a session flow control mechanism. We’ll assume TCP here, since it’s the most common flow reliable transport. Since most TCP stacks are kernel based, it makes the use of TCP flow control suboptimal in a few ways. From the TCP stack’s perspective, once data has been read() into user space, it’s fair game to reopen the TCP rwin. That means that relying on TCP flow control to prevent excessive memory consumption by the application requires leaving data sitting in kernel socket buffers, even if that contains multiple frames of application data that the application may want to process. For example, the receive buffer may contain control frames, or it may have data frames that could be forwarded on to the next hop. Also, since kernel socket buffers are quite large, it makes it more difficult for applications to have precise control over memory usage.

That said, predicting the future remains an unsolved problem, and so there is a limit to the optimality of any flow-control solution. The kernel has much better access to information (rtt, throughput, etc) to make decisions about flow control windows / buffers. While additional session flow control mechanisms enable servers to better manage memory and not introduce head of line blocking under memory pressure, it also brings a very real risk of implementations advertising flow control windows that are too small to saturate links, and thereby hurt performance^[5]. One possible method to address this is adding another control frame to indicate that the sender hitting is the flow control window, in order to let the receiver know it should perhaps increase the windows if possible. It may be easier for servers to start out with smaller, more conservative window sizes, and then open them wider as needed. This of course costs roundtrips.

[1] SPDY provides SETTINGS frames that can do this.

[2] SPDY3 provides WINDOW_UPDATE frames that increment the existing stream receive window size by specified amounts. TCP specifies receive windows in absolute terms (the size of the receive window beyond the absolute sequence number in the acknowledgement field). SPDY does not have sequence numbers.

[3] SPDY supports an initial stream limit of 100,

[4] SPDY generally runs over a transport (TCP) with its own flow control windows.

[5] Google’s implementation of SPDY/3 flow control did this, since the initial per-stream receive windows were not large enough to allow saturating the link.