1 of 11

Buffer Sizing: Position Paper

Matt Mathis Andrew McGregor

mattmathis@google.com amcgregor@fastly.com

Stanford Buffer Sizing Workshop

Dec 2, 2019

2 of 11

The Punchline: Pace everything at scale

At the largest scales we can not afford "properly" sized buffers

Large buffers will perpetually be doomed by Moore's law
Small buffers doom self clocked protocols
We need to change the end systems

Break self clock and packet conservation [Jacobson88]
Pacing at scale
BBR is a good start

My charge to this community: invert the question.

Given buffer sizes in key places are smaller than we would prefer, how can we maximize effective network capacity and efficiency?

3 of 11

Moore's law

Colloquially: Speed-complexity product doubles every 18 Months.

Network link rates double every 2 years.

To maintain constant drain time:

Buffer speed has to double every 2 years;
Buffer size has to double every 2 years;
Buffer speed-complexity product needs to quadruple every 2 years

But this is economically infeasible in the fastest parts of the Internet

So buffer drain times keep falling

Sub mS is becoming common

4 of 11

Why do we want large buffers?

Many reasons.... but we dwell on one.

[VJ88] Design principles:

Packet conservation and TCP self clock

Vast majority of transmissions are triggered by ACKS

Explicitly stated: the entire TCP system is clocked by packets flowing through the bottleneck queue
This clearly works when buffer size > Bandwidth-Delay-product
But does this really work when the buffer size is only 1% of the BDP?

The clock source (the bottleneck) does not have enough memory to significantly spread or smooth bursts

5 of 11

BBR: new first principles for Congestion Control

BBR builds an explicit model of the network

Estimate max_BW and min_RTT

The BBR core algorithm:

By default pace at a previously measured Max_BW
Dither the pacing rate to measure model parameters

Up to observe new max rates
Down to observe the min RTT
Gather other signals such as ECN

BBR's "personality" is determined by the heuristics used to dither the rates and perform the measurements

These heuristics are completely unspecified in the core algorithm
Relatively easy to extend or adapt

6 of 11

BBR TCP

TCP estimates max_BW (at far edge) and min_RTT (entire path)
Servers send at ~1Mb/s per client
Traffic is smoother than Markov at some scales

Nominally no significant queues in the core

No loss in the core except true overload or pathological pacing synchronization (extremely unlikely)

Server

(10 Gb/s)

Client

(1 Mb/s)

Assume 50 mS RTT and that the return path batches or thins ACKs.

Core switch with 1mS drain time and

flow pinned ECMP

One 100 Gb/s strand of a 1.2 Tb/s Link Aggregation Group (LAG).

Router at the access edge with large buffers and AQM

7 of 11

Self clock is not good in a short queue Internet

Server rate bursts are delivered all the way to the far access edge

Where the bottleneck clocks the entire system
ACK thinning or compression causes persistent server rate bursts

e.g. WiFi and LTE channel arbitration

Concurrent bursts from 11 servers will cause queues in the core
Pathological ACK synchronization can cause loss at 2% load
The details of the burst structure come from weakly bound properties

Average window size, mechanisms that retime ACKs, etc

Server

(10 Gb/s)

Client

(1 Mb/s)

Assume 50 mS RTT and that the return path batches or thins ACKs.

Core switch with 1mS drain time and

flow pinned ECMP

One 100 Gb/s strand of a 1.2 Tb/s Link Aggregation Group (LAG).

Router at the access edge with large buffers and AQM

8 of 11

Deprecating VJ88 has profound implications

30yrs of research on CCA w/ Self Clock and Packet Conservation

Some things that we think we "know" are wrong
There might be gold in some ideas that were abandoned
Pretty much everything needs to be revisited

Conjectures:

BBR framework easily adapts to multiple modeling strategies
Most window based CC algorithms have paced equivalents
Some CC algorithms fit even better (e.g. chirping)
20 years of past CC work needs to be ported into BBR

See: Mathis & Mahdavi "Deprecating the TCP Macroscopic Model"

[CCR Oct 2019]

9 of 11

Buffer Sizing Research questions

Ongoing improvements to BBR

Port and test every window based CCA in BBR
Don't wait for BBR to be done before starting

Quantify the impact of bursty traffic on other traffic

What does it cost? buffer space or extra headroom (wasted capacity)?
Can ISPs incentivize reducing bursty traffic?

Does pacing at servers simplify queue management at the edges?
Are there alternatives besides pacing vs self clocked TCP?
Does application transaction smoothing help?

BBR natively restarts at the old max_BW. Should that decay?

Does ECMP still need flow pinning?

Paced packets are less likely to be reordered due to path diversity.
How much would it save us to discard flow pinning?

10 of 11

Conclusions

Moore's law squared dooms large buffers
Small buffers doom self clocked protocols
Some form of pacing is inevitable

BBR is a good start, but not done yet
Large content providers already have incentives

BBR solves real problems for them

Traffic statistics will change

11 of 11

Paced CUBIC is not a good solution

Pacing tests for available queue space every xx uS
Self clocked (bursty) cross traffic can cause transient full queues

Transient queues from different flows often interleave

For hypersensitive loss based CCAs, pacing loses to self-clock