1 of 11

Buffer Sizing: Position Paper

Matt Mathis Andrew McGregor

mattmathis@google.com amcgregor@fastly.com

Stanford Buffer Sizing Workshop

Dec 2, 2019

2 of 11

The Punchline: Pace everything at scale

At the largest scales we can not afford "properly" sized buffers

  • Large buffers will perpetually be doomed by Moore's law
  • Small buffers doom self clocked protocols
  • We need to change the end systems
    • Break self clock and packet conservation [Jacobson88]
    • Pacing at scale
    • BBR is a good start

My charge to this community: invert the question.

Given buffer sizes in key places are smaller than we would prefer, how can we maximize effective network capacity and efficiency?

3 of 11

Moore's law

Colloquially: Speed-complexity product doubles every 18 Months.

Network link rates double every 2 years.

To maintain constant drain time:

  • Buffer speed has to double every 2 years;
  • Buffer size has to double every 2 years;
  • Buffer speed-complexity product needs to quadruple every 2 years

But this is economically infeasible in the fastest parts of the Internet

So buffer drain times keep falling

  • Sub mS is becoming common

4 of 11

Why do we want large buffers?

  • Many reasons.... but we dwell on one.

  • [VJ88] Design principles:
    • Packet conservation and TCP self clock
      • Vast majority of transmissions are triggered by ACKS
    • Explicitly stated: the entire TCP system is clocked by packets flowing through the bottleneck queue
    • This clearly works when buffer size > Bandwidth-Delay-product
    • But does this really work when the buffer size is only 1% of the BDP?
      • The clock source (the bottleneck) does not have enough memory to significantly spread or smooth bursts

5 of 11

BBR: new first principles for Congestion Control

  • BBR builds an explicit model of the network
    • Estimate max_BW and min_RTT

  • The BBR core algorithm:
    • By default pace at a previously measured Max_BW
    • Dither the pacing rate to measure model parameters
      • Up to observe new max rates
      • Down to observe the min RTT
      • Gather other signals such as ECN

  • BBR's "personality" is determined by the heuristics used to dither the rates and perform the measurements
    • These heuristics are completely unspecified in the core algorithm
    • Relatively easy to extend or adapt

6 of 11

BBR TCP

  • TCP estimates max_BW (at far edge) and min_RTT (entire path)
  • Servers send at ~1Mb/s per client
  • Traffic is smoother than Markov at some scales
    • Nominally no significant queues in the core
  • No loss in the core except true overload or pathological pacing synchronization (extremely unlikely)

Server

(10 Gb/s)

Client

(1 Mb/s)

Assume 50 mS RTT and that the return path batches or thins ACKs.

Core switch with 1mS drain time and

flow pinned ECMP

One 100 Gb/s strand of a 1.2 Tb/s Link Aggregation Group (LAG).

Router at the access edge with large buffers and AQM

7 of 11

Self clock is not good in a short queue Internet

  • Server rate bursts are delivered all the way to the far access edge
    • Where the bottleneck clocks the entire system
    • ACK thinning or compression causes persistent server rate bursts
      • e.g. WiFi and LTE channel arbitration
  • Concurrent bursts from 11 servers will cause queues in the core
  • Pathological ACK synchronization can cause loss at 2% load
  • The details of the burst structure come from weakly bound properties
    • Average window size, mechanisms that retime ACKs, etc

Server

(10 Gb/s)

Client

(1 Mb/s)

Assume 50 mS RTT and that the return path batches or thins ACKs.

Core switch with 1mS drain time and

flow pinned ECMP

One 100 Gb/s strand of a 1.2 Tb/s Link Aggregation Group (LAG).

Router at the access edge with large buffers and AQM

8 of 11

Deprecating VJ88 has profound implications

  • 30yrs of research on CCA w/ Self Clock and Packet Conservation
    • Some things that we think we "know" are wrong
    • There might be gold in some ideas that were abandoned
    • Pretty much everything needs to be revisited
  • Conjectures:
    • BBR framework easily adapts to multiple modeling strategies
    • Most window based CC algorithms have paced equivalents
    • Some CC algorithms fit even better (e.g. chirping)
    • 20 years of past CC work needs to be ported into BBR

See: Mathis & Mahdavi "Deprecating the TCP Macroscopic Model"

[CCR Oct 2019]

9 of 11

Buffer Sizing Research questions

  • Ongoing improvements to BBR
    • Port and test every window based CCA in BBR
    • Don't wait for BBR to be done before starting
  • Quantify the impact of bursty traffic on other traffic
    • What does it cost? buffer space or extra headroom (wasted capacity)?
    • Can ISPs incentivize reducing bursty traffic?
  • Does pacing at servers simplify queue management at the edges?
  • Are there alternatives besides pacing vs self clocked TCP?
  • Does application transaction smoothing help?
    • BBR natively restarts at the old max_BW. Should that decay?
  • Does ECMP still need flow pinning?
    • Paced packets are less likely to be reordered due to path diversity.
    • How much would it save us to discard flow pinning?

10 of 11

Conclusions

  • Moore's law squared dooms large buffers
  • Small buffers doom self clocked protocols
  • Some form of pacing is inevitable
    • BBR is a good start, but not done yet
    • Large content providers already have incentives
      • BBR solves real problems for them
  • Traffic statistics will change

11 of 11

Paced CUBIC is not a good solution

  • Pacing tests for available queue space every xx uS
  • Self clocked (bursty) cross traffic can cause transient full queues
    • Transient queues from different flows often interleave
  • For hypersensitive loss based CCAs, pacing loses to self-clock