1
CS 168, Summer 2025 @ UC Berkeley
Slides credit: Sylvia Ratnasamy, Rob Shakir, Peyrin Kao, Ankit Singla, Murphy McCauley
Datacenter Topology
Lecture 18 (Datacenters 1)
What is a Datacenter?
Lecture 18, CS 168, Summer 2025
Designing Datacenters
Congestion Control in Datacenters
What is a Datacenter?
Our model of the Internet so far: Your computer exchanges packets with a server.
Is YouTube really a single machine serving videos to the entire world? No.
Your computer
Home router
YouTube router
YouTube server
Internet
What is a Datacenter?
In reality: Large applications (e.g. YouTube, Facebook) are hosted in datacenters.
Datacenter: A building of inter-connected machines, hosting applications.
Our goal: How do we build network infrastructure to connect machines in the datacenter?
Internet
Datacenter
Your computer
Home router
A building full of Google routers and servers
Our goal: Designing the stuff in here.
What is a Datacenter?
So far, we've thought about general-purpose networks.
Today, we'll think about datacenter networks.
Unlike general-purpose networks we've seen so far, datacenter networks have unique constraints and specialized solutions.
Our goal: Designing this.
Why are Datacenters Different?
What makes datacenter network infrastructure different?
Datacenter Traffic Patterns
High performance is critical because the datacenter has lots of internal traffic.
Internet
Datacenter
Your computer
Home router
External traffic between you (the user) and datacenter.
Internal traffic between servers in the datacenter.
Datacenter Traffic Patterns
High performance is critical because the datacenter has lots of internal traffic.
Scaling Memcache at Facebook | ||
Rajesh Nishtala et. al. | USENIX NSDI, 2013 | |
Loading one popular page requires 521 internal loads on average.
99th percentile = 1740 loads!
Datacenter Traffic Patterns
Terminology:
East-west traffic is several orders of magnitude larger than north-south traffic.
Datacenter
Internet
North-south
East-west
Datacenter Traffic Patterns
Datacenter traffic demand has increased significantly.
“Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network”, Arjun Singh et al. @ Google, ACM SIGCOMM’15
Inside a Real Datacenter
Lecture 18, CS 168, Summer 2025
Designing Datacenters
Congestion Control in Datacenters
Accessing an Application in a Datacenter
When you send a packet to the service (e.g. Facebook):
Datacenter
WAN
Your computer
Home router
Peering
ISP-owned infrastructure
Facebook-owned infrastructure
Internet Infrastructure: Peering Locations
Peering locations host network interconnection infrastructure.
WAN
Datacenter
Your computer
Home router
ISP-owned infrastructure
Facebook-owned infrastructure
Peering
Internet Infrastructure: Datacenters
Datacenters host the servers and application infrastructure.
WAN
Datacenter
Your computer
Home router
Peering
ISP-owned infrastructure
Facebook-owned infrastructure
Internet Infrastructure: The Big Picture
Datacenters host the servers and application infrastructure.
Peering locations host network interconnection infrastructure.
Facebook's wide area network (WAN) connects its locations together.
Facebook-owned infrastructure
WAN
Datacenter
Datacenter
Datacenter
ISP
ISP
ISP
Peering
Peering
Our special datacenter networks connect servers inside here.
Internet Infrastructure: Datacenters
Datacenter locations are often chosen based on constraints:
Infrastructure for cooling inside a Google datacenter.
Ultimately, a datacenter is just a building with a ton of servers. Our job is to build the network that connects the servers.
Inside a Datacenter: Racks
Inside a datacenter, servers are organized in physical racks.
Each unit fits 1–2 servers.
~40 units in each rack.
Inside a Datacenter: Top-of-Rack Switches
To connect all the servers in the same rack:
TOR switch
Uplinks in blue
Forwarding Chip
…
TOR
100 Gbps
100 Gbps
100 Gbps
100 Gbps
100 Gbps
Inside a Datacenter: Connecting Racks?
We built a rack of 40–80 servers, and connected all the servers in the rack.
Each datacenter has many racks.
Bisection Bandwidth
Lecture 18, CS 168, Summer 2025
Designing Datacenters
Congestion Control in Datacenters
Measuring Network Connectivity
Before we think about connecting racks, we need a way to measure connectedness.
Bisection bandwidth is a way to formally measure how connected a network is.
Topology 1
Topology 2
Topology 3
R1
R2
R3
R4
R5
R6
R1
R2
R3
R4
R5
R6
R1
R2
R3
R4
R5
R6
Bisection Bandwidth
To compute bisection bandwidth:
Intuition: If the network is more connected, we have to delete more links to split it.
9 Gbps
3 Gbps
1 Gbps
Assume each link is 1 Gbps.
R1
R2
R3
R4
R5
R6
R1
R2
R3
R4
R5
R6
R1
R2
R3
R4
R5
R6
Bisection Bandwidth
Another equivalent definition of bisection bandwidth:
Intuition: This is the bottleneck bandwidth of the network.
9 Gbps
3 Gbps
1 Gbps
Assume each link is 1 Gbps.
R1
R2
R3
R4
R5
R6
R1
R2
R3
R4
R5
R6
R1
R2
R3
R4
R5
R6
Full Bisection Bandwidth1
Full bisection bandwidth: All hosts in one partition can send at full rate.
Oversubscription is a measure of how overloaded the network is.
A
B
C
D
E
10
10
1
1
A
C
D
E
F
10
10
B
1
1
1
1
Full: 10 Gbps
Bisection: 1 Gbps
Over: 10x
Full: 10 Gbps
Bisection: 2 Gbps
Over: 5x
A
C
D
E
B
F
G
H
I
J
K
L
Assume all links 1 Gbps.
Full: 4 Gbps
Bisection: 2 Gbps
Over: 2x
Datacenter Topologies�(Clos Networks)
Lecture 18, CS 168, Summer 2025
Designing Datacenters
Congestion Control in Datacenters
Maximizing Bisection Bandwidth
Back to our original problem:
Topology #1: Big Switch
Problem 1: High switching speed needed.
Problem 2: Large radix. The switch needs a large number of (physical) ports.
Large switch
Topology #1: Big Switch
A single big switch scales poorly.
via Urs Hölzle (Google) on LinkedIn | ||
But what we needed was a 10,000 port switch that cost $100/port. So, almost exactly 20 years ago, we sent this 5-page RFP to four different switch vendors (IIRC: Cisco, Force10, HP, and Quanta) and tried to interest them in building such a switch. They politely declined because "nobody is asking for such a product except you", and they anticipated margins to be low. | ||
10K Gigabit Ethernet Switch Request for Proposal | ||
2004 | ||
This section attempts to explain why we believe it is possible to build a 10,000 port non-blocking switch for $100/port. | ||
Google tried to build this. It didn't go so well.
Topology #2: Tree
A tree topology solves our problems:
Problem: Low bisection bandwidth.
100 Gbps links
Topology #3: Fat Tree
Increase link bandwidth at the top layer.
This can be used.
100 Gbps links
300 Gbps links
Clos Networks
Instead of custom-built switches, use commodity switches to build networks.
Clos Networks
High performance comes from creating many paths.
Ideally, any two hosts talking can use a dedicated path.
Folded Clos Networks
So far, we've drawn sender racks on the left, and recipient racks on the right.
In real networks, each rack could be both sender and recipient.
In a folded Clos network, traffic flows into the network, and then back out.
Clos Network
Folded Clos Network
Topology #4: Fat Tree Clos Network
Idea: Combine the tree and the folded Clos network.
A Scalable, Commodity Data Center Network Architecture | ||
Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat | SIGCOMM 2008 | |
Pod 1
Pod 2
Pod 3
Pod 4
Edge layer
Aggregation layer
Core layer
k = 4
Topology #4: Fat Tree Clos Network
A k-ary fat tree has k pods.
Pod 1
Pod 2
Pod 3
Pod 4
Edge layer
Aggregation layer
Core layer
k = 4
Topology #4: Fat Tree Clos Network
Each switch has k links. Half connect up, and half connect down.
k = 4
Pod 1
Pod 2
Pod 3
Pod 4
Edge layer
Aggregation layer
Core layer
k = 4
2 up to core.
2 down to edge.
2 up to aggregation.
2 down to servers.
Topology #4: Fat Tree Clos Network
A k-ary fat tree has:
Pod 1
Pod 2
Pod 3
Pod 4
Edge layer
Aggregation layer
Core layer
k = 4
4 down to pods.
Topology #4: Fat Tree Clos Network
A k-ary fat tree has:
Pod 3
Pod 4
Pod 2
Pod 1
Pod 5
Pod 6
Edge
k = 6
Aggregation
Core
Topology #4: Fat Tree Clos Network – Bisection Bandwidth
Achieves full bisection bandwidth.
Pod 1
Pod 2
Pod 3
Pod 4
Edge layer
Aggregation layer
Core layer
k = 4
Topology #4: Fat Tree Clos Network – Bisection Bandwidth
Achieves full bisection bandwidth.
Pod 1
Pod 2
Pod 3
Pod 4
Edge layer
Aggregation layer
Core layer
k = 4
Topology #4: Fat Tree Clos Network – Racks
The servers and switches inside a pod can be organized into racks.
Pod
k = 48:
k/2 = 24 aggregation switches
k/2 = 24 edge switches
(k/2)2 = 576 servers per pod
48 servers
48 servers
48 servers
48 servers
48 servers
48 servers
48 servers
48 servers
48 servers
48 servers
48 servers
48 servers
48 aggregation and edge switches
Evolution of Clos Networks for DC
ACM SIGCOMM 2015
Design Variants are Common
Congestion Control in Datacenters
Lecture 18, CS 168, Summer 2025
Designing Datacenters
Congestion Control in Datacenters
Why are Datacenters Different? (1/2) – Queuing Delay
Recall: Packet delay is the sum of 3 values.
Transmission + Propagation + Queuing
Transmission + Propagation + Queuing
General Internet:
Datacenters:
Why are Datacenters Different? (1/2) – Queuing Delay
Example:
Total delay:
Why are Datacenters Different? (2/2) – Types of Flows
Problem: TCP deliberately tries to fill up queues.
The problem is worse in datacenters, where there are limited types of flows.
Datacenter Congestion Control
Congestion control in datacenters is special because:
Recall: Datacenters are constrained environments.
Datacenter-specific congestion control algorithms:
Datacenter Congestion Control (1/3) – BBR
BBR (Google) reacts to delay instead of loss.
Datacenter Congestion Control (2/3) – DCTCP
DCTCP (Microsoft, 2010) reacts to explicit feedback from routers.
Datacenter Congestion Control (2/3) – DCTCP Performance
We can measure performance with FCT (flow completion time):
Our benchmark is the ideal FCT:
Normalized FCT = FCT / ideal FCT.
Datacenter Congestion Control (3/3) – pFabric
Goal: Give mice a way to skip to the front of the queue.
Packets carry a single priority number in the header.
Implementation:
Datacenter Congestion Control (3/3) – pFabric Performance
pFabric performs even better than DCTCP, and close to ideal!
Good example of network and host co-design.
But, practically harder to realize, since it requires full control.
Datacenter Congestion Control (3/3) – pFabric Performance
Why does pFabric work so well?
Summary: Datacenters