Discussion 12
1
CS 168, Summer 2025 @ UC Berkeley
Slides credit: Sylvia Ratnasamy, Rob Shakir, Peyrin Kao, Iuniana Oprescu
Multicast and Collectives 🎉
Logistics
Multicast
Implementing Multicast
Where are multicast protocols implemented? There are two options:
We'll study both options and their trade-offs.
Multicast
Overlay Multicast
IP Multicast
Host-to-router
(IGMP)
Routing protocol
(DVMRP, CBT)
Multicast
Packet Delivery Models
We've seen four models for delivering a packet:
Some notes on multicast:
Broadcast
Send the packet to all destinations.
Unicast
Send the packet to exactly one destination.
Anycast
Send the packet to one of a set of possible destinations.
Multicast
Send the packet to all members of a group.
IP Multicast Model
Each multicast group has an IP address.
Hosts can send 3 different types of packets:
Routers must know how to process Join, Leave, and Data packets from hosts.
Implementing IP Multicast
The goal of routers:
Routers coordinate with two protocols:
Multicast
Overlay Multicast
IP Multicast
Host-to-router
(IGMP)
Routing protocol
(DVMRP, CBT)
Implementing IP Multicast: Host-to-Router
Host-to-Router protocol: Router learns the group(s) that its directly-connected hosts are members of.
R1
R7
R5
R6
R2
R3
R4
R8
Multicast
Overlay Multicast
IP Multicast
Host-to-router
(IGMP)
Routing protocol
(DVMRP, CBT)
Similar to static routing in unicast routing:
Learn about your directly-connected hosts.
Doesn't help you learn about other hosts in the network.
Implementing IP Multicast: Routing Protocol
Routing protocol: Routers talk to each other to build forwarding tables, so they learn how to forward packets toward the members of the destination group.
R1
R7
R5
R6
R3
R4
R8
R2
Multicast
Overlay Multicast
IP Multicast
Host-to-router
(IGMP)
Routing protocol
(DVMRP, CBT)
Think of this as the multicast equivalent of distance-vector or link-state.
IGMP: Directly-Connected Hosts
The router and its directly-connected hosts periodically exchange messages, so that the router is informed about the hosts' group memberships.
Similar to static routing in unicast: Learn about directly-connected hosts, but not other hosts in the network.
R2
Multicast
Overlay Multicast
IP Multicast
Host-to-router
(IGMP)
Routing protocol
(DVMRP, CBT)
Queries
Reports
Multicast
Building Spanning Trees
How do we build a tree rooted at a given host?
A
R1
R2
R3
R4
R5
R6
R7
R8
R9
C
D
E
F
Building Spanning Trees
How do we build a tree rooted at a given host?
A
R1
R2
R3
R4
R5
R6
R7
R8
R9
C
D
E
F
Building Spanning Trees
How do we build a tree rooted at a given host?
Two interpretations of the same tree:
Unicast:
Multicast:
A
R1
R2
R3
R4
R5
R6
R7
R8
R9
C
D
E
F
DVMRP
Routing: Build a spanning tree from the source to all hosts.
Pruning: Trim the tree to make a tree from the source to all members only.
Forwarding: Packets are forwarded from your parent to your children.
DVMRP Pros and Cons
Con: DVMRP scales poorly.
Pro: DVMRP gives optimal performance.
Pro: DVMRP is a simple, elegant extension to an existing protocol.
Multicast
CBT: Definition
A Core-Based Tree (CBT) for a given destination group is:
R5
R3
R7
R1
R2
R4
R9
R6
R8
R10
CBT: Definition
A Core-Based Tree (CBT) for a given destination group is:
R5
R3
R7
R1
R2
R4
R9
R6
R8
R10
Multicast
Overlay Multicast: Key Idea
Key idea: Build a virtual network topology to connect the hosts together.
The hosts can run a multicast routing protocol (CBT, DVMRP) to learn how to send multicast packets.
R51
R52
R53
R61
R62
R63
R64
R71
R72
R73
A
B
C
D
F
G
E
Overlay Multicast: Key Idea
Overlay multicast is implemented in hosts, not routers.
R51
R52
R53
R61
R62
R63
R64
R71
R72
R73
A
B
C
D
F
G
E
Overlay Multicast Performance: Stretch
Overlay multicast performance depends on how well the virtual topology matches the physical topology.
R51
R52
R53
A
B
C
Assume all links are assigned cost 1.
Physical path cost: 4
Virtual path cost: 1
Stretch: 4 / 1 = 4x
Collectives
Collectives
Distributed Training Infrastructure: GPUs, TPUs
What exactly is each "node"?
The nodes are inter-connected in a datacenter-like network.
Recall the properties of a datacenter network:
Distributed Training Infrastructure: Servers
Each server in the datacenter has:
Server 1
CPU
GPU 1
GPU 2
Memory
Memory
NIC
Server 2
CPU
GPU 1
GPU 2
Memory
Memory
NIC
Distributed Training Infrastructure: Network Topology
The servers are connected in a structured topology.
Pod 1
Pod 2
Pod 3
Pod 4
Edge layer
Aggregation layer
Core layer
Distributed Training Infrastructure: Network Topology
Notice: Some pairs of nodes are more closely-connected than others.
Pod 1
Pod 2
Pod 3
Pod 4
Edge layer
Aggregation layer
Core layer
Good performance
Worse performance
Distributed Training Infrastructure: Network Topology
Alternate topology: 2D torus: A square, where the edges wrap around.
Distributed Training Infrastructure: Network Topology
Alternate topology: 3D torus: A cube, where the edges wrap around.
Collectives
Collective Communication: Definition
Collective communication: A group of nodes that exchange data in a coordinated manner as part of a group computation.
Brief history:
Collective Communication Definition: Recap
To recap, collective operations are:
We'll now define the 7 collective operations.
The operations can roughly split into two categories:
Collectives
Broadcast
The target node sends its vector to everybody.
Node 1:
Node 2:
Node 3:
Node 4:
Node 1:
Node 2:
Node 3:
Node 4:
Before:
After:
w4
w3
w2
w1
x4
x3
x2
x1
y4
y3
y2
y1
z4
z3
z2
z1
w4
w3
w2
w1
w4
w3
w2
w1
w4
w3
w2
w1
w4
w3
w2
w1
Broadcast
Scatter
The target node sends its ith element to the ith node.
Node 1:
Node 2:
Node 3:
Node 4:
Node 1:
Node 2:
Node 3:
Node 4:
Before:
After:
w4
w3
w2
w1
x4
x3
x2
x1
y4
y3
y2
y1
z4
z3
z2
z1
w1
w2
w3
w4
Scatter
Gather
Node i sends its ith element to the target node.
Node 1:
Node 2:
Node 3:
Node 4:
Node 1:
Node 2:
Node 3:
Node 4:
Before:
After:
w4
w3
w2
w1
x4
x3
x2
x1
y4
y3
y2
y1
z4
z3
z2
z1
z4
y3
x2
w1
Gather
AllGather
Node i sends its ith element to everybody.
Node 1:
Node 2:
Node 3:
Node 4:
Node 1:
Node 2:
Node 3:
Node 4:
Before:
After:
w4
w3
w2
w1
x4
x3
x2
x1
y4
y3
y2
y1
z4
z3
z2
z1
z4
y3
x2
w1
z4
y3
x2
w1
z4
y3
x2
w1
z4
y3
x2
w1
AllGather
Collectives
Reduce
Sum up the vectors, then send the resulting sum vector to a target node.
w4
w3
w2
w1
x4
x3
x2
x1
y4
y3
y2
y1
z4
z3
z2
z1
w4 + x4 + y4 + z4
w3 + x3 + y3 + z3
w2 + x2 + y2 + z2
w1 + x1 + y1 + z1
Node 1:
Node 2:
Node 3:
Node 4:
Node 1:
Node 2:
Node 3:
Node 4:
Before:
After:
Reduce
AllReduce
Sum up the vectors, then send the resulting sum vector to all nodes.
w4 + x4 + y4 + z4
w3 + x3 + y3 + z3
w2 + x2 + y2 + z2
w1 + x1 + y1 + z1
w4 + x4 + y4 + z4
w3 + x3 + y3 + z3
w2 + x2 + y2 + z2
w1 + x1 + y1 + z1
w4 + x4 + y4 + z4
w3 + x3 + y3 + z3
w2 + x2 + y2 + z2
w1 + x1 + y1 + z1
w4 + x4 + y4 + z4
w3 + x3 + y3 + z3
w2 + x2 + y2 + z2
w1 + x1 + y1 + z1
w4
w3
w2
w1
x4
x3
x2
x1
y4
y3
y2
y1
z4
z3
z2
z1
Node 1:
Node 2:
Node 3:
Node 4:
Node 1:
Node 2:
Node 3:
Node 4:
Before:
After:
AllReduce
ReduceScatter
The ith node has the sum of the ith elements of each vector.
w1 + x1 + y1 + z1
w2 + x2 + y2 + z2
w3 + x3 + y3 + z3
w4 + x4 + y4 + z4
w4
w3
w2
w1
x4
x3
x2
x1
y4
y3
y2
y1
z4
z3
z2
z1
Before:
After:
Node 1:
Node 2:
Node 3:
Node 4:
Node 1:
Node 2:
Node 3:
Node 4:
ReduceScatter
Collectives
Mesh Topology
5
4
3
1
2
w4
w3
w2
w1
w5
z4
z3
z2
z1
z5
y4
y3
y2
y1
y5
x4
x3
x2
x1
x5
v4
v3
v2
v1
v5
Single Root Topology
2
4
3
1
5
w4
w3
w2
w1
w5
z4
z3
z2
z1
z5
y4
y3
y2
y1
y5
x4
x3
x2
x1
x5
v4
v3
v2
v1
v5
Tree-Based AllReduce
5
z4
z3
z2
z1
4
y4
y3
y2
y1
3
x4
x3
x2
x1
1
v4
v3
v2
v1
2
w4
w3
w2
w1
z5
y5
w5
v5
x5
Ring-Based AllReduce
v1
w1
x1
y1
z1
v2
w2
x2
y2
z2
v3
w3
x3
y3
z3
w4
x4
y4
z4
v4
w5
x5
y5
z5
v5
5
4
3
1
2
Ring-Based AllReduce (Optimized)
Topology Efficiency
Total bandwidth: Each node receives 1 vector (D bytes) from the right.� Each node sends 1 vector (D bytes) to the left. p nodes in total.
Number of steps: Vector must travel around all p nodes in the ring.
Bandwidth per step: Each node receives 1 element (D/p bytes) from the right.� Each node sends 1 element (D/p bytes) to the left.
| Total Bandwidth | Number of Steps | Bandwidth Per Step |
Mesh | Dp2 | 1 | Dp |
Single Root | Dp | 2 | Dp |
Tree | Dp | log(p) | D |
Ring (Naive) | Dp | p | D |
Ring (Optimized) | Dp | p | D/p |
Overlay and Underlay Topologies
Suppose Hosts A, B, C, D want to run an AllReduce operation.
A, B, C, D are not in a ring topology. But, we can add virtual links to create a ring!
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
A
C
D
B
Layers of Abstraction
We thought about collectives at 3 different layers of abstraction:
w4
w3
w2
w1
x4
x3
x2
x1
y4
y3
y2
y1
z4
z3
z2
z1
Node 1:
Node 2:
Node 3:
Node 4:
w4 + x4 + y4 + z4
w3 + x3 + y3 + z3
w2 + x2 + y2 + z2
w1 + x1 + y1 + z1
w4 + x4 + y4 + z4
w3 + x3 + y3 + z3
w2 + x2 + y2 + z2
w1 + x1 + y1 + z1
w4 + x4 + y4 + z4
w3 + x3 + y3 + z3
w2 + x2 + y2 + z2
w1 + x1 + y1 + z1
w4 + x4 + y4 + z4
w3 + x3 + y3 + z3
w2 + x2 + y2 + z2
w1 + x1 + y1 + z1
AllReduce
Node 1:
Node 2:
Node 3:
Node 4:
Layers of Abstraction
We thought about collectives at 3 different layers of abstraction:
5
4
3
1
2
Layers of Abstraction
We thought about collectives at 3 different layers of abstraction:
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
A
C
D
B
Worksheet
Question 1: True or False
Question 1: True or False
Worksheet
Question 2: Multicast I
Question 2: Multicast I
Worksheet
Question 3: Multicast II
Question 3: Multicast II
Question 3: Multicast II
Question 3: Multicast II
Worksheet
Question 4: Collectives
Question 4: Collectives
Question 4: Collectives
Question 4: Collectives
Question 4: Collectives
Question 4: Collectives
Question 4: Collectives
Question 4: Collectives
Questions?
Feedback Form: https://tinyurl.com/cs168-su25-disc-feedback