1 of 76

Discussion 12

1

CS 168, Summer 2025 @ UC Berkeley

Slides credit: Sylvia Ratnasamy, Rob Shakir, Peyrin Kao, Iuniana Oprescu

Multicast and Collectives 🎉

2 of 76

Logistics

  • Final on August 13th, 3-6pm (put it in your calendar)!
    • You can bring four 8.5” x 11” cheat sheets, each two-sided (8 sides total)
    • All lectures, discussions, and projects are included in the exam scope
  • Please fill out course evaluations for extra credit!

3 of 76

Multicast

  • IP Multicast
  • DVMRP
  • CBT
  • Overlay Multicast

4 of 76

Implementing Multicast

Where are multicast protocols implemented? There are two options:

  • IP Multicast: Implemented in routers, at Layer 3.
  • Overlay Multicast: Implemented in hosts, at Layer 7.

We'll study both options and their trade-offs.

Multicast

Overlay Multicast

IP Multicast

Host-to-router

(IGMP)

Routing protocol

(DVMRP, CBT)

5 of 76

Multicast

  • IP Multicast
  • DVMRP
  • CBT
  • Overlay Multicast

6 of 76

Packet Delivery Models

We've seen four models for delivering a packet:

Some notes on multicast:

  • Hosts can choose to join/leave groups at any time.
  • You can send a packet to a group, even if you are not a member of that group.

Broadcast

Send the packet to all destinations.

Unicast

Send the packet to exactly one destination.

Anycast

Send the packet to one of a set of possible destinations.

Multicast

Send the packet to all members of a group.

7 of 76

IP Multicast Model

Each multicast group has an IP address.

  • Addresses in the range 224.0.0.0 to 239.255.255.255 are reserved for multicast groups.

Hosts can send 3 different types of packets:

  • Join: Send a packet to announce that you are joining a group.
  • Leave: Send a packet to announce that you are leaving a group.
  • Data: Send a data packet to a multicast group.

Routers must know how to process Join, Leave, and Data packets from hosts.

  • Routers must store state. (e.g. "where are the members of group 226.1.2.3?")
  • Routers must run protocols to learn how to forward packets.

8 of 76

Implementing IP Multicast

The goal of routers:

  • Hosts send Join, Leave, and Data packets.
  • Routers must coordinate to process these packets and build forwarding tables.
  • The forwarding tables say: When I receive a packet with destination group G1, what link(s) do I forward this packet out of?

Routers coordinate with two protocols:

  • Host-to-Router: Learns the group(s) that your directly-connected hosts belong to.
  • Routing: Routers talk to each other to build forwarding tables.

Multicast

Overlay Multicast

IP Multicast

Host-to-router

(IGMP)

Routing protocol

(DVMRP, CBT)

9 of 76

Implementing IP Multicast: Host-to-Router

Host-to-Router protocol: Router learns the group(s) that its directly-connected hosts are members of.

R1

R7

R5

R6

R2

R3

R4

R8

Multicast

Overlay Multicast

IP Multicast

Host-to-router

(IGMP)

Routing protocol

(DVMRP, CBT)

Similar to static routing in unicast routing:

Learn about your directly-connected hosts.

Doesn't help you learn about other hosts in the network.

10 of 76

Implementing IP Multicast: Routing Protocol

Routing protocol: Routers talk to each other to build forwarding tables, so they learn how to forward packets toward the members of the destination group.

R1

R7

R5

R6

R3

R4

R8

R2

Multicast

Overlay Multicast

IP Multicast

Host-to-router

(IGMP)

Routing protocol

(DVMRP, CBT)

Think of this as the multicast equivalent of distance-vector or link-state.

11 of 76

IGMP: Directly-Connected Hosts

The router and its directly-connected hosts periodically exchange messages, so that the router is informed about the hosts' group memberships.

  • Query: Router periodically asks hosts: "What groups do you belong to?"
  • Reports: Hosts periodically tell the router: "I belong to these groups."
  • If router doesn't receive a Report for a long time, assume the membership has expired.

Similar to static routing in unicast: Learn about directly-connected hosts, but not other hosts in the network.

R2

Multicast

Overlay Multicast

IP Multicast

Host-to-router

(IGMP)

Routing protocol

(DVMRP, CBT)

Queries

Reports

12 of 76

Multicast

  • IP Multicast
  • DVMRP
  • CBT
  • Overlay Multicast

13 of 76

Building Spanning Trees

How do we build a tree rooted at a given host?

  • Have we seen spanning trees anywhere else?

A

R1

R2

R3

R4

R5

R6

R7

R8

R9

C

D

E

F

14 of 76

Building Spanning Trees

How do we build a tree rooted at a given host?

  • Have we seen spanning trees anywhere else?
  • When we did unicast routing, we built a tree!
  • Key idea: Reuse this same tree.

A

R1

R2

R3

R4

R5

R6

R7

R8

R9

C

D

E

F

15 of 76

Building Spanning Trees

How do we build a tree rooted at a given host?

  • Have we seen spanning trees anywhere else?
  • When we did unicast routing, we built a tree!
  • Key idea: Reuse this same tree.

Two interpretations of the same tree:

Unicast:

  • Packets flow up toward the root.
  • The root is the destination.

Multicast:

  • Packets flow down, from the root to everyone.
  • The root is the sender.

A

R1

R2

R3

R4

R5

R6

R7

R8

R9

C

D

E

F

16 of 76

DVMRP

Routing: Build a spanning tree from the source to all hosts.

  • Learning parent: Use unicast forwarding table.
  • Learning children: Everyone sends advertisement to parent.�Multicast table has a list of your children.

Pruning: Trim the tree to make a tree from the source to all members only.

  • If you receive a prune from a child, remove the child from your multicast table.
  • If none of your descendants (directly-connected hosts or children) belong to this group, send a pruning message to your parent.
  • Periodically clear all pruning information (revert to forwarding to all children).

Forwarding: Packets are forwarded from your parent to your children.

  • Use unicast table to check if packet is from parent.
  • If so, use multicast table to forward packet to children. Else, drop the packet.

17 of 76

DVMRP Pros and Cons

Con: DVMRP scales poorly.

  • Pruning information is periodically cleared.�Packets are occasionally broadcast to everybody, wasting bandwidth.
  • Multicast table needs one entry per source, per destination group.

Pro: DVMRP gives optimal performance.

  • Spanning tree gives us the least-cost paths from source to all group members.
  • Multicast packets travel least-cost paths to their destinations.

Pro: DVMRP is a simple, elegant extension to an existing protocol.

  • We reused the unicast forwarding table to save ourselves some work.

18 of 76

Multicast

  • IP Multicast
  • DVMRP
  • CBT
  • Overlay Multicast

19 of 76

CBT: Definition

A Core-Based Tree (CBT) for a given destination group is:

  • Rooted at a specific core router.
  • Touches every member of that destination group.

R5

R3

R7

R1

R2

R4

R9

R6

R8

R10

20 of 76

CBT: Definition

A Core-Based Tree (CBT) for a given destination group is:

  • Rooted at a specific core router.
  • Touches every member of that destination group.

R5

R3

R7

R1

R2

R4

R9

R6

R8

R10

21 of 76

Multicast

  • IP Multicast
  • DVMRP
  • CBT
  • Overlay Multicast

22 of 76

Overlay Multicast: Key Idea

Key idea: Build a virtual network topology to connect the hosts together.

  • The virtual links are a fiction. Packets still have to be sent along physical links.
  • But the virtual links give us the illusion that the hosts are in a single network.

The hosts can run a multicast routing protocol (CBT, DVMRP) to learn how to send multicast packets.

R51

R52

R53

R61

R62

R63

R64

R71

R72

R73

A

B

C

D

F

G

E

23 of 76

Overlay Multicast: Key Idea

Overlay multicast is implemented in hosts, not routers.

  • Routers just forward unicast packets, as usual.
  • Virtual routers in the hosts run routing protocols and populate multicast tables.
  • Hosts are now forwarding packets between each other!

R51

R52

R53

R61

R62

R63

R64

R71

R72

R73

A

B

C

D

F

G

E

24 of 76

Overlay Multicast Performance: Stretch

Overlay multicast performance depends on how well the virtual topology matches the physical topology.

  • Stretch: Ratio of physical path cost to the corresponding virtual path cost.
  • Low stretch is good: Physical path cost is similar to virtual path cost.
  • High stretch is bad: Physical path is many times longer than virtual path.

R51

R52

R53

A

B

C

Assume all links are assigned cost 1.

Physical path cost: 4

Virtual path cost: 1

Stretch: 4 / 1 = 4x

25 of 76

Collectives

  • Distributed Training Infrastructure
  • Collective Communication
  • Redistribution Operations
  • Consolidation Operations
  • AllReduce Topologies

26 of 76

Collectives

  • Distributed Training Infrastructure
  • Collective Communication
  • Redistribution Operations
  • Consolidation Operations
  • AllReduce Topologies

27 of 76

Distributed Training Infrastructure: GPUs, TPUs

What exactly is each "node"?

  • GPUs (Graphics Processing Units): Chips that are good at AI operations.
  • TPUs (Tensor Processing Units): AI-optimized chips designed by Google.

The nodes are inter-connected in a datacenter-like network.

Recall the properties of a datacenter network:

  • Nodes are physically close (e.g. same building).
  • Nodes are organized in a structured topology (e.g. Clos network).
  • Nodes are homogeneous (e.g. all the same model of GPU).
  • Links have very high bandwidth.

28 of 76

Distributed Training Infrastructure: Servers

Each server in the datacenter has:

  • One or more GPUs/TPUs. Used to do the AI training.
  • A relatively weak CPU. Used for miscellaneous operations, not AI training.
  • A NIC, shared by all the chips on this server.

Server 1

CPU

GPU 1

GPU 2

Memory

Memory

NIC

Server 2

CPU

GPU 1

GPU 2

Memory

Memory

NIC

29 of 76

Distributed Training Infrastructure: Network Topology

The servers are connected in a structured topology.

  • Many different possible topologies. Example: Fat-tree Clos topology.
  • Remember: Each server ( ) could have one or more GPUs/TPUs.

Pod 1

Pod 2

Pod 3

Pod 4

Edge layer

Aggregation layer

Core layer

30 of 76

Distributed Training Infrastructure: Network Topology

Notice: Some pairs of nodes are more closely-connected than others.

  • Two nodes on the same server: Great performance.
    • Don't need to use the network at all.
    • Nodes can communicate with effectively infinite bandwidth and zero latency.
  • Two nodes in the same pod: Good performance.
  • Two nodes in different pods: Worse performance.

Pod 1

Pod 2

Pod 3

Pod 4

Edge layer

Aggregation layer

Core layer

Good performance

Worse performance

31 of 76

Distributed Training Infrastructure: Network Topology

Alternate topology: 2D torus: A square, where the edges wrap around.

  • TPUs can be directly connected to each other with ICI (Inter-Core Interconnect), with no routers needed.
  • As before, some pairs of nodes are more closely-connected than others.
  • "Wrap around" means: If you're at a bottom node and traverse a link down, you'll end up at the top node. (And similar for the other directions.)

32 of 76

Distributed Training Infrastructure: Network Topology

Alternate topology: 3D torus: A cube, where the edges wrap around.

  • TPUs can be directly connected to each other with ICI (Inter-Core Interconnect), with no routers needed.
  • As before, some pairs of nodes are more closely-connected than others.

33 of 76

Collectives

  • Distributed Training Infrastructure
  • Collective Communication
  • Redistribution Operations
  • Consolidation Operations
  • AllReduce Topologies

34 of 76

Collective Communication: Definition

Collective communication: A group of nodes that exchange data in a coordinated manner as part of a group computation.

  • The nodes work together to achieve a common goal.
  • The nodes have to exchange data during that process.

Brief history:

  • Originally developed for supercomputers.
  • Now an active area of research again in the context of AI training.
  • Modern implementations: NCCL (Nvidia), MSCCL (Microsoft), etc.
  • NCCL code is available online if you're curious.

35 of 76

Collective Communication Definition: Recap

To recap, collective operations are:

  • Orchestrated: A centralized controller plans out the job ahead of time.
  • Synchronized: Every node starts the operation at the same time.
  • Blocking: Must wait for all nodes to finish before proceeding.

We'll now define the 7 collective operations.

  • First, we'll define the inputs and outputs (the what).
  • Then, we'll look at how the operation is implemented in the network (the how).

The operations can roughly split into two categories:

  • Redistribution operations move data around without transforming it.
  • Consolidation operations aggregate many pieces of data into a single output.

36 of 76

Collectives

  • Distributed Training Infrastructure
  • Collective Communication
  • Redistribution Operations
  • Consolidation Operations
  • AllReduce Topologies

37 of 76

Broadcast

The target node sends its vector to everybody.

Node 1:

Node 2:

Node 3:

Node 4:

Node 1:

Node 2:

Node 3:

Node 4:

Before:

After:

w4

w3

w2

w1

x4

x3

x2

x1

y4

y3

y2

y1

z4

z3

z2

z1

w4

w3

w2

w1

w4

w3

w2

w1

w4

w3

w2

w1

w4

w3

w2

w1

Broadcast

38 of 76

Scatter

The target node sends its ith element to the ith node.

Node 1:

Node 2:

Node 3:

Node 4:

Node 1:

Node 2:

Node 3:

Node 4:

Before:

After:

w4

w3

w2

w1

x4

x3

x2

x1

y4

y3

y2

y1

z4

z3

z2

z1

w1

w2

w3

w4

Scatter

39 of 76

Gather

Node i sends its ith element to the target node.

Node 1:

Node 2:

Node 3:

Node 4:

Node 1:

Node 2:

Node 3:

Node 4:

Before:

After:

w4

w3

w2

w1

x4

x3

x2

x1

y4

y3

y2

y1

z4

z3

z2

z1

z4

y3

x2

w1

Gather

40 of 76

AllGather

Node i sends its ith element to everybody.

Node 1:

Node 2:

Node 3:

Node 4:

Node 1:

Node 2:

Node 3:

Node 4:

Before:

After:

w4

w3

w2

w1

x4

x3

x2

x1

y4

y3

y2

y1

z4

z3

z2

z1

z4

y3

x2

w1

z4

y3

x2

w1

z4

y3

x2

w1

z4

y3

x2

w1

AllGather

41 of 76

Collectives

  • Distributed Training Infrastructure
  • Collective Communication
  • Redistribution Operations
  • Consolidation Operations
  • AllReduce Topologies

42 of 76

Reduce

Sum up the vectors, then send the resulting sum vector to a target node.

w4

w3

w2

w1

x4

x3

x2

x1

y4

y3

y2

y1

z4

z3

z2

z1

w4 + x4 + y4 + z4

w3 + x3 + y3 + z3

w2 + x2 + y2 + z2

w1 + x1 + y1 + z1

Node 1:

Node 2:

Node 3:

Node 4:

Node 1:

Node 2:

Node 3:

Node 4:

Before:

After:

Reduce

43 of 76

AllReduce

Sum up the vectors, then send the resulting sum vector to all nodes.

w4 + x4 + y4 + z4

w3 + x3 + y3 + z3

w2 + x2 + y2 + z2

w1 + x1 + y1 + z1

w4 + x4 + y4 + z4

w3 + x3 + y3 + z3

w2 + x2 + y2 + z2

w1 + x1 + y1 + z1

w4 + x4 + y4 + z4

w3 + x3 + y3 + z3

w2 + x2 + y2 + z2

w1 + x1 + y1 + z1

w4 + x4 + y4 + z4

w3 + x3 + y3 + z3

w2 + x2 + y2 + z2

w1 + x1 + y1 + z1

w4

w3

w2

w1

x4

x3

x2

x1

y4

y3

y2

y1

z4

z3

z2

z1

Node 1:

Node 2:

Node 3:

Node 4:

Node 1:

Node 2:

Node 3:

Node 4:

Before:

After:

AllReduce

44 of 76

ReduceScatter

The ith node has the sum of the ith elements of each vector.

w1 + x1 + y1 + z1

w2 + x2 + y2 + z2

w3 + x3 + y3 + z3

w4 + x4 + y4 + z4

w4

w3

w2

w1

x4

x3

x2

x1

y4

y3

y2

y1

z4

z3

z2

z1

Before:

After:

Node 1:

Node 2:

Node 3:

Node 4:

Node 1:

Node 2:

Node 3:

Node 4:

ReduceScatter

45 of 76

Collectives

  • Distributed Training Infrastructure
  • Collective Communication
  • Redistribution Operations
  • Consolidation Operations
  • AllReduce Topologies

46 of 76

Mesh Topology

5

4

3

1

2

w4

w3

w2

w1

w5

z4

z3

z2

z1

z5

y4

y3

y2

y1

y5

x4

x3

x2

x1

x5

v4

v3

v2

v1

v5

47 of 76

Single Root Topology

2

4

3

1

5

w4

w3

w2

w1

w5

z4

z3

z2

z1

z5

y4

y3

y2

y1

y5

x4

x3

x2

x1

x5

v4

v3

v2

v1

v5

48 of 76

Tree-Based AllReduce

5

z4

z3

z2

z1

4

y4

y3

y2

y1

3

x4

x3

x2

x1

1

v4

v3

v2

v1

2

w4

w3

w2

w1

z5

y5

w5

v5

x5

49 of 76

Ring-Based AllReduce

v1

w1

x1

y1

z1

v2

w2

x2

y2

z2

v3

w3

x3

y3

z3

w4

x4

y4

z4

v4

w5

x5

y5

z5

v5

5

4

3

1

2

50 of 76

Ring-Based AllReduce (Optimized)

51 of 76

Topology Efficiency

Total bandwidth: Each node receives 1 vector (D bytes) from the right.� Each node sends 1 vector (D bytes) to the left. p nodes in total.

Number of steps: Vector must travel around all p nodes in the ring.

Bandwidth per step: Each node receives 1 element (D/p bytes) from the right.� Each node sends 1 element (D/p bytes) to the left.

Total Bandwidth

Number of Steps

Bandwidth Per Step

Mesh

Dp2

1

Dp

Single Root

Dp

2

Dp

Tree

Dp

log(p)

D

Ring (Naive)

Dp

p

D

Ring (Optimized)

Dp

p

D/p

52 of 76

Overlay and Underlay Topologies

Suppose Hosts A, B, C, D want to run an AllReduce operation.

A, B, C, D are not in a ring topology. But, we can add virtual links to create a ring!

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

R11

R12

A

C

D

B

53 of 76

Layers of Abstraction

We thought about collectives at 3 different layers of abstraction:

  • Definitions: Users just need to know the input/outputs to use the operations.

w4

w3

w2

w1

x4

x3

x2

x1

y4

y3

y2

y1

z4

z3

z2

z1

Node 1:

Node 2:

Node 3:

Node 4:

w4 + x4 + y4 + z4

w3 + x3 + y3 + z3

w2 + x2 + y2 + z2

w1 + x1 + y1 + z1

w4 + x4 + y4 + z4

w3 + x3 + y3 + z3

w2 + x2 + y2 + z2

w1 + x1 + y1 + z1

w4 + x4 + y4 + z4

w3 + x3 + y3 + z3

w2 + x2 + y2 + z2

w1 + x1 + y1 + z1

w4 + x4 + y4 + z4

w3 + x3 + y3 + z3

w2 + x2 + y2 + z2

w1 + x1 + y1 + z1

AllReduce

Node 1:

Node 2:

Node 3:

Node 4:

54 of 76

Layers of Abstraction

We thought about collectives at 3 different layers of abstraction:

  • Definitions: Users just need to know the input/outputs to use the operations.
  • Overlay: The data exchanged over virtual links.

5

4

3

1

2

55 of 76

Layers of Abstraction

We thought about collectives at 3 different layers of abstraction:

  • Definitions: Users just need to know the input/outputs to use the operations.
  • Overlay: The data exchanged over virtual links.
  • Underlay: How the virtual links correspond to actual physical links.

R1

R2

R3

R4

R5

R6

R7

R8

R9

R10

R11

R12

A

C

D

B

56 of 76

Worksheet

  • True or False
  • Multicast I
  • Multicast II
  • Collectives

57 of 76

Question 1: True or False

58 of 76

Question 1: True or False

59 of 76

Worksheet

  • True or False
  • Multicast I
  • Multicast II
  • Collectives

60 of 76

Question 2: Multicast I

61 of 76

Question 2: Multicast I

62 of 76

Worksheet

  • True or False
  • Multicast I
  • Multicast II
  • Collectives

63 of 76

Question 3: Multicast II

64 of 76

Question 3: Multicast II

65 of 76

Question 3: Multicast II

66 of 76

Question 3: Multicast II

67 of 76

Worksheet

  • True or False
  • Multicast I
  • Multicast II
  • Collectives

68 of 76

Question 4: Collectives

69 of 76

Question 4: Collectives

70 of 76

Question 4: Collectives

71 of 76

Question 4: Collectives

72 of 76

Question 4: Collectives

73 of 76

Question 4: Collectives

74 of 76

Question 4: Collectives

75 of 76

Question 4: Collectives

76 of 76

Questions?

Feedback Form: https://tinyurl.com/cs168-su25-disc-feedback