1 of 57

Lecture 19: Disjoint Sets

CSE 373: Data Structures and Algorithms

1

CSE 373 23SU

1

2 of 57

Warmup

KruskalMST(Graph G)

initialize each vertex to be an independent component

sort the edges by weight

foreach(edge (u, v) in sorted order){

if(u and v are in different components){

add (u,v) to the MST

update u and v to be in the same component

}

Run Kruskal’s algorithm on the following graph to find the MST (minimum spanning tree) of the graph below.

Below is the provided pseudocode for Kruksal’s algorithm to choose all the edges.

CSE 373 23SU

2

3 of 57

Announcements

Ex 5 due Monday 8/7 at 11:59PM
P4 due 8/18 at 11:59PM
Extra credit survey out! Due tonight!

worth 0.1 final GPA boost!
survey link: https://forms.gle/o5cCL4R9LU3ii2qy8

CSE 373 23SU

3

4 of 57

Disjoint Sets

CSE 373 23SU

4

5 of 57

Selecting an ADT

Kruskal’s needs to find what MST a vertex belongs to, and union those MSTs together

Our existing ADTs don’t lend themselves well to “unioning” two sets…
Let’s define a new one!

A

B

D

E

C

4

2

11

1

3

5

8

9

10

7

6

F

kruskalMST(G graph)

Set(?) msts; Set finalMST;

initialize msts with each vertex as single-element MST

sort all edges by weight (smallest to largest)

for each edge (u,v) in ascending order:

uMST = msts.find(u)

vMST = msts.find(v)

if (uMST != vMST):

finalMST.add(edge (u, v))

msts.union(uMST, vMST)

CSE 373 23SU

5

6 of 57

Disjoint Sets ADT (aka “Union-Find”)

Kruskal’s will use a Disjoint Sets ADT under the hood

Conceptually, a single instance of this ADT contains a “family” of sets that are disjoint (no element belongs to multiple sets)

DISJOINT SETS ADT

State

Family of Sets

disjoint: no shared elements
each set has a representative (either a member or a unique ID)

Behavior

makeSet(value) – new set with value as only member (and representative)

find(value) – return representative of the set containing value

union(x, y) – combine sets containing x and y into one set with all elements, choose single new representative

kruskalMST(G graph)

DisjointSets<V> msts; Set finalMST;

initialize msts with each vertex as single-element MST

sort all edges by weight (smallest to largest)

for each edge (u,v) in ascending order:

uMST = msts.find(u)

vMST = msts.find(v)

if (uMST != vMST):

finalMST.add(edge (u, v))

msts.union(uMST, vMST);

A

B

D

E

C

4

2

11

1

3

5

8

9

10

7

6

F

CSE 373 23SU

6

7 of 57

Disjoint Sets in mathematics

“In mathematics, two sets are said to be disjoint sets if they have no element in common.” - Wikipedia

disjoint = not overlapping

Kevin

Aileen

Keanu

Sherdil

Leona

These two sets are disjoint sets

Nishu

Santino

Brian

These two sets are not disjoint sets

Santino

Set #1

Set #2

Set #3

Set #4

CSE 373 23SU

7

8 of 57

Disjoint Sets in computer science

In computer science, disjointsets can refer to this ADT/data structure that keeps track of the multiple “mini” sets that are disjoint

This overall grey blob thing is the actual

disjoint sets, and it’s keeping track of any number of mini-sets, which are all disjoint (the mini sets have no overlapping values).

Note: this might feel really different than ADTs we’ve run into before. The ADTs we’ve seen before (dictionaries, lists, sets, etc.) just store values directly.

But the Disjoint Set ADT is particularly interested in

letting you group your values into sets and

keep track of which particular set your values are in.

new ADT!

Kevin

Aileen

Keanu

Sherdil

Leona

Set #1

Set #2

CSE 373 23SU

8

9 of 57

New ADT

Set ADT

create(x) - creates a new set with a single member, x

Count of Elements

state

behavior

Set of elements

Elements must be unique!
No required order

add(x) - adds x into set if it is unique, otherwise add is ignored

remove(x) – removes x from set

size() – returns current number of elements in set

Disjoint-Set ADT

makeSet(x) – creates a new set within the disjoint set where the only member is x. Picks representative for set

Count of Sets

state

behavior

Set of Sets

Disjoint: Elements must be unique across sets
No required order
Each set has representative

findSet(x) – looks up the set containing element x, returns representative of that set

union(x, y) – looks up set containing x and set containing y, combines two sets into one. Picks new representative for resulting set

D

B

C

A

D

C

F

B

A

G

H

CSE 373 23SU

9

10 of 57

findSet(value)

findSet(value) returns some ID for which particular set the value is in. For Disjoint Sets, we often call this the representative (as it’s a value that represents the whole set).

Examples:

findSet(Brian)

findSet(Sherdil)

findSet(Santino)

findSet(Kevin) == findSet(Aileen)

4

2

3

true

Kevin

Aileen

Keanu

Set #1

Sherdil

Leona

Set #2

Nishu

Santino

Set #3

Brian

Rahul

Set #4

CSE 373 23SU

10

11 of 57

union(valueA, valueB)

union(valueA, valueB) merges the set that A is in with the set that B is in. (basically add the two sets together into one)

Example: union(Kevin, Nishu)

Kevin

Aileen

Keanu

Set #1

Sherdil

Leona

Set #2

Nishu

Santino

Set #3

Brian

Rahul

Set #4

Kevin

Aileen

Keanu

Set #1

Sherdil

Leona

Set #2

Nishu

Santino

Brian

Rahul

Set #4

CSE 373 23SU

11

12 of 57

makeSet(value)

makeSet(value) makes a new mini set that just has the value parameter in it.

Examples:

makeSet(Elena)

makeSet(Anish)

Kevin

Aileen

Keanu

Set #1

Sherdil

Leona

Set #2

Nishu

Santino

Set #3

Brian

Rahul

Set #4

Set #5

Set #6

Elena

Anisha

CSE 373 23SU

12

13 of 57

Example

b

Rep: 1

a

Rep: 0

e

Rep: 4

c

Rep: 2

d

Rep: 3

new()

makeSet(a)

makeSet(b)

makeSet(c)

makeSet(d)

makeSet(e)

findSet(a)

findSet(d)

union(a, c)

CSE 373 23SU

13

14 of 57

Example

b

Rep: 1

e

Rep: 4

d

Rep: 3

c

a

Rep: 0

new()

makeSet(a)

makeSet(b)

makeSet(c)

makeSet(d)

makeSet(e)

findSet(a)

findSet(d)

union(a, c)

union(b, d)

CSE 373 23SU

14

15 of 57

Example

findSet(a) == findSet(c)

findSet(a) == findSet(d)

e

Rep: 4

c

a

Rep: 0

b

Rep: 1

d

new()

makeSet(a)

makeSet(b)

makeSet(c)

makeSet(d)

makeSet(e)

findSet(a)

findSet(d)

union(a, c)

union(b, d)

true

false

CSE 373 23SU

15

16 of 57

Questions?

CSE 373 23SU

16

17 of 57

Disjoint Set Implementation

Weighted Union

Path Compression

Array Implementation

CSE 373 23SU

17

18 of 57

Implementation

TreeDisjointSet<E>

makeSet(x)-create a new tree of size 1 and add to our forest

state

behavior

Set<TreeSet> forest

findSet(x)-locates node with x and moves up tree to find root

union(x, y)-append tree with y as a child of tree with x

Disjoint-Set ADT

makeSet(x) – creates a new set within the disjoint set where the only member is x. Picks representative for set

Count of Sets

state

behavior

Set of Sets

Disjoint: Elements must be unique across sets
No required order
Each set has representative

findSet(x) – looks up the set containing element x, returns representative of that set

union(x, y) – looks up set containing x and set containing y, combines two sets into one. Picks new representative for resulting set

Map<NodeValues, NodeLocations> nodeInventory

TreeSet<E>

TreeSet(x)

state

behavior

SetNode overallRoot

add(x)

remove(x, y)

getRep()- returns data of overallRoot

SetNode<E>

SetNode(x)

state

behavior

E data

updateParent(x)

SetNode<E> parent

CSE 373 23SU

18

19 of 57

Implementation

Nishu

Santino

Brian

Rahul

TreeDisjointSet<E>

TreeSet<E>

SetNode<E>

1

4

Kevin

Aileen

Keanu

Sherdil

Leona

9

Disjoint Sets are built as a collection of three objects

The TreeDisjointSet<E> is the top level object with two fields:

Set<TreeSet> forest
Map with node references

Each TreeSet<E> is a collection of SetNodes<E>

Each SetNode<E> can have an unlimited number of children, but only one parent. Nodes store parents instead of children.

Set<TreeSet> forest =

Map<E, SetNode<E>> nodeInventory =

CSE 373 23SU

19

20 of 57

Implement makeSet(x)

Worst case runtime?

O(1)

TreeDisjointSet<E>

makeSet(x)-create a new tree of size 1 and add to our forest

state

behavior

Collection<TreeSet> forest

findSet(x)-locates node with x and moves up tree to find root

union(x, y)-append tree with y as a child of tree with x

Dictionary<NodeValues, NodeLocations> nodeInventory

0

1

2

3

4

5

forest

0	1	2	3	4	5

makeSet(0)
makeSet(1)
makeSet(2)
makeSet(3)
makeSet(4)
makeSet(5)

CSE 373 23SU

20

21 of 57

Implement find(X)

Key Idea: Jump to the node given. Travel upward to parent until parent field is null, nodes with null parents are roots and their data will act as the representative for the set

How do we jump to a node quickly?

Store a map from value to its node (Omitted in future slides)

Runtime

jump to node O(1)
travel up to root

based on height of TreeSet
Worst case: O(n) if TreeSet is degenerate Tree

find(Santino) -> Aileen

find(Ken) -> Joyce

find(Santino) != find(Ken)

find(Santino) == find(Aileen)

find(Ken):

jump to Ken node

travel upward until root Joyce

return root “Joyce”

Aileen

Santino

Paul

Joyce

Ken

Sam

Alex

Sam

Alex

Paul

…

Map<E, SetNode<E>> nodeInventory =

CSE 373 23SU

21

22 of 57

Implement union(x, y)

Key idea: easy to simply rearrange pointers to union entire trees together!

it doesn’t matter what the order of the trees are, only that all the nodes from one tree are connected to the other tree

union(Ken, Santino):

rootK = find(Ken) //Joyce

rootS = find(Santino) //Aileen

set rootK to point to rootS

Aileen

Santino

Paul

Joyce

Ken

Sam

Alex

RESULT:

Aileen

Santino

Paul

Joyce

Ken

Sam

Alex

CSE 373 23SU

22

23 of 57

Union: Why bother with the second root?

Key idea: keeping the height of each tree short will help minimize runtime for future find(x)

Pointing directly to the individual element instead of the root can grow the tree height

union(Ken, Santino):

rootK = find(Ken)

rootS = find(Santino)

set rootK to point to rootS

union(Ken, Santino):

rootK = find(Ken)

set rootK to point to Santino

Aileen

Santino

Paul

Joyce

Ken

Sam

Alex

Why not just use:

Aileen

Santino

Paul

Joyce

Ken

Sam

Alex

Height = 3

Height = 4

CSE 373 23SU

23

24 of 57

Union runtime

union(A, B):

rootA = find(A)

rootB = find(B)

set rootA to point to rootB

find(A):

jump to A node

travel upward until root

return ID

A series of calls to union that would create a worst-case runtime for find on these Disjoint Sets:

A

B

C

D

union(A, B)

B

A

C

D

union(B, C)

union(C, D)

find(A)

n runtime :(

CSE 373 23SU

24

25 of 57

Analyzing the union worst case

How did we get a degenerate tree?

Even though pointing a root to a root usually helps with this, we can still get a degenerate tree if we put the root of a large tree under the root of a small tree.
Instead of always putting rootA under rootB what if we could ensure the smaller tree went under the larger tree?

B

A

C

D

union(C, D)

B

A

C

D

What currently happens

What would help avoid degenerate tree

CSE 373 23SU

25

26 of 57

Practice

Given the following disjoint-set what would be the result of the following calls on union if we always add the smaller tree (fewer nodes) into the larger tree (more nodes). Draw the forest at each stage with corresponding ranks for each tree.

6

4

5

0

3

1

2

8

10

12

9

11

7

13

union(2, 13)
union(4, 12)
union(2, 8)

CSE 373 23SU

26

27 of 57

Practice

Given the following disjoint-set what would be the result of the following calls on union if we add the “union-by-weight” optimization. Draw the forest at each stage with corresponding ranks for each tree.

6

4

5

0

3

1

2

8

10

12

9

11

7

13

union(2, 13)

CSE 373 23SU

27

28 of 57

Practice

Given the following disjoint-set what would be the result of the following calls on union if we add the “union-by-weight” optimization. Draw the forest at each stage with corresponding ranks for each tree.

6

4

5

0

3

1

2

8

10

12

9

11

7

13

union(2, 13)
union(4, 12)

CSE 373 23SU

28

29 of 57

Practice

Given the following disjoint-set what would be the result of the following calls on union if we add the “union-by-weight” optimization. Draw the forest at each stage with corresponding ranks for each tree.

6

4

5

0

3

1

2

8

10

12

9

11

7

13

union(2, 13)
union(4, 12)
union(2, 8)

CSE 373 23SU

29

30 of 57

Practice

Given the following disjoint-set what would be the result of the following calls on union if we add the “union-by-weight” optimization. Draw the forest at each stage with corresponding ranks for each tree.

Does this improve the worst case runtimes?
findSet is more likely to be O(log(n)) than O(n)

8

10

12

9

11

union(2, 13)
union(4, 12)
union(2, 8)

6

4

5

0

3

1

2

7

13

CSE 373 23SU

30

31 of 57

Disjoint Set Implementation

Weighted Union

Path Compression

Array Implementation

CSE 373 23SU

31

32 of 57

WeightedUnion

Goal: Always pick the smaller tree to go under the larger tree

Implementation: Store the number of nodes (or “weight”) of each tree in the root

Constant-time lookup instead of having to traverse the entire tree to count

union(A, B):

rootA = find(A)

rootB = find(B)

put lighter root under heavier root

A

union(A, B)

union(B, C)

union(C, D)

find(A)

weight: 1

B

weight: 1

C

weight: 1

D

weight: 1

2

3

4

O(1) runtime :)

CSE 373 23SU

32

33 of 57

WeightedUnion: Performance

Consider the worst case where the tree height grows as fast as possible

0

N	H
1	0

CSE 373 23SU

33

34 of 57

WeightedUnion: Performance

Consider the worst case where the tree height grows as fast as possible

0

1

N	H
1	0
2	1

CSE 373 23SU

34

35 of 57

WeightedUnion: Performance

Consider the worst case where the tree height grows as fast as possible

0

1

2

3

N	H
1	0
2	1
4	?

CSE 373 23SU

35

36 of 57

WeightedUnion: Performance

Consider the worst case where the tree height grows as fast as possible

0

1

2

3

N	H
1	0
2	1
4	2

CSE 373 23SU

36

37 of 57

WeightedUnion: Performance

Consider the worst case where the tree height grows as fast as possible

0

1

2

3

4

5

6

7

N	H
1	0
2	1
4	2
8	?

CSE 373 23SU

37

38 of 57

WeightedUnion: Performance

Consider the worst case where the tree height grows as fast as possible

0

1

2

3

N	H
1	0
2	1
4	2
8	3

4

5

6

7

CSE 373 23SU

38

39 of 57

WeightedUnion: Performance

Consider the worst case where the tree height grows as fast as possible

Worst case tree height is O(log N)

0

1

2

3

N	H
1	0
2	1
4	2
8	3
16	4

4

5

6

7

8

9

10

11

12

13

14

15

CSE 373 23SU

39

40 of 57

Runtime so far…

This is pretty good! But there’s one final optimization we can make:

path compression��

	Worst Case Runtime
makeSet(value)	O(1)
find(value)	O(log n)
union(x, y)	O(log n)

kruskalMST(G graph)

DisjointSets<V> msts; Set finalMST;

initialize msts with each vertex as single-element MST

sort all edges by weight (smallest to largest)

for each edge (u,v) in ascending order:

uMST = msts.find(u)

vMST = msts.find(v)

if (uMST != vMST):

finalMST.add(edge (u, v))

msts.union(uMST, vMST);

O(ElogV)

CSE 373 23SU

40

41 of 57

Disjoint Set Implementation

Weighted Union

Path Compression

Array Implementation

CSE 373 23SU

41

42 of 57

Modifying Data Structures for Future Gains

Thus far, the modifications we’ve studied are designed to preserve invariants

E.g. Performing rotations to preserve the AVL invariant
We rely on those invariants always being true so every call is fast

Path compression is entirely different: we are modifying the tree structure to improve future performance

Not adhering to a specific invariant
The first call may be slow, but will optimize so future calls can be fast

CSE 373 23SU

42

43 of 57

Path Compression: Idea

This is the worst-case topology if we use WeightedUnion��

Key Idea: When we do find(15), move all visited nodes under the root

Additional cost is insignificant (we already have to visit those nodes, just constant time work to point to root too)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

CSE 373 23SU

43

44 of 57

Path Compression: Idea

This is the worst-case topology if we use WeightedUnion��

Key Idea: When we do find(15), move all visited nodes under the root

Additional cost is insignificant (we already have to visit those nodes, just constant time work to point to root too)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Perform Path Compression on every find(), so future calls to find() are faster!

CSE 373 23SU

44

45 of 57

Path Compression: Details and Runtime

Run path compression on every find()!

Including the find()s that are invoked as part of a union()��

Understanding the performance of M>1 operations requires amortized analysis

Effectively averaging out rare events over many common ones
Typically used for “In-Practice” case

E.g. when we assume an array doesn’t resize “in practice”, we can do that because the rare resizing calls are amortized over many faster calls

In 373 we don’t go in-depth on amortized analysis

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

CSE 373 23SU

45

46 of 57

Path Compression: Runtime

M find()s on WeightedUnion requires takes O(M log N)��

… but M find()s using the WeightedUnion and PathCompression optimizations takes O(M log*N)!

log*n is the “iterated log”: the number of times you need to apply log to n before it’s <= 1
Note: log* is a loose bound�

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

CSE 373 23SU

46

47 of 57

Path Compression: Runtime

Path compression results in find()s and union()s that are very very close to (amortized) constant time

log* is less than 5 for any realistic input
If M find()s/union()s on N nodes is O(M log*N)�and log*N ≈ 5, then find()/union()s amortizes�to O(1)! 🤯

N	log* N
1	0
2	1
4	2
16	3
65536	4
2⁶⁵⁵³⁶	5

2¹⁶

Number of atoms in the known universe is 2²⁵⁶ish

CSE 373 23SU

47

48 of 57

Kruskal’s Runtime

Find and union are log|V| in worst case, but amortized constant “in practice”

Either way, dominated by time to sort the edges ☹

For an MST to exist, E can’t be smaller than V, so assume it dominates
Note: some people write |E|log|V|, which is the same (within a constant factor)

kruskalMST(G graph)

DisjointSets<V> msts; Set finalMST;

initialize msts with each vertex as single-element MST

sort all edges by weight (smallest to largest)

for each edge (u,v) in ascending order:

uMST = msts.find(u)

vMST = msts.find(v)

if (uMST != vMST):

finalMST.add(edge (u, v))

msts.union(uMST, vMST)

CSE 373 23SU

48

49 of 57

Disjoint Set Implementation

Weighted Union

Path Compression

Array Implementation

CSE 373 23SU

49

50 of 57

Using Arrays for Up-Trees

Since every node can have at most one parent, what if we use an array to store the parent relationships?

Proposal: each node corresponds to an index, where we store the index of the parent (or –1 for roots). Use the root index as the representative ID!

Just like with heaps, tree picture still conceptually correct, but exists in our minds!

Aileen

Santino

Paul

Joyce

Ken

Sam

Alex

0	1	2	3	4	5	6
-1	0	-1	6	-1	2	0
Joyce	Sam	Aileen	Alex	Paul	Santino	Ken

CSE 373 23SU

50

51 of 57

Using Arrays: Find

Initial jump to element still done with extra Map

But traversing up the tree can be done purely within the array!

0	1	2	3	4	5	6
-1	0	-1	6	-1	2	0
Joyce	Sam	Aileen	Alex	Paul	Santino	Ken

Aileen

Santino

Paul

Joyce

Ken

Sam

Alex

Aileen

Sam

…

find(A):

index = jump to A node’s index

while array[index] > 0:

index = array[index]

path compression

return index

1

2

find(Alex)

1

2

= 0

Can still do path compression by setting all indices along the way to the root index!

0

3

CSE 373 23SU

51

52 of 57

Using Arrays: Union

For WeightedUnion, we need to store the number of nodes in each tree (the weight)

Instead of just storing -1 to indicate a root, we can store -1 * weight!

0	1	2	3	4	5	6
-4	0	-2	6	-1	2	0
Joyce	Sam	Aileen	Alex	Paul	Santino	Ken

Aileen

Santino

Joyce

Ken

Sam

Alex

union(A, B):

rootA = find(A)

rootB = find(B)

use -1 * array[rootA] and -1 *

array[rootB] to determine weights

put lighter root under heavier root

weight 4

weight 2

union(Ken, Santino)

Paul

weight 1

Aileen

Santino

Paul

Joyce

Ken

Sam

Alex

CSE 373 23SU

52

53 of 57

Using Arrays: Union

For WeightedUnion, we need to store the number of nodes in each tree (the weight)

Instead of just storing -1 to indicate a root, we can store -1 * weight!

0	1	2	3	4	5	6
-4	0	-2	6	-1	2	0
Joyce	Sam	Aileen	Alex	Paul	Santino	Ken

union(A, B):

rootA = find(A)

rootB = find(B)

use -1 * array[rootA] and -1 *

array[rootB] to determine weights

put lighter root under heavier root

-6

0

Aileen (2)

Santino

Joyce

Ken

Sam

Alex

weight 6

Paul

weight 1

Aileen

union(Ken, Santino)

CSE 373 23SU

53

54 of 57

Array Implementation Practice

1

6

3

weight = 1

4

2

10

5

7

0

9

8

11

15

13

weight = 8

14

12

17

16

18

weight = 10

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18

Fill in the array representing this DisJoint set. Remember to Store (weight * -1) - 1 as the “parent” of the root nodes

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18
-1	-10	1	2	2	2	1	6	7	7	6	-8	11	12	12	11	15	15	17

Each “node” now only takes 4 bytes of memory instead of 32

CSE 373 23SU

54

55 of 57

Array Implementation Practice

55

3

0

weight = 1

4

11

1

5

2

13

12

weight = 8

10

9

14

15

8

weight = 6

weight = 2

6

7

16

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
3	0	0	-6	3	-1	-2	6	12	13	13	0	13	-8	12	12	12

Update the Array with the correct values after a call of union(14, 11) using WeightedUnion and PathCompression

CSE 373 23SU

55

56 of 57

Array Implementation Practice

56

3

0

weight = 1

4

11

1

5

2

13

12

weight = 14

10

9

14

15

8

weight = 2

6

7

16

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
3	0	0	13	3	-1	-2	6	12	13	13	13	13	-14	13	12	12

Update the Array with the correct values after a call of union(14, 11) using WeightedUnion and PathCompression

CSE 373 23SU

56

57 of 57

That’s all!

CSE 373 23SU

57