1 of 43

Lecture 18: Disjoint Sets

CSE 373: Data Structures and Algorithms

CSE 373 2` SP – CHAMPION

1

2 of 43

Announcements

P4 due in 2 weeks on Wednesday August 17th

extra credit dynamic programming part worth 8 points

E4 due Friday, EX5 out Friday

3 of 43

Practice - no poll today

Given the following disjoint-set what would be the result of the following calls on union if we always add the smaller tree (fewer nodes) into the larger tree (more nodes). Draw the forest at each stage with corresponding ranks for each tree.

CSE 373 SP 21 - CHAMPION

3

6

4

5

0

3

1

2

8

10

12

9

11

7

13

union(2, 13)
union(4, 12)
union(2, 8)

4 of 43

Practice

Given the following disjoint-set what would be the result of the following calls on union if we add the “union-by-weight” optimization. Draw the forest at each stage with corresponding ranks for each tree.

CSE 373 SP 21 - CHAMPION

4

6

4

5

0

3

1

2

8

10

12

9

11

7

13

union(2, 13)

5 of 43

Practice

Given the following disjoint-set what would be the result of the following calls on union if we add the “union-by-weight” optimization. Draw the forest at each stage with corresponding ranks for each tree.

CSE 373 SP 21 - CHAMPION

5

6

4

5

0

3

1

2

8

10

12

9

11

7

13

union(2, 13)
union(4, 12)

6 of 43

Practice

Given the following disjoint-set what would be the result of the following calls on union if we add the “union-by-weight” optimization. Draw the forest at each stage with corresponding ranks for each tree.

CSE 373 SP 21 - CHAMPION

6

4

5

0

3

1

2

8

10

12

9

11

7

13

union(2, 13)
union(4, 12)
union(2, 8)

7 of 43

Practice

Given the following disjoint-set what would be the result of the following calls on union if we add the “union-by-weight” optimization. Draw the forest at each stage with corresponding ranks for each tree.

CSE 373 SP 21 - CHAMPION

7

8

10

12

9

11

union(2, 13)
union(4, 12)
union(2, 8)

6

4

5

0

3

1

2

7

13

Does this improve the worst case runtimes?
findSet is more likely to be O(log(n)) than O(n)

8 of 43

New ADT

CSE 373 SP 18 - KASEY CHAMPION

8

Set ADT

create(x) - creates a new set with a single member, x

Count of Elements

state

behavior

Set of elements

Elements must be unique!
No required order

add(x) - adds x into set if it is unique, otherwise add is ignored

remove(x) – removes x from set

size() – returns current number of elements in set

Disjoint-Set ADT

makeSet(x) – creates a new set within the disjoint set where the only member is x. Picks representative for set

Count of Sets

state

behavior

Set of Sets

Disjoint: Elements must be unique across sets
No required order
Each set has representative

findSet(x) – looks up the set containing element x, returns representative of that set

union(x, y) – looks up set containing x and set containing y, combines two sets into one. Picks new representative for resulting set

D

B

C

A

D

C

F

B

A

G

H

9 of 43

Implementation

CSE 373 SP 18 - KASEY CHAMPION

9

TreeDisjointSet<E>

makeSet(x)-create a new tree of size 1 and add to our forest

state

behavior

Collection<TreeSet> forest

findSet(x)-locates node with x and moves up tree to find root

union(x, y)-append tree with y as a child of tree with x

Disjoint-Set ADT

makeSet(x) – creates a new set within the disjoint set where the only member is x. Picks representative for set

Count of Sets

state

behavior

Set of Sets

Disjoint: Elements must be unique across sets
No required order
Each set has representative

findSet(x) – looks up the set containing element x, returns representative of that set

union(x, y) – looks up set containing x and set containing y, combines two sets into one. Picks new representative for resulting set

Dictionary<NodeValues, NodeLocations> nodeInventory

TreeSet<E>

TreeSet(x)

state

behavior

SetNode overallRoot

add(x)

remove(x, y)

getRep()-returns data of overallRoot

SetNode<E>

SetNode(x)

state

behavior

E data

addChild(x)

removeChild(x, y)

Collection<SetNode> children

10 of 43

Review QuickFind vs. QuickUnion

Joyce, Sam, Ken, Alex

Aileen, Santino

Paul

DISJOINT SETS ADT

QuickFind

QuickUnion

map from value to representative ID

Aileen

Joyce

Santino

Sam

Ken

1

2

1

Alex

Paul

2

3

Aileen (1)

Santino

Paul (3)

Joyce (2)

Ken

Sam

Alex

trees of values with representative ID at each root

	(Baseline)	QuickFind	QuickUnion
makeSet(value)	𝚯(1)	𝚯(1)	𝚯(1)
find(value)	𝚯(n)	𝚯(1)	𝚯(n)
union(x, y)	𝚯(n)	𝚯(n)	𝚯(1)

Could also use one element from each set (e.g. the root) as its representative: only uniqueness matters

11 of 43

Review QuickUnion: Why Use Both Roots?

Example: result of union(Ken, Santino) on these Disjoint Sets given three possible implementations:

union(A, B):

rootA = find(A)

rootB = find(B)

set rootA to point to rootB

union(A, B):

rootB = find(B)

set A to point to rootB

union(A, B):

rootA = find(A)

set rootA to point to B

Aileen (1)

Santino

Paul (3)

Joyce

Ken

Sam

Alex

Aileen (1)

Santino

Paul (3)

Joyce (2)

Ken

Sam

Alex

Aileen (1)

Santino

Paul (3)

Joyce

Ken

Sam

Alex

Aileen (1)

Santino

Paul (3)

Joyce (2)

Ken

Sam

Alex

Correct: Everything in Ken’s set now connected to everything in Santino’s set!

Incorrect: Ken and Joyce were connected before; the union operation shouldn’t remove connections.

Inefficient: Technically correct, but increases height of the up-tree so makes

12 of 43

QuickUnion: Checking in on those runtimes

	Maps to Sets	QuickFind	QuickUnion
makeSet(value)	𝚯(1)	𝚯(1)	𝚯(1)
findSet(value)	𝚯(n)	𝚯(1)	𝚯(n)
union(x, y)	𝚯(n)	𝚯(n)	𝚯(1)

union(A, B):

rootA = find(A)

rootB = find(B)

set rootA to point to rootB

kruskalMST(G graph)

DisjointSets<V> msts; Set finalMST;

initialize msts with each vertex as single-element MST

sort all edges by weight (smallest to largest)

for each edge (u,v) in ascending order:

uMST = msts.find(u)

vMST = msts.find(v)

if (uMST != vMST):

finalMST.add(edge (u, v))

msts.union(uMST, vMST);

*

13 of 43

QuickUnion: Let’s Build a Worst Case

union(A, B):

rootA = find(A)

rootB = find(B)

set rootA to point to rootB

find(A):

jump to A node

travel upward until root

return ID

Even with the ”use-the-roots” implementation of union, try to come up with a series of calls to union that would create a worst-case runtime for find on these Disjoint Sets:

A

B

C

D

14 of 43

QuickUnion: Let’s Build a Worst Case

union(A, B):

rootA = find(A)

rootB = find(B)

set rootA to point to rootB

find(A):

jump to A node

travel upward until root

return ID

Even with the ”use-the-roots” implementation of union, try to come up with a series of calls to union that would create a worst-case runtime for find on these Disjoint Sets:

A

B

C

D

union(A, B)

union(B, C)

union(C, D)

find(A)

B

A

C

D

15 of 43

Analyzing the QuickUnion Worst Case

How did we get a degenerate tree?

Even though pointing a root to a root usually helps with this, we can still get a degenerate tree if we put the root of a large tree under the root of a small tree.
In QuickUnion, rootA always goes under rootB

But what if we could ensure the smaller tree went under the larger tree?

B

A

C

D

union(C, D)

B

A

C

D

What currently happens

What would help avoid degenerate tree

16 of 43

WeightedQuickUnion

Goal: Always pick the smaller tree to go under the larger tree
Implementation: Store the number of nodes (or “weight”) of each tree in the root

Constant-time lookup instead of having to traverse the entire tree to count

union(A, B):

rootA = find(A)

rootB = find(B)

put lighter root under heavier root

union(A, B)

union(B, C)

union(C, D)

find(A)

A

B

C

D

Now what happens?

B

A

C

D

Perfect! Best runtime we can get.

17 of 43

WeightedQuickUnion: Performance

union()’s runtime is still dependent on find()’s runtime, which is a function of the tree’s height
What’s the worst-case height for WeightedQuickUnion?

union(A, B):

rootA = find(A)

rootB = find(B)

put lighter root under heavier root

18 of 43

WeightedQuickUnion: Performance

Consider the worst case where the tree height grows as fast as possible

0

N	H
1	0

19 of 43

WeightedQuickUnion: Performance

Consider the worst case where the tree height grows as fast as possible

0

1

N	H
1	0
2	1

20 of 43

WeightedQuickUnion: Performance

Consider the worst case where the tree height grows as fast as possible

0

1

2

3

N	H
1	0
2	1
4	?

21 of 43

WeightedQuickUnion: Performance

Consider the worst case where the tree height grows as fast as possible

0

1

2

3

N	H
1	0
2	1
4	2

22 of 43

WeightedQuickUnion: Performance

Consider the worst case where the tree height grows as fast as possible

0

1

2

3

4

5

6

7

N	H
1	0
2	1
4	2
8	?

23 of 43

WeightedQuickUnion: Performance

Consider the worst case where the tree height grows as fast as possible

0

1

2

3

N	H
1	0
2	1
4	2
8	3

4

5

6

7

24 of 43

WeightedQuickUnion: Performance

0

1

2

3

N	H
1	0
2	1
4	2
8	3
16	4

4

5

6

7

8

9

10

11

12

13

14

15

Consider the worst case where the tree height grows as fast as possible
Worst case tree height is Θ(log N)

25 of 43

Why Weights Instead of Heights?

We used the number of items in a tree to decide upon the root�
Why not use the height of the tree?

HeightedQuickUnion’s runtime is asymptotically the same: Θ(log(n))
It’s easier to track weights than heights, even though WeightedQuickUnion can lead to some suboptimal structures like this one:

1

2

0

4

6

5

3

8

9

7

+

1

2

0

4

6

5

3

8

9

7

26 of 43

WeightedQuickUnion Runtime

This is pretty good! But there’s one final optimization we can make: path compression��

	Maps to Sets	QuickFind	QuickUnion	WeightedQuickUnion
makeSet(value)	𝚯(1)	𝚯(1)	𝚯(1)	𝚯(1)
find(value)	𝚯(n)	𝚯(1)	𝚯(n)	𝚯(log n)
union(x, y) assuming root args	𝚯(n)	𝚯(n)	𝚯(1)	𝚯(1)
union(x, y)	𝚯(n)	𝚯(n)	𝚯(n)	𝚯(log n)

Aileen

Joyce

Santino

Sam

Ken

1

2

1

B

A

C

D

1

2

3

4

5

6

27 of 43

Modifying Data Structures for Future Gains

Thus far, the modifications we’ve studied are designed to preserve invariants

E.g. Performing rotations to preserve the AVL invariant
We rely on those invariants always being true so every call is fast

Path compression is entirely different: we are modifying the tree structure to improve future performance

Not adhering to a specific invariant
The first call may be slow, but will optimize so future calls can be fast

28 of 43

Path Compression: Idea

This is the worst-case topology if we use WeightedQuickUnion��
Idea: When we do find(15), move all visited nodes under the root

Additional cost is insignificant (we already have to visit those nodes, just constant time work to point to root too)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

29 of 43

Path Compression: Idea

This is the worst-case topology if we use WeightedQuickUnion��
Idea: When we do find(15), move all visited nodes under the root

Additional cost is insignificant (we already have to visit those nodes, just constant time work to point to root too)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Perform Path Compression on every find(), so future calls to find() are faster!

30 of 43

Path Compression: Details and Runtime

Run path compression on every find()!

Including the find()s that are invoked as part of a union()��

Understanding the performance of M>1 operations requires amortized analysis

Effectively averaging out rare events over many common ones
Typically used for “In-Practice” case

E.g. when we assume an array doesn’t resize “in practice”, we can do that because the rare resizing calls are amortized over many faster calls

In 373 we don’t go in-depth on amortized analysis

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

31 of 43

Path Compression: Runtime

M find()s on WeightedQuickUnion requires takes Θ(M log N)��

… but M find()s on WeightedQuickUnionWithPathCompression takes O(M log*N)!

log*n is the “iterated log”: the number of times you need to apply log to n before it’s <= 1
Note: log* is a loose bound�

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

32 of 43

Path Compression: Runtime

Path compression results in find()s and union()s that are very very close to (amortized) constant time

log* is less than 5 for any realistic input
If M find()s/union()s on N nodes is O(M log*N)�and log*N ≈ 5, then find()/union()s amortizes�to O(1)! 🤯

N	log* N
1	0
2	1
4	2
16	3
65536	4
2⁶⁵⁵³⁶	5

2¹⁶

Number of atoms in the known universe is 2²⁵⁶ish

33 of 43

WQU + Path Compression Runtime

And if log* n <= 5 for any reasonable input…

We’ve just witnessed an incredible feat of data structure engineering: every operation is constant!?*
*Caveat: amortized constant, in the “in-practice” case; still logarithmic in the worst case!

	(Baseline)	QuickFind	QuickUnion	WeightedQuickUnion	WQU + Path Compression
makeSet(value)	𝚯(1)	𝚯(1)	𝚯(1)	𝚯(1)	𝚯(1)
find(value)	𝚯(n)	𝚯(1)	𝚯(n)	𝚯(log n)	𝚯(log* n)
union(x, y) assuming root args	𝚯(n)	𝚯(n)	𝚯(1)	𝚯(1)	𝚯(1)
union(x, y)	𝚯(n)	𝚯(n)	𝚯(n)	𝚯(log n)	𝚯(log* n)

In-Practice Runtimes:

34 of 43

Disjoint Sets Implementation

	(Baseline)	QuickFind	QuickUnion	WeightedQuickUnion	WQU + Path Compression
makeSet(value)	𝚯(1)	𝚯(1)	𝚯(1)	𝚯(1)	𝚯(1)
find(value)	𝚯(n)	𝚯(1)	𝚯(n)	𝚯(log n)	𝚯(log* n)
union(x, y) assuming root args	𝚯(n)	𝚯(n)	𝚯(1)	𝚯(1)	𝚯(1)
union(x, y)	𝚯(n)	𝚯(n)	𝚯(n)	𝚯(log n)	𝚯(log* n)

In-Practice Runtimes:

Aileen

Joyce

Santino

Sam

Ken

1

2

1

B

A

C

D

0

1

2

3

4

5

6

7

0

1

2

3

4

5

35 of 43

Kruskal’s Runtime

find and union are log|V| in worst case, but amortized constant “in practice”
Either way, dominated by time to sort the edges ☹

For an MST to exist, E can’t be smaller than V, so assume it dominates
Note: some people write |E|log|V|, which is the same (within a constant factor)

kruskalMST(G graph)

DisjointSets<V> msts; Set finalMST;

initialize msts with each vertex as single-element MST

sort all edges by weight (smallest to largest)

for each edge (u,v) in ascending order:

uMST = msts.find(u)

vMST = msts.find(v)

if (uMST != vMST):

finalMST.add(edge (u, v))

msts.union(uMST, vMST)

36 of 43

Using Arrays for Up-Trees

Since every node can have at most one parent, what if we use an array to store the parent relationships?
Proposal: each node corresponds to an index, where we store the index of the parent (or –1 for roots). Use the root index as the representative ID!
Just like with heaps, tree picture still conceptually correct, but exists in our minds!

Aileen (2)

Santino

Paul (4)

Joyce (0)

Ken

Sam

Alex

0	1	2	3	4	5	6
-1	0	-1	6	-1	2	0
Joyce	Sam	Aileen	Alex	Paul	Santino	Ken

37 of 43

Using Arrays: Find

Initial jump to element still done with extra Map
But traversing up the tree can be done purely within the array!

0	1	2	3	4	5	6
-1	0	-1	6	-1	2	0
Joyce	Sam	Aileen	Alex	Paul	Santino	Ken

Aileen (2)

Santino

Paul (4)

Joyce (0)

Ken

Sam

Alex

Aileen

Sam

…

find(A):

index = jump to A node’s index

while array[index] > 0:

index = array[index]

path compression

return index

1

2

find(Alex)

1

2

= 0

Can still do path compression by setting all indices along the way to the root index!

0

3

38 of 43

Using Arrays: Union

For WeightedQuickUnion, we need to store the number of nodes in each tree (the weight)
Instead of just storing -1 to indicate a root, we can store -1 * weight!

0	1	2	3	4	5	6
-4	0	-2	6	-1	2	0
Joyce	Sam	Aileen	Alex	Paul	Santino	Ken

Aileen (2)

Santino

Joyce (0)

Ken

Sam

Alex

union(A, B):

rootA = find(A)

rootB = find(B)

use -1 * array[rootA] and -1 *

array[rootB] to determine weights

put lighter root under heavier root

weight 4

weight 2

union(Ken, Santino)

Paul (4)

weight 1

39 of 43

Using Arrays: Union

For WeightedQuickUnion, we need to store the number of nodes in each tree (the weight)
Instead of just storing -1 to indicate a root, we can store -1 * weight!

0	1	2	3	4	5	6
-4	0	-2	6	-1	2	0
Joyce	Sam	Aileen	Alex	Paul	Santino	Ken

union(A, B):

rootA = find(A)

rootB = find(B)

use -1 * array[rootA] and -1 *

array[rootB] to determine weights

put lighter root under heavier root

-6

0

Aileen (2)

Santino

Joyce (0)

Ken

Sam

Alex

weight 6

Paul (4)

weight 1

Aileen

union(Ken, Santino)

40 of 43

Array Implementation

CSE 373 SP 18 - KASEY CHAMPION

40

1

6

3

rank = 0

4

2

10

5

7

0

9

8

11

15

13

rank = 3

14

12

17

16

18

rank = 3

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18
-1	-1	1	2	2	2	1	6	7	7	6	-1	11	12	12	11	15	15	17

Store (rank * -1) - 1

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18
-1	-4	1	2	2	2	1	6	7	7	6	-4	11	12	12	11	15	15	17

Each “node” now only takes 4 bytes of memory instead of 32

41 of 43

Practice

CSE 373 SP 18 - KASEY CHAMPION

41

3

0

rank = 0

4

11

1

5

2

13

12

rank = 2

10

9

14

15

8

rank = 2

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16

rank = 1

6

7

16

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
3	0	0	-6	3	-1	-2	6	12	13	13	0	13	-3	12	12	12

42 of 43

Using Arrays for WQU+PC

Same asymptotic runtime as using tree nodes, but check out all these other benefits:

More compact in memory
Better spatial locality, leading to better constant factors from cache usage
Simplify the implementation!

	(Baseline)	QuickFind	QuickUnion	WeightedQuickUnion	WQU + Path Compression	ArrayWQU+PC
makeSet(value)	𝚯(1)	𝚯(1)	𝚯(1)	𝚯(1)	𝚯(1)	𝚯(1)
find(value)	𝚯(n)	𝚯(1)	𝚯(n)	𝚯(log n)	𝚯(log* n)	𝚯(log* n)
union(x, y) assuming root args	𝚯(n)	𝚯(n)	𝚯(1)	𝚯(1)	𝚯(1)	𝚯(1)
union(x, y)	𝚯(n)	𝚯(n)	𝚯(n)	𝚯(log n)	𝚯(log* n)	𝚯(log* n)

43 of 43

Appendix