1 of 43

Lecture 18: Disjoint Sets

CSE 373: Data Structures and Algorithms

CSE 373 2` SP – CHAMPION

1

2 of 43

Announcements

  • P4 due in 2 weeks on Wednesday August 17th
    • extra credit dynamic programming part worth 8 points
  • E4 due Friday, EX5 out Friday

3 of 43

Practice - no poll today

  • Given the following disjoint-set what would be the result of the following calls on union if we always add the smaller tree (fewer nodes) into the larger tree (more nodes). Draw the forest at each stage with corresponding ranks for each tree.

CSE 373 SP 21 - CHAMPION

3

6

4

5

0

3

1

2

8

10

12

9

11

7

13

  • union(2, 13)
  • union(4, 12)
  • union(2, 8)

4 of 43

Practice

  • Given the following disjoint-set what would be the result of the following calls on union if we add the β€œunion-by-weight” optimization. Draw the forest at each stage with corresponding ranks for each tree.

CSE 373 SP 21 - CHAMPION

4

6

4

5

0

3

1

2

8

10

12

9

11

7

13

  • union(2, 13)

5 of 43

Practice

  • Given the following disjoint-set what would be the result of the following calls on union if we add the β€œunion-by-weight” optimization. Draw the forest at each stage with corresponding ranks for each tree.

CSE 373 SP 21 - CHAMPION

5

6

4

5

0

3

1

2

8

10

12

9

11

7

13

  • union(2, 13)
  • union(4, 12)

6 of 43

Practice

  • Given the following disjoint-set what would be the result of the following calls on union if we add the β€œunion-by-weight” optimization. Draw the forest at each stage with corresponding ranks for each tree.

CSE 373 SP 21 - CHAMPION

6

6

4

5

0

3

1

2

8

10

12

9

11

7

13

  • union(2, 13)
  • union(4, 12)
  • union(2, 8)

7 of 43

Practice

  • Given the following disjoint-set what would be the result of the following calls on union if we add the β€œunion-by-weight” optimization. Draw the forest at each stage with corresponding ranks for each tree.

CSE 373 SP 21 - CHAMPION

7

8

10

12

9

11

  • union(2, 13)
  • union(4, 12)
  • union(2, 8)

6

4

5

0

3

1

2

7

13

  • Does this improve the worst case runtimes?
  • findSet is more likely to be O(log(n)) than O(n)

8 of 43

New ADT

CSE 373 SP 18 - KASEY CHAMPION

8

Set ADT

create(x) - creates a new set with a single member, x

Count of Elements

state

behavior

Set of elements

  • Elements must be unique!
  • No required order

add(x) - adds x into set if it is unique, otherwise add is ignored

remove(x) – removes x from set

size() – returns current number of elements in set

Disjoint-Set ADT

makeSet(x) – creates a new set within the disjoint set where the only member is x. Picks representative for set

Count of Sets

state

behavior

Set of Sets

  • Disjoint: Elements must be unique across sets
  • No required order
  • Each set has representative

findSet(x) – looks up the set containing element x, returns representative of that set

union(x, y) – looks up set containing x and set containing y, combines two sets into one. Picks new representative for resulting set

D

B

C

A

D

C

F

B

A

G

H

9 of 43

Implementation

CSE 373 SP 18 - KASEY CHAMPION

9

TreeDisjointSet<E>

makeSet(x)-create a new tree of size 1 and add to our forest

state

behavior

Collection<TreeSet> forest

findSet(x)-locates node with x and moves up tree to find root

union(x, y)-append tree with y as a child of tree with x

Disjoint-Set ADT

makeSet(x) – creates a new set within the disjoint set where the only member is x. Picks representative for set

Count of Sets

state

behavior

Set of Sets

  • Disjoint: Elements must be unique across sets
  • No required order
  • Each set has representative

findSet(x) – looks up the set containing element x, returns representative of that set

union(x, y) – looks up set containing x and set containing y, combines two sets into one. Picks new representative for resulting set

Dictionary<NodeValues, NodeLocations> nodeInventory

TreeSet<E>

TreeSet(x)

state

behavior

SetNode overallRoot

add(x)

remove(x, y)

getRep()-returns data of overallRoot

SetNode<E>

SetNode(x)

state

behavior

E data

addChild(x)

removeChild(x, y)

Collection<SetNode> children

10 of 43

Review QuickFind vs. QuickUnion

Joyce, Sam, Ken, Alex

Aileen, Santino

Paul

DISJOINT SETS ADT

QuickFind

QuickUnion

map from value to representative ID

Aileen

Joyce

Santino

Sam

Ken

1

2

2

2

1

Alex

Paul

2

3

Aileen (1)

Santino

Paul (3)

Joyce (2)

Ken

Sam

Alex

trees of values with representative ID at each root

(Baseline)

QuickFind

QuickUnion

makeSet(value)

𝚯(1)

𝚯(1)

𝚯(1)

find(value)

𝚯(n)

𝚯(1)

𝚯(n)

union(x, y)

𝚯(n)

𝚯(n)

𝚯(1)

Could also use one element from each set (e.g. the root) as its representative: only uniqueness matters

11 of 43

Review QuickUnion: Why Use Both Roots?

Example: result of union(Ken, Santino) on these Disjoint Sets given three possible implementations:

union(A, B):

rootA = find(A)

rootB = find(B)

set rootA to point to rootB

union(A, B):

rootB = find(B)

set A to point to rootB

union(A, B):

rootA = find(A)

set rootA to point to B

Aileen (1)

Santino

Paul (3)

Joyce

Ken

Sam

Alex

Aileen (1)

Santino

Paul (3)

Joyce (2)

Ken

Sam

Alex

Aileen (1)

Santino

Paul (3)

Joyce

Ken

Sam

Alex

Aileen (1)

Santino

Paul (3)

Joyce (2)

Ken

Sam

Alex

Correct: Everything in Ken’s set now connected to everything in Santino’s set!

Incorrect: Ken and Joyce were connected before; the union operation shouldn’t remove connections.

Inefficient: Technically correct, but increases height of the up-tree so makes

12 of 43

QuickUnion: Checking in on those runtimes

  • Β 

Maps to Sets

QuickFind

QuickUnion

makeSet(value)

𝚯(1)

𝚯(1)

𝚯(1)

findSet(value)

𝚯(n)

𝚯(1)

𝚯(n)

union(x, y)

𝚯(n)

𝚯(n)

𝚯(1)

union(A, B):

rootA = find(A)

rootB = find(B)

set rootA to point to rootB

kruskalMST(G graph)

DisjointSets<V> msts; Set finalMST;

initialize msts with each vertex as single-element MST

sort all edges by weight (smallest to largest)

for each edge (u,v) in ascending order:

uMST = msts.find(u)

vMST = msts.find(v)

if (uMST != vMST):

finalMST.add(edge (u, v))

msts.union(uMST, vMST);

*

*

13 of 43

QuickUnion: Let’s Build a Worst Case

union(A, B):

rootA = find(A)

rootB = find(B)

set rootA to point to rootB

find(A):

jump to A node

travel upward until root

return ID

Even with the ”use-the-roots” implementation of union, try to come up with a series of calls to union that would create a worst-case runtime for find on these Disjoint Sets:

A

B

C

D

14 of 43

QuickUnion: Let’s Build a Worst Case

union(A, B):

rootA = find(A)

rootB = find(B)

set rootA to point to rootB

find(A):

jump to A node

travel upward until root

return ID

Even with the ”use-the-roots” implementation of union, try to come up with a series of calls to union that would create a worst-case runtime for find on these Disjoint Sets:

A

B

C

D

union(A, B)

union(B, C)

union(C, D)

find(A)

B

A

C

D

15 of 43

Analyzing the QuickUnion Worst Case

  • How did we get a degenerate tree?
    • Even though pointing a root to a root usually helps with this, we can still get a degenerate tree if we put the root of a large tree under the root of a small tree.
    • In QuickUnion, rootA always goes under rootB
      • But what if we could ensure the smaller tree went under the larger tree?

B

A

C

D

union(C, D)

B

A

C

D

What currently happens

What would help avoid degenerate tree

16 of 43

WeightedQuickUnion

  • Goal: Always pick the smaller tree to go under the larger tree
  • Implementation: Store the number of nodes (or β€œweight”) of each tree in the root
    • Constant-time lookup instead of having to traverse the entire tree to count

union(A, B):

rootA = find(A)

rootB = find(B)

put lighter root under heavier root

union(A, B)

union(B, C)

union(C, D)

find(A)

A

B

C

D

Now what happens?

B

A

C

D

Perfect! Best runtime we can get.

17 of 43

WeightedQuickUnion: Performance

  • union()’s runtime is still dependent on find()’s runtime, which is a function of the tree’s height
  • What’s the worst-case height for WeightedQuickUnion?

union(A, B):

rootA = find(A)

rootB = find(B)

put lighter root under heavier root

18 of 43

WeightedQuickUnion: Performance

  • Consider the worst case where the tree height grows as fast as possible

0

N

H

1

0

19 of 43

WeightedQuickUnion: Performance

  • Consider the worst case where the tree height grows as fast as possible

0

1

N

H

1

0

2

1

20 of 43

WeightedQuickUnion: Performance

  • Consider the worst case where the tree height grows as fast as possible

0

1

2

3

N

H

1

0

2

1

4

?

21 of 43

WeightedQuickUnion: Performance

  • Consider the worst case where the tree height grows as fast as possible

0

1

2

3

N

H

1

0

2

1

4

2

22 of 43

WeightedQuickUnion: Performance

  • Consider the worst case where the tree height grows as fast as possible

0

1

2

3

4

5

6

7

N

H

1

0

2

1

4

2

8

?

23 of 43

WeightedQuickUnion: Performance

  • Consider the worst case where the tree height grows as fast as possible

0

1

2

3

N

H

1

0

2

1

4

2

8

3

4

5

6

7

24 of 43

WeightedQuickUnion: Performance

0

1

2

3

N

H

1

0

2

1

4

2

8

3

16

4

4

5

6

7

8

9

10

11

12

13

14

15

  • Consider the worst case where the tree height grows as fast as possible
  • Worst case tree height is Θ(log N)

25 of 43

Why Weights Instead of Heights?

  • We used the number of items in a tree to decide upon the rootοΏ½
  • Why not use the height of the tree?
    • HeightedQuickUnion’s runtime is asymptotically the same: Θ(log(n))
    • It’s easier to track weights than heights, even though WeightedQuickUnion can lead to some suboptimal structures like this one:

1

2

0

4

6

5

3

8

9

7

+

1

2

0

4

6

5

3

8

9

7

26 of 43

WeightedQuickUnion Runtime

  • This is pretty good! But there’s one final optimization we can make: path compressionοΏ½οΏ½οΏ½οΏ½

Maps to Sets

QuickFind

QuickUnion

WeightedQuickUnion

makeSet(value)

𝚯(1)

𝚯(1)

𝚯(1)

𝚯(1)

find(value)

𝚯(n)

𝚯(1)

𝚯(n)

𝚯(log n)

union(x, y)

assuming root args

𝚯(n)

𝚯(n)

𝚯(1)

𝚯(1)

union(x, y)

𝚯(n)

𝚯(n)

𝚯(n)

𝚯(log n)

Aileen

Joyce

Santino

Sam

Ken

1

2

2

2

1

B

A

C

D

1

2

3

4

5

6

27 of 43

Modifying Data Structures for Future Gains

  • Thus far, the modifications we’ve studied are designed to preserve invariants
    • E.g. Performing rotations to preserve the AVL invariant
    • We rely on those invariants always being true so every call is fast

  • Path compression is entirely different: we are modifying the tree structure to improve future performance
    • Not adhering to a specific invariant
    • The first call may be slow, but will optimize so future calls can be fast

28 of 43

Path Compression: Idea

  • This is the worst-case topology if we use WeightedQuickUnionοΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½
  • Idea: When we do find(15), move all visited nodes under the root
    • Additional cost is insignificant (we already have to visit those nodes, just constant time work to point to root too)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

29 of 43

Path Compression: Idea

  • This is the worst-case topology if we use WeightedQuickUnionοΏ½οΏ½οΏ½οΏ½οΏ½οΏ½οΏ½
  • Idea: When we do find(15), move all visited nodes under the root
    • Additional cost is insignificant (we already have to visit those nodes, just constant time work to point to root too)

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

  • Perform Path Compression on every find(), so future calls to find() are faster!

30 of 43

Path Compression: Details and Runtime

  • Run path compression on every find()!
    • Including the find()s that are invoked as part of a union()οΏ½οΏ½οΏ½οΏ½

  • Understanding the performance of M>1 operations requires amortized analysis
    • Effectively averaging out rare events over many common ones
    • Typically used for β€œIn-Practice” case
      • E.g. when we assume an array doesn’t resize β€œin practice”, we can do that because the rare resizing calls are amortized over many faster calls
    • In 373 we don’t go in-depth on amortized analysis

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

31 of 43

Path Compression: Runtime

  • M find()s on WeightedQuickUnion requires takes Θ(M log N)οΏ½οΏ½οΏ½οΏ½

  • … but M find()s on WeightedQuickUnionWithPathCompression takes O(M log*N)!
    • log*n is the β€œiterated log”: the number of times you need to apply log to n before it’s <= 1
    • Note: log* is a loose boundοΏ½

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

32 of 43

Path Compression: Runtime

  • Path compression results in find()s and union()s that are very very close to (amortized) constant time
    • log* is less than 5 for any realistic input
    • If M find()s/union()s on N nodes is O(M log*N)οΏ½and log*N β‰ˆ 5, then find()/union()s amortizesοΏ½to O(1)! 🀯

N

log* N

1

0

2

1

4

2

16

3

65536

4

265536

5

216

Number of atoms in the known universe is 2256ish

33 of 43

WQU + Path Compression Runtime

  • And if log* n <= 5 for any reasonable input…
    • We’ve just witnessed an incredible feat of data structure engineering: every operation is constant!?*
    • *Caveat: amortized constant, in the β€œin-practice” case; still logarithmic in the worst case!

(Baseline)

QuickFind

QuickUnion

WeightedQuickUnion

WQU + Path Compression

makeSet(value)

𝚯(1)

𝚯(1)

𝚯(1)

𝚯(1)

𝚯(1)

find(value)

𝚯(n)

𝚯(1)

𝚯(n)

𝚯(log n)

𝚯(log* n)

union(x, y)

assuming root args

𝚯(n)

𝚯(n)

𝚯(1)

𝚯(1)

𝚯(1)

union(x, y)

𝚯(n)

𝚯(n)

𝚯(n)

𝚯(log n)

𝚯(log* n)

In-Practice Runtimes:

34 of 43

Disjoint Sets Implementation

(Baseline)

QuickFind

QuickUnion

WeightedQuickUnion

WQU + Path Compression

makeSet(value)

𝚯(1)

𝚯(1)

𝚯(1)

𝚯(1)

𝚯(1)

find(value)

𝚯(n)

𝚯(1)

𝚯(n)

𝚯(log n)

𝚯(log* n)

union(x, y)

assuming root args

𝚯(n)

𝚯(n)

𝚯(1)

𝚯(1)

𝚯(1)

union(x, y)

𝚯(n)

𝚯(n)

𝚯(n)

𝚯(log n)

𝚯(log* n)

In-Practice Runtimes:

Aileen

Joyce

Santino

Sam

Ken

1

2

2

2

1

B

A

C

D

0

1

2

3

4

5

6

7

0

1

2

3

4

5

35 of 43

Kruskal’s Runtime

  • find and union are log|V| in worst case, but amortized constant β€œin practice”
  • Either way, dominated by time to sort the edges ☹
    • For an MST to exist, E can’t be smaller than V, so assume it dominates
    • Note: some people write |E|log|V|, which is the same (within a constant factor)

kruskalMST(G graph)

DisjointSets<V> msts; Set finalMST;

initialize msts with each vertex as single-element MST

sort all edges by weight (smallest to largest)

for each edge (u,v) in ascending order:

uMST = msts.find(u)

vMST = msts.find(v)

if (uMST != vMST):

finalMST.add(edge (u, v))

msts.union(uMST, vMST)

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

Β 

36 of 43

Using Arrays for Up-Trees

  • Since every node can have at most one parent, what if we use an array to store the parent relationships?
  • Proposal: each node corresponds to an index, where we store the index of the parent (or –1 for roots). Use the root index as the representative ID!
  • Just like with heaps, tree picture still conceptually correct, but exists in our minds!

Aileen (2)

Santino

Paul (4)

Joyce (0)

Ken

Sam

Alex

0

1

2

3

4

5

6

-1

0

-1

6

-1

2

0

Joyce

Sam

Aileen

Alex

Paul

Santino

Ken

37 of 43

Using Arrays: Find

  • Initial jump to element still done with extra Map
  • But traversing up the tree can be done purely within the array!

0

1

2

3

4

5

6

-1

0

-1

6

-1

2

0

Joyce

Sam

Aileen

Alex

Paul

Santino

Ken

Aileen (2)

Santino

Paul (4)

Joyce (0)

Ken

Sam

Alex

Alex

Aileen

Sam

…

find(A):

index = jump to A node’s index

while array[index] > 0:

index = array[index]

path compression

return index

1

2

find(Alex)

1

2

= 0

  • Can still do path compression by setting all indices along the way to the root index!

0

3

3

38 of 43

Using Arrays: Union

  • For WeightedQuickUnion, we need to store the number of nodes in each tree (the weight)
  • Instead of just storing -1 to indicate a root, we can store -1 * weight!

0

1

2

3

4

5

6

-4

0

-2

6

-1

2

0

Joyce

Sam

Aileen

Alex

Paul

Santino

Ken

Aileen (2)

Santino

Joyce (0)

Ken

Sam

Alex

union(A, B):

rootA = find(A)

rootB = find(B)

use -1 * array[rootA] and -1 *

array[rootB] to determine weights

put lighter root under heavier root

weight 4

weight 2

union(Ken, Santino)

Paul (4)

weight 1

39 of 43

Using Arrays: Union

  • For WeightedQuickUnion, we need to store the number of nodes in each tree (the weight)
  • Instead of just storing -1 to indicate a root, we can store -1 * weight!

0

1

2

3

4

5

6

-4

0

-2

6

-1

2

0

Joyce

Sam

Aileen

Alex

Paul

Santino

Ken

union(A, B):

rootA = find(A)

rootB = find(B)

use -1 * array[rootA] and -1 *

array[rootB] to determine weights

put lighter root under heavier root

-6

0

Aileen (2)

Santino

Joyce (0)

Ken

Sam

Alex

weight 6

Paul (4)

weight 1

Aileen

union(Ken, Santino)

40 of 43

Array Implementation

CSE 373 SP 18 - KASEY CHAMPION

40

1

6

3

rank = 0

4

2

10

5

7

0

9

8

11

15

13

rank = 3

14

12

17

16

18

rank = 3

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

-1

-1

1

2

2

2

1

6

7

7

6

-1

11

12

12

11

15

15

17

Store (rank * -1) - 1

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

-1

-4

1

2

2

2

1

6

7

7

6

-4

11

12

12

11

15

15

17

Each β€œnode” now only takes 4 bytes of memory instead of 32

41 of 43

Practice

CSE 373 SP 18 - KASEY CHAMPION

41

3

0

rank = 0

4

11

1

5

2

13

12

rank = 2

10

9

14

15

8

rank = 2

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

rank = 1

6

7

16

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

3

0

0

-6

3

-1

-2

6

12

13

13

0

13

-3

12

12

12

42 of 43

Using Arrays for WQU+PC

  • Same asymptotic runtime as using tree nodes, but check out all these other benefits:
    • More compact in memory
    • Better spatial locality, leading to better constant factors from cache usage
    • Simplify the implementation!

(Baseline)

QuickFind

QuickUnion

WeightedQuickUnion

WQU + Path Compression

ArrayWQU+PC

makeSet(value)

𝚯(1)

𝚯(1)

𝚯(1)

𝚯(1)

𝚯(1)

𝚯(1)

find(value)

𝚯(n)

𝚯(1)

𝚯(n)

𝚯(log n)

𝚯(log* n)

𝚯(log* n)

union(x, y)

assuming root args

𝚯(n)

𝚯(n)

𝚯(1)

𝚯(1)

𝚯(1)

𝚯(1)

union(x, y)

𝚯(n)

𝚯(n)

𝚯(n)

𝚯(log n)

𝚯(log* n)

𝚯(log* n)

43 of 43

Appendix