1 of 54

Preannouncements

Leverage content from this class + 170 + 176 for life saving purposes.

  • Specifically… Computational Biology
  • There exists a club that does comp bio, it is: ALIGN.
  • This March 9th Thursday 6:30 - 8:00 in 151 Barrow
  • Ask questions of grad students who do computational biology.
    • EE/CS
    • MCB
    • How medical interfaces with CS.

2 of 54

Announcements

Project 2 due tonight at 11:59 PM.

  • 24 hour grace period.
  • 10% off for subsequent days.
  • Late penalties scaled by your score (1 day late with 30 point score means you get 29.4 points)
  • Project has seemed like a more appropriate level of difficulty than previous semesters (!). Still very challenging but not utterly merciless.
  • Partnerships: Some were rocky, and that’s OK. Good to know what sort of things you might face in the future. Sorry if it was stressful.

3 of 54

CS61B

Lecture 20: Disjoint Sets

  • Dynamic Connectivity and the Disjoint Sets Problem
  • Quick Find
  • Quick Union
  • Weighted Quick Union
  • Path Compression (CS170 Preview)

4 of 54

Meta-goals of the Coming Lectures: Data Structure Refinement

Project 2: A chance to see how a design evolves.

Next couple of weeks: Deriving classic solutions to interesting problems, with an emphasis on how set, map, and priority queue ADTs are implemented.

  • Today: A chance to see how an implementation of an ADT can evolve and how different underlying data structures affect asymptotic runtime (using our formal notation).

5 of 54

Today’s Goal: Dynamic Connectivity

Today: A case study in ADT implementation.

Goal: Given a series of pairwise connectedness declarations, determine if two items are connected transitively. Only care about yes vs. no.

  • Example: We have Mexico, USA, Canada, Ukraine, and Estonia.
    • USA is connected to Mexico.
    • USA is connected to Canada.
    • Is Mexico connected to Canada? Yes.
    • Ukraine is connected to Estonia.
    • Is Ukraine connected to USA? No.

Solvable using a simple ADT with cool implementation details.

6 of 54

The Dynamic Connectivity Problem

Goal: Given a series of pairwise integers connectedness declarations, determine if two integers (or items) are connected. Two operations:

  • connect(p, q): Connect items p and q.
  • isConnected(p, q): Are p and q connected?�

0

4

3

1

2

5

connect(0, 1)

connect(1, 2)

connect(0, 4)

connect(3, 5)

isConnected(2, 4): true

isConnected(3, 0): false

6

7 of 54

The Dynamic Connectivity Problem

Goal: Given a series of pairwise integers connectedness declarations, determine if two integers (or items) are connected. Two operations:

  • connect(p, q): Connect items p and q.
  • isConnected(p, q): Are p and q connected?�

connect(0, 1)

connect(1, 2)

connect(0, 4)

connect(3, 5)

isConnected(2, 4): true

isConnected(3, 0): false

connect(4, 2)

0

4

3

1

2

5

6

8 of 54

The Dynamic Connectivity Problem

Goal: Given a series of pairwise integers connectedness declarations, determine if two integers (or items) are connected. Two operations:

  • connect(p, q): Connect items p and q.
  • isConnected(p, q): Are p and q connected?�

connect(0, 1)

connect(1, 2)

connect(0, 4)

connect(3, 5)

isConnected(2, 4): true

isConnected(3, 0): false

connect(4, 2)

connect(4, 6)

0

4

3

1

2

5

6

9 of 54

The Dynamic Connectivity Problem

Goal: Given a series of pairwise integers connectedness declarations, determine if two integers (or items) are connected. Two operations:

  • connect(p, q): Connect items p and q.
  • isConnected(p, q): Are p and q connected?�

connect(0, 1)

connect(1, 2)

connect(0, 4)

connect(3, 5)

isConnected(2, 4): true

isConnected(3, 0): false

connect(4, 2)

connect(4, 6)

connect(3, 6)

0

4

3

1

2

5

6

10 of 54

The Dynamic Connectivity Problem

Goal: Given a series of pairwise integers connectedness declarations, determine if two integers (or items) are connected. Two operations:

  • connect(p, q): Connect items p and q.
  • isConnected(p, q): Are p and q connected?�

connect(0, 1)

connect(1, 2)

connect(0, 4)

connect(3, 5)

isConnected(2, 4): true

isConnected(3, 0): false

connect(4, 2)

connect(4, 6)

connect(3, 6)

isConnected(3, 0): true

0

4

3

1

2

5

6

11 of 54

The Disjoint Sets ADT

Goal: Design an efficient DisjointSets implementation.

  • Number of elements N can be huge.
  • Number of method calls M can be huge.
  • Calls to methods may be interspersed (e.g. can’t assume that we stop getting connect calls after some point).

public interface DisjointSets {

/** Connects two items P and Q. */

void connect(int p, int q);

/** Checks to see if two items are connected. */

boolean isConnected(int p, int q);

}

getBack()

get(int i)

deleteBack()

connect(int p, int q)

isConnected(int p, int q)

12 of 54

The Naive Approach

Naive approach:

  • Connecting two things: Record every single connecting line in some data structure.
  • Checking connectedness: Do some sort of (??) iteration over the lines to see if one thing can be reached from the other.

0

4

1

2

3

5

6

13 of 54

A Better Approach: Connected Components

Rather than manually writing out every single connecting line, record the sets that something belongs to.�

connect(0, 1)

connect(1, 2)

connect(0, 4)

connect(3, 5)

isConnected(2, 4): true

isConnected(3, 0): false

connect(4, 2)

connect(4, 6)

connect(3, 6)

isConnected(3, 0): true

{0}, {1}, {2}, {3}, {4}, {5}, {6}, {7}

{0, 1}, {2}, {3}, {4}, {5}, {6}, {7}

{0, 1, 2}, {3}, {4}, {5}, {6}, {7}

{0, 1, 2, 4}, {3}, {5}, {6}, {7}

{0, 1, 2, 4}, {3, 5}, {6}, {7}

{0, 1, 2, 4}, {3, 5}, {6}, {7}

{0, 1, 2, 4, 6}, {3, 5}, {7}

{0, 1, 2, 3, 4, 5, 6}, {7}

14 of 54

A Better Approach: Connected Components

A connected component is a maximal set of items that are mutually connected.

  • Naive approach: Record every single connecting line somehow.
  • Better approach: Model connectedness in terms of sets.
    • How things are connected isn’t something we need to know.

{ 0, 1, 2, 4 }, {3, 5}, {6}

0

4

1

2

3

5

6

15 of 54

Quick Find

16 of 54

Challenge: Pick Data Structures to Support Tracking of Sets

Before connect(2, 5) operation: After connect(2, 5) operation:

{ 0, 1, 2, 4 }, {3, 5}, {6}

{ 0, 1, 2, 4, 3, 5}, {6}

0

4

1

2

3

5

6

0

4

1

2

3

5

6

17 of 54

Challenge: Pick Data Structures to Support Tracking of Sets

Before connect(2, 5) operation: After connect(2, 5) operation:

Idea #1: ArrayList<ArrayList<Integer>> [linked list variant]

  • 0th list is {0, 1, 2, 4}, next up is {3, 5}, and then also {6}

Idea #2: HashMap<Integer, Set>

  • 0 → put(0), put(1), put(2), put(4)
  • isConnected(4, 5): map.get(4) .equals map.get(5)

{ 0, 1, 2, 4 }, {3, 5}, {6}

{ 0, 1, 2, 4, 3, 5}, {6}

0

4

1

2

3

5

6

0

4

1

2

3

5

6

18 of 54

Challenge: Pick Data Structures to Support Tracking of Sets

Before connect(2, 5) operation: After connect(2, 5) operation:

Idea #3: Linked List

{ 0, 1, 2, 4 }, {3, 5}, {6}

{ 0, 1, 2, 4, 3, 5}, {6}

0

4

1

2

3

5

6

0

4

1

2

3

5

6

19 of 54

Using an Array

One natural choice: int[] where ith entry gives set number of item i.

  • connect(p, q): Change entries that equal id[p] to id[q]

Before connect(2, 5) operation: After connect(2, 5) operation:

0

0

0

3

0

3

6

0 1 2 3 4 5 6

3

3

3

3

3

3

6

0 1 2 3 4 5 6

int[] id

int[] id

{ 0, 1, 2, 4 }, {3, 5}, {6}

{ 0, 1, 2, 4, 3, 5}, {6}

0

4

1

2

3

5

6

0

4

1

2

3

5

6

20 of 54

QuickFindDS

public class QuickFindDS implements DisjointSets {

private int[] id;

public boolean isConnected(int p, int q) {

return id[p] == id[q];

}

public void connect(int p, int q) {

int pid = id[p];

int qid = id[q];

for (int i = 0; i < id.length; i++) {

if (id[i] == pid) {

id[i] = qid;

}

}...

Very fast: Two array accesses.

Relatively slow: N+2 to 2N+2 array accesses.

public QuickFindDS(int N) {

id = new int[N];

for (int i = 0; i < N; i++)

id[i] = i;

}

}

21 of 54

Performance Summary

QuickFindDS is too slow: Connecting two items takes N time.

Implementation

constructor

connect

isConnected

QuickFindDS

Θ(N)

Θ(N)

Θ(1)

22 of 54

Quick Union

23 of 54

Improving the Connect Operation

Approach zero: Represent everything as boxes and lines. This was overkill.

Approach one: Represent everything as connected components. Represented connected components as an array.

Approach two: We’re still going to stick with connected components, but will represent connected components differently.

0

4

1

2

3

5

6

??? in Java.

{ 0, 1, 2, 4 }, {3, 5}, {6}

0

0

0

3

0

3

6

0 1 2 3 4 5 6

int[] id

{ 0, 1, 2, 4 }, {3, 5}, {6}

??? in Java.

24 of 54

Improving the Connect Operation

A hard question: How could we change our set representation so that combining two sets into their union requires changing one value?

0

0

0

3

0

3

6

0 1 2 3 4 5 6

3

3

3

3

3

3

6

0 1 2 3 4 5 6

int[] id

int[] id

{ 0, 1, 2, 4 }, {3, 5}, {6}

{ 0, 1, 2, 4, 3, 5}, {6}

0

4

1

2

3

5

6

0

4

1

2

3

5

6

25 of 54

Improving the Connect Operation

A hard question: How could we change our set representation so that combining two sets into their union requires changing one value?

0

0

0

3

0

3

6

0 1 2 3 4 5 6

3

3

3

3

3

3

6

0 1 2 3 4 5 6

int[] id

int[] id

{ 0, 1, 2, 4 }, {3, 5}, {6}

{ 0, 1, 2, 4, 3, 5}, {6}

0

4

1

2

3

5

6

0

4

1

2

3

5

6

SET: 0

SET: 0

26 of 54

Improving the Connect Operation

Possibly harder question:

  • Knowing that we only need to support the union/belongs operations, how can we represent a set such that the set union operation is very fast?
  • Idea: Assign each node a parent (instead of an id).
    • An innocuous sounding, seemingly arbitrary solution.
    • Unlocks a pretty amazing universe of math that we won’t discuss.

{0, 1, 2, 4} {3, 5} {6}

2

1

4

0

6

0 is the “boss” of this set.

0

0

1

3

0

3

6

0 1 2 3 4 5 6

int[] parent

5

3

27 of 54

Improving the Connect Operation

connect(5, 2)

  • How do we do this?
    • If you’re not sure where to start, consider: why can’t we just set id[5] to 2?

{0, 1, 2, 4} {3, 5} {6}

2

1

4

0

6

0 is the “boss” of this set.

0

0

1

3

0

3

6

0 1 2 3 4 5 6

int[] parent

5

3

28 of 54

Improving the Connect Operation

connect(5, 2)

  • How do we do this?
    • If you’re not sure where to start, consider: why can’t we just set id[5] to 2?
    • make root(5) into a child of root(2).

{0, 1, 2, 4, 3, 5} {6}

2

1

4

0

5

3

6

0 is the “boss” of this set.

0

0

1

0

0

3

6

0 1 2 3 4 5 6

int[] parent

29 of 54

Set Union Using Rooted-Tree Representation

connect(5, 2)

  • Make root(5) into a child of root(2).

What are the potential performance issues with this approach?

{0, 1, 2, 4, 3, 5} {6}

2

1

4

0

5

3

6

0 is the “boss” of this set.

0

0

1

0

0

3

6

0 1 2 3 4 5 6

int[] parent

30 of 54

Set Union Using Rooted-Tree Representation

connect(5, 2)

  • Make root(5) into a child of root(2).

What are the potential performance issues with this approach?

  • Tree can get too tall!

{0, 1, 2, 4, 3, 5} {6}

2

1

4

0

5

3

6

0 is the “boss” of this set.

0

0

1

0

0

3

6

0 1 2 3 4 5 6

int[] parent

31 of 54

The Worst Case

If we always connect the first items’ tree below the second item’s tree, we can end up with a tree of height M after M operations:

  • connect(4, 3)
  • connect(3, 2)
  • connect(2, 1)
  • connect(1, 0)

For N items, what’s the worst case runtime…

  • For connect(p, q)?
    • Find the boss! Well finding the boss twice is also theta(N)
  • For isConnected(p, q)?
    • Walk entire tree twice, theta(N)!

3

2

1

0

4

32 of 54

QuickUnionDS

public class QuickUnionDS implements DisjointSets {

private int[] parent;

public QuickUnionDS(int N) {

parent = new int[N];

for (int i = 0; i < N; i++)

parent[i] = i;

}

private int find(int p) {

while (p != parent[p])

p = parent[p];

return p;

}

For N items, this means a worst case runtime of Θ(N).

public boolean isConnected(int p, int q) {

return find(p) == find(q);

}

public void connect(int p, int q) {

int i = find(p);

int j = find(q);

parent[i] = j;

}

33 of 54

Performance Summary

QuickFindDS defect: QuickFindDS is too slow: Connecting two items takes N time in the worst case.

QuickUnion defect: Trees can get tall. Results in potentially even worse performance than QuickFind if tree is imbalanced.

Implementation

Constructor

connect

isConnected

QuickFindDS

Θ(N)

Θ(N)

Θ(1)

QuickUnionDS

Θ(N)

O(N)

O(N)

34 of 54

Weighted Quick Union

35 of 54

Weighted QuickUnion: http://shoutkey.com/black

Modify quick-union to avoid tall trees.

  • Track tree size (number of elements).
  • New rule: Always link root of smaller tree to larger tree.

New rule: If we call connect(3, 8), which entry (or entries) of parent[] changes?

  • parent[3]
  • parent[0]
  • parent[8]
  • parent[6]

1

2

0

4

6

5

3

8

9

7

0

0

0

0

0

0

6

6

6

8

0 1 2 3 4 5 6 7 8 9

int[] parent

36 of 54

Improvement #1: Weighted QuickUnion

Modify quick-union to avoid tall trees.

  • Track tree size (number of elements).
  • New rule: Always link root of smaller tree to larger tree.

New rule: If we call connect(3, 8), which entry (or entries) of parent[] changes?

  • parent[3]
  • parent[0]
  • parent[8]
  • parent[6]

1

2

0

4

6

5

3

8

9

7

0

0

0

0

0

0

0

6

6

8

0 1 2 3 4 5 6 7 8 9

int[] parent

37 of 54

Implementing WeightedQuickUnion

Minimal changes needed:

  • Use parent[] array as before, but also add size[] array.
  • isConnected(int p, int q) requires no changes.
  • connect(int p, int q) needs two changes:
    • Link root of smaller tree to larger tree.
    • Update size[] array.

Now the connect method looks like:

public void connect(int p, int q) {

int i = find(p);

int j = find(q);

if (i == j) return;

if (size[i] < size[j]) { parent[i] = j; size[j] += size[i]; }

else { parent[j] = i; size[i] += size[j]; }

}

0

0

0

0

0

0

0

6

6

8

0 1 2 3 4 5 6 7 8 9

int[] parent

10

1

1

1

1

1

4

1

2

1

0 1 2 3 4 5 6 7 8 9

int[] size

38 of 54

WeightedQuickUnion Performance

As before, connect and isConnected require time proportional to depth of the items involved.

Max depth of any item: log N

Very brief proof for the curious (not covered in lecture):

  • Depth of an element x increases only when tree T1 that contains x is linked below some other tree T2.
    • The size of the tree at least doubles since weight(T2) ≥ weight(T1).
    • Tree containing x doubles at most log N times.

39 of 54

Performance Summary

By tweaking QuickUnionDS we’ve achieved logarithmic time performance.

  • Fast enough for most (all?) practical problems.

Implementation

Constructor

connect

isConnected

QuickFindDS

Θ(N)

Θ(N)

Θ(1)

QuickUnionDS

Θ(N)

O(N)

O(N)

WeightedQuickUnionDS

Θ(N)

O(log N)

O(log N)

40 of 54

Performance Summary

Performing M operations on a DisjointSet object with N elements:

  • Runtime goes from O(MN) to O(N + M log N)
  • For N = 109 and M = 109, time to run goes from 30 years to 6 seconds.
    • Key point: Good data structure unlocks solutions to problems that could otherwise not be solved!
  • … pretty good for most problems, but could we do better?

Implementation

Constructor

connect

isConnected

QuickFindDS

Θ(N)

Θ(N)

Θ(1)

QuickUnionDS

Θ(N)

O(N)

O(N)

WeightedQuickUnionDS

Θ(N)

O(log N)

O(log N)

41 of 54

Path Compression

(CS170 Spoiler)

42 of 54

170 Spoiler: Path Compression: A Clever Idea

Below is the topology of the worst case if we use WeightedQuickUnion.

  • Clever idea: When we do isConnected(15, 10), tie all nodes seen to the root.
    • Additional cost is insignificant (same order of growth).

15

11

5

12

13

6

1

7

14

8

2

9

10

3

0

4

43 of 54

170 Spoiler: Path Compression: A Clever Idea

Below is the topology of the worst case if we use WeightedQuickUnion

  • Clever idea: When we do isConnected(15, 10), tie all nodes seen to the root.
    • Additional cost is insignificant (same order of growth).

15

11

5

12

13

6

1

7

14

8

2

9

10

3

0

4

44 of 54

170 Spoiler: Path Compression: A Clever Idea

Path compression results in a union/connected operations that are very very close to amortized constant time.

  • M operations on N nodes is O(N + M lg* N) - you will see this in CS170.
  • lg* is less than 5 for any realistic input.

15

11

5

12

13

6

1

7

14

8

2

9

10

3

0

4

N

lg* N

1

0

2

1

4

2

16

3

65536

4

265536

5

45 of 54

170 Spoiler: Path Compression: A Clever Idea

Path compression results in a union/connected operations that are very very close to amortized constant time.

  • M operations on N nodes is O(N + M lg* N) - you will see this in CS170.
  • A tighter bound: O(N + M α(N)), where α is the inverse Ackermann function.
  • The inverse ackermann function is less than 5 for all practical inputs!

15

11

5

12

13

6

1

7

14

8

2

9

10

3

0

4

N

α(N)

1

0

...

1

...

2

...

3

...

4

5

46 of 54

170 Spoiler: Path Compression: A Clever Idea

  • And we’re done! The end result of our iterative design process is the standard way disjoint sets are implemented today - quick union and path compression.
  • The resulting code for find() is simple:

private int find(int p) {

if (p == parent[p]) {

return p;

} else {

parent[p] = find(parent[p]);

return parent[p];

}

}

47 of 54

Everything All Together...

public class WeightedQuickUnionDSWithPathCompression implements DisjointSets {

private int[] parent; private int[] size;

public WeightedQuickUnionDSWithPathCompression(int N) {

parent = new int[N]; size = new int[N];

for (int i = 0; i < N; i++) {

parent[i] = i;

size[i] = 1;

}

}

private int find(int p) {

if (p == parent[p]) {

return p;

} else {

parent[p] = find(parent[p]);

return parent[p];

}

}

public boolean isConnected(int p, int q) {

return find(p) == find(q);

}

public void connect(int p, int q) {

int i = find(p);

int j = find(q);

if (i == j) return;

if (size[i] < size[j]) {

parent[i] = j; size[j] += size[i];

} else {

parent[j] = i; size[i] += size[j];

}

}

48 of 54

Performance Summary

Runtimes are given assuming:

  • We have a DisjointSets object of size N.
  • We perform M operations, where an operation is defined as either a call to connected or isConnected.

Implementation

Runtime

QuickFindDS

Θ(NM)

QuickUnionDS

O(NM)

WeightedQuickUnionDS

O(N + M log N)

WeightedQuickUnionDSWithPathCompression

O(N + M α(N))

49 of 54

Citations

Nazca Lines: http://redicecreations.com/ul_img/24592nazca_bird.jpg

Implementation code adapted from Algorithms, 4th edition and Professor Jonathan Shewchuk’s lecture notes on disjoint sets, where he presents a faster one-array solution. I would recommend taking a look.

(http://www.cs.berkeley.edu/~jrs/61b/lec/33)

The proof of the inverse ackermann runtime for disjoint sets is given here:

http://www.uni-trier.de/fileadmin/fb4/prof/INF/DEA/Uebungen_LVA-Ankuendigungen/ws07/KAuD/effi.pdf

as originally proved by Tarjan here at UC Berkeley in 1975.

50 of 54

extra

51 of 54

Set Union Using Parent Representation

connect(11, 3)

  • Any ideas how to do this quickly?

0

8

2

0

0

0

0

8

8

6

0

8

8

8

0 1 2 3 4 5 6 7 8 9 10 11 12 13

int[] parent

{0, 3, 4, 5, 6, 9, 10}, {1, 7, 8, 11, 12, 13}, {2}

10

9

6

5

4

3

0

13

11

12

7

1

8

2

52 of 54

Improving the Connect Operation

Possibly harder question:

  • Knowing that we only need to support the union/belongs operations, how can we represent a set such that the set union operation is very fast?
  • Idea: Assign each node a parent (instead of an id).
    • An innocuous sounding, seemingly arbitrary solution.
    • Unlocks a pretty amazing universe of math that we won’t discuss. D:

{0, 3, 4, 5, 6, 9, 10}, {1, 7, 8, 11, 12, 13}, {2}

10

9

6

5

4

3

0

13

11

12

7

1

8

2

0 is the “boss” of this set.

0

8

2

0

0

0

0

8

8

6

0

8

8

8

0 1 2 3 4 5 6 7 8 9 10 11 12 13

int[] parent

53 of 54

Set Union Using Rooted-Tree Representation

connect(11, 3)

  • Make root(11) into a child of root(3).

Anybody see any issues with this?

{0, 3, 4, 5, 6, 9, 10, 1, 7, 8, 11, 12, 13}, {2}

10

9

6

5

4

3

0

13

11

12

7

1

8

2

0

8

2

0

0

0

0

8

0

6

0

8

8

8

0 1 2 3 4 5 6 7 8 9 10 11 12 13

int[] id

54 of 54

Set Union Using Rooted-Tree Representation

connect(11, 3)

  • Make root(11) into a child of root(3).

Anybody see any issues with this? Tree can get too tall!

0

8

2

0

0

0

0

8

0

6

0

8

8

8

0 1 2 3 4 5 6 7 8 9 10 11 12 13

int[] id

{0, 3, 4, 5, 6, 9, 10, 1, 7, 8, 11, 12, 13}, {2}

10

9

6

5

4

3

0

13

11

12

7

1

8

2