1 of 25

CP190-CP0

Topic 10: Set and Map

Ninghui Li

Purdue University

2 of 25

Limitation of vector

  • Some operations in vector are expensive, i.e, they take time linear in the length of the vector:
  • What are the complexities for the following operations on vector A?
    • A[k]
    • A.push_back(val);
    • A.insert(pos, val);
    • find(A.begin(), A.end(), val);
    • A.erase(pos);

A[k] is constant time

A.push_back on avg constant time

A.insert is worst/avg case linear

find is worst/avg case linear

A.erase is worst/avg case linear

3 of 25

C++ STL set and unordered_set

(Java TreeSet and HashSet)

  • A set is a collection containing no duplicate elements
  • A set supports the following operations efficiently:
    • Testing for membership (find, count), Adding elements (insert), Removing elements (erase)
    • Worst/avg case O(log n) in set, and avg case O(1) in unordered_set
  • In set, which is ordered, elements are ordered by some order, and thus additionally support efficient
    • first, last, ordered traversal (finding next element by incrementing the iterator is efficient), etc.

4 of 25

C++ map and unordered_map

Java TreeMap and HashMap

  • Map is a set of ordered pairs, each (key, value)
  • In a given Map, there are no duplicate keys
  • Each key is mapped to one value
  • Different keys can be mapped to the same value
  • Can be used to implement multi-set, i.e., a collection of things where one thing can appear multiple times.
  • An array can be viewed as a Map from a set of integer indices to a set of values
  • Whenever you want to use an array, but the index is not an integer, you can use map.

5 of 25

LeetCode: Repeated DNA Sequences

6 of 25

Problem Description

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output: ["AAAAACCCCC", "CCCCCAAAAA"]

7 of 25

Version 0a: Bruteforce 1

vector<string> findRepeatedDnaSequences(string s) {

vector<string> ans;

for (int i=0; i+10<=s.size(); i++)

for (int j=i+1; j+10<=s.size(); j++) {

int k = 0;

while (k<10 && s[i+k] == s[j+k]) k++;

if (k == 10) ans.push_back(s.substr(i,10));

}

return ans;

}

This version has a bug and give wrong answer. Can you find the bug?

When a string appears more than twice, it will be outputted more than once.

8 of 25

Version 0b: Bug Fixed

vector<string> findRepeatedDnaSequences(string s) {

vector<string> ans;

for (int i=0; i+10<=s.size(); i++)

for (int j=i+1; j+10<=s.size(); j++) {

int k = 0;

while (k<10 && s[i+k] == s[j+k]) k++;

if (k == 10) {

string x = s.substr(i,10);

if (find(ans.begin(), ans.end(), x) == ans.end())

ans.push_back(x);

}

}

return ans;

}

This version gets TLE? What is its time complexity?

The complexity is quadratic. O(n^2), where n is length. Can we speed this up?

9 of 25

Version 1a: Using Set (the version that should not pass, but did)

vector<string> findRepeatedDnaSequences(string s) {

unordered_set<string> ss;

vector<string> ans;

for (int i=0; i+10<=s.size(); i++) {

string x = s.substr(i, 10);

if (ss.count(x)==0) ss.insert(x);

else if (count(ans.begin(),ans.end(),x)==0)

ans.push_back(x);

}

return ans;

}

Constructing the solution is potentially quadratic in the number of strings, which could be linear in length. Worst case is when every length-10 appears exactly twice. This can be solved by adding another unordered_set. But that is a bit ugly.

10 of 25

Version 1b: Using Map

vector<string> findRepeatedDnaSequences(string s) {

unordered_map<string, int> ss;

vector<string> ans;

for (int i=0; i<s.size()-10; i++) {

string x = s.substr(i, 10);

int c = ++ss[x];

if (c == 2)

ans.push_back(x);

}

return ans;

}

This version has a bug. What is it?

s.size() returns uint64. When computing s.size()-10, computation is done in uint64, 0-10 becomes a huge positive number. How to fix it?

(At least) three fixes: (1) always cast s.size() to int, (2) use addition instead of subtraction, (3) checks.size() first to avoid negative number.

11 of 25

Version 1c: Using map with Bug Fixed

vector<string> findRepeatedDnaSequences(string s) {

unordered_map<string, int> ss;

vector<string> ans;

for (int i=0; i+10<=s.size(); i++) {

string x = s.substr(i, 10);

int c = ++ss[x];

if (c == 2) ans.push_back(x);

}

return ans;

}

This version is accepted. Can it be made faster?

12 of 25

Version 2: Encoding Strings (part 1)

class Solution {

int c2b (char x) { // char to bits

if (x == 'A') return 0;

else if (x == 'C') return 1;

else if (x == 'G') return 2;

else return 3;

}

public:

vector<string> findRepeatedDnaSequences(string s) {

vector<string> ans;

if (s.size() < 10) return ans;

// to be continued on next slide

13 of 25

Version 2: Encoding Strings (part 2)

bitset<1<<20> f1, f2;

int bp = 0, mask = (1 << 20) - 1;

for (int i=0; i<10; i++) { bp <<= 2; bp |= c2b(s[i]); }

f1[bp] = true;

for (int i=10; i<s.size(); i++) {

bp <<= 2; bp &= mask; bp |= c2b(s[i]);

if (f1[bp]) {

if (! f2[bp]) {

ans.push_back(s.substr(i-9, 10)); f2[bp] = true;

}

} else f1[bp] = true;

}

return ans;

}

};

14 of 25

LeetCode: Avoid Flood in the City

15 of 25

Description

Your country has an infinite number of lakes. Initially, all the lakes are empty, but when it rains over the nth lake, the nth lake becomes full of water. If it rains over a lake which is full of water, there will be a flood. Your goal is to avoid the flood in any lake.

Given an integer array rains where:

  • rains[i] > 0 means there will be rains over the rains[i] lake.
  • rains[i] == 0 means there are no rains this day and you can choose one lake this day and dry it.

16 of 25

Description and Sample

Return an array ans where:

  • ans.length == rains.length
  • ans[i] == -1 if rains[i] > 0.
  • ans[i] is the lake you choose to dry in the ith day if rains[i] == 0.

If there are multiple valid answers return any of them. If it is impossible to avoid flood return an empty array.

Input

1

3

0

2

0

1

2

0

1

0

0

1

3

Output

-1

-1

1

-1

2

-1

-1

1

-1

1

3

-1

-1

17 of 25

Approach 1: Look Back to Avoid Each Flood

  • For each dangerous situation (rains happen twice for a lake), look back to figure out how to defuse the danger.
  • We want two conditions:
    • It has to be after the most recent raining at the lake.
    • We want to be as early as possible.
  • We need to maintain all available days.

18 of 25

Approach 2: Look Ahead to defuse Nearest Threat

  • For each dry day, look ahead to determine which one needs to be dried.
  • Two conditions:
    • It should a lake that is already filled.
    • We will be fine choosing the lake whose next raining day is earliest.

19 of 25

Approach 1: Solution Code (1)

vector<int> avoidFlood(vector<int>& rains) {

int n = rains.size();

// lake id -> recent raining time

unordered_map<int, int> filled;

set<int> dry; // unused dry days

vector<int> ans(n, 1);

for (int j=0; j<n; j++) {

if (rains[j] == 0) { dry.insert(j); continue; }

// to be continued on the next slide

20 of 25

Approach 1: Solution Code (2)

ans[j] = -1; // rain happens on this day

auto p = filled.find(rains[j]);

if (p == filled.end()) { // no undried raining

filled.insert(make_pair(rains[j],j));

continue;

}

auto q = dry.upper_bound(p->second); // find dry day

if (q == dry.end()) return vector<int>();

ans[*q] = rains[j]; dry.erase(q);

filled[rains[j]] = j;

}

return ans;

}

21 of 25

22 of 25

Anaysis

  • Naive solution
    • Add each number to the end of a vector (constant time). When need to return the ranges, sort, scan and construct.
  • Maintain ranges
    • Update vector of ranges as new numbers come in.

23 of 25

Homework Problem Hints:

24 of 25

Hints

  • Galactic Election: Use maps. Note that one can map a key to a value, which can itself be a map.
  • User Names: A clean way is to implement a class that supports all queries regarding one last name. The class should support two public methods: one for getting the next available number for that last name, and one for freeing one number. You need to design a data structure that support both operation in at most log time.
  • Data Stream Queries I: See the C language for detailed hints.

25 of 25

End of Topic 10