CP190-CP0
Topic 10: Set and Map
Ninghui Li
Purdue University
Limitation of vector
A[k] is constant time
A.push_back on avg constant time
A.insert is worst/avg case linear
find is worst/avg case linear
A.erase is worst/avg case linear
C++ STL set and unordered_set
(Java TreeSet and HashSet)
C++ map and unordered_map
Java TreeMap and HashMap
LeetCode: Repeated DNA Sequences
Problem Description
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
Version 0a: Bruteforce 1
vector<string> findRepeatedDnaSequences(string s) {
vector<string> ans;
for (int i=0; i+10<=s.size(); i++)
for (int j=i+1; j+10<=s.size(); j++) {
int k = 0;
while (k<10 && s[i+k] == s[j+k]) k++;
if (k == 10) ans.push_back(s.substr(i,10));
}
return ans;
}
This version has a bug and give wrong answer. Can you find the bug?
When a string appears more than twice, it will be outputted more than once.
Version 0b: Bug Fixed
vector<string> findRepeatedDnaSequences(string s) {
vector<string> ans;
for (int i=0; i+10<=s.size(); i++)
for (int j=i+1; j+10<=s.size(); j++) {
int k = 0;
while (k<10 && s[i+k] == s[j+k]) k++;
if (k == 10) {
string x = s.substr(i,10);
if (find(ans.begin(), ans.end(), x) == ans.end())
ans.push_back(x);
}
}
return ans;
}
This version gets TLE? What is its time complexity?
The complexity is quadratic. O(n^2), where n is length. Can we speed this up?
Version 1a: Using Set (the version that should not pass, but did)
vector<string> findRepeatedDnaSequences(string s) {
unordered_set<string> ss;
vector<string> ans;
for (int i=0; i+10<=s.size(); i++) {
string x = s.substr(i, 10);
if (ss.count(x)==0) ss.insert(x);
else if (count(ans.begin(),ans.end(),x)==0)
ans.push_back(x);
}
return ans;
}
Constructing the solution is potentially quadratic in the number of strings, which could be linear in length. Worst case is when every length-10 appears exactly twice. This can be solved by adding another unordered_set. But that is a bit ugly.
Version 1b: Using Map
vector<string> findRepeatedDnaSequences(string s) {
unordered_map<string, int> ss;
vector<string> ans;
for (int i=0; i<s.size()-10; i++) {
string x = s.substr(i, 10);
int c = ++ss[x];
if (c == 2)
ans.push_back(x);
}
return ans;
}
This version has a bug. What is it?
s.size() returns uint64. When computing s.size()-10, computation is done in uint64, 0-10 becomes a huge positive number. How to fix it?
(At least) three fixes: (1) always cast s.size() to int, (2) use addition instead of subtraction, (3) checks.size() first to avoid negative number.
Version 1c: Using map with Bug Fixed
vector<string> findRepeatedDnaSequences(string s) {
unordered_map<string, int> ss;
vector<string> ans;
for (int i=0; i+10<=s.size(); i++) {
string x = s.substr(i, 10);
int c = ++ss[x];
if (c == 2) ans.push_back(x);
}
return ans;
}
This version is accepted. Can it be made faster?
Version 2: Encoding Strings (part 1)
class Solution {
int c2b (char x) { // char to bits
if (x == 'A') return 0;
else if (x == 'C') return 1;
else if (x == 'G') return 2;
else return 3;
}
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> ans;
if (s.size() < 10) return ans;
// to be continued on next slide
Version 2: Encoding Strings (part 2)
bitset<1<<20> f1, f2;
int bp = 0, mask = (1 << 20) - 1;
for (int i=0; i<10; i++) { bp <<= 2; bp |= c2b(s[i]); }
f1[bp] = true;
for (int i=10; i<s.size(); i++) {
bp <<= 2; bp &= mask; bp |= c2b(s[i]);
if (f1[bp]) {
if (! f2[bp]) {
ans.push_back(s.substr(i-9, 10)); f2[bp] = true;
}
} else f1[bp] = true;
}
return ans;
}
};
LeetCode: Avoid Flood in the City
Description
Your country has an infinite number of lakes. Initially, all the lakes are empty, but when it rains over the nth lake, the nth lake becomes full of water. If it rains over a lake which is full of water, there will be a flood. Your goal is to avoid the flood in any lake.
Given an integer array rains where:
Description and Sample
Return an array ans where:
If there are multiple valid answers return any of them. If it is impossible to avoid flood return an empty array.
Input | 1 | 3 | 0 | 2 | 0 | 1 | 2 | 0 | 1 | 0 | 0 | 1 | 3 |
Output | -1 | -1 | 1 | -1 | 2 | -1 | -1 | 1 | -1 | 1 | 3 | -1 | -1 |
Approach 1: Look Back to Avoid Each Flood
Approach 2: Look Ahead to defuse Nearest Threat
Approach 1: Solution Code (1)
vector<int> avoidFlood(vector<int>& rains) {
int n = rains.size();
// lake id -> recent raining time
unordered_map<int, int> filled;
set<int> dry; // unused dry days
vector<int> ans(n, 1);
for (int j=0; j<n; j++) {
if (rains[j] == 0) { dry.insert(j); continue; }
// to be continued on the next slide
Approach 1: Solution Code (2)
ans[j] = -1; // rain happens on this day
auto p = filled.find(rains[j]);
if (p == filled.end()) { // no undried raining
filled.insert(make_pair(rains[j],j));
continue;
}
auto q = dry.upper_bound(p->second); // find dry day
if (q == dry.end()) return vector<int>();
ans[*q] = rains[j]; dry.erase(q);
filled[rains[j]] = j;
}
return ans;
}
LeetCode: Data Streams as Disjoint Intervals
Anaysis
Homework Problem Hints:
Hints
End of Topic 10