Locality Sensitive Hashing
Locality Sensitive Hashing
What is it, and how does it work?
2
Overview
Locality sensitive hashing is a technique that for similar inputs, it would with high probability create outputs that fall in the same "bucket".
Unlike traditional hashing, where we aim to minimize hashing collisions, for locality sensitive hashing we strive to maximize collisions for similar items.
Simple Hashing Example
Simple Hashing Example
Traditional Hash Function
1
2
3
4
5
6
...
Simple Hashing Example
Traditional Hash Function
1
2
3
4
5
6
1
17
30
...
...
...
Simple Hashing Example
Traditional Hash Function
Locality Sensitive Hash
1
2
3
4
5
6
1
17
30
1
2
3
4
5
6
...
...
...
...
Simple Hashing Example
Traditional Hash Function
Locality Sensitive Hash
1
2
3
4
5
6
1
17
30
1
2
3
4
5
6
1
2
3
...
...
...
...
...
Simple Hashing Example
Traditional Hash Function
Locality Sensitive Hash
1
2
3
4
5
6
1
17
30
1
2
3
4
5
6
1
2
3
"Try to group similar values closer together!"
...
...
...
...
...
Further Applications
Traditional Hashing
Further Applications
Checksums
Password Hashing
Hash Tables
Possible functions: Adler32, MD5
Traditional Hashing
Locality Sensitive Hashing
Further Applications
Checksums
Password Hashing
Hash Tables
Possible functions: Adler32, MD5
Dimensionality Reduction
Nearest Neighbor Search
Possible functions: Bit sampling, min-wise independent permutations
Formal Definition
We consider an LSH family F defined for a metric space M = (M, d), a threshold R > 0 and an approximation factor c > 1. The family F of functions h: M → S maps elements from the metric space to buckets s ∈ S.
Formal Definition
We consider an LSH family F defined for a metric space M = (M, d), a threshold R > 0 and an approximation factor c > 1. The family F of functions h: M → S maps elements from the metric space to buckets s ∈ S. For any two points p, q ∈ M, using a function h ∈ F chosen randomly:
• if d(p, q) ≤ R → h(p) = h(q) with probability ≥ P1
• if d(p, q) ≥ cR→ h(p) = h(q) with probability at most P2
Formal Definition
We consider an LSH family F defined for a metric space M = (M, d), a threshold R > 0 and an approximation factor c > 1. The family F of functions h: M → S maps elements from the metric space to buckets s ∈ S. For any two points p, q ∈ M, using a function h ∈ F chosen randomly:
• if d(p, q) ≤ R → h(p) = h(q) with probability ≥ P1
• if d(p, q) ≥ cR→ h(p) = h(q) with probability at most P2
We consider a family interesting if P1 > P2.
Source: Wikipedia
Locality Sensitive Hashing Techniques
Bitwise Sampling, Min Hashing, SimHash
Bitwise Sampling
Simple technique that "samples" a particular bit at random from a binary string.
We can express it formally as:
F = { h: {0, 1}d -> {0, 1}
| 3 i ε {1, ..., d}, h(x) = xi }
Example:
We may consider the following bit-strings:
A = {0, 1, 0, 1, 0}
B = {0, 0, 0, 1, 0}
C = {1, 1, 1, 1, 1}
One possible sample would be to select the second bit - so in this case for:
A = 1
B = 0
C = 1
Clearly the results are not very powerful using bitwise sampling alone.
Min Hashing
More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".
By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.
This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.
Example:
Source: link
Min Hashing
More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".
By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.
This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.
Example:
Source: link
Min Hashing
More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".
By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.
This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.
Example:
Source: link
Min Hashing
More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".
By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.
This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.
Example:
Source: link
Min Hashing
More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".
By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.
This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.
Example:
Source: link
Min Hashing
More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".
By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.
This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.
Example:
Source: link
Min Hashing
More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".
By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.
This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.
Example:
Source: link
Min Hashing
More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".
By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.
This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.
Example:
Source: link
Min Hashing
More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".
By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.
This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.
Example:
Source: link
Min Hashing
More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".
By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.
This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.
Example:
Source: link
Min Hashing
More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".
By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.
This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.
Example:
Source: link
SimHash
Google patented technique for rapidly identifying "near-identical" documents.
• Unlike MinHash, it is more suitable for finding "near-identical" (unlike clustering closely related documents)
• More performant than MinHash
• Used by Google for duplicate detection in web crawling, and FLoC in ads
Process for identical documents
1. Generate simhashes
2. Perform "Hamming distance problem"
Generating simhashes
Simhashes are typically 64-bit sequences generated from a set of features.
From the set of features, "average" the individual bits (i.e., you will have a set of 10 64-bit numbers, determine if more 1s than 0s occur in an index and apply that value).
Hamming distance problem
Efficiently determine closest simhashes (i.e., not in quadratic time) - achieved via random projections or permutations + interpolation search.
For reference, view the following link
Future Reading
Locality Preserving Hashing
• Hash function f that maps a single point or multiple points in a multi-dimensional coordinate space to a single scalar value
• Maintains the property that:
|A - B| < |B - C| =>
|ƒ(A) - ƒ(B)| < |ƒ(B) - ƒ(C)|
Simhash Paper
• Link: http://www2007.org/papers/paper215.pdf
SuperMinHash Paper
• Link: https://arxiv.org/pdf/1706.05698.pdf
Thanks for watching!
Feel free to leave feedback below! :)