1 of 31

Locality Sensitive Hashing

2 of 31

Locality Sensitive Hashing

What is it, and how does it work?

2

3 of 31

Overview

Locality sensitive hashing is a technique that for similar inputs, it would with high probability create outputs that fall in the same "bucket".

Unlike traditional hashing, where we aim to minimize hashing collisions, for locality sensitive hashing we strive to maximize collisions for similar items.

4 of 31

Simple Hashing Example

5 of 31

Simple Hashing Example

Traditional Hash Function

1

2

3

4

5

6

...

6 of 31

Simple Hashing Example

Traditional Hash Function

1

2

3

4

5

6

1

17

30

...

...

...

7 of 31

Simple Hashing Example

Traditional Hash Function

Locality Sensitive Hash

1

2

3

4

5

6

1

17

30

1

2

3

4

5

6

...

...

...

...

8 of 31

Simple Hashing Example

Traditional Hash Function

Locality Sensitive Hash

1

2

3

4

5

6

1

17

30

1

2

3

4

5

6

1

2

3

...

...

...

...

...

9 of 31

Simple Hashing Example

Traditional Hash Function

Locality Sensitive Hash

1

2

3

4

5

6

1

17

30

1

2

3

4

5

6

1

2

3

"Try to group similar values closer together!"

...

...

...

...

...

10 of 31

Further Applications

11 of 31

Traditional Hashing

Further Applications

Checksums

Password Hashing

Hash Tables

Possible functions: Adler32, MD5

12 of 31

Traditional Hashing

Locality Sensitive Hashing

Further Applications

Checksums

Password Hashing

Hash Tables

Possible functions: Adler32, MD5

Dimensionality Reduction

Nearest Neighbor Search

Possible functions: Bit sampling, min-wise independent permutations

13 of 31

Formal Definition

We consider an LSH family F defined for a metric space M = (M, d), a threshold R > 0 and an approximation factor c > 1. The family F of functions h: M → S maps elements from the metric space to buckets s ∈ S.

14 of 31

Formal Definition

We consider an LSH family F defined for a metric space M = (M, d), a threshold R > 0 and an approximation factor c > 1. The family F of functions h: M → S maps elements from the metric space to buckets s ∈ S. For any two points p, q M, using a function h ∈ F chosen randomly:

• if d(p, q) ≤ R → h(p) = h(q) with probability ≥ P1

• if d(p, q) ≥ cR→ h(p) = h(q) with probability at most P2

15 of 31

Formal Definition

We consider an LSH family F defined for a metric space M = (M, d), a threshold R > 0 and an approximation factor c > 1. The family F of functions h: M → S maps elements from the metric space to buckets s ∈ S. For any two points p, q M, using a function h ∈ F chosen randomly:

• if d(p, q) ≤ R → h(p) = h(q) with probability ≥ P1

• if d(p, q) ≥ cR→ h(p) = h(q) with probability at most P2

We consider a family interesting if P1 > P2.

Source: Wikipedia

16 of 31

Locality Sensitive Hashing Techniques

Bitwise Sampling, Min Hashing, SimHash

17 of 31

Bitwise Sampling

Simple technique that "samples" a particular bit at random from a binary string.

We can express it formally as:

F = { h: {0, 1}d -> {0, 1}

| 3 i ε {1, ..., d}, h(x) = xi }

Example:

We may consider the following bit-strings:

A = {0, 1, 0, 1, 0}

B = {0, 0, 0, 1, 0}

C = {1, 1, 1, 1, 1}

One possible sample would be to select the second bit - so in this case for:

A = 1

B = 0

C = 1

Clearly the results are not very powerful using bitwise sampling alone.

18 of 31

Min Hashing

More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".

By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.

This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.

Example:

Source: link

19 of 31

Min Hashing

More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".

By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.

This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.

Example:

Source: link

20 of 31

Min Hashing

More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".

By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.

This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.

Example:

Source: link

21 of 31

Min Hashing

More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".

By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.

This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.

Example:

Source: link

22 of 31

Min Hashing

More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".

By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.

This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.

Example:

Source: link

23 of 31

Min Hashing

More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".

By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.

This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.

Example:

Source: link

24 of 31

Min Hashing

More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".

By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.

This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.

Example:

Source: link

25 of 31

Min Hashing

More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".

By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.

This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.

Example:

Source: link

26 of 31

Min Hashing

More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".

By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.

This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.

Example:

Source: link

27 of 31

Min Hashing

More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".

By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.

This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.

Example:

Source: link

28 of 31

Min Hashing

More complex technique that leverages permutations of a binary string, and then aims to identify the first value in the string that is "1".

By using multiple permutations, a sequence can be generated acting as a "signature" for different features, but does not take up as much space.

This is likely very unclear, so a dedicated video for MinHashing and using it for nearest neighbor search can be found here.

Example:

Source: link

29 of 31

SimHash

Google patented technique for rapidly identifying "near-identical" documents.

• Unlike MinHash, it is more suitable for finding "near-identical" (unlike clustering closely related documents)

• More performant than MinHash

• Used by Google for duplicate detection in web crawling, and FLoC in ads

Process for identical documents

1. Generate simhashes

2. Perform "Hamming distance problem"

Generating simhashes

Simhashes are typically 64-bit sequences generated from a set of features.

From the set of features, "average" the individual bits (i.e., you will have a set of 10 64-bit numbers, determine if more 1s than 0s occur in an index and apply that value).

Hamming distance problem

Efficiently determine closest simhashes (i.e., not in quadratic time) - achieved via random projections or permutations + interpolation search.

For reference, view the following link

30 of 31

Future Reading

Locality Preserving Hashing

• Hash function f that maps a single point or multiple points in a multi-dimensional coordinate space to a single scalar value

• Maintains the property that:

|A - B| < |B - C| =>

|ƒ(A) - ƒ(B)| < |ƒ(B) - ƒ(C)|

Simhash Paper

• Link: http://www2007.org/papers/paper215.pdf

SuperMinHash Paper

• Link: https://arxiv.org/pdf/1706.05698.pdf

31 of 31

Thanks for watching!

Feel free to leave feedback below! :)