2 of 64

Recap: the course so far

Tool: Scalable computation on clusters

MapReduce, Spark and iteration, Spark workflows

Tool: Learning as optimization

MLE for binomials, linear regression, logistic regression, k-means clustering
Tools: exact solutions, gradient descent (batch, parallel, stochastic), EM and EM-like methods (for k-means)

“Craft” of ML

Tool: feature extraction / augmentation/ kernels
Tool: regularization and sparsity hyperparameter tuning
Tool: the hash trick and efficient SGD
Tool: distributed ML – interaction with sparsity, AllReduce

Tool: Randomized algorithms

Bloom filter (a randomized set)
Count-Min sketch (a randomized mapping x 🡪R+ )
Today: locality sensitive hashing (LSH)

3 of 64

LOCALITY SENSITIVE HASHING (LSH)

4 of 64

LSH

Bloom Filter and CountMin sketches

weak signal (of x ∊ S, or a[x] ≥ c) is boosted by repetition

Locality Sensitive Hashing (LSH)

weak signal of similarity of two things

SimHash: sign(r.dot(x1)) = sign(r.dot(x2)) for random projection r

x1, x2 vectors

MinHash: min(h(w): w∊S1) = min(h(w ): w∊S2) for a random hash h

S1, S2 sets of tokens

Start with LSH for SimHash (random projections)

5 of 64

LSH: key ideas

Goal:

map feature vector x to bit vector bx
ensure that bx preserves “similarity”

6 of 64

Random Projections

7 of 64

Random projections

-u

2γ

8 of 64

Random projections

Consider two points.

If they are close their projections will probably be close

9 of 64

Random projections

Consider two points.

If they are distant their projections will probably be distant …

When will the projections be close?

10 of 64

Random projections

r can also make two distant points “close”

when do the signs agree?

11 of 64

Random projections

sign(r.dot(x1)) < 0

sign(r.dot(x2)) > 0

12 of 64

Random projections

sign(r.dot(x1)) > 0

13 of 64

Random projections

sign(r.dot(x1)) < 0

sign(r.dot(x2)) > 0

sign(r.dot(x1)) ! sign(r.dot(x2))

possible angles for 𝜃 of ⟂r

𝜃 ⩽ cos(x1,x2)

14 of 64

Random projections

sign(r.dot(x1)) > 0

sign(r.dot(x2)) < 0

sign(r.dot(x1)) != sign(r.dot(x2))

possible angles for ⟂r

cos(x1,x2)

… or rotate r by 180 degrees

15 of 64

LSH: key ideas

Pr[sign(r.dot(x1)) ≠ sign(r.dot(x2)] = cos(x1,x2)

SimHash(x) = sign(r.dot(x)) where r is random projection

16 of 64

LSH and SimHash

Bloom Filter and CountMin sketches

weak randomized signal (x ∊ S, or a[x] ≥ c) is boosted by repetition

Locality Sensitive Hashing (LSH)

weak signal of similarity of two points
SimHash: sign(r.dot(x1)) = sign(r.dot(x2)) for random projection r

17 of 64

LSH: key ideas: SimHash

Goal:

map feature vector x to bit vector bx
ensure that bx preserves “similarity”

Basic idea: use random projections of x

Repeat many times:

Pick a random hyperplane r by picking random weights for each feature (say from a Gaussian)
Compute the inner product of r with x
Record if x is “close to” r (r.dot(x)>=0)

the next bit in bx

Theory says that if x’ and x have small cosine distance then bx and bx’ will have small Hamming distance

18 of 64

[Slides: Ben van Durme]

19 of 64

[Slides: Ben van Durme]

20 of 64

[Slides: Ben van Durme]

21 of 64

[Slides: Ben van Durme]

22 of 64

Hamming similarity for the bit vector approximates cosine similarity

23 of 64

“Collisions are your friend” – they indicate similarity

24 of 64

SimHash and MinHash

25 of 64

MinHash and Jaccard similarity

Cosine is a useful similarity measure for vectors
A useful similarity measure for sets is:

Jaccard(S, T) = |S ∩ T | / |S ∪ T|

S = {william, w, cohen, carnegie, mellon}

T = {william, cohen, carnegie, mellon, univ}

S ∩ T = {william, cohen, carnegie, mellon}

S ∪ T = {william, w, cohen, carnegie, mellon, univ}

Jaccard(S, T) = 4/6 = 0.667

26 of 64

MinHash and Jaccard similarity

A useful similarity measure for sets:

Jaccard(S, T) = |S ∩ T | / |S ∪ T|

SimHash(x) = sign(r.dot(x)) where r is random projection

MinHash(S) = min(h(w) for w in S) where h is random hash

Claim: Pr(MinHash(S) = MinHash(T)) = Jaccard(S, T)

where probability is taken over random choice of h

Pr[sign(r.dot(x’)) ≠ sign(r.dot(x)] = cos(x,x’) where prob is over choice of r

27 of 64

S = {william, w, cohen, carnegie, mellon, univ}

T = {william, cohen, carnegie, mellon}

h(william)

h(w)

h(carnegie)

h(mellon)

h(univ)

h(cohen)

MinHash(S) = MinHash(T) = h(‘carnegie’)

MinHash(S) = MinHash(T) 🡺 h_min ∊ (S ∩ T)

but h_min could be anything from (S ∪ T)

h(w) is an integer

h(w1), …., h(wn) for random h is a random permutation of tokens

28 of 64

S = {william, w, cohen, carnegie, mellon, univ}

T = {william, cohen, carnegie, mellon}

h(william)

h(w)

h(carnegie)

h(mellon)

h(univ)

h(cohen)

MinHash(S) = MinHash(T) = h(‘w’) ∊ (S ∪ T) but not (S ∩ T)

Claim: Pr(MinHash(S) = MinHash(T)) = Jaccard(S, T)

where probability is taken over random choice of h

29 of 64

MinHash and Jaccard similarity

A minhash signature for S is vector k_S=k₁,…,k_M

k_i= min(h_i(w) for w in S)

It approximates Jaccard similarity, like SimHash vectors approximate cosine similarity

The number of hashes M can smaller than |S|

The range of h_i can also be small: O(|S|)

30 of 64

{a,d}

{c}

{b,d,e}

{a,c,d}

{a,d}

{c}

{b,d,e}

{a,c,d}

{a,d}

{c}

{b,d,e}

{a,c,d}

31 of 64

LSH applications

Compact storage of data

size(x) ≫ size(bx)
and we can still compute similarities

LSH also can be used for fast approximate nearest neighbor!

32 of 64

From MinHash Signatures to Approximate Nearest Neighbors

33 of 64

MinHash LSH and approximate k-NN

Goal: given q find all x with sim(q, x) > s₀

We have M hashes in signatures k(x)

k_i(q) = min(h_i(w) for w in q)

Can we use x’ for which k₁(x)=k₁(q) ?

Very noisy, because this is a weak signal

Can we use x for which k(x)=k(q) ?

Maybe: these x’s should be similar to q
But Pr(x’ is missed) = 1 - sim(q, x)^M
… so we will miss many similar values

34 of 64

MinHash LSH and approximate k-NN

Goal: given q find all x with sim(q, x) > s₀

We have M hashes

k_i(q) = min(h_i(w) for w in q)

Idea: group M=b*r hashes into b bands with r signals in each band

Take x as a candidate neighbor if it matches q on all signals in any band.
Pr(x is missed) = (1 – sim(q, x)^r)^b

35 of 64

MinHash LSH and approximate k-NN

Goal: given q find all x with sim(q, x) > s₀

We have M hashes

k_i(q) = min(h_i(w) for w in q)

Pr(x is missed) = (1 – sim(q, x)^r)^b

sim(x,q)

36 of 64

MinHash LSH and approximate k-NN

Goal: given q find all x with sim(q, x) > s₀

We have M hashes

k_i(q) = min(h_i(w) for w in q)

Ideas:

replace k with concat(k, k’) where k’ is a random permutation of k
now the signature is longer

group into 2b*r hashes into 2b bands with r signals in each band
more bands 🡺 more recall
length-r bands are unlikely to be duplicated

37 of 64

38 of 64

SimHash LSH for k-NN search

Problem: given a bit vector q find the closest bit vector, h, in a collection
We could look at all buckets with Hamming distance 0, then 1, then 2, … but there are 2^k buckets of distance k
Multi-index hashing for SimHash

start out with a long code h
break it into m equal-size bitstrings h⁽¹⁾, …, h^(m)
claim: if HammingDist(h,g) > r, then for some z, HammingDist(h^(z),g^(z)) > ceil(r/m)

Example if m=4:

h and g differ in 3 bits 🡺 at least one of the four “bands” has HammingDist(h^(z),g^(z)) >= 1
h and g differ in 7 bits 🡺 … >= 2

like bands

Old: look at bands with Hamming distance 0 to g, then 1, … 🡺 2ⁿ
New: look at bands w Hamming distance 0 to g^(z), then 1 … 🡺 m*2^n/m

Fast in practice!

39 of 64

k-means for k-NN search

40 of 64

k-means and k-NN

Find centroid i closest to qx
Find element x in cluster i closest to qx

Not guaranteed to find the closest neighbor

You can make guarantees using the triangle inequality if you precompute distances between x and its centroid and check multiple centroids

41 of 64

J-means for k-NN search

Does J-means help you find points close to qx?
After you have learned the j clusters of n objects

Find centroid i closest to qx

O(J) distance computations

Find element x in cluster i closest to qx

O(N/J) distance computations

Overall comparison

if J=sqrt(N): O(J) + O(N/J) beats O(N)

If this isn’t enough you can try running J-means internally in each cluster and so on

Build a tree of depth d

O(d J ) * O(N / J^d)

42 of 64

Vector quantization

We can use the same data structure to either

find approximate k-NN’s to qx
map qx to a centroid that is similar

centroid is an approximate encoding of the near neighbor x

With a hierarchical clustering and millions of centroids this will be a good approximation of qx

but it is compact to describe (as a path through the J-means tree)
so J-means also compresses a set of vectors, like LSH

43 of 64

Product quantization for k-NN search

Another trick:

split x into m equal-size vectors x⁽¹⁾, …, x^(m)
build a J-means indices for each of the m sets of subvectors
to “encode” qx find closest centroids to qx⁽¹⁾, …, qx^(m)

call them c⁽¹⁾, …, c^(m)
call the concatenation c

there are J^m possible c’s

The larger space of encodings means PQ can be very accurate
Theory says that the encoding loss is lower for qx,c pairs that are highly similar

there are the important ones

like bands

44 of 64

Product quantization for k-NN search

45 of 64

A Recent Survey

2024

46 of 64

Extension/Application:�Online LSH and Pooling

48 of 64

LSH algorithm

Naïve algorithm:

Initialization:

For i=1 to outputBits:

For each feature f:

Draw r(f,i) ~ Normal(0,1)

Given an instance x

For i=1 to outputBits:

LSH[i] =

sum(x[f ]*r[i,f ] for f with non-zero weight in x) > 0 ? 1 : 0

Return the bit-vector LSH

49 of 64

LSH algorithm

But: storing the k classifiers is expensive in high dimensions

For each of 256 bits, a dense vector of weights for every feature in the vocabulary

Storing seeds and random number generators:

Possible but somewhat fragile

50 of 64

LSH: “pooling” (van Durme)

Alternative algorithm:

Initialization:

Create a pool:

Pick a random seed s
For i=1 to poolSize:

Draw pool[i] ~ Normal(0,1)

For i=1 to outputBits:

Devise a random hash function hash(i,f):

Given an instance x

For i=1 to outputBits:

LSH[i] = sum(

x[f] * pool[hash(i, f ) % poolSize] for f in x) > 0 ?

1 : 0

Return the bit-vector LSH

52 of 64

LSH: key ideas: pooling

Advantages:

with pooling, this is a compact re-encoding of the data

you don’t need to store the r’s, just the pool
no worries about saving/reusing the particular hash functions you used

53 of 64

Locality Sensitive Hashing (LSH) in an On-line Setting

54 of 64

Recap: using a Count-Min Sketch

x appears near y

Turney, 2002 used two seeds, “excellent” and “poor”

In general, SO(w) can be written in terms of logs of products of counters for w, with and without seeds

55 of 64

LSH: key ideas: online computation

This SO trick is based on another common task: distributional clustering

for a word w, v(w) is sparse vector of words that co-occur with w
cluster the v(w)’s

…guards at Pentonville prison in North London discovered that an escape attempt…

An American Werewolf in London is to be remade by the son of the original director…

…UK pop up shop on Monmouth Street in London today and on Friday the brand…

v(London): Pentonville, prison, in, North, …. and, on Friday

56 of 64

LSH: key ideas: online computation

Common task: distributional clustering

for a word w, v(w) is sparse vector of words that co-occur with w
cluster the v(w)’s – or here, compute similarities

is similar to:

57 of 64

LSH: key ideas: online computation

Common task: distributional clustering

for a word w, v(w) is sparse vector of words that co-occur with w
cluster the v(w)’s

v(w) is very similar to a word embedding (eg, from word2vec or GloVE)

Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225.

58 of 64

LSH: key ideas: online computation

Construct LSH(v(w)) for each word w in the corpus

stream through the corpus once
use minimal amount of memory
ideally no more than size of output

…guards at Pentonville prison in North London discovered that an escape attempt…

An American Werewolf in London is to be remade by the son of the original director…

…UK pop up shop on Monmouth Street in London today and on Friday the brand…

v(London): Pentonville, prison, in, North, …. and, on Friday

LSH(v(London)): 0101101010111011101

LSH(v(Monmouth)): 0111101000110010111

…

59 of 64

v is context vector; d is vocab size

r_i is i-th random projection

h_i(v) i-th bit of LSH encoding

because context vector is sum of mention contexts

these come one by one as we stream thru the corpus

keep this as H[w][i]

threshold later on

60 of 64

j-th hash of f_i

weight of f_i in r_j

compress H[w][j] to bits

61 of 64

Comments

Tricks in this paper

saving calls to random() with pooling
computing word distribution vectors incrementally

and at each increment compress, via random projection, to d real numbers

thresholding the random projections to a bit

62 of 64

Experiment

Corpus: 700M+ tokens, 1.1M distinct bigrams
For each, build a feature vector of words that co-occur near it, using on-line LSH
Check results with 50,000 actual vectors

63 of 64

Experiment

64 of 64

128

256

1 of 64

2 of 64

3 of 64

4 of 64

5 of 64

6 of 64

7 of 64

8 of 64

9 of 64

10 of 64

11 of 64

12 of 64

13 of 64

14 of 64

15 of 64

16 of 64

17 of 64

18 of 64

19 of 64

20 of 64

21 of 64

22 of 64

23 of 64

24 of 64

25 of 64

26 of 64

27 of 64

28 of 64

29 of 64

30 of 64

31 of 64

32 of 64

33 of 64

34 of 64

35 of 64

36 of 64

37 of 64

38 of 64

39 of 64

40 of 64

41 of 64

42 of 64

43 of 64

44 of 64

45 of 64

46 of 64

47 of 64

48 of 64

49 of 64

50 of 64

51 of 64

52 of 64

53 of 64

54 of 64

55 of 64

56 of 64

57 of 64

58 of 64

59 of 64

60 of 64

61 of 64

62 of 64

63 of 64

64 of 64