1 of 64

10-605 / 10-805

Machine Learning from Large Datasets

2 of 64

Recap: the course so far

  • Tool: Scalable computation on clusters
    • MapReduce, Spark and iteration, Spark workflows
  • Tool: Learning as optimization
    • MLE for binomials, linear regression, logistic regression, k-means clustering
    • Tools: exact solutions, gradient descent (batch, parallel, stochastic), EM and EM-like methods (for k-means)
  • “Craft” of ML
    • Tool: feature extraction / augmentation/ kernels
    • Tool: regularization and sparsity hyperparameter tuning
    • Tool: the hash trick and efficient SGD
    • Tool: distributed ML – interaction with sparsity, AllReduce
  • Tool: Randomized algorithms
    • Bloom filter (a randomized set)
    • Count-Min sketch (a randomized mapping x 🡪R+ )
    • Today: locality sensitive hashing (LSH)

2

3 of 64

LOCALITY SENSITIVE HASHING (LSH)

4 of 64

LSH

  • Bloom Filter and CountMin sketches
    • weak signal (of x S, or a[x] ≥ c) is boosted by repetition
  • Locality Sensitive Hashing (LSH)
    • weak signal of similarity of two things
      • SimHash: sign(r.dot(x1)) = sign(r.dot(x2)) for random projection r
        • x1, x2 vectors
      • MinHash: min(h(w): w∊S1) = min(h(w ): w∊S2) for a random hash h
        • S1, S2 sets of tokens
    • Start with LSH for SimHash (random projections)

5 of 64

LSH: key ideas

  • Goal:
    • map feature vector x to bit vector bx
    • ensure that bx preserves “similarity”

6 of 64

Random Projections

7 of 64

Random projections

u

-u

2γ

+

+

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

8 of 64

Random projections

+

+

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

Consider two points.

If they are close their projections will probably be close

r

9 of 64

Random projections

+

+

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

Consider two points.

If they are distant their projections will probably be distant …

When will the projections be close?

r

10 of 64

Random projections

+

+

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

r can also make two distant points “close”

when do the signs agree?

11 of 64

Random projections

x1

x2

r

sign(r.dot(x1)) < 0

sign(r.dot(x2)) > 0

12 of 64

Random projections

x1

x2

r

sign(r.dot(x1)) > 0

sign(r.dot(x1)) > 0

13 of 64

Random projections

x1

x2

r

sign(r.dot(x1)) < 0

sign(r.dot(x2)) > 0

sign(r.dot(x1)) ! sign(r.dot(x2))

possible angles for 𝜃 of ⟂r

𝜃 ⩽ cos(x1,x2)

14 of 64

Random projections

x1

x2

r

sign(r.dot(x1)) > 0

sign(r.dot(x2)) < 0

sign(r.dot(x1)) != sign(r.dot(x2))

possible angles for ⟂r

cos(x1,x2)

… or rotate r by 180 degrees

15 of 64

LSH: key ideas

Pr[sign(r.dot(x1)) ≠ sign(r.dot(x2)] = cos(x1,x2)

SimHash(x) = sign(r.dot(x)) where r is random projection

16 of 64

LSH and SimHash

  • Bloom Filter and CountMin sketches
    • weak randomized signal (x S, or a[x] ≥ c) is boosted by repetition
  • Locality Sensitive Hashing (LSH)
    • weak signal of similarity of two points
    • SimHash: sign(r.dot(x1)) = sign(r.dot(x2)) for random projection r

17 of 64

LSH: key ideas: SimHash

  • Goal:
    • map feature vector x to bit vector bx
    • ensure that bx preserves “similarity”
  • Basic idea: use random projections of x
    • Repeat many times:
      • Pick a random hyperplane r by picking random weights for each feature (say from a Gaussian)
      • Compute the inner product of r with x
      • Record if x is “close to” r (r.dot(x)>=0)
        • the next bit in bx
      • Theory says that if x’ and x have small cosine distance then bx and bx’ will have small Hamming distance

18 of 64

[Slides: Ben van Durme]

19 of 64

[Slides: Ben van Durme]

20 of 64

[Slides: Ben van Durme]

21 of 64

[Slides: Ben van Durme]

22 of 64

Hamming similarity for the bit vector approximates cosine similarity

23 of 64

“Collisions are your friend” – they indicate similarity

24 of 64

SimHash and MinHash

25 of 64

MinHash and Jaccard similarity

  • Cosine is a useful similarity measure for vectors
  • A useful similarity measure for sets is:

Jaccard(S, T) = |S ∩ T | / |S ∪ T|

S = {william, w, cohen, carnegie, mellon}

T = {william, cohen, carnegie, mellon, univ}

S ∩ T = {william, cohen, carnegie, mellon}

S ∪ T = {william, w, cohen, carnegie, mellon, univ}

Jaccard(S, T) = 4/6 = 0.667

26 of 64

MinHash and Jaccard similarity

  • A useful similarity measure for sets:

Jaccard(S, T) = |S ∩ T | / |S ∪ T|

SimHash(x) = sign(r.dot(x)) where r is random projection

MinHash(S) = min(h(w) for w in S) where h is random hash

Claim: Pr(MinHash(S) = MinHash(T)) = Jaccard(S, T)

where probability is taken over random choice of h

Pr[sign(r.dot(x’)) ≠ sign(r.dot(x)] = cos(x,x’) where prob is over choice of r

27 of 64

S = {william, w, cohen, carnegie, mellon, univ}

T = {william, cohen, carnegie, mellon}

h(william)

h(w)

h(carnegie)

h(mellon)

h(univ)

h(cohen)

MinHash(S) = MinHash(T) = h(‘carnegie’)

MinHash(S) = MinHash(T) 🡺 hmin ∊ (S ∩ T)

but hmin could be anything from (S ∪ T)

h(w) is an integer

h(w1), …., h(wn) for random h is a random permutation of tokens

28 of 64

S = {william, w, cohen, carnegie, mellon, univ}

T = {william, cohen, carnegie, mellon}

h(william)

h(w)

h(carnegie)

h(mellon)

h(univ)

h(cohen)

MinHash(S) = MinHash(T) = h(‘w’) ∊ (S ∪ T) but not (S ∩ T)

Claim: Pr(MinHash(S) = MinHash(T)) = Jaccard(S, T)

where probability is taken over random choice of h

29 of 64

MinHash and Jaccard similarity

A minhash signature for S is vector kS=k1,…,kM

ki= min(hi(w) for w in S)

It approximates Jaccard similarity, like SimHash vectors approximate cosine similarity

The number of hashes M can smaller than |S|

The range of hi can also be small: O(|S|)

30 of 64

{a,d}

{c}

{b,d,e}

{a,c,d}

{a,d}

{c}

{b,d,e}

{a,c,d}

{a,d}

{c}

{b,d,e}

{a,c,d}

31 of 64

LSH applications

  • Compact storage of data
    • size(x) ≫ size(bx)
    • and we can still compute similarities
  • LSH also can be used for fast approximate nearest neighbor!

32 of 64

From MinHash Signatures to Approximate Nearest Neighbors

33 of 64

MinHash LSH and approximate k-NN

  • Goal: given q find all x with sim(q, x) > s0
    • We have M hashes in signatures k(x)

ki(q) = min(hi(w) for w in q)

    • Can we use x’ for which k1(x)=k1(q) ?
      • Very noisy, because this is a weak signal
    • Can we use x for which k(x)=k(q) ?
      • Maybe: these x’s should be similar to q
      • But Pr(x’ is missed) = 1 - sim(q, x)M
      • … so we will miss many similar values

34 of 64

MinHash LSH and approximate k-NN

  • Goal: given q find all x with sim(q, x) > s0
    • We have M hashes

ki(q) = min(hi(w) for w in q)

    • Idea: group M=b*r hashes into b bands with r signals in each band
      • Take x as a candidate neighbor if it matches q on all signals in any band.
      • Pr(x is missed) = (1 – sim(q, x)r )b

35 of 64

MinHash LSH and approximate k-NN

  • Goal: given q find all x with sim(q, x) > s0
    • We have M hashes

ki(q) = min(hi(w) for w in q)

    • Pr(x is missed) = (1 – sim(q, x)r )b

sim(x,q)

36 of 64

MinHash LSH and approximate k-NN

  • Goal: given q find all x with sim(q, x) > s0
    • We have M hashes

ki(q) = min(hi(w) for w in q)

    • Ideas:
      • replace k with concat(k, k’) where k’ is a random permutation of k
      • now the signature is longer
    • group into 2b*r hashes into 2b bands with r signals in each band
    • more bands 🡺 more recall
    • length-r bands are unlikely to be duplicated

37 of 64

More on Nearest Neighbors

38 of 64

SimHash LSH for k-NN search

  • Problem: given a bit vector q find the closest bit vector, h, in a collection
  • We could look at all buckets with Hamming distance 0, then 1, then 2, … but there are 2k buckets of distance k
  • Multi-index hashing for SimHash
    • start out with a long code h
    • break it into m equal-size bitstrings h(1), …, h(m)
    • claim: if HammingDist(h,g) > r, then for some z, HammingDist(h(z),g (z)) > ceil(r/m)

  • Example if m=4:
    • h and g differ in 3 bits 🡺 at least one of the four “bands” has HammingDist(h(z),g (z)) >= 1
    • h and g differ in 7 bits 🡺 … >= 2

like bands

  • Old: look at bands with Hamming distance 0 to g, then 1, … 🡺 2n
  • New: look at bands w Hamming distance 0 to g(z), then 1 … 🡺 m*2n/m

  • Fast in practice!

39 of 64

k-means for k-NN search

40 of 64

k-means and k-NN

  1. Find centroid i closest to qx
  2. Find element x in cluster i closest to qx

Not guaranteed to find the closest neighbor

You can make guarantees using the triangle inequality if you precompute distances between x and its centroid and check multiple centroids

41 of 64

J-means for k-NN search

  • Does J-means help you find points close to qx?
  • After you have learned the j clusters of n objects
    • Find centroid i closest to qx
      • O(J) distance computations
    • Find element x in cluster i closest to qx
      • O(N/J) distance computations
  • Overall comparison
      • if J=sqrt(N): O(J) + O(N/J) beats O(N)

  • If this isn’t enough you can try running J-means internally in each cluster and so on
    • Build a tree of depth d
      • O(d J ) * O(N / Jd)

42 of 64

Vector quantization

  • We can use the same data structure to either
    • find approximate k-NN’s to qx
    • map qx to a centroid that is similar
      • centroid is an approximate encoding of the near neighbor x
  • With a hierarchical clustering and millions of centroids this will be a good approximation of qx
    • but it is compact to describe (as a path through the J-means tree)
    • so J-means also compresses a set of vectors, like LSH

43 of 64

Product quantization for k-NN search

  • Another trick:
    • split x into m equal-size vectors x(1), …, x(m)
    • build a J-means indices for each of the m sets of subvectors
    • to “encode” qx find closest centroids to qx(1), …, qx(m)
      • call them c(1), …, c(m)
      • call the concatenation c
    • there are Jm possible c’s
  • The larger space of encodings means PQ can be very accurate
  • Theory says that the encoding loss is lower for qx,c pairs that are highly similar
    • there are the important ones

like bands

44 of 64

Product quantization for k-NN search

45 of 64

A Recent Survey

2024

46 of 64

Extension/Application:�Online LSH and Pooling

46

47 of 64

47

2010

48 of 64

LSH algorithm

  • Naïve algorithm:
    • Initialization:
      • For i=1 to outputBits:
        • For each feature f:
          • Draw r(f,i) ~ Normal(0,1)
    • Given an instance x
      • For i=1 to outputBits:

LSH[i] =

sum(x[f ]*r[i,f ] for f with non-zero weight in x) > 0 ? 1 : 0

      • Return the bit-vector LSH

48

49 of 64

LSH algorithm

  • But: storing the k classifiers is expensive in high dimensions
    • For each of 256 bits, a dense vector of weights for every feature in the vocabulary
  • Storing seeds and random number generators:
    • Possible but somewhat fragile

49

50 of 64

LSH: “pooling” (van Durme)

  • Alternative algorithm:
    • Initialization:
      • Create a pool:
        • Pick a random seed s
        • For i=1 to poolSize:
          • Draw pool[i] ~ Normal(0,1)
      • For i=1 to outputBits:
        • Devise a random hash function hash(i,f):
    • Given an instance x
      • For i=1 to outputBits:

LSH[i] = sum(

x[f] * pool[hash(i, f ) % poolSize] for f in x) > 0 ?

1 : 0

      • Return the bit-vector LSH

50

51 of 64

51

52 of 64

LSH: key ideas: pooling

  • Advantages:
    • with pooling, this is a compact re-encoding of the data
      • you don’t need to store the r’s, just the pool
      • no worries about saving/reusing the particular hash functions you used

52

53 of 64

Locality Sensitive Hashing (LSH) in an On-line Setting

53

54 of 64

Recap: using a Count-Min Sketch

x appears near y

Turney, 2002 used two seeds, “excellent” and “poor”

In general, SO(w) can be written in terms of logs of products of counters for w, with and without seeds

55 of 64

LSH: key ideas: online computation

  • This SO trick is based on another common task: distributional clustering
    • for a word w, v(w) is sparse vector of words that co-occur with w
    • cluster the v(w)’s

55

…guards at Pentonville prison in North London discovered that an escape attempt…

An American Werewolf in London is to be remade by the son of the original director…

…UK pop up shop on Monmouth Street in London today and on Friday the brand…

v(London): Pentonville, prison, in, North, …. and, on Friday

56 of 64

LSH: key ideas: online computation

  • Common task: distributional clustering
    • for a word w, v(w) is sparse vector of words that co-occur with w
    • cluster the v(w)’s – or here, compute similarities

56

is similar to:

is similar to:

is similar to:

57 of 64

LSH: key ideas: online computation

  • Common task: distributional clustering
    • for a word w, v(w) is sparse vector of words that co-occur with w
    • cluster the v(w)’s

57

v(w) is very similar to a word embedding (eg, from word2vec or GloVE)

Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225.

58 of 64

LSH: key ideas: online computation

  • Construct LSH(v(w)) for each word w in the corpus
    • stream through the corpus once
    • use minimal amount of memory
    • ideally no more than size of output

58

…guards at Pentonville prison in North London discovered that an escape attempt…

An American Werewolf in London is to be remade by the son of the original director…

…UK pop up shop on Monmouth Street in London today and on Friday the brand…

v(London): Pentonville, prison, in, North, …. and, on Friday

LSH(v(London)): 0101101010111011101

LSH(v(Monmouth)): 0111101000110010111

59 of 64

59

v is context vector; d is vocab size

ri is i-th random projection

hi(v) i-th bit of LSH encoding

because context vector is sum of mention contexts

these come one by one as we stream thru the corpus

keep this as H[w][i]

threshold later on

60 of 64

60

j-th hash of fi

weight of fi in rj

compress H[w][j] to bits

61 of 64

Comments

  • Tricks in this paper
    • saving calls to random() with pooling
    • computing word distribution vectors incrementally
      • and at each increment compress, via random projection, to d real numbers
    • thresholding the random projections to a bit

62 of 64

Experiment

  • Corpus: 700M+ tokens, 1.1M distinct bigrams
  • For each, build a feature vector of words that co-occur near it, using on-line LSH
  • Check results with 50,000 actual vectors

62

63 of 64

Experiment

63

64 of 64

64

16

32

64

128

256