10-605 / 10-805
Machine Learning from Large Datasets
Recap: the course so far
2
LOCALITY SENSITIVE HASHING (LSH)
LSH
LSH: key ideas
Random Projections
Random projections
u
-u
2γ
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
Random projections
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
Consider two points.
If they are close their projections will probably be close
r
Random projections
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
Consider two points.
If they are distant their projections will probably be distant …
When will the projections be close?
r
Random projections
+
+
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
r can also make two distant points “close”
when do the signs agree?
Random projections
x1
x2
r
sign(r.dot(x1)) < 0
sign(r.dot(x2)) > 0
Random projections
x1
x2
r
sign(r.dot(x1)) > 0
sign(r.dot(x1)) > 0
Random projections
x1
x2
r
sign(r.dot(x1)) < 0
sign(r.dot(x2)) > 0
sign(r.dot(x1)) ! sign(r.dot(x2))
possible angles for 𝜃 of ⟂r
𝜃 ⩽ cos(x1,x2)
Random projections
x1
x2
r
sign(r.dot(x1)) > 0
sign(r.dot(x2)) < 0
sign(r.dot(x1)) != sign(r.dot(x2))
possible angles for ⟂r
cos(x1,x2)
… or rotate r by 180 degrees
LSH: key ideas
Pr[sign(r.dot(x1)) ≠ sign(r.dot(x2)] = cos(x1,x2)
SimHash(x) = sign(r.dot(x)) where r is random projection
LSH and SimHash
LSH: key ideas: SimHash
[Slides: Ben van Durme]
[Slides: Ben van Durme]
[Slides: Ben van Durme]
[Slides: Ben van Durme]
Hamming similarity for the bit vector approximates cosine similarity
“Collisions are your friend” – they indicate similarity
SimHash and MinHash
MinHash and Jaccard similarity
Jaccard(S, T) = |S ∩ T | / |S ∪ T|
S = {william, w, cohen, carnegie, mellon}
T = {william, cohen, carnegie, mellon, univ}
S ∩ T = {william, cohen, carnegie, mellon}
S ∪ T = {william, w, cohen, carnegie, mellon, univ}
Jaccard(S, T) = 4/6 = 0.667
MinHash and Jaccard similarity
Jaccard(S, T) = |S ∩ T | / |S ∪ T|
SimHash(x) = sign(r.dot(x)) where r is random projection
MinHash(S) = min(h(w) for w in S) where h is random hash
Claim: Pr(MinHash(S) = MinHash(T)) = Jaccard(S, T)
where probability is taken over random choice of h
Pr[sign(r.dot(x’)) ≠ sign(r.dot(x)] = cos(x,x’) where prob is over choice of r
S = {william, w, cohen, carnegie, mellon, univ}
T = {william, cohen, carnegie, mellon}
h(william)
h(w)
h(carnegie)
h(mellon)
h(univ)
h(cohen)
MinHash(S) = MinHash(T) = h(‘carnegie’)
MinHash(S) = MinHash(T) 🡺 hmin ∊ (S ∩ T)
but hmin could be anything from (S ∪ T)
h(w) is an integer
h(w1), …., h(wn) for random h is a random permutation of tokens
S = {william, w, cohen, carnegie, mellon, univ}
T = {william, cohen, carnegie, mellon}
h(william)
h(w)
h(carnegie)
h(mellon)
h(univ)
h(cohen)
MinHash(S) = MinHash(T) = h(‘w’) ∊ (S ∪ T) but not (S ∩ T)
Claim: Pr(MinHash(S) = MinHash(T)) = Jaccard(S, T)
where probability is taken over random choice of h
MinHash and Jaccard similarity
A minhash signature for S is vector kS=k1,…,kM
ki= min(hi(w) for w in S)
It approximates Jaccard similarity, like SimHash vectors approximate cosine similarity
The number of hashes M can smaller than |S|
The range of hi can also be small: O(|S|)
{a,d}
{c}
{b,d,e}
{a,c,d}
{a,d}
{c}
{b,d,e}
{a,c,d}
{a,d}
{c}
{b,d,e}
{a,c,d}
LSH applications
From MinHash Signatures to Approximate Nearest Neighbors
MinHash LSH and approximate k-NN
ki(q) = min(hi(w) for w in q)
MinHash LSH and approximate k-NN
ki(q) = min(hi(w) for w in q)
MinHash LSH and approximate k-NN
ki(q) = min(hi(w) for w in q)
sim(x,q)
MinHash LSH and approximate k-NN
ki(q) = min(hi(w) for w in q)
More on Nearest Neighbors
SimHash LSH for k-NN search
like bands
k-means for k-NN search
k-means and k-NN
Not guaranteed to find the closest neighbor
You can make guarantees using the triangle inequality if you precompute distances between x and its centroid and check multiple centroids
J-means for k-NN search
Vector quantization
Product quantization for k-NN search
like bands
Product quantization for k-NN search
A Recent Survey
2024
Extension/Application:�Online LSH and Pooling
46
47
2010
LSH algorithm
LSH[i] =
sum(x[f ]*r[i,f ] for f with non-zero weight in x) > 0 ? 1 : 0
48
LSH algorithm
49
LSH: “pooling” (van Durme)
LSH[i] = sum(
x[f] * pool[hash(i, f ) % poolSize] for f in x) > 0 ?
1 : 0
50
51
LSH: key ideas: pooling
52
Locality Sensitive Hashing (LSH) in an On-line Setting
53
Recap: using a Count-Min Sketch
x appears near y
Turney, 2002 used two seeds, “excellent” and “poor”
In general, SO(w) can be written in terms of logs of products of counters for w, with and without seeds
LSH: key ideas: online computation
55
…guards at Pentonville prison in North London discovered that an escape attempt…
An American Werewolf in London is to be remade by the son of the original director…
…UK pop up shop on Monmouth Street in London today and on Friday the brand…
v(London): Pentonville, prison, in, North, …. and, on Friday
LSH: key ideas: online computation
56
is similar to:
is similar to:
is similar to:
LSH: key ideas: online computation
57
v(w) is very similar to a word embedding (eg, from word2vec or GloVE)
Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225.
LSH: key ideas: online computation
58
…guards at Pentonville prison in North London discovered that an escape attempt…
An American Werewolf in London is to be remade by the son of the original director…
…UK pop up shop on Monmouth Street in London today and on Friday the brand…
v(London): Pentonville, prison, in, North, …. and, on Friday
LSH(v(London)): 0101101010111011101
LSH(v(Monmouth)): 0111101000110010111
…
59
v is context vector; d is vocab size
ri is i-th random projection
hi(v) i-th bit of LSH encoding
because context vector is sum of mention contexts
these come one by one as we stream thru the corpus
keep this as H[w][i]
threshold later on
60
j-th hash of fi
weight of fi in rj
compress H[w][j] to bits
Comments
Experiment
62
Experiment
63
64
16
32
64
128
256