1 of 58

The Neural Search Frontier

The Relevance Engineer's Deep Learning Toolbelt

2 of 58

About us

Doug Turnbull / Author "Relevant Search"

Doug Turnbull, CTO - OpenSource Connections (http://o19s.com)

Tommaso Teofili

Computer Scientist, Adobe ASF member

Discount Code:

ctwactivate18

3 of 58

Relevance: challenge of 'aboutness'

bitcoin regulation

'bitcoin' - aboutness

‘regulation’ aboutness

X

Doc 1

Doc 2

X

Query

Doc 2 more about terms, ranked 'higher'

4 of 58

One hypothesis: TF*IDF vector similarity (and bm25 ...)

bitcoin regulation

'bitcoin' - TF*IDF

‘regulation’ TF*IDF

X

Doc 1

Doc 2

X

Query

Doc 2 more 'about' because higher concentration of the corpus's 'bitcoin' and 'regulation' terms

5 of 58

Sadly 'Aboutness' is hard

After minutes - simple TF*IDF

After years relevance / feature eng. work

Possible Search quality

Why!?

Searcher vocab mismatches
Searchers expect context (personal, emotional, ...)
Different entities to match to
Language evolves
Misspellings, typos, etc
TF*IDF doesn't always matter

6 of 58

There's Something About Aboutness?

Neural Search Frontier

Hunt for the About October?

The Last Sharknado: It's *About* Time?

7 of 58

Embeddings: another vector sim

A vector storing information about the context a word appears in

“Have you heard the hype about bitcoin currency?”

“Bitcoin a currency for hackers?”

“Have you heard the hype about bitcoin currency?”

“Hacking on my Bitcoin server! It's not hyped”

Encode contexts

[0.6,-0.4] bitcoin

Docs

8 of 58

Easier to see as a graph

[0.5,-0.5] cryptocurrency

[0.6,-0.4] bitcoin

These two terms occur in similar contexts

Graph from Desmos

9 of 58

Document embeddings

Encode full text as context for document

Encode

[0.5,-0.4] doc1234

Have you heard of cryptocurrency? It's the cool currency being worked on by hackers! Bitcoin is one prominent example. Some say it's all hype

...

10 of 58

Is shared 'context' same as 'aboutness'!?

[0.5,-0.5] cryptocurrency

[0.6,-0.4] bitcoin

[0.5,-0.4] doc1234

[0.4,-0.2] doc5678

Embeddings:

bitcoin

Closer docs to 'bitcoin' ranked higher

doc1234
doc5678

11 of 58

Sadly no :(

I want to cancel my reservation

I want to confirm my reservation

Words occur in similar contexts can have opposite meanings!!

'Doc about canceling' not relevant for 'confirm'

12 of 58

Fundamentally there’s still a mismatch

Fiat currencies have a global conspiracy against dogecoin

bit dollar?

Crypto expert creates content with nerd speak

Average Citizen searches in lay-speak

Embeddings derived from corpus, not our searcher’s language

13 of 58

Embedding Modeling

14 of 58

Can we improve on embeddings?

15 of 58

Dot Product

(‘similarity’)

Sigmoid

(forces to 0..1 ‘probability’)

[5.5,

8.1,

...

0.6]

'Bitcoin' Embedding

[6.0,

1.7,

...

8.4]

'Energy' Embedding

5.5*6.0 + 8.1*1.7 +

...

+ 0.6*8.4 = 102.65

0...1

0.67

Out of context

True context

Term	Other Term	In Same Context?	Model Prediction
bitcoin	energy	1	0.67

Trying to predict

A word2vec Model

Term

embedding

Table

(initialized w/ random vals)

Term

embedding

Table (initialized w/ random vals)

16 of 58

Tweak…

Term	Other Term	True Context?	Prediction
bitcoin	energy	1	0.67
bitcoin	dongles	0	0.78
bitcoin	relevance	0	0.56
bitcoin	cables	0	0.34

To get into the weeds on the loss function & gradient descent for word2vec:Stanford CS224D Deep Learning for NLP Lecture Notes Chaubard, Mundra, Socher

Random unrelated terms

Terms sharing context

Tweak ‘bitcoin’ to get prediction closer to 1

Tweak ‘bitcoin’ to get prediction closer to 0

Goal:

(showing skipgram / negative sampling method)

17 of 58

Maximize Me!

-

Maximize ‘True’ Contexts

Dot Product (‘similarity’)

Sigmoid

(forces to 0..1)

[5.5,

8.1,

...

0.6]

[6.0,

1.7,

...

8.4]

Minimize ‘False’ Contexts

Just the model (sigmoid of dot product)

F =

18 of 58

Did you just neural network?

Backpropagate tweaks to weights to reduce error

Neuron

(sigmoid)

Word1

Word2

(Dot product implicit)

d F

d v[0]

d F

d v[1]

d F

d v[2]

...

Our fitness function

Tweak this component to maximize fit

More complex/'deep' neural nets, can propagate back error to earlier layers to learn weights

19 of 58

Dot Product (‘similarity’)

Sigmoid

(forces to 0..1)

[5.5,

8.1,

...

0.6]

'Bitcoin' Embedding

[6.0,

1.7,

...

8.4]

'Energy' Embedding

5.5*6.0*2.0 + 8.1*1.7*0.5 +

...

+ 0.6*8.4*1.4 = 102.65

0...1

0.67

Out of context

True context

Doc2Vec, train paragraph vector w/ term vectors

[2.0,

0.5,

...

1.4]

Para.

embedding

matrix

Term

embedding

matrix

Term

embedding

matrix

backpropagate

20 of 58

Same neg sampling can apply here

Term	Para	True Context?	Prediction
bitcoin	Doc 1234 Para 1	1	0.67
bitcoin	Doc 5678 Para 5	0	0.78
bitcoin	Doc 1234 Para 8	0	0.56
bitcoin	Doc 1537 Para 1	0	0.34

I’m continuing the thread of negative sampling, but doc2vec need not use negative sampling

Random unrelated docs

Terms sharing context

Tweak embeddings to get prediction closer to 1

Tweak embeddings to get prediction closer to 0

21 of 58

Embedding Modeling:

[0.5,-0.4] doc1234

...For years you've been manipulating this...

Your new clay to mold...

Manipulating the constructions of embeddings to measure something *domain specific*

22 of 58

You're *about* to see some fun embedding hacks...

23 of 58

Query-based embeddings?

Query Term	Other Q. Term	Same Session	Relevance Score
bitcoin	regulation	1	0.67
bitcoin	bananas	0	0.78
bitcoin	headaches	0	0.56
bitcoin	daycare	0	0.34

Random unrelated docs

Relevant doc for term

Tweak embeddings to get query-doc score closer to 1

Tweak embeddings to get prediction closer to 0

24 of 58

Embeddings from judgments?

Query Term	Doc	Relevant?	Relevance Score
bitcoin	Doc 1234	1	0.67
bitcoin	Doc 5678	0	0.78
bitcoin	Doc 1234	0	0.56
bitcoin	Doc 1537	0	0.34

Random unrelated docs

Relevant doc for term

Tweak embeddings to get query-doc score closer to 1

Tweak embeddings to get prediction closer to 0

(derived from clicks, etc)

25 of 58

Pretrain word2vec with corpus/sessions?

Query Term	Doc	Corpus Model	True Relevance	TweakedScore
bitcoin	Doc 1234	0	1	0.01
bitcoin	Doc 5678	1	0	0.99
bitcoin	Doc 1234	0	0	0.56
bitcoin	Doc 1537	0	0	0.34

Takes a lot to overcome original unified model, and its a case-by-case basis

(a kind of prior)

An online / evolutionary approach to converge on improved embeddings?

26 of 58

Geometrically adjusted query embeddings?

Averaging query vectors by means of some possibly relevant documents

27 of 58

Embeddings go beyond language

User	Item	Purchased?	Recommended?
Tom	Jeans	1	0.67
Tom	Khakis	0	0.78
Tom	iPad	0	0.56
Tom	Dress Shoes	0	0.34

Tweak embeddings to get user-item score closer to 1

Tweak embeddings to get prediction closer to 0

(use to find similar items / users)

28 of 58

More features beyond just 'context'

Query Term	Doc	Sentiment	Same Context	TweakedScore
cancel	Doc 1234	Angry	1	0.01
confirm	Doc 5678	Angry	0	0.99
cancel	Doc 1234	Happy	0	0.56
confirm	Doc 1537	Happy	0	0.34

See: https://arxiv.org/abs/1805.07966

29 of 58

Evolutionary/contextual bandit embeddings?

Query Term	Doc	Corpus Model	True Relevance	TweakedScore
bitcoin	Doc 1234	0	1	0.01
bitcoin	Doc 5678	1	0	0.99
bitcoin	Doc 1234	0	0	0.56
bitcoin	Doc 1537	0	0	0.34

Pull embedding in the direction your KPIs seem to say is successful

(a kind of prior)

An online / evolutionary approach to converge on improved embeddings?

30 of 58

Embedding Frontier...

There's a lot we could do

Model specificity: high specificity terms cluster differently than low specificity terms

You often data-model/clean A LOT before computing your embeddings (just like you do your search index)

See Simon Hughes "Vectors in Search" talk on encoding vectors and using them in Solr

31 of 58

Neural Language Models

32 of 58

Translators grok 'aboutness'

Serĉa konferenco

What is in the head of this person that 'groks' Esperanto?

'a search conference?'

33 of 58

Language Model

Can seeing text, and guessing what would "come next" help us on our quest for 'aboutness'?

“eat __?_____”

cat: 0.001

...

chair: 0.001

pizza: 0.02

...

nap: 0.001

Highest Probability

Obvious applications: autosuggest, etc

34 of 58

Markov language model, vocab size V

	pizza	chair	nap	...
eat	0.02	0.0002	0.0001	...
cat	0.0001	0.0001	0.02	...
...	...	...	...

Probability of ‘pizza’ following ‘eat’ (as in “eat pizza”)

V words

From corpus we can build a transition matrix reflecting frequency of word transitions:

35 of 58

Transition Matrix Math

	pizza	chair	nap	...
eat	0.02	0.0002	0.0001	...
cat	0.0001	0.0001	0.02	...
...	...	...	...

Let's say we don't 100% know the 'current' word can we still guess the 'next' word?

	Prob
eat	0.25
cat	0.40
...	...

X

=

	Prob 'pizza' next	P(chair)	P(nap)	...
	0.250.02+0.400.0001 + ...	...	...	...

P_curr

P_next:

V words

36 of 58

Language model for embeddings?

[5.5,

8.1,

...

0.6]

Term

embedding

matrix

‘eat’

embedding

‘h’ dimensions

	0	1	2	...	h
0	0.2	-0.4	0.1
1	-0.1	0.3	-0.24
2
...
h

X

[0.5,

6.1,

...

-4.8]

Embedding of next word

=

(probably clusters near ‘pizza’ etc)

A transition matrix - weights learned through backpropagation

37 of 58

Accuracy requires context...

The race is on!

dust!

eat

Word => Next Word

pizza!

Context +

(and new context)

38 of 58

Context + new word => New Context/word

‘Embedding’ markov model from earlier

Word => New Context & Word

Old Context +

[0.5, 1.6 …

6.1, -4.8 …

… ]

[ 1.6,

-5.4

…]

X

[5.5,

8.1,

...

Eat (embedding)

=

Input

‘Transition’

[...]

[-0.9,

-1.2

…]

X

[5.5,

8.1,

...

=

Input -> New Context

Transition

Pizza (embedding)

[0.5, 1.6 …

6.1, -4.8 …

… ]

X

=

Prev Context -> New Context Transition

[ 0.3,

4.5

…]

[-0.6,

3.3

…]

+

New

Context

‘Embedding’ for next word

[...]

Context -> Output Transition

[-1.6,

7.3

…]

X

=

Output Embedding

Next...

39 of 58

Simpler view

[5.5,

9.1,

...

W_xh

[-5.5,

1.1,

...

W_hh

[5.0,

4.1,

...

[1.5,

0.1,

...

Unbearable Effectiveness of Recurrent Neural Networks

W_xh

[ 8.5,

-1.1,

...

Predicted next word (or embedding)

Weights learned through backpropagation:

(trained on true sentences in our corpus)

on

eat

[-5.5,

1.1,

...

[5.5,

9.1,

...

is

dust

W_hy

W_xh

[-2.5,

5.6,

...

[5.5,

9.1,

...

race

W_xh

Psssst… this is a Recurrent Neural Network (RNN)!

40 of 58

For search...

[5.5,

9.1,

...

W_xh

[-5.5,

1.1,

...

W_hh

[5.0,

4.1,

...

[1.5,

0.1,

...

W_xh

[ 8.5,

-1.1,

...

[-5.5,

1.1,

...

[5.5,

9.1,

...

W_hy

W_xh

Inject other contextually likely terms:

At any given point scanning our document, we can get a probability distribution of likely terms

A dog is loose! Please call the

animal

control

dog

catcher

phone

yodel

walk to

41 of 58

Not a silver bullet

[5.5,

9.1,

...

W_xh

[-5.5,

1.1,

...

W_hh

[5.0,

4.1,

...

[1.5,

0.1,

...

W_xh

[ 8.5,

-1.1,

...

[-5.5,

1.1,

...

[5.5,

9.1,

...

W_hy

W_xh

A dog is loose! Please call the

animal

control

dog

catcher

(corpus language)

(searcher language)

Won't learn this ideal language model if corpus never says 'dog catcher'

42 of 58

One Vector Space To Rule Them All: Seq2Seq

43 of 58

Translation: RNN decoder/encoder

[-1.5,

3.1,

...

W_xh

on

[-5.5,

1.1,

...

[5.5,

9.1,

...

is

W_xh

[-2.5,

5.6,

...

[1.5,

9.5,

...

race

W_xh

W_hh

Encoder RNN

[-4.5,

5.1,

...

[-4.5,

5.1,

...

[4.5,

3.1,

...

W_xh

carrera

[-1.5,

9.1,

...

[1.3,

4.3,

...

la

W_xh

W_hh

[-4.7,

5.1,

...

Decoder RNN

[2.8,

7.3,

...

W_xh

<START>

W_hy

la

W_hy

carrera

W_hy

es

Seq2Seq The Clown Car of Deep Learning

Backprop...

W_hh

44 of 58

'Translate' documents to queries?

Encoder RNN

Decoder RNN

Animal Control Law

Herefore let it be …

And therefore… with much to be blah blah blah

…

The End

Predicted Queries

Doc	Query	Relevant?
Animal Control Law	Dog catcher	1
Littering Law	Dog Catcher	0

Using judgments/clicks as training data:

45 of 58

Can we use graded judgments?

Encoder RNN

Decoder RNN

Animal Control Law

Herefore let it be …

And therefore… with much to be blah blah blah

…

The End

Predicted Queries

Doc	Query	Relevant?
Animal Control Law	Dog catcher	4
Animal Leash Law	Dog Catcher	2

Construct training data to reflect weighting

46 of 58

Skip-Thought Vectors (a kind of ‘sentence2vec’)

Encoder RNN

Decoder RNN

(sentence before)

Decoder RNN

(sentence after)

It’s fleece was white as snow

Becomes an embedding for the sentence encoding semantic meaning

Mary had a little lamb

And everywhere that mary went, that lamb was sure to go

47 of 58

Skip-Thought Vectors for queries? (‘doc2queryVec’)

Encoder RNN

Decoder RNN

(relevant query)

Decoder RNN

(relevant query)

Becomes an embedding mapping docs into the query’s semantic space

q=dog catcher

q=loose dog

Animal Control Law

Herefore let it be …

And therefore… with much to be blah blah blah

…

The End

Decoder RNN

(relevant query)

...

q=dog fight

My Thoughts on Skip Thoughts

48 of 58

The Frontier

49 of 58

Anything2vec

Deep learning is especially good at learning representations of :

Images
User History
Songs
Video
Text

If everything can be ‘tokenized’ into some kind of token space for retrieval

Everything can be ‘vectorized’ into some embedding space for retrieval / ranking

50 of 58

Lucene community needs to get better first-class vector support

Similarity API and vectors ?

long computeNorm(FieldInvertState state)

SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats)

SimScorer simScorer(SimWeight weight, LeafReaderContext context)

51 of 58

Delimited Term Frequency Filter Factory (use tf to encode
Payloads & Payload Score

Some progress

Check out Simon Hughes talk: "Vectors in Search"

52 of 58

It’s not magic, it’s math!

Join the Search Relevancy Slack Community

http://o19s.com/slack

(projects, chat, conferences, help, book authors, and more…)

53 of 58

Matching: vocab mismatch!

Taxonomical Semantical Magical Search Doug Turnbull, Lucene/Solr Revolution 2017

"Animal Control Law"

Dog catcher

Dog

Catcher

Animal Control

Concept 1234

Taxonomist / Librarian

Costly to manually maintain mapping between searcher & corpus vocabulary

No Match, despite relevant results!

legalese

lay-speak

54 of 58

Animal Control Law

Herefore let it be …

And therefore… with much to be blah blah blah

…

The End

Ranking optimization hard!

Dog catcher

Score doc

Ranked Results

LTR is only as good as your features: Solr/ES queries based on your skill with Solr/ES queries

LTR?

Ranking can be hard optimization problem:

Fine tuning heuristics: TF*IDF, Solr/ES queries, analyzers, etc...

55 of 58

How can deep learning help?

Taxonomical Semantical Magical Search Doug Turnbull, Lucene/Solr Revolution 2017

Can we learn, quickly at scale, the semantic relationships between concepts?

What other forms of every-day enrichment could deep-learning accelerate?

Dog catcher

Animal control

56 of 58

Sessions as documents

Embeddings built for just query terms

Dog catcher

Dog bit me

Lost dog

[5.5,

8.1,

...

0.6]

word2vec/doc2vec

57 of 58

Can we translate between searcher & corpus?

[5.5,

8.1,

...

0.6]

[2.0,

0.5,

...

1.4]

Insert Machine Learning Here

Ranking Prediction

Doc

Search term

(classic LTR could go here)

58 of 58

Embeddings

Words with similar context share close embeddings

Cryptocurrency is not hyped! It's the future

Hackers invent their own cryptocurrency

Cryptocurrency is a digital currency

Hyped cryptocurrency is for hackers!

Encode contexts

[0.5,-0.5] cryptocurrency

[0.6,-0.4] bitcoin