1 of 57

Bringing the neural search paradigm shift to production

Jo Kristian Bergum (twitter.com/jobergum)

Copyright Yahoo

2 of 57

Agenda

The neural search paradigm shift - what do I mean by that?
Transformers/BERT and how NOT to use BERT for search ranking
Introduction to Vespa and Vespa @ Yahoo
Representing text - sparse and dense and retrieval
Ranking framework in Vespa
Vespa sample app - MS Marco passage ranking

Copyright Yahoo

3 of 57

The neural search paradigm shift

BERT (Bidirectional Encoder Representations from Transformers) released by Google late 2018

Large gain in effectiveness kicked off the “Bert revolution”.

Observed on the largest data intensive relevancy collection MS Marco

Metric RR (Reciprocal Rank)

RR = 1/(position of first relevant hit)

Pretrained Transformers for Text Ranking: BERT and Beyond. Jimmy Lin, Rodrigo Nogueira, Andrew Yates

Copyright Yahoo

4 of 57

The neural search paradigm shift

Neural paradigm shift confirmed in TREC Deep Learning Ranking in both 2019 and 2020.
TREC using deep graded relevance judgements (Highly relevant, relevant,,)
Neural rankers built on pre-trained language models significantly better on NDCG@10 than traditional sparse methods
TREC/Leaderboards - No runtime constraints
Ranking runs might take days

Overview of the TREC 2019 deep learning track. Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees

Copyright Yahoo

5 of 57

Paradigm shifts requires new skills and tools

Photo by Siora Photography on Unsplash

Copyright Yahoo

6 of 57

BERT

Demystified: A deep neural network architecture with a text tokenizer and a fixed vocabulary.

A Pre-trained neural network weights from language modelling (self-supervised training using large amount of text )

BERT-base uncased

(12 layers, hidden size 768)

110M tunable parameters

CLS

what

is

love

SEP

love

could

be

a

song

by

hadaway

CLS

Copyright Yahoo

7 of 57

Pre-trained language model to task specific model

Weights off the shelf from language modelling (self-supervised learning), e.g masked language model
Need domain and task specific supervised learning
Needs supervised data to shine on task at hand

Pre-trained weights

Task specific tuned weights

Task specific supervised training

Language Model

Labeled data

Task Specific Model

Copyright Yahoo

8 of 57

Applying BERT for ranking and retrieval needs task specific training

Needs supervision to shine (query, relevant doc, irrelevant doc)

BERT as a classification model

Train a classifier, all to all interaction between query and document terms

BERT as representation model

Train a joint representation of queries and documents in an embedding space - reduce interaction between query and document to vector similarity

Magic is in the fine-tuning, lots of development on representation models

Copyright Yahoo

9 of 57

Dense representation learning - State of the art retrievers

[CLS] when was the last time anyone was on the moon [CLS] A total of twelve men have landed on the Moon ...

Query Embedder (Transformer model)

Document Embedder (Transformer model)

Distance Metric

Copyright Yahoo

10 of 57

How NOT to use BERT for text ranking

How not to use BERT

Copyright Yahoo

11 of 57

How NOT to use BERT for text ranking

Using BERT CLS vector representation without fine-tuning (10x worse than BM25) on MS Marco Passage Ranking leaderboard

BM25

BERT

Copyright Yahoo

12 of 57

Introduction to Vespa.ai

Copyright Yahoo

13 of 57

Vespa.ai

An open-source platform for low latency computations over large, evolving data

Apache 2.0 Licensed

https://github.com/vespa-engine/vespa

@vespaengine

Search, filter and rank structured and unstructured data
Dense and sparse representations
Scalable in any dimension
Multiphase retrieval & ranking

Dense HNSW - nearest neighbor search
Sparse WAND
Hybrid combinations

Tensors and ML are first class citizens
Real-time Indexing and true partial updates
Elastic content scalability (no pre-sharding)

Copyright Verizon Media

Copyright Yahoo

14 of 57

Quick history of Vespa Serving Engine

Yahoo

Yahoo buys Inktomi, Overture, Altavista and web division of Fast Search Transfer

2003

Vespa is born

Vertical Search Platform in Yahoo.

2004

Vespa 5.x

New real time indexing

2010

Vespa open sourced

Apache 2.0 License.

All development in the open at github.com/vespa-engine

2017

Vespa Cloud

Commercial offering - Free trials, 4 production zones

2021

1998

Fast Search and Transfer Web search division

Alltheweb.com

Copyright Yahoo

15 of 57

Vespa @ Yahoo

Serve about 45B real-time queries per day, 10+ production zones across the globe. Software as a service (SaaS)

150+ different applications (search, recommendation, ads)

Gemini Native Ads, Gemini Product Ads, Yahoo News, Yahoo Finance. Yahoo shopping, Gemini Product Ads, Yahoo Search

Check blog.vespa.ai for some use cases @ Yahoo in detail.

Also available as a cloud service at cloud.vespa.ai

Copyright Yahoo

16 of 57

Vespa Overview

Vespa stateless container cluster (Java Runtime)

Vespa Configuration System

Vespa

Application package

Deployment with Schemas, Models, and Code

Query Functions

Document Functions

Components/Linguistics

HTTP API (mTLS - http/2)

Distribution: x86 RPM (Centos 7), Container (vespaengine/vespa), Vespa Cloud

Vespa stateful content cluster

C++ Runtime

Model Inference

Content and Index management

index

db

hnsw

Model Inference

vespa-cli

Scatter & Gather

Retrieval

Ranking

scale

Copyright Yahoo

17 of 57

Vespa Stateful Content Cluster

Content Node 0

Document Store

Inverted Index

Memory Index

HNSW

Index

Tensor Store

Retrieval

Ranking

Re-Ranking

Models

Content Node 1

Content Node 2

Vespa Stateless Container Cluster

Content Node 3

Content Node 4

TLS

Copyright Yahoo

18 of 57

Retrieval and text representations

Copyright Yahoo

19 of 57

Copyright Yahoo

20 of 57

Representing text (queries and documents)

Sparse

High dimensional vector space with |Vocabulary| dimensions, most dimensions 0

Dense

Relatively high dimensional vector space |D|

Multi-Dense

Multi-Dense - query and document represented by multiple dense representations
Where is the love => [[0.1,0.2], [0.01,0.005],[0.02,0.003],[0.1,0.3]]

Hybrid Sparse - Dense

Combination of sparse and dense representations

A Proposed Conceptual Framework for a Representational Approach to Information Retrieval, Jimmy Lin

Copyright Yahoo

21 of 57

Sparse text representation

D1: “... men landed on the moon ...” D2: “... the moon is made of cheese ...moon”

Sparse representation:

D1

D2

|V|

...	cheese	...	is	...	land	...	made	...	men	...	moon	...	of	...	on	...	the	...
					1				1		1				1		1
	1		1				1				2		1				1

Copyright Yahoo

22 of 57

Accelerated retrieval with sparse text representation

Q: “When was the last time anyone was on the moon”

when				D4		D6
was	D1			D4			D7
the	D1	D2	D3	D4	D5	D6	D7	...
last		D2
time		D2	D3
anyone		D2		D4			D7
on	D1			D4	D5		D7
moon	D1	D2

WAND(..)..

D1
D2
D4

Candidates:

Efficient Query Evaluation using a Two-Level Retrieval Process, Broder et. al, CIKM 2003

Inverted Index

Dynamic pruning algorithms - avoid scoring all documents which contains any of the query terms

Copyright Yahoo

23 of 57

A platform without WAND* is not a search engine

* Dynamic sparse pruning algorithms in general (BM-WAND, MAX-Score and variants) - Costly to evaluate brute force all documents which matches at least one of the query terms (to be or not to be)

Copyright Yahoo

24 of 57

Dense text representations - Representation Learning

Sparse representation

Dense representation

...	cheese	...	is	...	land	...	made	...	men	...	moon	...	of	...	on	...	the	...
					1				1		1				1		1

0.12	0.77	0.92	0.04	0.33	0.09

Vectorize/Embed/Transform

Copyright Yahoo

25 of 57

Retrieval with dense text representation

Q: “When was the last time anyone was on the moon” D: “A total of twelve men have landed on the Moon ...”

0.1	0.7	0.9	0.2	0.3	0.1

0.2	0.6	0.8	0.3	0.2	0.1

Dot product	1.290
Euclidean	0.099
Cosine	0.986

Distance/Similarity

Copyright Yahoo

26 of 57

Accelerated retrieval with dense text representation

Exact nearest neighbor search is linear with number of documents with high cost to achieve servable low latency for large document collections

Approximate nearest neighbor search is sub-linear - Comes with an accuracy loss

Many methods for ANN - some are worse than others (low overlap between exact and approximate vectors returned)

Billion scale vector datasets on single node < 5 ms single threaded with high recall@10 (> 0.9)

Copyright Yahoo

27 of 57

A platform without (A)NN* is not a search engine

Dense representations with accurate approximate nearest neighbor search are state of the art first phase retrievers in data rich domains with training data, implicit from user interactions or explicit by editorial judgements

Copyright Yahoo

28 of 57

Supervised text representations

Representation learning possible for both sparse and dense representations
Given training triplets (query, relevant document, irrelevant document)

Learn sparse term weights for sparse retrieval
Learn document and query representation for dense retrieval (bi-encoders built on transformer models)

Note: Sparse bm25 is unsupervised - two hyper parameters

Unfair to compare with learned representations (sparse or dense). Usually a good baseline when starting without training data.

Copyright Yahoo

29 of 57

Out of-domain

In-domain

👍

👎

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Copyright Yahoo

30 of 57

A platform without BM25* is not a search engine

BM25/sparse text ranking is hard to beat in low data regime, before gathering training data from online interactions

Copyright Yahoo

31 of 57

Where dense representations shine - multimodal representations

Our cat, Truls, says “Winter is coming”. Let us stay indoors.

[0.11, 0.14, 0.38, ...]

[0.27, 0.79, 0.98, ...]

[0.89, 0.0014, 0.52,0.0000045]

[page_rank, global_ctr, page_quality,image_quality]

Copyright Yahoo

32 of 57

Vespa Ranking Framework

Retrieval and Ranking

Copyright Yahoo

33 of 57

Retrieval and Ranking Conceptual

Effective candidate retrievers

Accelerated (limited choice of scoring function)

First phase ranking

Second phase ranking

Any scoring function (features, models)

N-phase ranking

Billions

Millions

Thousands

100s

Copyright Yahoo

34 of 57

Physical retrieval versus logical ranking

Query formulation is “the physical retriever”.

Documents retrieved by query formulation are exposed to configurable ranking

Sparse weakAnd & wand (sub-linear dynamic scoring algorithms)
Dense using nearest neighbor search (exact or approximate with HNSW)
Hybrid combining the above with logical disjunction
Combine with real-world-filters and constraints

Ranking expressed using Vespa ranking expressions in Vespa ranking profiles - run-time chose ranking profile (e.g after query classification in stateless layer)

Copyright Yahoo

35 of 57

Efficient sparse retriever implementations in Vespa

weakAnd query operator

Uses text TF-IDF to dynamically prune result list
Integrates with linguistic processing/stemming
Vespa calculates document statistics and corpus statistics tf/idf.

wand query operator

Uses user provided query and document weights (integer weights)
No linguistic integration. Vocabulary controlled by user

q={123:1, 134:2}, d={123:2, 134:3, 145:4}

Maximum inner dot product score (safe)
Support both text/long/int
Great for learned sparse representations (E.g. using BERT subword vocabulary ids)
Multi-threaded per search (Latency control)

Copyright Yahoo

36 of 57

Efficient dense retriever implementation in Vespa

nearestNeighbor query operator

Input query vector (single order n dimensional tensor)
Document tensor field
Target hits, approximate true/false

distance-metrics

L2/euclidean,Angular, Hamming

Tensor cell precision

float (4 bytes), bfloat16(2 bytes), int8 (1)

Vespa ANN is based on HNSW (Hierarchical Navigable Small World) graphs.

Real time insert, delete, update

HNSW settings controlled in schema (accuracy versus performance)

Support multiple HNSW indexes per document

Support for switching exact/approximate allows finding accuracy drop

Multi-threaded exact search

Copyright Yahoo

37 of 57

Vespa Query Language (YQL)

{

“yql”: select id, text from passages where [“targetHits”:10]nearestNeighbor(doc_embedding,query_embedding) or

[“targetHits”:10]weakAnd(userQuer());

“query”: “what was the manhattan project?”,

“ranking.features.query(query_embedding)”: [0.1, 0.2,0.2],

“ranking.profile” : “my-ranking-profile”

}

Combine sparse and dense representations in the same query

Copyright Yahoo

38 of 57

Real world search is constrained + diversity

select id, text from passages where

( [“targetHits”:10]nearestNeighbor(doc_embedding,query_embedding) or

[“targetHits”:10]weakAnd(userQuery())

) and visible = true and category contains “sports” and quality > 0.1 and market contains “us” limit 0 | all(group(category) order(-max(relevance())) max(10) each(max(2) each(output(summary()))));

Copyright Yahoo

39 of 57

Ranking with Vespa

Photo by Roman Mager on Unsplash

Copyright Yahoo

40 of 57

Ranking with Vespa

Documents matching query formulation is exposed to configurable ranking

Configured in document schema(s) - Compiled (LLVM) in runtime C++

rank-profile my-ranking-profile {

first-phase {

bm25(text) + attribute(doc_ctr) + sum(query_categories)*attribute(document_categories)

}

Copyright Yahoo

41 of 57

Ranking with Vespa

Rich set of built in text (matching) ranking features

bm25(field), nativeFieldmatch(field), nativeProximity(field)

No need to change query formulation to express changes in ranking

Vespa rank features

Easy to express multi-phased ranking and ensemble models

Copyright Yahoo

42 of 57

Ranking - Tensors

Tensors are first class citizens in Vespa

Tensors in queries, documents and global documents

field embedding type tensor<int8>(x[128])

query(query_embedding) type tensor<int8>(x[128])

query(qt_colbert) type tensor<float>(qt[32], x[128])

Tensor feature store - Access document and global documents in ranking

Tensor compute engine - sum(query(qt)*attribute(dt)) - recognized and hw accelerated if possible

Copyright Yahoo

43 of 57

Ranking - Models from your favorite framework

Ranking Model integrations

ONNX (Open Neural Network Exchange) - Interoperability

GDBT family (XGboost/LightGBM) - classic LTR

PyTorch, Tensorflow, scikit-learn,++ => Export to ONNX

Ability to run Model inference in both stateless and stateful content layer

Vespa integrates with ONNX-Runtime for accelerated inference of ONNX models (both stateless and stateful)

Copyright Yahoo

44 of 57

No model to rule them all

From Pre-trained Language Model for Web-scale Retrieval in Baidu Search

Copyright Yahoo

45 of 57

Easy to deploy multiple models applied to same query

Stateless query intent classification

Hybrid retrieval (dense + sparse)

GBDT (Classic LTR)

Neural methods

In the same query, in the same engine

ranking-profile ensemble-phased {

function intent_score() {

expression: sum(query(cat)*attribute(categories))

}

function neural() {

expression: onnx(minilmranker){d0:0,d1:0}

}

first-phase {

expression: lightGBM(“light-gbdt-v3.json”) }

}

second-phase {

expression: firstPhase + intent_score() + neural()

}

Copyright Yahoo

46 of 57

Vespa replaces all of this

Inverted index library

Embedding data storage

Vector search library

Document storage

GBDT Inference Service

Search Ranking Middleware

Query

Ranked Docs

Snippet Service

Transformer Inference Service

Feature storage

#MLOps

#DevOps

#pagerduty #failuremodes

Copyright Yahoo

47 of 57

Ranking with Transformers with Vespa

Photo by Aditya Vyas on Unsplash

Copyright Yahoo

48 of 57

Scaling Transformer Inference and Ranking

Scaling Transformer for production workloads

Latency SLA
Cost of serving the model with throughput targets

Getting there? On CPU?

Model
Distillation
Quantization
Graph optimization
Model input sequence length ( |query| <<< |document|)

Copyright Yahoo

49 of 57

Best way to scale BERT is not using BERT!

Don’t default to BERT-Base for your production retrieval and ranking pipelines

Copyright Yahoo

50 of 57

Transformer models versus accuracy & serving cost

https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased

Same accuracy, but 9x reduced cost.

With int8 weights

27x reduced cost.

Your go-to model?

Copyright Yahoo

51 of 57

Vespa Ms Marco Passage Ranking

Copyright Yahoo

52 of 57

Vespa App Package

Text (passage text)

Text_token_ids (Bert subword ids)

Dense single representation embedding

Dt multi-dense representations

Each token in passage has a contextual dense embedding representation

schema passage

document passage {

field text type string {}

field text_token_ids type tensor<float>(d0[128]) {}

field embedding type tensor<float>(x[384]){

index: {hnsw {..}}

}

field dt type tensor<bfloat16>(dt{}, x[32]) {}

}

Copyright Yahoo

53 of 57

App Package -

Models

Export Transformer models to ONNX. All 3 models based on MiniLM 6 layer, shipped with app package.

Dense Query embedding model inference in stateless container cluster
ColBERT query embedding model inference in stateless container cluster
BERT re-ranking on content node(s)

schema passage

document passage {}

onnx-model minilmranker {

file: models/msmarco_v2.onnx

input input_ids: input_ids

input attention_mask: attention_mask

input token_type_ids: token_type_ids

}

rank-profile sparse {}

rank-profile dense {}

rank-profile colbert inherits dense {}

Copyright Yahoo

54 of 57

MS Marco Passage Ranking with Vespa

Stateless

BERT Subword Tokenizer

Dense query embedder

Retrieval

Ranking

/search/?query=..

query=tensor(d[32]): [101,2043, 2001,..]

Chain

Stateful

Summary store

(passage text, word piece tokens)

Hnsw

index

passages

Local ranking phases

Search

Fetch

Reranking (optional)

Global Top-K passages

Merge

Ranked passages

Inverted Index

ColBERT

query embedder

RPC

HTTPS

Tensor

store

Diversification (optional)

Copyright Yahoo

55 of 57

App package - Code

Stateless container (Java Runtime)

Support deploying custom searchers, processors, rendering, components+

Chained execution of searchers

Deploy as part of application package

Injectable components

Linguistics
Model Inference
Tokenizer

Copyright Yahoo

56 of 57

Hated it?

Tweet me @jobergum

QA

Copyright Yahoo

57 of 57

Resources

QA

https://vespa.ai/

https://blog.vespa.ai/

https://github.com/vespa-engine/vespa

Slack http://slack.vespa.ai/

https://github.com/vespa-engine/sample-apps/blob/master/msmarco-ranking/passage-ranking.md (Neural end to end)

https://github.com/vespa-engine/sample-apps/blob/master/msmarco-ranking/document-ranking.md (WAND BM25 + classic GBDT re-ranking)

https://cloud.vespa.ai/ (free trial)

Copyright Verizon Media

Copyright Yahoo