Bringing the neural search paradigm shift to production
Jo Kristian Bergum (twitter.com/jobergum)
Copyright Yahoo
Copyright Yahoo
Agenda
Copyright Yahoo
The neural search paradigm shift
BERT (Bidirectional Encoder Representations from Transformers) released by Google late 2018
Large gain in effectiveness kicked off the “Bert revolution”.
Observed on the largest data intensive relevancy collection MS Marco
Metric RR (Reciprocal Rank)
RR = 1/(position of first relevant hit)
Pretrained Transformers for Text Ranking: BERT and Beyond. Jimmy Lin, Rodrigo Nogueira, Andrew Yates
Copyright Yahoo
The neural search paradigm shift
Copyright Yahoo
Paradigm shifts requires new skills and tools
Photo by Siora Photography on Unsplash
Copyright Yahoo
BERT
Demystified: A deep neural network architecture with a text tokenizer and a fixed vocabulary.
A Pre-trained neural network weights from language modelling (self-supervised training using large amount of text )
BERT-base uncased
(12 layers, hidden size 768)
110M tunable parameters
CLS
what
is
love
SEP
love
could
be
a
song
by
hadaway
CLS
Copyright Yahoo
Pre-trained language model to task specific model
Pre-trained weights
Task specific tuned weights
Task specific supervised training
Language Model
Labeled data
Task Specific Model
Copyright Yahoo
Applying BERT for ranking and retrieval needs task specific training
Needs supervision to shine (query, relevant doc, irrelevant doc)
BERT as a classification model
Train a classifier, all to all interaction between query and document terms
BERT as representation model
Train a joint representation of queries and documents in an embedding space - reduce interaction between query and document to vector similarity
Magic is in the fine-tuning, lots of development on representation models
Copyright Yahoo
Dense representation learning - State of the art retrievers
[CLS] when was the last time anyone was on the moon [CLS] A total of twelve men have landed on the Moon ...
Query Embedder (Transformer model)
Document Embedder (Transformer model)
Distance Metric
Copyright Yahoo
How NOT to use BERT for text ranking
Copyright Yahoo
How NOT to use BERT for text ranking
Using BERT CLS vector representation without fine-tuning (10x worse than BM25) on MS Marco Passage Ranking leaderboard
BM25
BERT
Copyright Yahoo
Introduction to Vespa.ai
Copyright Yahoo
Copyright Yahoo
Vespa.ai
An open-source platform for low latency computations over large, evolving data
Apache 2.0 Licensed
https://github.com/vespa-engine/vespa
@vespaengine
Copyright Verizon Media
Copyright Yahoo
Quick history of Vespa Serving Engine
Yahoo
Yahoo buys Inktomi, Overture, Altavista and web division of Fast Search Transfer
2003
Vespa is born
Vertical Search Platform in Yahoo.
2004
Vespa 5.x
New real time indexing
2010
Vespa open sourced
Apache 2.0 License.
All development in the open at github.com/vespa-engine
2017
Vespa Cloud
Commercial offering - Free trials, 4 production zones
2021
1998
Fast Search and Transfer Web search division
Alltheweb.com
Copyright Yahoo
Vespa @ Yahoo
Serve about 45B real-time queries per day, 10+ production zones across the globe. Software as a service (SaaS)
150+ different applications (search, recommendation, ads)
Check blog.vespa.ai for some use cases @ Yahoo in detail.
Also available as a cloud service at cloud.vespa.ai
Copyright Yahoo
Vespa Overview
Vespa stateless container cluster (Java Runtime)
Vespa Configuration System
Vespa
Application package
Deployment with Schemas, Models, and Code
Query Functions
Document Functions
Components/Linguistics
HTTP API (mTLS - http/2)
Distribution: x86 RPM (Centos 7), Container (vespaengine/vespa), Vespa Cloud
Vespa stateful content cluster
C++ Runtime
Model Inference
Content and Index management
index
db
hnsw
Model Inference
vespa-cli
Scatter & Gather
Retrieval
Ranking
scale
Copyright Yahoo
Vespa Stateful Content Cluster
Content Node 0
Document Store
Inverted Index
Memory Index
HNSW
Index
Tensor Store
Retrieval
Ranking
Re-Ranking
Models
Content Node 1
Content Node 2
Vespa Stateless Container Cluster
Content Node 3
Content Node 4
TLS
Copyright Yahoo
Retrieval and text representations
Copyright Yahoo
Copyright Yahoo
Copyright Yahoo
Representing text (queries and documents)
Copyright Yahoo
Sparse text representation
D1: “... men landed on the moon ...” D2: “... the moon is made of cheese ...moon”
Sparse representation:
D1
D2
|V|
... | cheese | ... | is | ... | land | ... | made | ... | men | ... | moon | ... | of | ... | on | ... | the | ... |
| | | | | 1 | | | | 1 | | 1 | | | | 1 | | 1 | |
| 1 | | 1 | | | | 1 | | | | 2 | | 1 | | | | 1 | |
Copyright Yahoo
Accelerated retrieval with sparse text representation
Q: “When was the last time anyone was on the moon”
when | | | | D4 | | D6 | | |
was | D1 | | | D4 | | | D7 | |
the | D1 | D2 | D3 | D4 | D5 | D6 | D7 | ... |
last | | D2 | | | | | | |
time | | D2 | D3 | | | | | |
anyone | | D2 | | D4 | | | D7 | |
on | D1 | | | D4 | D5 | | D7 | |
moon | D1 | D2 | | | | | | |
WAND(..)..
D1 |
D2 |
D4 |
Candidates:
Efficient Query Evaluation using a Two-Level Retrieval Process, Broder et. al, CIKM 2003
Inverted Index
Dynamic pruning algorithms - avoid scoring all documents which contains any of the query terms
Copyright Yahoo
A platform without WAND* is not a search engine
* Dynamic sparse pruning algorithms in general (BM-WAND, MAX-Score and variants) - Costly to evaluate brute force all documents which matches at least one of the query terms (to be or not to be)
Copyright Yahoo
Dense text representations - Representation Learning
Sparse representation
Dense representation
... | cheese | ... | is | ... | land | ... | made | ... | men | ... | moon | ... | of | ... | on | ... | the | ... |
| | | | | 1 | | | | 1 | | 1 | | | | 1 | | 1 | |
0.12 | 0.77 | 0.92 | 0.04 | 0.33 | 0.09 |
Vectorize/Embed/Transform
Copyright Yahoo
Retrieval with dense text representation
Q: “When was the last time anyone was on the moon” D: “A total of twelve men have landed on the Moon ...”
0.1 | 0.7 | 0.9 | 0.2 | 0.3 | 0.1 |
0.2 | 0.6 | 0.8 | 0.3 | 0.2 | 0.1 |
Dot product | 1.290 |
Euclidean | 0.099 |
Cosine | 0.986 |
Distance/Similarity
Copyright Yahoo
Accelerated retrieval with dense text representation
Exact nearest neighbor search is linear with number of documents with high cost to achieve servable low latency for large document collections
Approximate nearest neighbor search is sub-linear - Comes with an accuracy loss
Many methods for ANN - some are worse than others (low overlap between exact and approximate vectors returned)
Billion scale vector datasets on single node < 5 ms single threaded with high recall@10 (> 0.9)
Copyright Yahoo
A platform without (A)NN* is not a search engine
Dense representations with accurate approximate nearest neighbor search are state of the art first phase retrievers in data rich domains with training data, implicit from user interactions or explicit by editorial judgements
Copyright Yahoo
Supervised text representations
Note: Sparse bm25 is unsupervised - two hyper parameters
Unfair to compare with learned representations (sparse or dense). Usually a good baseline when starting without training data.
Copyright Yahoo
Out of-domain
In-domain
👍
👎
Copyright Yahoo
A platform without BM25* is not a search engine
BM25/sparse text ranking is hard to beat in low data regime, before gathering training data from online interactions
Copyright Yahoo
Where dense representations shine - multimodal representations
Our cat, Truls, says “Winter is coming”. Let us stay indoors.
[0.11, 0.14, 0.38, ...]
[0.27, 0.79, 0.98, ...]
[0.89, 0.0014, 0.52,0.0000045]
[page_rank, global_ctr, page_quality,image_quality]
Copyright Yahoo
Vespa Ranking Framework
Retrieval and Ranking
Copyright Yahoo
Copyright Yahoo
Retrieval and Ranking Conceptual
Effective candidate retrievers
Accelerated (limited choice of scoring function)
First phase ranking
Second phase ranking
Any scoring function (features, models)
N-phase ranking
Billions
Millions
Thousands
100s
Copyright Yahoo
Physical retrieval versus logical ranking
Query formulation is “the physical retriever”.
Documents retrieved by query formulation are exposed to configurable ranking
Ranking expressed using Vespa ranking expressions in Vespa ranking profiles - run-time chose ranking profile (e.g after query classification in stateless layer)
Copyright Yahoo
Efficient sparse retriever implementations in Vespa
Copyright Yahoo
Efficient dense retriever implementation in Vespa
nearestNeighbor query operator
distance-metrics
L2/euclidean,Angular, Hamming
Tensor cell precision
float (4 bytes), bfloat16(2 bytes), int8 (1)
Vespa ANN is based on HNSW (Hierarchical Navigable Small World) graphs.
Real time insert, delete, update
HNSW settings controlled in schema (accuracy versus performance)
Support multiple HNSW indexes per document
Support for switching exact/approximate allows finding accuracy drop
Multi-threaded exact search
Copyright Yahoo
Vespa Query Language (YQL)
{
“yql”: select id, text from passages where [“targetHits”:10]nearestNeighbor(doc_embedding,query_embedding) or
[“targetHits”:10]weakAnd(userQuer());
“query”: “what was the manhattan project?”,
“ranking.features.query(query_embedding)”: [0.1, 0.2,0.2],
“ranking.profile” : “my-ranking-profile”
}
Combine sparse and dense representations in the same query
Copyright Yahoo
Real world search is constrained + diversity
select id, text from passages where
( [“targetHits”:10]nearestNeighbor(doc_embedding,query_embedding) or
[“targetHits”:10]weakAnd(userQuery())
) and visible = true and category contains “sports” and quality > 0.1 and market contains “us” limit 0 | all(group(category) order(-max(relevance())) max(10) each(max(2) each(output(summary()))));
Copyright Yahoo
Ranking with Vespa
Photo by Roman Mager on Unsplash
Copyright Yahoo
Ranking with Vespa
Documents matching query formulation is exposed to configurable ranking
Configured in document schema(s) - Compiled (LLVM) in runtime C++
rank-profile my-ranking-profile {
first-phase {
bm25(text) + attribute(doc_ctr) + sum(query_categories)*attribute(document_categories)
}
Copyright Yahoo
Ranking with Vespa
Rich set of built in text (matching) ranking features
No need to change query formulation to express changes in ranking
Easy to express multi-phased ranking and ensemble models
Copyright Yahoo
Ranking - Tensors
Tensors are first class citizens in Vespa
Tensors in queries, documents and global documents
field embedding type tensor<int8>(x[128])
query(query_embedding) type tensor<int8>(x[128])
query(qt_colbert) type tensor<float>(qt[32], x[128])
Tensor feature store - Access document and global documents in ranking
Tensor compute engine - sum(query(qt)*attribute(dt)) - recognized and hw accelerated if possible
Copyright Yahoo
Ranking - Models from your favorite framework
Ranking Model integrations
ONNX (Open Neural Network Exchange) - Interoperability
GDBT family (XGboost/LightGBM) - classic LTR
PyTorch, Tensorflow, scikit-learn,++ => Export to ONNX
Ability to run Model inference in both stateless and stateful content layer
Vespa integrates with ONNX-Runtime for accelerated inference of ONNX models (both stateless and stateful)
Copyright Yahoo
No model to rule them all
Copyright Yahoo
Easy to deploy multiple models applied to same query
Stateless query intent classification
Hybrid retrieval (dense + sparse)
GBDT (Classic LTR)
Neural methods
In the same query, in the same engine
ranking-profile ensemble-phased {
function intent_score() {
expression: sum(query(cat)*attribute(categories))
}
function neural() {
expression: onnx(minilmranker){d0:0,d1:0}
}
first-phase {
expression: lightGBM(“light-gbdt-v3.json”) }
}
second-phase {
expression: firstPhase + intent_score() + neural()
}
}
Copyright Yahoo
Vespa replaces all of this
Inverted index library
Embedding data storage
Vector search library
Document storage
GBDT Inference Service
Search Ranking Middleware
Query
Ranked Docs
Snippet Service
Transformer Inference Service
Feature storage
#MLOps
#DevOps
#pagerduty #failuremodes
Copyright Yahoo
Ranking with Transformers with Vespa
Photo by Aditya Vyas on Unsplash
Copyright Yahoo
Scaling Transformer Inference and Ranking
Scaling Transformer for production workloads
Getting there? On CPU?
Copyright Yahoo
Best way to scale BERT is not using BERT!
Don’t default to BERT-Base for your production retrieval and ranking pipelines
Copyright Yahoo
Transformer models versus accuracy & serving cost
https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased
Same accuracy, but 9x reduced cost.
With int8 weights
27x reduced cost.
Your go-to model?
Copyright Yahoo
Vespa Ms Marco Passage Ranking
Copyright Yahoo
Copyright Yahoo
Vespa App Package
Text (passage text)
Text_token_ids (Bert subword ids)
Dense single representation embedding
Dt multi-dense representations
schema passage
document passage {
field text type string {}
field text_token_ids type tensor<float>(d0[128]) {}
field embedding type tensor<float>(x[384]){
index: {hnsw {..}}
}
field dt type tensor<bfloat16>(dt{}, x[32]) {}
}
}
Copyright Yahoo
App Package -
Models
Export Transformer models to ONNX. All 3 models based on MiniLM 6 layer, shipped with app package.
schema passage
document passage {}
onnx-model minilmranker {
file: models/msmarco_v2.onnx
input input_ids: input_ids
input attention_mask: attention_mask
input token_type_ids: token_type_ids
}
rank-profile sparse {}
rank-profile dense {}
rank-profile colbert inherits dense {}
Copyright Yahoo
MS Marco Passage Ranking with Vespa
Stateless
BERT Subword Tokenizer
Dense query embedder
Retrieval
Ranking
/search/?query=..
query=tensor(d[32]): [101,2043, 2001,..]
Chain
Stateful
Summary store
(passage text, word piece tokens)
Hnsw
index
passages
Local ranking phases
Search
Fetch
Reranking (optional)
Global Top-K passages
Merge
Ranked passages
Inverted Index
ColBERT
query embedder
RPC
HTTPS
Tensor
store
Diversification (optional)
Copyright Yahoo
App package - Code
Stateless container (Java Runtime)
Support deploying custom searchers, processors, rendering, components+
Chained execution of searchers
Deploy as part of application package
Injectable components
Copyright Yahoo
Hated it?
QA
Copyright Yahoo
Copyright Yahoo
Resources
QA
https://github.com/vespa-engine/vespa
Slack http://slack.vespa.ai/
https://github.com/vespa-engine/sample-apps/blob/master/msmarco-ranking/passage-ranking.md (Neural end to end)
https://github.com/vespa-engine/sample-apps/blob/master/msmarco-ranking/document-ranking.md (WAND BM25 + classic GBDT re-ranking)
https://cloud.vespa.ai/ (free trial)
Copyright Verizon Media
Copyright Yahoo