1 of 7

AI For Document Retrieval and Data Quality

Using text-embeddings to enhance user-experience of data collections

2 of 7

What is a text embedding?

Large Language Models (LLMs) offer a representation of text in a high dimensional vector space
These semantic spaces allow meaning to be represented with maths
king - man + woman = queen
Transformer models are the current state of the art for obtaining vector representations

What can you do with a vector representation?

Semantic Record retrieval

The semantic distance between two texts is the distance between vectors
Similar documents are near each-other in the space allowing clustering, matching etc.
Some transformers can also split the space into queries and answers, allowing us to obtain a vector representation from the question which is near the answer, rather than the question.

Data Quality

Matching with LLMs is flexible - texts can work across languages and orthographies, with spelling mistakes, and without normalisation. This is very helpful in entity recognition tasks.

The strategy of "embed and cluster" can help to find duplicate records.
It also provides a strategy for controlling record matching at scale (1 billion × 1 billion = 1 quintillion) - it's much faster just to search the neighbors (low distance vectors) of each record.
Anomaly detection

RAG Retrieval-Augmented Generation

Vectorize documents
Ask a question
Get the "QUESTION" embedding, and search for neighbors, e.g. "QUESTION What are some novels about the Cold War"
Extract information about the neighbours from a traditional RDBMS or Graph Database (title, abstract, author, hyper-link etc)
Use this information to produce a prompt.
"Answer the following question given the relevant documents and their hyperlinks: � 1. Author: Tom Clancy, Title: The Hunt for Red October� 2. Author: Tom Clancy, Title: Clear and Present Danger"
Feed the prompt and question to a chat bot
Chat bot responds with the correct internal document record answers without hallucinating (hopefully)

What do I need to make these things work

An LLM tuned for embeddings (MxBai is good for small things, Ada is better for big things)
A traditional database (Graph or RDBMS)
A vector database (Currently HNSW variants are the best performing)
A way to create strings for the embeddings from records (JSON+handlebars templates?)
Good prompt engineering (Good luck!)
Some glue code (python?)