1 of 7

AI For Document Retrieval and Data Quality

Using text-embeddings to enhance user-experience of data collections

2 of 7

What is a text embedding?

  • Large Language Models (LLMs) offer a representation of text in a high dimensional vector space
  • These semantic spaces allow meaning to be represented with maths
  • king - man + woman = queen
  • Transformer models are the current state of the art for obtaining vector representations

3 of 7

What can you do with a vector representation?

  • Semantic Record retrieval
  • Improve data quality
    • Entity matching
    • Anomaly detection
  • RAG (Retrieval-Augmented Generation)

4 of 7

Semantic Record retrieval

  • The semantic distance between two texts is the distance between vectors
  • Similar documents are near each-other in the space allowing clustering, matching etc.
  • Some transformers can also split the space into queries and answers, allowing us to obtain a vector representation from the question which is near the answer, rather than the question.
  • d("dogs are the best", � "canines are the greatest") ~ 0
  • d("QUERY Who was president in 1964",� "ANSWER Lyndon Johnson") ~ 0
  • d("ANSWER Lyndon B. Johnson",
  • "ANSWER Lyndon Johnson") ~ 0

5 of 7

Data Quality

  • Matching with LLMs is flexible - texts can work across languages and orthographies, with spelling mistakes, and without normalisation. This is very helpful in entity recognition tasks.
    • d("Jim", "James") ~ 0
    • d("Khrushchev", "Chruschtschow") ~ 0
  • The strategy of "embed and cluster" can help to find duplicate records.
  • It also provides a strategy for controlling record matching at scale (1 billion × 1 billion = 1 quintillion) - it's much faster just to search the neighbors (low distance vectors) of each record.
  • Anomaly detection

6 of 7

RAG Retrieval-Augmented Generation

  • Multi-stage process using chatbot to obtain answers
    • Vectorize documents
    • Ask a question
    • Get the "QUESTION" embedding, and search for neighbors, e.g. "QUESTION What are some novels about the Cold War"
    • Extract information about the neighbours from a traditional RDBMS or Graph Database (title, abstract, author, hyper-link etc)
    • Use this information to produce a prompt.
    • "Answer the following question given the relevant documents and their hyperlinks: � 1. Author: Tom Clancy, Title: The Hunt for Red October� 2. Author: Tom Clancy, Title: Clear and Present Danger"
    • Feed the prompt and question to a chat bot
    • Chat bot responds with the correct internal document record answers without hallucinating (hopefully)

7 of 7

What do I need to make these things work

  • An LLM tuned for embeddings (MxBai is good for small things, Ada is better for big things)
  • A traditional database (Graph or RDBMS)
  • A vector database (Currently HNSW variants are the best performing)
  • A way to create strings for the embeddings from records (JSON+handlebars templates?)
  • Good prompt engineering (Good luck!)
  • Some glue code (python?)