Using MongoDB Atlas Vector Search for AI Semantic Search
Leonardo Gomes - Certified MongoDB Developer
Jan., 2025
What is the first thing that comes to mind when you hear the word AI?
Artificial Intelligence (AI)
A field in computer science that trains computers to simulate human intelligence.
Machine Learning
Machine learning is a subset of artificial intelligence that uses algorithms to train data to obtain results.
Deep Learning
Deep learning models consist of artificial deep neural networks — i.e., interconnected neurons (or nodes) — and have many layers, enabling them to process more complex data patterns than machine learning algorithms.
Deep learning is a subset of machine learning that resembles human intelligence.
Large language model (LLM) and Generative AI
Generative AI refers to the use of AI to create new content, like text, images, music, audio, and videos.�E.g.: ChatGPT, DALL-E, Llama Live.
A LLM is a statistical language model that can be used to generate and translate text and other content.
LLMs and generative AI are subsets of deep learning.
Natural language processing (NLP)
A subfield of computer science and artificial intelligence (AI) that uses machine learning to enable computers to understand and communicate with human language.�E.g.: ChatBot
Summary of concepts:
Artificial Intelligence
Machine Learning
Deep Learning
Generative AI
Natural Language�Processing
Vectors
A vector has magnitude and direction, and can represent complex data in data science through numerical features. Vector databases store these representations, enabling efficient similarity searches in a multi-dimensional space.
Embeddings
An embedding model converts diverse data types like text, images, and audio into vectors, positioning them in a multi-dimensional space.
Vector Databases
Vector search and the cosine algorithm
Cosine Similarity calculates the cosine of the angle between two vectors, revealing how closely the vectors are aligned.
For instance, words like "cat" and "dog" will have a higher cosine similarity than "cat" and "banana."
When was vector search created?
MongoDB Atlas Vector Search currently provides 3 approaches to calculate vector similarity:
Calculating the cosine similarity
We define cosine similarity mathematically as the dot product of the vectors divided by their magnitude.
import numpy as np
def cosine_similarity(x, y):
# Ensure length of x and y are the same
if len(x) != len(y) :
return None
# Compute the dot product between x and y
dot_product = np.dot(x, y)
# Compute the L2 norms (magnitudes) of x and y
magnitude_x = np.sqrt(np.sum(x**2))
magnitude_y = np.sqrt(np.sum(y**2))
# Compute the cosine similarity
cosine_similarity = dot_product / (magnitude_x * magnitude_y)
return cosine_similarity
Using the cosine to find similarities
Using cosine, the closest words are those nearest to the search term and in the same direction.
✅
❌
❌
MongoDB Atlas
MongoDB Atlas is a multi-cloud database service provided by MongoDB.
Atlas simplifies deploying and managing databases while offering versatility to build resilient and performant global applications on cloud providers.
MongoDB Atlas Vector Search
The combined power of vectors and MongoDB
OpenAI
Example request:
Response:
OpenAI specializes in AI for natural language processing and offers the Embedding API, a tool for generating document embeddings.
curl https://api.openai.com/v1/embeddings \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": "The food was delicious and the waiter...",
"model": "text-embedding-ada-002",
"encoding_format": "float"
}'
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [
0.0023064255,
-0.009327292,
.... (1536 floats total for ada-002)
-0.0028842222,
],
"index": 0
}
],
"model": "text-embedding-ada-002",
"usage": {
"prompt_tokens": 8,
"total_tokens": 8
}
}
Pricing | OpenAI
Embedding models: Build advanced search, clustering, topic modeling, and classification functionality with our embeddings offering.
https://openai.com/api/pricing�������
*Batch API pricing requires requests to be submitted as a batch.
Model | Pricing | Pricing with Batch API* |
text-embedding-3-small | $0.020 / 1M tokens | $0.010 / 1M tokens |
text-embedding-3-large | $0.130 / 1M tokens | $0.065 / 1M tokens |
ada v2 | $0.100 / 1M tokens | $0.050 / 1M tokens |
What are tokens in the OpenAI and how to count them?
Tokens are pieces of words. Before processing, the input is split into tokens, which may include trailing spaces or sub-words.�Here are some examples:
Input | ~# of tokens | Pricing with ada v2 |
4 chars in English | 1 token | $0.0000001 |
¾ words | 1 token | $0.0000001 |
¾ words | 100 tokens | $0.00001 |
1-2 sentence | 30 tokens | $0.000003 |
1 paragraph | 100 tokens | $0.00001 |
1,500 words | 2048 tokens | $0.0002048 |
Pricing | MongoDB Atlas Vector Search
For this demo, we will use Atlas Search M0 (Free Cluster).
* Vector Search Indexes are different from Database indexes.�There is no limited number for standard database indexes.
Part 2: Hands-on!
Curious in learning more?
Curious in learning more?
Toronto MongoDB User Group: https://www.meetup.com/toronto-mongodb-usergroup
Connect with MUGs From All Over the World: https://mdb.link/mug