1 of 49

Word embeddings are numerical representations of a word’s meaning. They are formed based on the assumption that meaning is contextual. That is, a word’s meaning is dependant on its neighbors:

24 of 49

https://qdrant.tech/articles/what-are-embeddings/

Creating vector embeddings

Embeddings translate the complexities of human language to a format that computers can understand. It uses neural networks to assign numerical values to the input data, in a way that similar data has similar values.

25 of 49

What is Word Embedding?

Word Embedding is a language modeling technique for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. Word embeddings can be generated using various methods like neural networks, co-occurrence matrices, probabilistic models, etc. Word2Vec consists of models for generating word embedding. These models are shallow two-layer neural networks having one input layer, one hidden layer, and one output layer.

Word2Vec is a widely used method in natural language processing (NLP) that allows words to be represented as vectors in a continuous vector space. Word2Vec is an effort to map words to high-dimensional vectors to capture the semantic relationships between words, developed by researchers at Google. Words with similar meanings should have similar vector representations, according to the main principle of Word2Vec. Word2Vec utilizes two architectures:

26 of 49

https://medium.com/geekculture/what-are-word-embeddings-6f6f677b13ce

Word embeddings are represented as mathematical vectors. This representation enables to perform standard mathematical operations in words, like addition and subtraction.

27 of 49

Embeddings

28 of 49

What are Vector Embeddings?

Embeddings are numerical machine learning representations of the semantic of the input data. They capture the meaning of complex, high-dimensional data, like text, images, or audio, into vectors. Enabling algorithms to process and analyze the data more efficiently.

https://qdrant.tech/articles/what-are-embeddings/

https://causewriter.ai/courses/ai-explainers/lessons/vector-embedding/

29 of 49

How do embeddings work?

The quality of the vector representations drives the performance. The embedding model that works best for you depends on your use case.

https://arize.com/blog-course/embeddings-meaning-examples-and-how-to-compute/

Word Embeddings: Example

30 of 49

Neural Networks

Word2Vec

understand the context

nuances of the word “right” are blended

BERT

use of a word in its surroundings

https://qdrant.tech/articles/what-are-embeddings/

31 of 49

https://www.analyticsvidhya.com/blog/2021/07/feature-extraction-and-embeddings-in-nlp-a-beginners-guide-to-understand-natural-language-processing/

32 of 49

https://colab.research.google.com/drive/1P44CeGMe9sOAhslI6bEWotBD281SKeNo?usp=sharing

https://medium.com/@manansuri/a-dummys-guide-to-word2vec-456444f3c673

A Dummy’s Guide to Word2Vec

33 of 49

https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

https://www.cambridgespark.com/info/word-embeddings-in-python

34 of 49

FAISS

35 of 49

https://medium.com/operations-research-bit/build-your-own-semantic-search-web-app-with-sentence-transformers-and-faiss-c304e04ca3c9

Situação

36 of 49

https://medium.com/operations-research-bit/build-your-own-semantic-search-web-app-with-sentence-transformers-and-faiss-c304e04ca3c9

FAISS

37 of 49

https://medium.com/operations-research-bit/build-your-own-semantic-search-web-app-with-sentence-transformers-and-faiss-c304e04ca3c9

FAISS

São definidos os nomes dos arquivos denso_vectors_file para armazenar vetores densos e faiss_index_file para armazenar o índice Faiss.

38 of 49

IndexFlatL2

https://www.pinecone.io/learn/series/faiss/faiss-tutorial/

39 of 49

IndexFlatL2

Mede a distância L2 (ou euclidiana) entre todos os pontos dados entre nosso vetor de consulta e os vetores carregados no índice.

É simples, muito preciso, mas não muito rápido.

https://www.pinecone.io/learn/series/faiss/faiss-tutorial/

40 of 49

Mais métodos: Velocidade ou precisão?

https://www.pinecone.io/learn/series/faiss/faiss-tutorial/

41 of 49

Mais métodos: Velocidade ou precisão?

Ainda que um método seja mais rápido que outro, devemos também tomar nota dos resultados ligeiramente diferentes retornados. Anteriormente, com nossa pesquisa exaustiva em L2, estávamos retornando 7460, 10940, 3781 e 5747. Agora, vemos uma ordem de resultados ligeiramente diferente – e dois IDs diferentes, 5013 e 5370.

https://www.pinecone.io/learn/series/faiss/faiss-tutorial/