PyBay - Saturday, September 21, 2024
Thinking of Topic Modeling as Search
The social network for activism
Shoutout to the team
Unified
Shion Deysarkar
CEO & Co-Founder
Prev. founded Datafiniti, grew sales to $3M+
Deployed relational organizing software to 40+ campaigns
CMU CompSci, Student President @ Rice MBA
Kas Stohr
Lead Data Scientist
Led social listening for Simon & Schuster
Led non-profit that reached 1.2 million people
Jeremy Smith
Co-Founder
CEO @ Civitech ($63M valuation)
Registered 1M+ voters via Register2Vote
Francisco Jimenez
Lead Mobile Developer
Developer of organizing app for 2000+ vols
10 yrs leading app development for Shion
Brian T. Smith
Head of Partnerships
22 yrs campaigns & government relations
Election-winner and legislation-crafter
Jack Klika
Lead Backend Engineer
Led scaling for Foxconn’s AI systems
Enumerator for US census
Madelyn Morneault
Events Manager
PM, Austin Indie Film Fest
Co-host, Austin Women’s march
Ben Magos
Lead Web Developer
Led web and mobile development at interactive media firm
Olivia Pasion
Web Developer
Web developer for multiple projects
Also:
Simon (dog)
Linux (cat)
(cat)
FEATURE OBJECTIVES
Track trending topics in real-time
Suggest organizations and users by topic
Why are we even talking about topic modeling "as search?"
ENGINEERING CHALLENGE
Retrieve and Rank Posts
Typically use use a topic model to classify new posts and assign them to a new topic. However, “tagging” posts everytime the topics change does not scale.
.
Why are we even talking about topic modeling "as search?"
Me: Oh wait…You want to do this on the fly????
Constraints
Benefits of “searching” instead of “tagging”
Brief refresher on embeddings
To build a topic model, you typically start by converting your documents into text embeddings, and for longer documents, sentence embeddings.
Definition: Embeddings are numerical representations of text that capture semantic meaning. Similar items are positioned closer to one another than less similar items in the “embedding space.” In the case of text, sentences that are semantically similar should have similar embedded vectors and thus be closer together in the space.
Why It Matters: They enable us to perform various NLP tasks, including similarity search and clustering using metrics such as Euclidean distance and cosine similarity.
| all | about | cat | afternoon | meow |
All about that cat. | 1 | 1 | 1 | 0 | 0 |
Afternoon cat 🐈⬛ | 0 | 0 | 1 | 1 | 0 |
Meow 🐈⬛🐈⬛ | 0 | 0 | 0 | 0 | 1 |
Document-term matrix ("Bag of Words" approach)
Each word is vector:
30,000 English words => “sparse vector.” Reduce vocabulary and use dimensionality reduction to create a "dense vector"
all = [0.43776, 0.00021, 0.00744]
about = [0.35821, 0.00011, 0.00213]
cat = [0.75543, 0.34675, 0.68345]
all = [1,0,0]
about = [1,0,0]
cat = [1,1,0]
Python packages: SentenceTransformers, Spacy, FastText, Hugging Face, Word2Vec, Doc2Vec, etc.
Generating document embeddings
Preprocessed Text -> Sentence Embedding (dense vectors) -> Document Embedding (dense vector) -> [vector store]
async def preprocess_sentences(text)-> list[str]:
"""
Splits the text into sentences. Pre-processes each sentence. Returns a list
of cleaned sentences.
"""
# Use spaCy to segment text into sentences
nlp = spacy.load("en_core_web_sm")
split_text: list[Span] = nlp(text)
# Preprocess each sentence
sentences = [preprocess_text(sent.text) for sent in split_text.sents]
# Return list of sentences
return sentences
async def create_doc_embedding(sentences) -> list[float]:
"""
Creates a document embedding by taking the mean of each sentence embedding.
"""
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(sentences)
# The mean of each vector is representative of the
# document as a whole. Each sentence is given equal # weight.
doc_embedding = np.mean(embeddings, axis=0).tolist()
# Return document embedding
return doc_embedding
Choosing an embedding model
Short Text (Word, Phrase, Sentence)
Best Models: Word2Vec, Sentence-BERT, Universal Sentence Encoder (USE), BERT, FastText
Key Features:�- capture sentence-level semantics and context�- good for short or medium-length texts.�- smaller in size (production considerations)
Long Text (Paragraph, Document)
Best Models: Longformer, Doc2Vec, T5, XLNet, Transformer-based models (BERT, GPT/LLMs for large contexts)
Key Features: �- designed or adapted for handling longer documents (LLM contexts => 128k)�- leverage attention mechanisms or hierarchical encoding to maintain contextual relevance
Storing embeddings
To speed performance, my first thought was to pre-compute these embeddings and cache them each time a post was made.
Vector Store Options:
https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/
Storing embeddings
… Storing embeddings as dense vectors means you can also search for similar embeddings…
Database Options:
Me:
Tell me more…
Searching for topics
New plan: Instead of tagging documents and dealing with maintaining tags. Use embedding search
retrieval to retrieve documents similar to the a given topic defined by my topic model.
Posts into structured documents
Posts are media rich, unstructured text. Before generating an embedding the post must be converted into a structured text document.
{
"post_id": "4162a0db-6eab-423b-94f8-8eb452bec7db",
"post_author": "72",
"modified_at": "2023-10-18 18:07:58.968690+00:00",
"created_at": "2023-10-18 18:07:58.907013+00:00",
"deleted_at": null,
"body": {
"post_text": "Some pretty cool and amazing pictures generated by ChatGPT",
"links": null,
"assets": [
{
"image_url": "https://ufd-prod-asset-uploads.s3.amazonaws.com/20231018/cf7afe2dd420bfd5f1c608cf20ef2018.jpg",
"caption": "a man and a woman sitting at a table with a cat ",
"model": "nlpconnect/vit-gpt2-image-captioning",
},
{
"image_url": "https://ufd-prod-asset-uploads.s3.amazonaws.com/20231018/2db56233c21dabe363b25e0d652d325c.jpg",
"caption": "a man and a woman sitting at a table with a cat ",
"model": "nlpconnect/vit-gpt2-image-captioning",
},
"actions": null,
"mentions": null,
"entities": [
{
"text": "chatgpt",
"start_char": 51,
"end_char": 58,
"label": "ORG"
}
],
"hashtags": null
},
"doc_embedding": {
"embedding": [
-0.031969476491212845,
-0.002114212140440941,
0.029350850731134415,
...
],
"model": "all-MiniLM-L6-v2",
"embedding_type": "doc"
},
}
API
/v1/post
Postgres
Controller Logic (awaited)
Post Processing
(background task)
MLOPS API
/v1/process_post
Elastic
(Vector Store)
NLP Tasks
Moderation Tasks
MLOPS API
/v1/search_posts
API
/v1/post/search
Search PostDocs
Generate Embedding
async class PostToPostDoc():
...
await self.init_doc()
tasks = [
self.process_child_posts(), # -> extract text
self.process_links(), # -> fetch metadata, url, extract text
self.process_assets(), # -> image -> vision model -> caption text
self.process_actions(), # -> fetch officials, office - action text
self.process_mentions(), # -> extract text (excluded from embedding, PII)
self.process_entities(), # -> extract entities -> NER model - entity objects
self.process_hashtags(), # -> extract text
]
results = await asyncio.gather(*tasks)
await self.generate_doc_embedding() # tokenize -> embedding model
self.store_es()
gc.collect()
...
Training the topic model
Document Embeddings [vector store] -> [topic model] ->Topic Embeddings -> [topic cache]
Example Model pipeline:
*Credit: https://github.com/MaartenGr/BERTopic By Maarten Grootendorst
c-TF-IDF: https://www.maartengrootendorst.com/blog/ctfidf/
👀
K8s Workflow�[Temporal]
pre-process text
generate embeddings
reduce dimensionality
cluster
count vectorization
c-TF-IDF
label topics
(LLM)
generate topic embeddings
Searching for documents
OpenSearch Example:
** Cosine similarity returns a number between -1 and 1, but because OpenSearch relevance scores can’t be below 0, the k-NN plugin adds 1 to get the final score.[scale: 0, 2]
Reference: �https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/
def search_similar_documents(embedding: np.ndarray, top_k: int = 1000,
filters: List[dict] = None) -> List[dict]:
"""Search for similar documents in OpenSearch using an embedding."""
filters = filters or []
query = {
"size": top_k,
"query": {
"bool": {
"must": [
{
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "knn_score",
"lang": "knn",
"params": {
"field": "doc_embedding",
"query_value": embedding.tolist(),
"space_type": "cosinesimil",
},
},
}
}
],
"filter": filters, # Apply filters if provided
}
},
"sort": [{"_score": {"order": "desc"}}],
}
response = self.opensearch_client.search(index=self.index_name, body=query)
hits = response["hits"]["hits"]
results = [{"score": hit["_score"], **hit["_source"]} for hit in hits]
return results
Initial Results…
Given the same data, over the same period of time, if you limit the number of posts returned by the search to the same number of posts assigned to the topic, the majority of those results should be posts that were assigned to the topic. Similarity scores should be higher for the posts that were assigned to the topic by the model, then posts that were not.
Mean search similarity score for all topics: 0.701 [scale: 0, 2]
Mean ratio of posts assigned to topic returned in results for all topics: 0.013
Initial Results…
Example (Topic Vector Search)
Topic: Cats, cats, cats�
post_text | post_link |
🎉 Congratulations to the Minnesota House and Senate for passing legislation to ensure fair wages for Lyft and Uber drivers! | |
Gathering signatures for Ohio Reproductive Rights Ballot Initiative! | |
📢SEE YOU AT #NN23! Heading to Chicago TOMORROW for @Netroots_Nation! | |
👋🏼 ✊🏼 Here for the first time from on Twitter. What are folks using Unified for? Is it useful? | |
📢CALLING ALL AUSTIN #ACTIVISTS! Join Unified on April 28th for our first Monthly Meet-up! | |
2023 has been a hell of a year for my organization, New Sun Coalition. | |
Very cool to see several Unifiers on the TX Progressive Caucus endorsement list! .. did I miss anybody?? https://www.texasprogressivecaucus.org/party_endorsements_2024 | |
📆 4 months until Unified Jam! 🎉 Let the countdown to this year's epic celebration of activism culture begin! | |
🚨BREAKING NEWS!!! My Rep Lloyd Doggett became the FIRST Democrat to call for Biden to withdraw from the presidential race. 👀So what do you think? Should Biden withdraw? |
Me:
Why?
Poorly fit topic model? Well, yes… but…
coherence_score: 0.450�topic_diversity: 0.0196
Why?
Mean Vector vs. Centroid
Topic
Search Area
Localized Topic Search
(centroid)
Topic Search
(mean)
👍
👎
Search Area
Improving search results with 'localized' topic embeddings
Instead of computing the document embedding from the mean of all document embeddings assigned to the topic, use an approach that better approximates the “center’ of the topic.
[topic model] -> assigned documents -> c-TF-IDF -> keywords -> keyword embedding
Options (depends on use case):
Keywords -> Keyword Embedding (observability, short texts)
Cosine similarity - > Representative documents -> Representative Embedding (longer texts)
Documents -> Summary -> Summary Embedding (large context window)
Source: https://github.com/MaartenGr/BERTopic By Maarten Grootendorst
c-TF-IDF: https://www.maartengrootendorst.com/blog/ctfidf/
Create Keyword Embeddings (pseudo code)
"""
Calculate keywords for each topic using TFI-DF or (c-TF-IDF)
"""
# Generate topics
topics, probs = topic_model.fit(embeddings, texts)
# Identify texts associated with each topic
topic_assignments = list(zip(topics, texts)
# Group the documents by topic
for topic, text in topic_assignments:
if topic not in topic_dict:
topic_dict[topic] = []
topic_dict[topic].append(text)
# Extract keywords from TF-IDF matrix
vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.01)
def get_top_words(tfidf_matrix, top_n=10):
sum_tfidf = tfidf_matrix.sum(axis=0)
tfidf_scores = [
(feature, sum_tfidf[0, idx]) for idx, feature in enumerate(vectorizer.get_feature_names_out())
]
sorted_scores = sorted(tfidf_scores, key=lambda x: x[1], reverse=True)
return sorted_scores[:top_n]
# Apply TF-IDF on the documents for each topic
topic_kw_embeddings = {}
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
for topic, docs in topic_dict.items()
tfidf_matrix = vectorizer.fit_transform(docs)
# Get feature names (words)
top_words = get_top_words(tfidf_matrix, top_n=10)� words = [word[0] for word in top_words]
# Return keyword embeddings
topic_kw_embeddings[topic] = embedding_model.encode(words)
Updated Results…
Using keyword embeddings resulted in a greatly improved topic "match" score, but interestingly a much lower overall search similarity score.
Mean search similarity score for all topics: 0.449 [scale: 0, 2]
Mean ratio of posts assigned to topic returned in results for all topics: 0.243
Updated Results…
Example (Keyword Vector Search)
post_text | post_link |
Afternoon cats ☀️ 🎃 🐈⬛ | |
Toasty cat ☀️ | |
Living the life 🐈⬛🐈⬛ | |
| |
Meow | |
Embrace the tranquility of the weekend. It’s a perfect time to sort our thoughts, seek peace through meditation, and rejuvenate our minds. #WeekendMindfulness #InnerPeace 🌱🕉️ #cattax | |
Rockin’ the Unified ! | |
Good morning ☀️ I hope yall have a wonderful weekend! Cat tax attached. | |
Some pretty cool and amazing pictures generated by ChatGPT | |
Have a good Sunday yall ☀️ |
Updated Results…
Example (Keyword Vector Search)
post_text | post_link |
Afternoon cats ☀️ 🎃 🐈⬛ | |
Toasty cat ☀️ | |
Living the life 🐈⬛🐈⬛ | |
| |
Meow | |
Embrace the tranquility of the weekend. It’s a perfect time to sort our thoughts, seek peace through meditation, and rejuvenate our minds. #WeekendMindfulness #InnerPeace 🌱🕉️ #cattax | |
Rockin’ the Unified ! | |
Good morning ☀️ I hope yall have a wonderful weekend! Cat tax attached. | |
Some pretty cool and amazing pictures generated by ChatGPT | |
Have a good Sunday yall ☀️ |
Ongoing Evaluation
Key success metric:
Mean ratio of posts assigned to topic returned by search for all topics
Model cards:
Monitoring:
Observability:
Conclusion
Github Repository
References & Credits
Thanks go to Maarten Grootendorst for his work and excellent documentation in BERTopic as well as colleagues at Unified and peer coder Ray 'Urgent' McLendon for his interest and input.
Text Embeddings:
https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/
BERTopic - Package for topic modeling by Maarten Grootendorst
https://github.com/MaartenGr/BERTopic
Comparing Clustering Algorithms (HDBSCAN)
https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html
c-TF-IDF
https://www.maartengrootendorst.com/blog/ctfidf/
Vector Search
https://towardsdatascience.com/text-search-vs-vector-search-better-together-3bd48eb6132a
Hybrid Search�https://machine-mind-ml.medium.com/enhancing-llm-performance-with-vector-search-and-vector-databases-1f20eb1cc650
🤓
Thoughts? Suggestions?
Contact me: