1 of 29

PyBay - Saturday, September 21, 2024

Thinking of Topic Modeling as Search

2 of 29

The social network for activism

3 of 29

Shoutout to the team

Unified

Shion Deysarkar

CEO & Co-Founder

Prev. founded Datafiniti, grew sales to $3M+

Deployed relational organizing software to 40+ campaigns

CMU CompSci, Student President @ Rice MBA

Kas Stohr

Lead Data Scientist

Led social listening for Simon & Schuster

Led non-profit that reached 1.2 million people

Jeremy Smith

Co-Founder

CEO @ Civitech ($63M valuation)

Registered 1M+ voters via Register2Vote

Francisco Jimenez

Lead Mobile Developer

Developer of organizing app for 2000+ vols

10 yrs leading app development for Shion

Brian T. Smith

Head of Partnerships

22 yrs campaigns & government relations

Election-winner and legislation-crafter

Jack Klika

Lead Backend Engineer

Led scaling for Foxconn’s AI systems

Enumerator for US census

Madelyn Morneault

Events Manager

PM, Austin Indie Film Fest

Co-host, Austin Women’s march

Ben Magos

Lead Web Developer

Led web and mobile development at interactive media firm

Olivia Pasion

Web Developer

Web developer for multiple projects

Also:

Simon (dog)

Linux (cat)

(cat)

4 of 29

FEATURE OBJECTIVES

Track trending topics in real-time

ever-changing
highly influenced by news events
tend to focus on named entities such as people or places (politics!)

Suggest organizations and users by topic

Most users do not post (80:20 Pareto principle)
Help new users discover people near them
Existing users follow a topic or organization

Why are we even talking about topic modeling "as search?"

5 of 29

ENGINEERING CHALLENGE

Retrieve and Rank Posts

Typically use use a topic model to classify new posts and assign them to a new topic. However, “tagging” posts everytime the topics change does not scale.

.

Mobile app needs to be performant
Posts accumulate topic tags over time.
Tagging alone does not rank posts by relevance
How do we track, merge, deprecate topics over time

Why are we even talking about topic modeling "as search?"

Me: Oh wait…You want to do this on the fly????

6 of 29

Constraints

Small team (1 person)
Limited to infrastructure in place (or what I could stand up with limited support)
Must be easy to maintain (1 person!!!)
Easy to evaluate, track model drift and improve over time.
MVP - feature enhancements expected, design with backwards compatibility in mind

7 of 29

Benefits of “searching” instead of “tagging”

Use topic model to discover topics, but not to classify documents. More forgiving. Model does not have to perform well at classification.

Simplify and improve performance for storing and retrieving documents related to a topic in production environments. No tagging!

Capture topics related to fast-moving, evolving conversations

Hybrid Search (i.e. #hashtags AND topic embedding)

Allow user-generated topics (personalization)

Anticipate topics -- Yep, you can create a topic that does not yet exist. For example, you may want to create a topic related to an upcoming event, such as say, the "2024 Presidential Election."

8 of 29

Brief refresher on embeddings

To build a topic model, you typically start by converting your documents into text embeddings, and for longer documents, sentence embeddings.

Definition: Embeddings are numerical representations of text that capture semantic meaning. Similar items are positioned closer to one another than less similar items in the “embedding space.” In the case of text, sentences that are semantically similar should have similar embedded vectors and thus be closer together in the space.

Why It Matters: They enable us to perform various NLP tasks, including similarity search and clustering using metrics such as Euclidean distance and cosine similarity.

	all	about	cat	afternoon	meow
All about that cat.	1	1	1	0	0
Afternoon cat 🐈‍⬛	0	0	1	1	0
Meow 🐈‍⬛🐈‍⬛	0	0	0	0	1

Document-term matrix ("Bag of Words" approach)

Each word is vector:

30,000 English words => “sparse vector.” Reduce vocabulary and use dimensionality reduction to create a "dense vector"

all = [0.43776, 0.00021, 0.00744]

about = [0.35821, 0.00011, 0.00213]

cat = [0.75543, 0.34675, 0.68345]

all = [1,0,0]

about = [1,0,0]

cat = [1,1,0]

Python packages: SentenceTransformers, Spacy, FastText, Hugging Face, Word2Vec, Doc2Vec, etc.

9 of 29

Generating document embeddings

Preprocessed Text -> Sentence Embedding (dense vectors) -> Document Embedding (dense vector) -> [vector store]

async def preprocess_sentences(text)-> list[str]:

"""

Splits the text into sentences. Pre-processes each sentence. Returns a list

of cleaned sentences.

"""

# Use spaCy to segment text into sentences

nlp = spacy.load("en_core_web_sm")

split_text: list[Span] = nlp(text)

# Preprocess each sentence

sentences = [preprocess_text(sent.text) for sent in split_text.sents]

# Return list of sentences

return sentences

async def create_doc_embedding(sentences) -> list[float]:

"""

Creates a document embedding by taking the mean of each sentence embedding.

"""

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(sentences)

# The mean of each vector is representative of the

# document as a whole. Each sentence is given equal # weight.

doc_embedding = np.mean(embeddings, axis=0).tolist()

# Return document embedding

return doc_embedding

Choosing an embedding model

Short Text (Word, Phrase, Sentence)

Best Models: Word2Vec, Sentence-BERT, Universal Sentence Encoder (USE), BERT, FastText

Key Features:�- capture sentence-level semantics and context�- good for short or medium-length texts.�- smaller in size (production considerations)

Long Text (Paragraph, Document)

Best Models: Longformer, Doc2Vec, T5, XLNet, Transformer-based models (BERT, GPT/LLMs for large contexts)

Key Features: �- designed or adapted for handling longer documents (LLM contexts => 128k)�- leverage attention mechanisms or hierarchical encoding to maintain contextual relevance

10 of 29

Storing embeddings

To speed performance, my first thought was to pre-compute these embeddings and cache them each time a post was made.

Vector Store Options:

ElasticSearch: (built in as of version 7.3 released in 2019) �https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html�
pgvector (Postgres extension) https://supabase.com/docs/guides/database/extensions/pgvector

Redis

https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/

FAISS (python package developed by Meta) https://github.com/facebookresearch/faiss

Other Options: Pinecone, Weaviate (ChatGPT suggestion ¯\_(ツ)_/¯ )

11 of 29

Storing embeddings

… Storing embeddings as dense vectors means you can also search for similar embeddings…

Database Options:

ElasticSearch: (built in as of version 7.3 released in 2019) �https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
pgvector (Postgres extension) https://supabase.com/docs/guides/database/extensions/pgvector
FAISS (python package developed by Meta) https://github.com/facebookresearch/faiss
Other Options: Pinecone (ChatGPT suggestion ¯\_(ツ)_/¯)

Me:

Tell me more…

12 of 29

Searching for topics

New plan: Instead of tagging documents and dealing with maintaining tags. Use embedding search

retrieval to retrieve documents similar to the a given topic defined by my topic model.

13 of 29

Posts into structured documents

Posts are media rich, unstructured text. Before generating an embedding the post must be converted into a structured text document.

Images captioned
Video keyframes captioned
Hashtags extracted and handled
Mentions extracted and handled** (PII)
etc.

{

"post_id": "4162a0db-6eab-423b-94f8-8eb452bec7db",

"post_author": "72",

"modified_at": "2023-10-18 18:07:58.968690+00:00",

"created_at": "2023-10-18 18:07:58.907013+00:00",

"deleted_at": null,

"body": {

"post_text": "Some pretty cool and amazing pictures generated by ChatGPT",

"links": null,

"assets": [

{

"image_url": "https://ufd-prod-asset-uploads.s3.amazonaws.com/20231018/cf7afe2dd420bfd5f1c608cf20ef2018.jpg",

"caption": "a man and a woman sitting at a table with a cat ",

"model": "nlpconnect/vit-gpt2-image-captioning",

},

{

"image_url": "https://ufd-prod-asset-uploads.s3.amazonaws.com/20231018/2db56233c21dabe363b25e0d652d325c.jpg",

"caption": "a man and a woman sitting at a table with a cat ",

"model": "nlpconnect/vit-gpt2-image-captioning",

},

"actions": null,

"mentions": null,

"entities": [

{

"text": "chatgpt",

"start_char": 51,

"end_char": 58,

"label": "ORG"

}

],

"hashtags": null

},

"doc_embedding": {

"embedding": [

-0.031969476491212845,

-0.002114212140440941,

0.029350850731134415,

...

],

"model": "all-MiniLM-L6-v2",

"embedding_type": "doc"

},

}

14 of 29

API

/v1/post

Postgres

Controller Logic (awaited)

Store assets
Store link metadata
Store posts

Post Processing

(background task)

MLOPS API

/v1/process_post

Elastic

(Vector Store)

NLP Tasks

Moderation Tasks

MLOPS API

/v1/search_posts

API

/v1/post/search

Search PostDocs

Generate Embedding

async class PostToPostDoc():

...

await self.init_doc()

tasks = [

self.process_child_posts(), # -> extract text

self.process_links(), # -> fetch metadata, url, extract text

self.process_assets(), # -> image -> vision model -> caption text

self.process_actions(), # -> fetch officials, office - action text

self.process_mentions(), # -> extract text (excluded from embedding, PII)

self.process_entities(), # -> extract entities -> NER model - entity objects

self.process_hashtags(), # -> extract text

]

results = await asyncio.gather(*tasks)

await self.generate_doc_embedding() # tokenize -> embedding model

self.store_es()

gc.collect()

...

15 of 29

Training the topic model

Document Embeddings [vector store] -> [topic model] ->Topic Embeddings -> [topic cache]

Example Model pipeline:

Clean and preprocess text (Critical)

All text is “unstructured”, take the time to convert unstructured text into a structured format ( 🔥Tip:Pydantic model).
Addresses? Remove numbers or “99” will be a topic. Convert emojis into text. Identify headlines, section headers. Use these to “chunk” your text into meaningful embeddings.

Generate embeddings (SentenceTransformers, GPT, etc. See above.)
Reduce dimensionality of embeddings (UMAP, PCA, etc.)
Cluster emeddings into topics (K-Means, LDA, HBSCAN, Agglomerative Clustering, etc.)
Tokenize words in topics (CountVectorizer)
Weight tokens (c-TF-IDF (class-based Term Frequency - Inverse Document Frequency)* (Critical)
Optional: Label topics (Open AI, LLM, KeyBert)
Convert documents assigned to each topic into a topic embedding ( np.mean(assigned_doc_embeddings).tolist())

*Credit: https://github.com/MaartenGr/BERTopic By Maarten Grootendorst

c-TF-IDF: https://www.maartengrootendorst.com/blog/ctfidf/

👀

K8s Workflow�[Temporal]

pre-process text

generate embeddings

reduce dimensionality

cluster

count vectorization

c-TF-IDF

label topics

(LLM)

generate topic embeddings

16 of 29

Searching for documents

OpenSearch Example:

Map field to dense vector in index (example in Github repo)
Choose distance metric (cosine similarity, L1, L2, etc.)

You may need to normalize embeddings before storing them depending on the distance metric you choose (i.e L2)

If hybrid search, normalized scores from both search methods before combining.

** Cosine similarity returns a number between -1 and 1, but because OpenSearch relevance scores can’t be below 0, the k-NN plugin adds 1 to get the final score.[scale: 0, 2]

Reference: �https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/

def search_similar_documents(embedding: np.ndarray, top_k: int = 1000,

filters: List[dict] = None) -> List[dict]:

"""Search for similar documents in OpenSearch using an embedding."""

filters = filters or []

query = {

"size": top_k,

"query": {

"bool": {

"must": [

{

"script_score": {

"query": {"match_all": {}},

"script": {

"source": "knn_score",

"lang": "knn",

"params": {

"field": "doc_embedding",

"query_value": embedding.tolist(),

"space_type": "cosinesimil",

},

}

],

"filter": filters, # Apply filters if provided

}

},

"sort": [{"_score": {"order": "desc"}}],

}

response = self.opensearch_client.search(index=self.index_name, body=query)

hits = response["hits"]["hits"]

results = [{"score": hit["_score"], **hit["_source"]} for hit in hits]

return results

17 of 29

Initial Results…

Given the same data, over the same period of time, if you limit the number of posts returned by the search to the same number of posts assigned to the topic, the majority of those results should be posts that were assigned to the topic. Similarity scores should be higher for the posts that were assigned to the topic by the model, then posts that were not.

Mean search similarity score for all topics: 0.701 [scale: 0, 2]

Mean ratio of posts assigned to topic returned in results for all topics: 0.013

18 of 29

Initial Results…

Example (Topic Vector Search)

Topic: Cats, cats, cats�

post_text	post_link
🎉 Congratulations to the Minnesota House and Senate for passing legislation to ensure fair wages for Lyft and Uber drivers!	https://unified.me/post/9c2728f8-8d89-4472-b172-0ab548c9923f
Gathering signatures for Ohio Reproductive Rights Ballot Initiative!	https://unified.me/post/29ba7f66-b4e7-4d22-8675-7fc333424205
📢SEE YOU AT #NN23! Heading to Chicago TOMORROW for @Netroots_Nation!	https://unified.me/post/eee83e26-b737-4708-8703-0fb3d856e194
👋🏼 ✊🏼 Here for the first time from on Twitter. What are folks using Unified for? Is it useful?	https://unified.me/post/090f7b00-de75-4321-b78d-1424f2715198
📢CALLING ALL AUSTIN #ACTIVISTS! Join Unified on April 28th for our first Monthly Meet-up!	https://unified.me/post/5194bfbf-f5b2-4333-804f-6fab750d1ea6
2023 has been a hell of a year for my organization, New Sun Coalition.	https://unified.me/post/00f965a8-fa38-4cb9-9cb9-7fd12642bbfa
Very cool to see several Unifiers on the TX Progressive Caucus endorsement list! .. did I miss anybody?? https://www.texasprogressivecaucus.org/party_endorsements_2024	https://unified.me/post/1a598c79-c1fd-46cd-ad43-610c1fbde03f
📆 4 months until Unified Jam! 🎉 ⁠ Let the countdown to this year's epic celebration of activism culture begin!	https://unified.me/post/f53ae853-85c6-4be6-a20c-1ea4ed1ed4f6
🚨BREAKING NEWS!!! My Rep Lloyd Doggett became the FIRST Democrat to call for Biden to withdraw from the presidential race. 👀So what do you think? Should Biden withdraw?	https://unified.me/post/1a0d56e0-b448-43fb-ab1d-5cb0c11a1e50

Me:

19 of 29

Why?

Poorly fit topic model? Well, yes… but…

coherence_score: 0.450�topic_diversity: 0.0196

20 of 29

Why?

Mean Vector vs. Centroid

Mean Vector of Documents Assigned to Topic Cluster:

Calculated by taking the average of all sentence vectors in the document. Mathematically, this is done by summing all the vectors and dividing by the number of vectors (sentences).
reflects the mean position in the embedding space, accounting for the semantic contribution of each document equally.
Captures an overall semantic representation of the topic as a whole. Useful for comparing one topic to another.
Searching on this vector will retrieve anything "near" the area of the topic as a whole which may fall outside the area defined by the cluster.

Centroid of a Topic Cluster

Refers to the most "central" point of a cluster.
Can be calculated from a weighted average of the TF-IDF word scores for each cluster (or c-TF-IDF)
Alternately, can be represented as an embedding of the most representative documents in the cluster or key words in the cluster.
This more "localized" embedding represents the most representative words or documents in the cluster. The 'center' of the cluster.
Searching on the center of the cluster will retrieve documents **within** the area defined by the cluster.

Topic

Search Area

Localized Topic Search

(centroid)

Topic Search

(mean)

👍

👎

Search Area

21 of 29

Improving search results with 'localized' topic embeddings

Instead of computing the document embedding from the mean of all document embeddings assigned to the topic, use an approach that better approximates the “center’ of the topic.

[topic model] -> assigned documents -> c-TF-IDF -> keywords -> keyword embedding

Options (depends on use case):

Keywords -> Keyword Embedding (observability, short texts)

c-TF-IDF (class-based Term Frequency - Inverse Document Frequency)
Convert top-N keywords into a keyword embedding that represents the topic
Beware of highly specific, rarer words skewing the embedding (🔥 Tip: Set TfidfVectorizer min/max word frequency)

Cosine similarity - > Representative documents -> Representative Embedding (longer texts)

compare embedding of each documents to topic embedding
identify most representative documents
convert representative documents into an embedding that represents the topic

Documents -> Summary -> Summary Embedding (large context window)

Prompt: Ask an LLM to “summarize” a sample of representative documents assigned to a document.
Convert the summary into a summary embedding of the topic.

Source: https://github.com/MaartenGr/BERTopic By Maarten Grootendorst

c-TF-IDF: https://www.maartengrootendorst.com/blog/ctfidf/

Create Keyword Embeddings (pseudo code)

"""

Calculate keywords for each topic using TFI-DF or (c-TF-IDF)

"""

# Generate topics

topics, probs = topic_model.fit(embeddings, texts)

# Identify texts associated with each topic

topic_assignments = list(zip(topics, texts)

# Group the documents by topic

for topic, text in topic_assignments:

if topic not in topic_dict:

topic_dict[topic] = []

topic_dict[topic].append(text)

# Extract keywords from TF-IDF matrix

vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.01)

def get_top_words(tfidf_matrix, top_n=10):

sum_tfidf = tfidf_matrix.sum(axis=0)

tfidf_scores = [

(feature, sum_tfidf[0, idx]) for idx, feature in enumerate(vectorizer.get_feature_names_out())

]

sorted_scores = sorted(tfidf_scores, key=lambda x: x[1], reverse=True)

return sorted_scores[:top_n]

# Apply TF-IDF on the documents for each topic

topic_kw_embeddings = {}

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

for topic, docs in topic_dict.items()

tfidf_matrix = vectorizer.fit_transform(docs)

# Get feature names (words)

top_words = get_top_words(tfidf_matrix, top_n=10)� words = [word[0] for word in top_words]

# Return keyword embeddings

topic_kw_embeddings[topic] = embedding_model.encode(words)

22 of 29

Updated Results…

Using keyword embeddings resulted in a greatly improved topic "match" score, but interestingly a much lower overall search similarity score.

Mean search similarity score for all topics: 0.449 [scale: 0, 2]

Mean ratio of posts assigned to topic returned in results for all topics: 0.243

23 of 29

Updated Results…

Example (Keyword Vector Search)

post_text	post_link
Afternoon cats ☀️ 🎃 🐈‍⬛	https://unified.me/post/92d375ef-13ad-4782-8518-e85f90163cdd
Toasty cat ☀️	https://unified.me/post/d473adbe-ab4c-4625-8570-85e00b99ade3
Living the life 🐈‍⬛🐈‍⬛	https://unified.me/post/f74423b8-843c-4ffb-b35f-c9a01b0fb2de
	https://unified.me/post/6ff910f8-77ff-4c26-af56-f8277b48ea8a
Meow	https://unified.me/post/e5705fbe-4017-4c7e-bdc5-ee70a4c13b95
Embrace the tranquility of the weekend. It’s a perfect time to sort our thoughts, seek peace through meditation, and rejuvenate our minds. #WeekendMindfulness #InnerPeace 🌱🕉️ #cattax	https://unified.me/post/15b9dc45-1a15-4240-940b-1d29eba9ccba
Rockin’ the Unified !	https://unified.me/post/cfcca4b1-5edf-4cc0-b004-3634966fec83
Good morning ☀️ I hope yall have a wonderful weekend! Cat tax attached.	https://unified.me/post/b454be98-e9a4-4ceb-a395-6ff05499cb9f
Some pretty cool and amazing pictures generated by ChatGPT	https://unified.me/post/4162a0db-6eab-423b-94f8-8eb452bec7db
Have a good Sunday yall ☀️	https://unified.me/post/bada12e1-08ef-4e88-90ef-7a1535ec4167

24 of 29

Updated Results…

Example (Keyword Vector Search)

post_text	post_link
Afternoon cats ☀️ 🎃 🐈‍⬛	https://unified.me/post/92d375ef-13ad-4782-8518-e85f90163cdd
Toasty cat ☀️	https://unified.me/post/d473adbe-ab4c-4625-8570-85e00b99ade3
Living the life 🐈‍⬛🐈‍⬛	https://unified.me/post/f74423b8-843c-4ffb-b35f-c9a01b0fb2de
	https://unified.me/post/6ff910f8-77ff-4c26-af56-f8277b48ea8a
Meow	https://unified.me/post/e5705fbe-4017-4c7e-bdc5-ee70a4c13b95
Embrace the tranquility of the weekend. It’s a perfect time to sort our thoughts, seek peace through meditation, and rejuvenate our minds. #WeekendMindfulness #InnerPeace 🌱🕉️ #cattax	https://unified.me/post/15b9dc45-1a15-4240-940b-1d29eba9ccba
Rockin’ the Unified !	https://unified.me/post/cfcca4b1-5edf-4cc0-b004-3634966fec83
Good morning ☀️ I hope yall have a wonderful weekend! Cat tax attached.	https://unified.me/post/b454be98-e9a4-4ceb-a395-6ff05499cb9f
Some pretty cool and amazing pictures generated by ChatGPT	https://unified.me/post/4162a0db-6eab-423b-94f8-8eb452bec7db
Have a good Sunday yall ☀️	https://unified.me/post/bada12e1-08ef-4e88-90ef-7a1535ec4167

25 of 29

Ongoing Evaluation

Key success metric:

Mean ratio of posts assigned to topic returned by search for all topics

Model cards:

Key training metrics (model parameters, coherence_score, topic_diversity, silhouette score, training artifacts, visualiations, runtime, etc.). (Temporal workflow, S3)

Monitoring:

Test ratio of posts assigned to topic returned in search results for all topics, alerts if score < 0.25 (Temporal workflow)

Observability:

Slack notification displays output of topic ranking workflow each time it runs to enable real-time monitoring and observability. (Temporal workflow)

26 of 29

Conclusion

Most commonly used databases now include support for dense vectors and knn or cosine similarity search

Search embeddings need to be localized. Take the centroid of the topic not the embedding for the topic overall. �
Opportunity to define and anticipate topics **not** discoverable by topic model.

27 of 29

Github Repository

Topic Modeling as Search

https://github.com/kstohr/topic-vector-search

28 of 29

References & Credits

Thanks go to Maarten Grootendorst for his work and excellent documentation in BERTopic as well as colleagues at Unified and peer coder Ray 'Urgent' McLendon for his interest and input.

Text Embeddings:

https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/

BERTopic - Package for topic modeling by Maarten Grootendorst

https://github.com/MaartenGr/BERTopic

Comparing Clustering Algorithms (HDBSCAN)

https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

c-TF-IDF

https://www.maartengrootendorst.com/blog/ctfidf/

Vector Search

https://towardsdatascience.com/text-search-vs-vector-search-better-together-3bd48eb6132a

Hybrid Search�https://machine-mind-ml.medium.com/enhancing-llm-performance-with-vector-search-and-vector-databases-1f20eb1cc650

🤓