1 of 29

PyBay - Saturday, September 21, 2024

Thinking of Topic Modeling as Search

2 of 29

The social network for activism

3 of 29

Shoutout to the team

Unified

Shion Deysarkar

CEO & Co-Founder

Prev. founded Datafiniti, grew sales to $3M+

Deployed relational organizing software to 40+ campaigns

CMU CompSci, Student President @ Rice MBA

Kas Stohr

Lead Data Scientist

Led social listening for Simon & Schuster

Led non-profit that reached 1.2 million people

Jeremy Smith

Co-Founder

CEO @ Civitech ($63M valuation)

Registered 1M+ voters via Register2Vote

Francisco Jimenez

Lead Mobile Developer

Developer of organizing app for 2000+ vols

10 yrs leading app development for Shion

Brian T. Smith

Head of Partnerships

22 yrs campaigns & government relations

Election-winner and legislation-crafter

Jack Klika

Lead Backend Engineer

Led scaling for Foxconn’s AI systems

Enumerator for US census

Madelyn Morneault

Events Manager

PM, Austin Indie Film Fest

Co-host, Austin Women’s march

Ben Magos

Lead Web Developer

Led web and mobile development at interactive media firm

Olivia Pasion

Web Developer

Web developer for multiple projects

Also:

Simon (dog)

Linux (cat)

(cat)

4 of 29

FEATURE OBJECTIVES

Track trending topics in real-time

  • ever-changing
  • highly influenced by news events
  • tend to focus on named entities such as people or places (politics!)

Suggest organizations and users by topic

  • Most users do not post (80:20 Pareto principle)
  • Help new users discover people near them
  • Existing users follow a topic or organization

Why are we even talking about topic modeling "as search?"

5 of 29

ENGINEERING CHALLENGE

Retrieve and Rank Posts

Typically use use a topic model to classify new posts and assign them to a new topic. However, “tagging” posts everytime the topics change does not scale.

.

  • Mobile app needs to be performant
  • Posts accumulate topic tags over time.
  • Tagging alone does not rank posts by relevance
  • How do we track, merge, deprecate topics over time

Why are we even talking about topic modeling "as search?"

Me: Oh wait…You want to do this on the fly????

6 of 29

Constraints

  • Small team (1 person)
  • Limited to infrastructure in place (or what I could stand up with limited support)
  • Must be easy to maintain (1 person!!!)
  • Easy to evaluate, track model drift and improve over time.
  • MVP - feature enhancements expected, design with backwards compatibility in mind

7 of 29

Benefits of “searching” instead of “tagging”

  • Use topic model to discover topics, but not to classify documents. More forgiving. Model does not have to perform well at classification.

  • Simplify and improve performance for storing and retrieving documents related to a topic in production environments. No tagging!

  • Capture topics related to fast-moving, evolving conversations

  • Hybrid Search (i.e. #hashtags AND topic embedding)

  • Allow user-generated topics (personalization)

  • Anticipate topics -- Yep, you can create a topic that does not yet exist. For example, you may want to create a topic related to an upcoming event, such as say, the "2024 Presidential Election."

8 of 29

Brief refresher on embeddings

To build a topic model, you typically start by converting your documents into text embeddings, and for longer documents, sentence embeddings.

Definition: Embeddings are numerical representations of text that capture semantic meaning. Similar items are positioned closer to one another than less similar items in the “embedding space.” In the case of text, sentences that are semantically similar should have similar embedded vectors and thus be closer together in the space.

Why It Matters: They enable us to perform various NLP tasks, including similarity search and clustering using metrics such as Euclidean distance and cosine similarity.

all

about

cat

afternoon

meow

All about that cat.

1

1

1

0

0

Afternoon cat 🐈‍⬛

0

0

1

1

0

Meow 🐈‍⬛🐈‍⬛

0

0

0

0

1

Document-term matrix ("Bag of Words" approach)

Each word is vector:

30,000 English words => “sparse vector.” Reduce vocabulary and use dimensionality reduction to create a "dense vector"

all = [0.43776, 0.00021, 0.00744]

about = [0.35821, 0.00011, 0.00213]

cat = [0.75543, 0.34675, 0.68345]

all = [1,0,0]

about = [1,0,0]

cat = [1,1,0]

Python packages: SentenceTransformers, Spacy, FastText, Hugging Face, Word2Vec, Doc2Vec, etc.

9 of 29

Generating document embeddings

Preprocessed Text -> Sentence Embedding (dense vectors) -> Document Embedding (dense vector) -> [vector store]

async def preprocess_sentences(text)-> list[str]:

"""

Splits the text into sentences. Pre-processes each sentence. Returns a list

of cleaned sentences.

"""

# Use spaCy to segment text into sentences

nlp = spacy.load("en_core_web_sm")

split_text: list[Span] = nlp(text)

# Preprocess each sentence

sentences = [preprocess_text(sent.text) for sent in split_text.sents]

# Return list of sentences

return sentences

async def create_doc_embedding(sentences) -> list[float]:

"""

Creates a document embedding by taking the mean of each sentence embedding.

"""

model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(sentences)

# The mean of each vector is representative of the

# document as a whole. Each sentence is given equal # weight.

doc_embedding = np.mean(embeddings, axis=0).tolist()

# Return document embedding

return doc_embedding

Choosing an embedding model

Short Text (Word, Phrase, Sentence)

Best Models: Word2Vec, Sentence-BERT, Universal Sentence Encoder (USE), BERT, FastText

Key Features:�- capture sentence-level semantics and context�- good for short or medium-length texts.�- smaller in size (production considerations)

Long Text (Paragraph, Document)

Best Models: Longformer, Doc2Vec, T5, XLNet, Transformer-based models (BERT, GPT/LLMs for large contexts)

Key Features: �- designed or adapted for handling longer documents (LLM contexts => 128k)�- leverage attention mechanisms or hierarchical encoding to maintain contextual relevance

10 of 29

Storing embeddings

To speed performance, my first thought was to pre-compute these embeddings and cache them each time a post was made.

Vector Store Options:

  • Redis

https://redis.io/docs/latest/develop/interact/search-and-query/advanced-concepts/vectors/

  • FAISS (python package developed by Meta) https://github.com/facebookresearch/faiss

  • Other Options: Pinecone, Weaviate (ChatGPT suggestion ¯\_(ツ)_/¯ )

11 of 29

Storing embeddings

… Storing embeddings as dense vectors means you can also search for similar embeddings…

Database Options:

Me:

Tell me more…

12 of 29

Searching for topics

New plan: Instead of tagging documents and dealing with maintaining tags. Use embedding search

retrieval to retrieve documents similar to the a given topic defined by my topic model.

13 of 29

Posts into structured documents

Posts are media rich, unstructured text. Before generating an embedding the post must be converted into a structured text document.

  • Images captioned
  • Video keyframes captioned
  • Hashtags extracted and handled
  • Mentions extracted and handled** (PII)
  • etc.

{

"post_id": "4162a0db-6eab-423b-94f8-8eb452bec7db",

"post_author": "72",

"modified_at": "2023-10-18 18:07:58.968690+00:00",

"created_at": "2023-10-18 18:07:58.907013+00:00",

"deleted_at": null,

"body": {

"post_text": "Some pretty cool and amazing pictures generated by ChatGPT",

"links": null,

"assets": [

{

"image_url": "https://ufd-prod-asset-uploads.s3.amazonaws.com/20231018/cf7afe2dd420bfd5f1c608cf20ef2018.jpg",

"caption": "a man and a woman sitting at a table with a cat ",

"model": "nlpconnect/vit-gpt2-image-captioning",

},

{

"image_url": "https://ufd-prod-asset-uploads.s3.amazonaws.com/20231018/2db56233c21dabe363b25e0d652d325c.jpg",

"caption": "a man and a woman sitting at a table with a cat ",

"model": "nlpconnect/vit-gpt2-image-captioning",

},

"actions": null,

"mentions": null,

"entities": [

{

"text": "chatgpt",

"start_char": 51,

"end_char": 58,

"label": "ORG"

}

],

"hashtags": null

},

"doc_embedding": {

"embedding": [

-0.031969476491212845,

-0.002114212140440941,

0.029350850731134415,

...

],

"model": "all-MiniLM-L6-v2",

"embedding_type": "doc"

},

}

14 of 29

API

/v1/post

Postgres

Controller Logic (awaited)

  • Store assets
  • Store link metadata
  • Store posts

Post Processing

(background task)

MLOPS API

/v1/process_post

Elastic

(Vector Store)

NLP Tasks

Moderation Tasks

MLOPS API

/v1/search_posts

API

/v1/post/search

Search PostDocs

Generate Embedding

async class PostToPostDoc():

...

await self.init_doc()

tasks = [

self.process_child_posts(), # -> extract text

self.process_links(), # -> fetch metadata, url, extract text

self.process_assets(), # -> image -> vision model -> caption text

self.process_actions(), # -> fetch officials, office - action text

self.process_mentions(), # -> extract text (excluded from embedding, PII)

self.process_entities(), # -> extract entities -> NER model - entity objects

self.process_hashtags(), # -> extract text

]

results = await asyncio.gather(*tasks)

await self.generate_doc_embedding() # tokenize -> embedding model

self.store_es()

gc.collect()

...

15 of 29

Training the topic model

Document Embeddings [vector store] -> [topic model] ->Topic Embeddings -> [topic cache]

Example Model pipeline:

  1. Clean and preprocess text (Critical)
    • All text is “unstructured”, take the time to convert unstructured text into a structured format ( 🔥Tip:Pydantic model).
    • Addresses? Remove numbers or “99” will be a topic. Convert emojis into text. Identify headlines, section headers. Use these to “chunk” your text into meaningful embeddings.
  2. Generate embeddings (SentenceTransformers, GPT, etc. See above.)
  3. Reduce dimensionality of embeddings (UMAP, PCA, etc.)
  4. Cluster emeddings into topics (K-Means, LDA, HBSCAN, Agglomerative Clustering, etc.)
  5. Tokenize words in topics (CountVectorizer)
  6. Weight tokens (c-TF-IDF (class-based Term Frequency - Inverse Document Frequency)* (Critical)
  7. Optional: Label topics (Open AI, LLM, KeyBert)
  8. Convert documents assigned to each topic into a topic embedding ( np.mean(assigned_doc_embeddings).tolist())

*Credit: https://github.com/MaartenGr/BERTopic By Maarten Grootendorst

c-TF-IDF: https://www.maartengrootendorst.com/blog/ctfidf/

👀

K8s Workflow�[Temporal]

pre-process text

generate embeddings

reduce dimensionality

cluster

count vectorization

c-TF-IDF

label topics

(LLM)

generate topic embeddings

16 of 29

Searching for documents

OpenSearch Example:

  • Map field to dense vector in index (example in Github repo)
  • Choose distance metric (cosine similarity, L1, L2, etc.)
    • You may need to normalize embeddings before storing them depending on the distance metric you choose (i.e L2)
  • If hybrid search, normalized scores from both search methods before combining.

** Cosine similarity returns a number between -1 and 1, but because OpenSearch relevance scores can’t be below 0, the k-NN plugin adds 1 to get the final score.[scale: 0, 2]

Reference: �https://opensearch.org/docs/latest/search-plugins/knn/knn-score-script/

def search_similar_documents(embedding: np.ndarray, top_k: int = 1000,

filters: List[dict] = None) -> List[dict]:

"""Search for similar documents in OpenSearch using an embedding."""

filters = filters or []

query = {

"size": top_k,

"query": {

"bool": {

"must": [

{

"script_score": {

"query": {"match_all": {}},

"script": {

"source": "knn_score",

"lang": "knn",

"params": {

"field": "doc_embedding",

"query_value": embedding.tolist(),

"space_type": "cosinesimil",

},

},

}

}

],

"filter": filters, # Apply filters if provided

}

},

"sort": [{"_score": {"order": "desc"}}],

}

response = self.opensearch_client.search(index=self.index_name, body=query)

hits = response["hits"]["hits"]

results = [{"score": hit["_score"], **hit["_source"]} for hit in hits]

return results

17 of 29

Initial Results…

Given the same data, over the same period of time, if you limit the number of posts returned by the search to the same number of posts assigned to the topic, the majority of those results should be posts that were assigned to the topic. Similarity scores should be higher for the posts that were assigned to the topic by the model, then posts that were not.

Mean search similarity score for all topics: 0.701 [scale: 0, 2]

Mean ratio of posts assigned to topic returned in results for all topics: 0.013

18 of 29

Initial Results…

Example (Topic Vector Search)

Topic: Cats, cats, cats

post_text

post_link

🎉 Congratulations to the Minnesota House and Senate for passing legislation to ensure fair wages for Lyft and Uber drivers!

Gathering signatures for Ohio Reproductive Rights Ballot Initiative!

📢SEE YOU AT #NN23! Heading to Chicago TOMORROW for @Netroots_Nation!

👋🏼 ✊🏼 Here for the first time from on Twitter. What are folks using Unified for? Is it useful?

📢CALLING ALL AUSTIN #ACTIVISTS! Join Unified on April 28th for our first Monthly Meet-up!

2023 has been a hell of a year for my organization, New Sun Coalition.

Very cool to see several Unifiers on the TX Progressive Caucus endorsement list! .. did I miss anybody?? https://www.texasprogressivecaucus.org/party_endorsements_2024

📆 4 months until Unified Jam! 🎉 ⁠ Let the countdown to this year's epic celebration of activism culture begin!

🚨BREAKING NEWS!!! My Rep Lloyd Doggett became the FIRST Democrat to call for Biden to withdraw from the presidential race.

👀So what do you think? Should Biden withdraw?

Me:

19 of 29

Why?

Poorly fit topic model? Well, yes… but…

coherence_score: 0.450�topic_diversity: 0.0196

20 of 29

Why?

Mean Vector vs. Centroid

  1. Mean Vector of Documents Assigned to Topic Cluster:
    • Calculated by taking the average of all sentence vectors in the document. Mathematically, this is done by summing all the vectors and dividing by the number of vectors (sentences).
    • reflects the mean position in the embedding space, accounting for the semantic contribution of each document equally.
    • Captures an overall semantic representation of the topic as a whole. Useful for comparing one topic to another.
    • Searching on this vector will retrieve anything "near" the area of the topic as a whole which may fall outside the area defined by the cluster.
  2. Centroid of a Topic Cluster
    • Refers to the most "central" point of a cluster.
    • Can be calculated from a weighted average of the TF-IDF word scores for each cluster (or c-TF-IDF)
    • Alternately, can be represented as an embedding of the most representative documents in the cluster or key words in the cluster.
    • This more "localized" embedding represents the most representative words or documents in the cluster. The 'center' of the cluster.
    • Searching on the center of the cluster will retrieve documents **within** the area defined by the cluster.

Topic

Search Area

Localized Topic Search

(centroid)

Topic Search

(mean)

👍

👎

Search Area

21 of 29

Improving search results with 'localized' topic embeddings

Instead of computing the document embedding from the mean of all document embeddings assigned to the topic, use an approach that better approximates the “center’ of the topic.

[topic model] -> assigned documents -> c-TF-IDF -> keywords -> keyword embedding

Options (depends on use case):

Keywords -> Keyword Embedding (observability, short texts)

  • c-TF-IDF (class-based Term Frequency - Inverse Document Frequency)
  • Convert top-N keywords into a keyword embedding that represents the topic
  • Beware of highly specific, rarer words skewing the embedding (🔥 Tip: Set TfidfVectorizer min/max word frequency)

Cosine similarity - > Representative documents -> Representative Embedding (longer texts)

  • compare embedding of each documents to topic embedding
  • identify most representative documents
  • convert representative documents into an embedding that represents the topic

Documents -> Summary -> Summary Embedding (large context window)

  • Prompt: Ask an LLM to “summarize” a sample of representative documents assigned to a document.
  • Convert the summary into a summary embedding of the topic.

Source: https://github.com/MaartenGr/BERTopic By Maarten Grootendorst

c-TF-IDF: https://www.maartengrootendorst.com/blog/ctfidf/

Create Keyword Embeddings (pseudo code)

"""

Calculate keywords for each topic using TFI-DF or (c-TF-IDF)

"""

# Generate topics

topics, probs = topic_model.fit(embeddings, texts)

# Identify texts associated with each topic

topic_assignments = list(zip(topics, texts)

# Group the documents by topic

for topic, text in topic_assignments:

if topic not in topic_dict:

topic_dict[topic] = []

topic_dict[topic].append(text)

# Extract keywords from TF-IDF matrix

vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.01)

def get_top_words(tfidf_matrix, top_n=10):

sum_tfidf = tfidf_matrix.sum(axis=0)

tfidf_scores = [

(feature, sum_tfidf[0, idx]) for idx, feature in enumerate(vectorizer.get_feature_names_out())

]

sorted_scores = sorted(tfidf_scores, key=lambda x: x[1], reverse=True)

return sorted_scores[:top_n]

# Apply TF-IDF on the documents for each topic

topic_kw_embeddings = {}

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

for topic, docs in topic_dict.items()

tfidf_matrix = vectorizer.fit_transform(docs)

# Get feature names (words)

top_words = get_top_words(tfidf_matrix, top_n=10)� words = [word[0] for word in top_words]

# Return keyword embeddings

topic_kw_embeddings[topic] = embedding_model.encode(words)

22 of 29

Updated Results…

Using keyword embeddings resulted in a greatly improved topic "match" score, but interestingly a much lower overall search similarity score.

Mean search similarity score for all topics: 0.449 [scale: 0, 2]

Mean ratio of posts assigned to topic returned in results for all topics: 0.243

23 of 29

Updated Results…

Example (Keyword Vector Search)

post_text

post_link

Afternoon cats ☀️ 🎃 🐈‍⬛

Toasty cat ☀️

Living the life 🐈‍⬛🐈‍⬛

Meow

Embrace the tranquility of the weekend. It’s a perfect time to sort our thoughts, seek peace through meditation, and rejuvenate our minds. #WeekendMindfulness #InnerPeace 🌱🕉️ #cattax

Rockin’ the Unified !

Good morning ☀️ I hope yall have a wonderful weekend! Cat tax attached.

Some pretty cool and amazing pictures generated by ChatGPT

Have a good Sunday yall ☀️

24 of 29

Updated Results…

Example (Keyword Vector Search)

post_text

post_link

Afternoon cats ☀️ 🎃 🐈‍⬛

Toasty cat ☀️

Living the life 🐈‍⬛🐈‍⬛

Meow

Embrace the tranquility of the weekend. It’s a perfect time to sort our thoughts, seek peace through meditation, and rejuvenate our minds. #WeekendMindfulness #InnerPeace 🌱🕉️ #cattax

Rockin’ the Unified !

Good morning ☀️ I hope yall have a wonderful weekend! Cat tax attached.

Some pretty cool and amazing pictures generated by ChatGPT

Have a good Sunday yall ☀️

25 of 29

Ongoing Evaluation

Key success metric:

Mean ratio of posts assigned to topic returned by search for all topics

Model cards:

  • Key training metrics (model parameters, coherence_score, topic_diversity, silhouette score, training artifacts, visualiations, runtime, etc.). (Temporal workflow, S3)

Monitoring:

  • Test ratio of posts assigned to topic returned in search results for all topics, alerts if score < 0.25 (Temporal workflow)

Observability:

  • Slack notification displays output of topic ranking workflow each time it runs to enable real-time monitoring and observability. (Temporal workflow)

26 of 29

Conclusion

  • Most commonly used databases now include support for dense vectors and knn or cosine similarity search

  • Search embeddings need to be localized. Take the centroid of the topic not the embedding for the topic overall. �
  • Opportunity to define and anticipate topics **not** discoverable by topic model.

27 of 29

Github Repository

Topic Modeling as Search

https://github.com/kstohr/topic-vector-search

28 of 29

References & Credits

Thanks go to Maarten Grootendorst for his work and excellent documentation in BERTopic as well as colleagues at Unified and peer coder Ray 'Urgent' McLendon for his interest and input.

Text Embeddings:

https://stackoverflow.blog/2023/11/09/an-intuitive-introduction-to-text-embeddings/

BERTopic - Package for topic modeling by Maarten Grootendorst

https://github.com/MaartenGr/BERTopic

Comparing Clustering Algorithms (HDBSCAN)

https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

c-TF-IDF

https://www.maartengrootendorst.com/blog/ctfidf/

Vector Search

https://towardsdatascience.com/text-search-vs-vector-search-better-together-3bd48eb6132a

Hybrid Searchhttps://machine-mind-ml.medium.com/enhancing-llm-performance-with-vector-search-and-vector-databases-1f20eb1cc650

🤓

29 of 29

Thoughts? Suggestions?

Contact me:

Kas Stohr

kas@joinunified.us

kas@99antennas.com

Linkedin: https://www.linkedin.com/in/katestohr/