1
RAG-TEC: Extracting and Classifying Topics in
Digital News Collections Using LLMs
WSDL Research Expo — 2026
Department of Computer Science
Old Dominion University, Norfolk, VA
June 23, 2026
Brian Llinas
Advisor: Dr. Michael L. Nelson
2
I want to learn about 180 countries around the world
WebSciDL.bsky.social
WSDL Research Expo – 2026
3
I want to learn about 180 countries around the world
WebSciDL.bsky.social
WSDL Research Expo – 2026
4
I do not know anything about Mozambique
Mozambique is
Summary from https://en.wikipedia.org/wiki/Mozambique
WebSciDL.bsky.social
WSDL Research Expo – 2026
5
But … I could read the news to check what is happening
WebSciDL.bsky.social
WSDL Research Expo – 2026
I do not know anything about Mozambique
6
I found a news article, what can I do?
WebSciDL.bsky.social
WSDL Research Expo – 2026
7
Read the title
I found a news article, what can I do?
WebSciDL.bsky.social
WSDL Research Expo – 2026
8
Read the title
Write a summary
I found a news article, what can I do?
Analytics Research Lab Pro Presentation
WebSciDL.bsky.social
WSDL Research Expo – 2026
Mozambique is in turmoil following the assassination of two prominent opposition figures, Elvino Dias and Paulo Guambe, escalating tensions ahead of planned protests against the election results. The killings come as the country awaits the contested election results, with opposition parties alleging fraud and calling for a nationwide strike.
9
Read the title
Write a summary
Mozambique is in turmoil following the assassination of two prominent opposition figures, Elvino Dias and Paulo Guambe, escalating tensions ahead of planned protests against the election results. The killings come as the country awaits the contested election results, with opposition parties alleging fraud and calling for a nationwide strike.
I found a news article, what can I do?
Topic Identification: Mozambique opposition assassinations
WebSciDL.bsky.social
WSDL Research Expo – 2026
10
Identify the events that mention in the article
I found a news article, what can I do?
WebSciDL.bsky.social
WSDL Research Expo – 2026
11
Mind Map to get a contextualization
I found a news article, what can I do?
WebSciDL.bsky.social
WSDL Research Expo – 2026
12
Reading many news articles and understanding a collection is time-consuming
WebSciDL.bsky.social
WSDL Research Expo – 2026
13
WebSciDL.bsky.social
WSDL Research Expo – 2026
Reading many news articles and understanding a collection is time-consuming
14
…
n
WebSciDL.bsky.social
WSDL Research Expo – 2026
Reading many news articles and understanding a collection is time-consuming
15
Understanding news collections is time-consuming—we need a rapid and reliable contextualization through thematic understanding.
The Challenge of Understanding News Collections
WebSciDL.bsky.social
WSDL Research Expo – 2026
16
One way to get contextualization is to identify the topics
[1] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
[2] Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.
[3] Kapoor, S., Gil, A., Bhaduri, S., Mittal, A., & Mulkar, R. (2024). Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling. arXiv preprint arXiv:2409.15626.
[4] Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[5] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
Latent Dirichlet Allocation (LDA) [1]
BERTopic [2]
Advantage
Limitations
WebSciDL.bsky.social
WSDL Research Expo – 2026
17
How can we use LLMs to extract and classify topics from news collections for rapid contextualization?
[1] Chew, R., Bollenbacher, J., Wenger, M., Speer, J., & Kim, A. (2023). LLM-assisted content analysis: Using large language models to support deductive coding. arXiv preprint arXiv:2306.14924.
[2] Dolphin, R., Dursun, J., Chow, J., Blankenship, J., Adams, K., & Pike, Q. (2024). Extracting structured insights from financial news: An augmented llm driven approach. arXiv preprint arXiv:2407.15788.
[3] Tong, Z., Ding, Z., & Wei, W. (2025, January). EvoPrompt: Evolving Prompts for Enhanced Zero-Shot Named Entity Recognition with Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics (pp. 5136-5153).
[4] Nakshatri, N., Liu, S., Chen, S., Roth, D., Goldwasser, D., & Hopkins, D. (2023, December). Using LLM for improving key event discovery: Temporal-guided news stream clustering with event summaries. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 4162-4173).
[5] López, A. B., Pastor-Galindo, J., & Ruipérez-Valiente, J. A. (2024). LLM-assisted topic modeling for hate speech characterization.
WebSciDL.bsky.social
WSDL Research Expo – 2026
Summarization & Interpretation:
LLMs streamline news summarization, extracting key events and contextual details rapidly [1-2].
Entity & Event Extraction:
Deep language models excel at named entity recognition (NER) and identifying event triggers in large text corpora [3-4].
Domain-Specific Insights:
Tailored LLM approaches can capture nuanced relationships (e.g., socio-political factors) from news archives [5].
18
RAG-TEC: Retrieval-Augmented Generation for Topic Extraction and Classification
WebSciDL.bsky.social
WSDL Research Expo – 2026
A five-stage framework
19
We manually curated four collections
WebSciDL.bsky.social
WSDL Research Expo – 2026
20
WebSciDL.bsky.social
WSDL Research Expo – 2026
We manually curated four collections
21
From top k relevant documents within the collection, we use gpt-4o-mini to extract topics
LLM serve as a Information Extraction Model
We do not need a Natural Language Processing Model to train
WebSciDL.bsky.social
WSDL Research Expo – 2026
22
Six resulting topics from the Mozambique Collection
WebSciDL.bsky.social
WSDL Research Expo – 2026
From top k relevant documents within the collection, we use gpt-4o-mini to extract topics
23
LLM serve as a Classification Model
We do not need a Machine Learning Model to train
WebSciDL.bsky.social
WSDL Research Expo – 2026
We use gpt-4o-mini to classify one document into one main topic and many sub-topics
24
LLM serve as a Classification Model
We do not need a Machine Learning Model to train
WebSciDL.bsky.social
WSDL Research Expo – 2026
We use gpt-4o-mini to classify one document into one main topic and many sub-topics
25
To assess the topic quality for the collection, we use Topic Diversity and Coherence
WebSciDL.bsky.social
WSDL Research Expo – 2026
26
WebSciDL.bsky.social
WSDL Research Expo – 2026
To assess the topic quality for the collection, we use Topic Diversity and Coherence
27
WebSciDL.bsky.social
WSDL Research Expo – 2026
To assess the topic quality for the collection, we use Topic Diversity and Coherence
28
Finally, we can understand the collection from a topic perspective without relying on manual inspection
WebSciDL.bsky.social
WSDL Research Expo – 2026
29
WebSciDL.bsky.social
WSDL Research Expo – 2026
Finally, we can understand the collection from a topic perspective without relying on manual inspection
We illustrate the process with the Mozambique collection.
Election Violence (EV), Climate Challenges (CC), Insurgency Crisis (IC), Social Media Restrictions (SMR), Economic Disparities (ED), and Humanitarian Response (HR).
30
Limitations
Small, manually curated micro-collections introduce selection bias and limit scalability.
Topic output is sensitive to prompt design and model behavior.
Evaluation was done without gold-standard labels.
Current approach uses only GPT-4o-mini; prompting strategies and open-source LLMs remain unexplored.
WebSciDL.bsky.social
WSDL Research Expo – 2026
31
Future Work
Experiment with diverse prompt styles and open-source models (e.g., Mistral, LLaMA).
Automate micro-collection building to reduce manual selection bias.
Expand method to multilingual and cross-national corpora for comparative analysis.
WebSciDL.bsky.social
WSDL Research Expo – 2026
32
Conclusion
RAG-TEC bridges interpretability and automation for topic modeling.
Emphasizes actors, agency, and mechanisms — vital for qualitative and mixed-method research.
Produces topics usable as initial codebooks for framing analysis.
Complements existing models by adding narrative richness and contextual sensitivity.
Demonstrated utility through four collections; extensible to other manually curated collections.
RAG-TEC empowers researchers to move beyond shallow topic labels and uncover the narrative mechanisms embedded in news discourse.
WebSciDL.bsky.social
WSDL Research Expo – 2026