1 of 32

1

RAG-TEC: Extracting and Classifying Topics in

Digital News Collections Using LLMs

WSDL Research Expo — 2026

Department of Computer Science

Old Dominion University, Norfolk, VA

June 23, 2026

Brian Llinas

Advisor: Dr. Michael L. Nelson

2 of 32

2

I want to learn about 180 countries around the world

WebSciDL.bsky.social

WSDL Research Expo – 2026

3 of 32

3

I want to learn about 180 countries around the world

WebSciDL.bsky.social

WSDL Research Expo – 2026

4 of 32

4

I do not know anything about Mozambique

Mozambique is

  • a country located in Southeast Africa.
  • The country was colonized by Portugal in the 15th century and remained under Portuguese rule until 1975.
  • Mozambique gained independence in 1975 after a long and bloody civil war.

WebSciDL.bsky.social

WSDL Research Expo – 2026

5 of 32

5

But … I could read the news to check what is happening

WebSciDL.bsky.social

WSDL Research Expo – 2026

I do not know anything about Mozambique

6 of 32

6

I found a news article, what can I do?

WebSciDL.bsky.social

WSDL Research Expo – 2026

7 of 32

7

Read the title

I found a news article, what can I do?

WebSciDL.bsky.social

WSDL Research Expo – 2026

8 of 32

8

Read the title

Write a summary

I found a news article, what can I do?

Analytics Research Lab Pro Presentation

WebSciDL.bsky.social

WSDL Research Expo – 2026

Mozambique is in turmoil following the assassination of two prominent opposition figures, Elvino Dias and Paulo Guambe, escalating tensions ahead of planned protests against the election results. The killings come as the country awaits the contested election results, with opposition parties alleging fraud and calling for a nationwide strike.

9 of 32

9

Read the title

Write a summary

Mozambique is in turmoil following the assassination of two prominent opposition figures, Elvino Dias and Paulo Guambe, escalating tensions ahead of planned protests against the election results. The killings come as the country awaits the contested election results, with opposition parties alleging fraud and calling for a nationwide strike.

I found a news article, what can I do?

Topic Identification: Mozambique opposition assassinations

WebSciDL.bsky.social

WSDL Research Expo – 2026

10 of 32

10

Identify the events that mention in the article

I found a news article, what can I do?

WebSciDL.bsky.social

WSDL Research Expo – 2026

11 of 32

11

Mind Map to get a contextualization

I found a news article, what can I do?

WebSciDL.bsky.social

WSDL Research Expo – 2026

12 of 32

12

Reading many news articles and understanding a collection is time-consuming

WebSciDL.bsky.social

WSDL Research Expo – 2026

13 of 32

13

WebSciDL.bsky.social

WSDL Research Expo – 2026

Reading many news articles and understanding a collection is time-consuming

14 of 32

14

n

WebSciDL.bsky.social

WSDL Research Expo – 2026

Reading many news articles and understanding a collection is time-consuming

15 of 32

15

Understanding news collections is time-consuming—we need a rapid and reliable contextualization through thematic understanding.

The Challenge of Understanding News Collections

WebSciDL.bsky.social

WSDL Research Expo – 2026

16 of 32

16

One way to get contextualization is to identify the topics

[1] Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research3(Jan), 993-1022.

[2] Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794.

[3] Kapoor, S., Gil, A., Bhaduri, S., Mittal, A., & Mulkar, R. (2024). Qualitative Insights Tool (QualIT): LLM Enhanced Topic Modeling. arXiv preprint arXiv:2409.15626.

[4] Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[5] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

Latent Dirichlet Allocation (LDA) [1]

BERTopic [2]

  • Bag-of-words limitation and the need to specify the number of clusters. [2, 3]
  • Fails to capture contextual nuances and relies on predefined rules.[4, 5]

  • Leverages deep neural networks to learn rich, contextual representations from large amounts of text data.[2]

  • Assumes that each document contains only a single topic, which may not align with real-world cases. [2, 3]
  • Topic representation is still based on bag-of-words, leading to redundancy and less informative topics. [2]
  • Probabilistic Topic Modeling and allows documents to be represented as mixtures of topics, ensuring interpretability, scalability, and flexibility for large-scale text analysis. [1]

Advantage

Limitations

WebSciDL.bsky.social

WSDL Research Expo – 2026

17 of 32

17

How can we use LLMs to extract and classify topics from news collections for rapid contextualization?

[1] Chew, R., Bollenbacher, J., Wenger, M., Speer, J., & Kim, A. (2023). LLM-assisted content analysis: Using large language models to support deductive coding. arXiv preprint arXiv:2306.14924.

[2] Dolphin, R., Dursun, J., Chow, J., Blankenship, J., Adams, K., & Pike, Q. (2024). Extracting structured insights from financial news: An augmented llm driven approach. arXiv preprint arXiv:2407.15788.

[3] Tong, Z., Ding, Z., & Wei, W. (2025, January). EvoPrompt: Evolving Prompts for Enhanced Zero-Shot Named Entity Recognition with Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics (pp. 5136-5153).

[4] Nakshatri, N., Liu, S., Chen, S., Roth, D., Goldwasser, D., & Hopkins, D. (2023, December). Using LLM for improving key event discovery: Temporal-guided news stream clustering with event summaries. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 4162-4173).

[5] López, A. B., Pastor-Galindo, J., & Ruipérez-Valiente, J. A. (2024). LLM-assisted topic modeling for hate speech characterization.

WebSciDL.bsky.social

WSDL Research Expo – 2026

Summarization & Interpretation:

LLMs streamline news summarization, extracting key events and contextual details rapidly [1-2].

Entity & Event Extraction:

Deep language models excel at named entity recognition (NER) and identifying event triggers in large text corpora [3-4].

Domain-Specific Insights:

Tailored LLM approaches can capture nuanced relationships (e.g., socio-political factors) from news archives [5].

18 of 32

18

RAG-TEC: Retrieval-Augmented Generation for Topic Extraction and Classification

WebSciDL.bsky.social

WSDL Research Expo – 2026

A five-stage framework

19 of 32

19

We manually curated four collections

WebSciDL.bsky.social

WSDL Research Expo – 2026

20 of 32

20

WebSciDL.bsky.social

WSDL Research Expo – 2026

We manually curated four collections

21 of 32

21

From top k relevant documents within the collection, we use gpt-4o-mini to extract topics

LLM serve as a Information Extraction Model

We do not need a Natural Language Processing Model to train

WebSciDL.bsky.social

WSDL Research Expo – 2026

22 of 32

22

Six resulting topics from the Mozambique Collection

WebSciDL.bsky.social

WSDL Research Expo – 2026

From top k relevant documents within the collection, we use gpt-4o-mini to extract topics

23 of 32

23

LLM serve as a Classification Model

We do not need a Machine Learning Model to train

WebSciDL.bsky.social

WSDL Research Expo – 2026

We use gpt-4o-mini to classify one document into one main topic and many sub-topics

24 of 32

24

LLM serve as a Classification Model

We do not need a Machine Learning Model to train

WebSciDL.bsky.social

WSDL Research Expo – 2026

We use gpt-4o-mini to classify one document into one main topic and many sub-topics

25 of 32

25

To assess the topic quality for the collection, we use Topic Diversity and Coherence

WebSciDL.bsky.social

WSDL Research Expo – 2026

26 of 32

26

WebSciDL.bsky.social

WSDL Research Expo – 2026

To assess the topic quality for the collection, we use Topic Diversity and Coherence

27 of 32

27

WebSciDL.bsky.social

WSDL Research Expo – 2026

To assess the topic quality for the collection, we use Topic Diversity and Coherence

28 of 32

28

Finally, we can understand the collection from a topic perspective without relying on manual inspection

WebSciDL.bsky.social

WSDL Research Expo – 2026

29 of 32

29

WebSciDL.bsky.social

WSDL Research Expo – 2026

Finally, we can understand the collection from a topic perspective without relying on manual inspection

We illustrate the process with the Mozambique collection.

Election Violence (EV), Climate Challenges (CC), Insurgency Crisis (IC), Social Media Restrictions (SMR), Economic Disparities (ED), and Humanitarian Response (HR).

30 of 32

30

Limitations

Small, manually curated micro-collections introduce selection bias and limit scalability.

Topic output is sensitive to prompt design and model behavior.

Evaluation was done without gold-standard labels.

Current approach uses only GPT-4o-mini; prompting strategies and open-source LLMs remain unexplored.

WebSciDL.bsky.social

WSDL Research Expo – 2026

31 of 32

31

Future Work

Experiment with diverse prompt styles and open-source models (e.g., Mistral, LLaMA).

Automate micro-collection building to reduce manual selection bias.

Expand method to multilingual and cross-national corpora for comparative analysis.

WebSciDL.bsky.social

WSDL Research Expo – 2026

32 of 32

32

Conclusion

RAG-TEC bridges interpretability and automation for topic modeling.

Emphasizes actors, agency, and mechanisms — vital for qualitative and mixed-method research.

Produces topics usable as initial codebooks for framing analysis.

Complements existing models by adding narrative richness and contextual sensitivity.

Demonstrated utility through four collections; extensible to other manually curated collections.

RAG-TEC empowers researchers to move beyond shallow topic labels and uncover the narrative mechanisms embedded in news discourse.

WebSciDL.bsky.social

WSDL Research Expo – 2026