1 of 190

NeuroSymbolic AI

for

Grounding,

Instructibility, and

Explainability

Tutorial 2025

2 of 190

Spread the Word

Making LLMs Explainable, Grounded, and Instructible

3 of 190

4 of 190

Focus of Tutorial

NeuroSymbolic AI and Instructible AI

Vector Symbolic Architectures

Explainability with Knowledge-infused Learning

Grounding with Retrieval Augmented Generation

OpenCHA: Domain Knowledge-driven

LLM-based Conversational Agent for Health

5 of 190

Tutorial’s Central Question

The "black box" nature of AI systems in

High Stakes Decision-Making Application

research has raised concerns about transparency and reproducibility.

How can we go about reducing Blackboxness??

6 of 190

Attention Maps and Feature Visualization

Layer Analysis and Activation Patterns

Attention Analysis and Probing Tasks

Token Attribution and Hidden States with Sparse Autoencoders

Mechanistic Interpretability

Behavioral Testing

?

7 of 190

Statistical AI is a Blackbox

NeuroSymbolic AI

8 of 190

Why NeuroSymbolic AI

Amit Sheth, Kaushik Roy, Manas Gaur, Neurosymbolic Artificial Intelligence (Why, What, and How), IEEE Intelligent Systems, 38 (3), May-June 2023

9 of 190

Explainability

🡪 Explanations that use terms and connections specific to a particular field or industry are more useful than general words that don't help people take action.

Features:

Local and Global Explanations
Causality
User-Appropriate Explanations

Features:

Symbolic Grounding
Functional Grounding

10 of 190

NeuroSymbolic AI

Neuro-symbolic AI techniques incorporate broader forms of knowledge (lexical, domain-specific, common-sense, and constraint-based) into addressing limitations of either symbolic or statistical AI approaches, such as model interpretations and user-level explanations. Compared to powerful statistical AI that exploits data, NeSy benefits from data and knowledge.

11 of 190

Neurosymbolic AI in the Era of Large Language Models

12 of 190

Applications f

Applications of NeuroSymbolic AI with a focus on Grounding, Explainability, and Instructibility (EGI)

13 of 190

NeuroSymbolic in Machine Learning and Natural Language Processing

Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., & Miller, A. (2019, November). Language Models as Knowledge Bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

14 of 190

Neural AI

Data

Hyperparameters

Activation Functions

Loss Functions

Computing Power

and Model

Compression

Optimization

Symbolic AI

Knowledge Graphs

Lexicons

Rules

Workflow or Procedural Knowledge

Constraints

15 of 190

Benchmarking Example

WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions,

In BlackboxNLP @ EMNLP 2024

16 of 190

Wellness Dimension

17 of 190

Wellness Dimension

Wellness Dimension Definitions and Questionnaire

https://store.samhsa.gov/sites/default/files/sma16-4958.pdf

18 of 190

MultiWD and

WellXplain Datasets

Content worth 4000 users

6 Wellness Dimensions

Physical
Intellectual
Vocational
Social
Spiritual
Emotional

Clinical expert explanations

19 of 190

Input

Attention Matrix

Explanation

Prediction

Input

Attention Matrix

Explanation

Prediction

Definitions In-Context Learning

Design 1

Design 2

20 of 190

Input

Attention Matrix

Explanation

Prediction

Questionnaire

Workflow-based

In-Context Learning

Design 4

Input

Attention Matrix

Explanation

Prediction

Chain of Thoughts with Definitions

Design 3

21 of 190

Domain-Specific LLMs

General Purpose LLMs

22 of 190

Hybridized Architectures: NeuroSymbolic AI

23 of 190

Process Knowledge-infused Learning

24 of 190

Simple Text Classification

�I am really struggling with my bisexuality, which is causing chaos in my relationship with a girl. Being a fan of the LGBTQ community, I am equal to worthless to her. I’m now starting to get drunk because I can’t cope with the obsessive, intrusive thoughts I need to get out of my head.

Don’t want to live anymore. Sexually assault, ignorant family members, and my never-ending loneliness brights up my path to death.

I do have the potential to live a decent life, but not with people who abandon me. Hopelessness and feelings of betrayal have turned my nights into days. I am developing insomnia because of my restlessness.

I just can’t take it anymore. Been abandoned yet again by someone I cared about. I've been diagnosed with borderline for a while, and I’m just going to isolate myself and sleep forever.

�Y = Suicide Ideation

Process Knowledge

Infusion is

better form than data-driven Classification

25 of 190

Simple Text Classification

�I am really struggling with my bisexuality, which is causing chaos in my relationship with a girl. Being a fan of the LGBTQ community, I am equal to worthless to her. I’m now starting to get drunk because I can’t cope with the obsessive, intrusive thoughts I need to get out of my head.

Don’t want to live anymore. Sexually assault, ignorant family members, and my never-ending loneliness brights up my path to death.

I do have the potential to live a decent life, but not with people who abandon me. Hopelessness and feelings of betrayal have turned my nights into days. I am developing insomnia because of my restlessness.

I just can’t take it anymore. Been abandoned yet again by someone I cared about. I've been diagnosed with borderline for a while, and I’m just going to isolate myself and sleep forever.

�Y = Suicide Ideation

Process Knowledge-based Classification

�Has the subject wished he was dead or wished

he could go to sleep and not wake up?

YES

�Has the subject had any thoughts of killing himself?

YES

�Has the subject been thinking about how he might do this?

NO

�Has the subject has these thoughts and some intentions of acting on them?

NO

26 of 190

Simple Text Classification

�I am really struggling with my bisexuality, which is causing chaos in my relationship with a girl. Being a fan of the LGBTQ community, I am equal to worthless to her. I’m now starting to get drunk because I can’t cope with the obsessive, intrusive thoughts I need to get out of my head.

Don’t want to live anymore. Sexually assault, ignorant family members, and my never-ending loneliness brights up my path to death.

I do have the potential to live a decent life, but not with people who abandon me. Hopelessness and feelings of betrayal have turned my nights into days. I am developing insomnia because of my restlessness.

I just can’t take it anymore. Been abandoned yet again by someone I cared about. I've been diagnosed with borderline for a while, and I’m just going to isolate myself and sleep forever.

�Y = Suicide Ideation

27 of 190

28 of 190

29 of 190

30 of 190

31 of 190

32 of 190

NeuroSymbolic AI in Social Media

33 of 190

NeuroSymbolic AI in Social Media

Symbolic AI

Knowledge Representation: LDA, Word2Vec; Maintains a "Constantly Updating Lexicon" as a formal knowledge representation structure
Semantic Processing: dedicated Location Extraction showing symbolic representation of geographical entities; Key Phrases/Hashtags as symbolic tokens with defined meanings.
Semantic Gap Management: Uses neologisms as a symbolic bridge between new terms and existing knowledge
Training Domain-Specific Models: LDA and Word2Vec.

34 of 190

NeuroSymbolic AI in Social Media

Symbolic AI Scoring

Semantic Mapping: A lexicon with metadata as fields and W2V model vocabulary as terms, created using cosine similarity.
Semantic Proximity: A lexicon with mental health concepts as fields and metadata concepts as terms, created using cosine similarity
Index Scoring: A cross-correlation matrix from mental health concepts to model vocabulary words, with metadata as transient terms.

35 of 190

NeuroSymbolic in Machine Learning and Natural Language Processing

Neural AI with Symbolic information

36 of 190

NeuroSymbolic in Machine Learning and Natural Language Processing

Anxiety	Depression, Cognitive distortions, panic attacks, hopelessness, physical sensations.
Depression	Mood swings, weight gain, rapid cycling, depressive episode, Impulsivity, mood swings, antisocial conduct, personality disorder
Addiction	Buying oxycodone, pain management, chronic pain, alienation, crippling alcohol, dependent on crack

37 of 190

NeuroSymbolic in Machine Learning and Natural Language Processing

The tables compare model performance for mental health classification across Precision, Recall, and F1-Score. The left table shows traditional models’ results with and without the Neurosymbolic approach, while the right table contrasts the NeuroSymbolic model with state-of-the-art LLMs like LLama, Phi, and Mistral.

The NeuroSymbolic model consistently outperforms both traditional models and state-of-the-art LLMs, achieving higher performance metrics and adaptability in mental health sentiment classification.

NB: Naïve Bayes

RF: Random Forest

38 of 190

CREST

39 of 190

Neural AI and Symbolic AI for achieving Consistency

Bonagiri, Vamshi Krishna, Sreeram Vennam, Priyanshul Govil, Ponnurangam Kumaraguru, and Manas Gaur. "SaGE: Evaluating Moral Consistency in Large Language Models." In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation LREC-COLING 2024.

40 of 190

Claim: LLMs are not semantically consistent, and can give contradictory answers to paraphrased questions

41 of 190

NeuroSymbolic Empirical Analysis

42 of 190

43 of 190

Semantic Graph-driven Consistent LLM Training (SaGE)

Moral Consistency Corpus (MCC)

44 of 190

Forbes, Maxwell, Jena D. Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi. "SOCIAL CHEMISTRY 101: Learning to Reason about Social and Moral Norms." In EMNLP. 2020.

45 of 190

BLEURT

BLEU

ROUGE-L

BERT Score

SaGE (LLAMA 3)

GPT-4

46 of 190

Forbes, Maxwell, Jena D. Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi. "SOCIAL CHEMISTRY 101: Learning to Reason about Social and Moral Norms." In EMNLP. 2020.

47 of 190

Integrated Mental Health Instruction Dataset (105K Samples)

Yang, Kailai, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. "MentaLLaMA: interpretable mental health analysis on social media with large language models." In Proceedings of the ACM Web Conference 2024, pp. 4489-4500. 2024.

48 of 190

NeuroSymbolic AI for Reliability

Reliability

Grounding

Ensemble of Large Language Models

Bias Awareness

Mechanistic Interpretability

49 of 190

NeuroSymbolic AI for Reliability

Reliability

Grounding

Ensemble of Large Language Models

Bias Awareness

Mechanistic Interpretability

Explainability

Knowledge Gap Resolution

Source Attribution

https://www.transformer-circuits.pub/2022/mech-interp-essay - By Chris Olah

Gorti, Atmika, Aman Chadha, and Manas Gaur. "Unboxing Occupational Bias: Debiasing LLMs with US Labor Data." In Proceedings of the AAAI Symposium Series, 2024.

Instructibility

50 of 190

Grounding

A successful AI teammate requires several cognitive capacities including situation assessment, task behavior, language comprehension and generation , and knowledge gap resolution processes. Grounding enables agents with different capabilities to communicate.

51 of 190

Knowledge Gap Resolution

Bajaj, Goonmeet, Bortik Bandyopadhyay, Daniel Schmidt, Pranav Maneriker, Christopher Myers, and Srinivasan Parthasarathy. "Understanding knowledge gaps in visual question answering: Implications for gap identification and testing." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 386-387. 2020.

52 of 190

Language Gap : Pay attention to Important Domain Concepts

53 of 190

Language Gap : Pay attention to Important Domain Concepts

What Is the Knowledge Gap?

The model’s inability to handle tasks requiring factual updates or logical reasoning beyond its training data.
In mental health or legal scenarios, LLMs often overlook subtle context cues.

54 of 190

55 of 190

More Details during

Grounding with Retrieval Augmented Generation

56 of 190

NeuroSymbolic AI for Instructibility

Why Instructibility Needs More Than Instruction Tuning?

Lin, Bill Yuchen, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. "The unlocking spell on base llms: Rethinking alignment via in-context learning." arXiv preprint arXiv:2312.01552

57 of 190

NeuroSymbolic AI for Instructibility

Why Instructibility Needs More Than Instruction Tuning?

Lin, Bill Yuchen, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. "The unlocking spell on base llms: Rethinking alignment via in-context learning." arXiv preprint arXiv:2312.01552

Token-Level Impact is Minimal

Instruction tuning modifies behavior on only a small fraction of tokens
Base and instruction-tuned models generate identical top token choices in most positions

Surface-Level Modifications

Instruction tuning represent a tiny fraction of model's overall output
Focus is on discourse markers and transitional words

Base Model Capability

Given appropriate prompting, base models can match instruction-tuned performance
Indicates instruction tuning surfaces existing capabilities rather than teaching new ones

58 of 190

Most Importantly …

🚫 Why This Is Not Enough

Instruction tuning doesn’t fill the knowledge gap; it simply redirects existing behavior.
Complex domains need deeper reasoning capabilities that go beyond token-level adjustments.

How Instructibility Relates to Knowledge Gaps

A genuinely instructible LLM must dynamically recognize gaps in its knowledge.
Requires active reasoning mechanisms to retrieve and apply external knowledge.
The model must collaborate with external systems (symbolic reasoners, knowledge bases) to fill these gaps effectively.

59 of 190

Weighted Contextual Mutual Information

Knowledge gap

60 of 190

Gou, Tian, Boyao Zhang, Zhenglie Sun, Jing Wang, Fang Liu, Yangang Wang, and Jue Wang. "Rationality of thought improves reasoning in large language models." In International Conference on Knowledge Science, Engineering and Management, pp. 343-358. Singapore: Springer Nature Singapore, 2024.

61 of 190

62 of 190

LLM Generator

LLM Evaluator

63 of 190

64 of 190

Summary

Definition Integration: The WellDunn framework formalizes the incorporation of clinical definitions into mental health assessment systems, enabling a more accurate understanding of psychological conditions

Rule of Thumb Extraction and Contextualization: SAGE extracts clinical heuristics from mental health knowledge bases as rules of thumb, turning LLM agents into more empathetic and grounded agents.

Semantic Encoding and Decoding Optimization: The SEDO framework preserves nuanced psychological semantics when integrating expert knowledge into mental health assessment systems.

Process Knowledge-infused Learning demonstrates how therapeutic processes and intervention sequences can be incorporated into AI systems to provide ethically sound mental health support.

Knowledge Gaps Assessment enables LLMs to dynamically measure intrinsic contextual uncertainty during conversations, strategically resolving persona knowledge gaps through targeted questions rather than producing hallucinated responses when information is incomplete.

65 of 190

Handoff

Vector Symbolic Architectures

66 of 190

Vector Symbolic architectures in deep learning

Edward Raff

AAAI TUTORIAL NEUROSYMBOLIC AI FOR EGI| 24 FEB 2025

Innovation center, Washington, D.C.

67 of 190

Vector Symbolic Architectures

This document is confidential and intended solely for the client to whom it is addressed.

67

68 of 190

VSA operations

This document is confidential and intended solely for the client to whom it is addressed.

68

69 of 190

Holographic Reduced Representations: a primer

69

Booz Allen Hamilton

Consider a d = 3 dimensional space, where we wish to compute c † ⊗ (c ⊗ x), we will get the result that:

70 of 190

Extreme Multi-label classification (XML)

This document is confidential and intended solely for the client to whom it is addressed.

70

Eli Chien, Jiong Zhang, Cho-Jui Hsieh, Jyun-Yu Jiang, Wei-Cheng Chang, Olgica Milenkovic, and Hsiang-Fu Yu. 2023. PINA: leveraging side information in extreme multi-label classification via predicted instance neighborhood aggregation. In Proceedings of the 40th International Conference on Machine Learning (ICML'23), Vol. 202. JMLR.org, Article 224, 5616–5630.

71 of 190

A symbolic version of XML

This document is confidential and intended solely for the client to whom it is addressed.

71

72 of 190

Smaller models

This document is confidential and intended solely for the client to whom it is addressed.

72

73 of 190

Deploying Convolutional Networks on Untrusted Platforms Using 2D Holographic Reduced Representations

Say you want to deploy your model on “AWS” to reduce local compute costs.

But you don’t fully trust “AWS” to not peak at your data/model, and trying to infer what you are doing.

Homorphic Encryption is cool, but not actually useful. Far too slow!
We can exploit the fact that binding inputs are dissimilar between pairs.
HRR is defined as the result of an FFT, so we can generalize to images via a 2D-FFT

This document is confidential and intended solely for the client to whom it is addressed.

73

74 of 190

Connectionist Symbolic Pseudo Secrets (CSPS)

This document is confidential and intended solely for the client to whom it is addressed.

74

75 of 190

CSPS classification accuracy

Training is done all locally on one device, it is prediction wea re concerned with.
Accuracy is reasonable, but some degradation as images get larger.
You can improve accuracy by running the prediction k times with k different secrets, and averaging the results
This is far faster than FHE!

This document is confidential and intended solely for the client to whom it is addressed.

75

76 of 190

Do you get any data saving?

CSPS can offload at least 65% of computation to a remote machine effectively

Some work needs to be done locally for unbind/binding, and we need a small classification head to run locally (too hard for a “linear probe”)

This document is confidential and intended solely for the client to whom it is addressed.

76

77 of 190

Are we giving away any secrets?

Utility is predicated on obscurity!

Can the adversary ease drop on the input and tell what is happening? The output?

Use clustering algorithms, with the true value of the number of classes, to see if they can find the signal. Adjusted Rand index should be =0 for purely random clustering. Far below 100%!

This document is confidential and intended solely for the client to whom it is addressed.

77

Network’s Input

Network’s Output

78 of 190

What if the adversary was even smarter

Using CSPS data with true labels to train a model directly requires an impossibly strong adversary.
Barely beats random-guessing performance, even with an impossible advantage, is a good sign for heuristic security.

MNIST, SVHN, and CIFAR-10 accuracies are at best 2.1× better than random-guessing
For CIFAR-100, and Mini-ImageNet it is 2.6 and 4.7× of random guessing

This document is confidential and intended solely for the client to whom it is addressed.

78

Random Guess
10%
10%
10%
1%
1%

79 of 190

Adversary has access to some of the original images, can it learn the secret using projected gradient descent (PGD)?

WHAT ABOUT USING ADVERSARIAL ML?

80 of 190

Adversary has access to some of the original images, can it learn by training an Autoencoder?

WHAT ABOUT TRAINING AN AUTO-ENCODER?

81 of 190

Self Attention

81

Booz Allen Hamilton

82 of 190

Self Attention with HRRs

82

Booz Allen Hamilton

We can think of Self-Attention as a fuzzy dictionary. We are finding the match between query and key and returning the corresponding value – made fuzzy by averaging based on similarity.

So we can perform querying against this “dictionary”, using HRRs as an inductive biased toward key/value lookups!

83 of 190

Self Attention with HRRs: Implementation

83

Booz Allen Hamilton

84 of 190

Self Attention with HRRs: Noise?

84

Booz Allen Hamilton

Our self attention works without Gaussian IID coefficients, how? Consider the H dimensional vectors a, b, c, d, and z. If each element of all these vectors is sampled from N (0, 1/H), then we would expect that (a⊗ b+c⊗ d)^⊤a^† ≈ 1.Similarly, the value z is not present, so we expect that (a ⊗ b + c ⊗ d)^⊤z^† ≈ 0. Now let's pretend we have 2D data:

We can query for a + z get and get

Or we can do c + z and get:

In either case, the noise terms share many coefficents, and will result in similar magnitude noise. We can interpret this as an additional noise constant ε that we must add to each�noise term. Then when we apply the softmax operation, we obtain the benefit that the softmax function is invariant to constant shifts in the input, i.e., ∀ε ∈ R, softmax(x + ε) = softmax(x). Thus, our softmax effectively acts as a clean-up operation over the original values!�

85 of 190

Long Range Arena Results

We use each datasets. Feature Vectors for all datasets were available, so all have FC results.
Most methods are worse than a naïve Transformer, but Hrrformer is always better!

85

Booz Allen Hamilton

86 of 190

Interpretability

We can visualize the attention weights that our model uses for each prediction, and see if they correspond with the content of the image.
In doing so, we see that the attention maps precisely to informative outlines/content of the image.
Remember: the task is linearized images! So the model is learning 2D structure from 1D representations!

87 of 190

Fast & Low Memory Training

Transformers are being used in ever larger and more expensive models. Are we fighting that trend? Yes!
Hrrformer is the fastest by far compared to all prior methods, up to 2 orders of magnitude.
The Hrrformer uses less memory to train by an order of magnitude or more, depending on what baseline we compare against
It is nearly the most accurate on the LRA benchmarks.

88 of 190

Fast Predictions

The FFT function has better numerical behavior as a function of batch size than matrix multiplication.
The gap between time/sample is smaller for varying batch sizes
Even our worst case time is better than a transformer’s best-case time!

89 of 190

Malware results

89

Booz Allen Hamilton

90 of 190

A Walsh Hadamard Derived Linear Vector Symbolic�Architecture

This document is confidential and intended solely for the client to whom it is addressed.

90

91 of 190

Properties of the HLB

This document is confidential and intended solely for the client to whom it is addressed.

91

92 of 190

Good at classical VSA tasks

When binding with random or repeated VSA vectors, HLB’s similarity score remains constant.
The magnitude of the vector does not change either as more items are bound
Better to design around something that has a known response

This document is confidential and intended solely for the client to whom it is addressed.

92

93 of 190

Better at XML classification

This document is confidential and intended solely for the client to whom it is addressed.

93

94 of 190

Does better at CSPS

HLB is more accurate at the CSPS task than prior VSAs
Also better at hiding its information for CSPS too!
Because CSPS is purely elementwise operations, no extra work for 2D/n-D generalization

This document is confidential and intended solely for the client to whom it is addressed.

94

95 of 190

Questions?

95

Edward Raff

EdwardRaff.com

Raff_Edward@bah.com

We can use VSAs to create neuro-symbolic AI methods

We can be clever with the loss function to impose symbolic constraints
We can design layers with symbolic interpretations as a way to express priors
We can design and simplify our approach to VSAs to achieve better results

96 of 190

Handoff

97 of 190

Grounding Blackbox Language Models with Retrieval Augmented Generation of Diverse Knowledge Form

Deepa Tilwani, �Phd Candidate �University of South Carolina

98 of 190

Introduction and Motivation (Part 1)

99 of 190

Progress in Language Modelling

Symbolic Era

Pre - 1990

Rule Based
Expert Systems
Limited Generalization

Statistical Era

1990 - 2006

Data-driven Approaches
Probabilistic Models

Scale Era

2006 onwards

Deep Learning and neural nets
General Purpose LMs
Massive Datasets and Computation

Turing Test

1950

ELIZA

ChatGPT

1966

2022

The three major eras of language modeling can be expanded with richer historical context and technical details:,

## Symbolic Era (Pre-1990)

The symbolic era began with rule-based approaches to language processing in the 1950s[3]. Key developments included:

- The Georgetown-IBM experiment in 1954, which demonstrated early machine translation capabilities[3]

- ELIZA in 1966, the first chatbot that simulated a psychotherapist using pattern matching[42]

- SHRDLU program for natural language understanding in virtual environments[3]

- Expert systems with handcrafted rules for language processing[2]

- Focus on formal logic and knowledge representation frameworks[2]

## Statistical Era (1990-2006)

This era marked a shift from rigid rules to probabilistic approaches[5]. Major developments included:

- N-gram models becoming the dominant methodology for language modeling[45]

- Hidden Markov Models for sequence modeling[46]

- Introduction of statistical machine translation systems[43]

- Development of continuous space language models[45]

- Cache models that capitalized on temporal aspects of language[45]

## Scale Era (2006 onwards)

The scale era revolutionized language modeling through massive neural architectures[31]. Key innovations include:

- Introduction of the Transformer architecture in 2017, enabling efficient processing of longer sequences[31]

- Development of BERT in 2018, which became ubiquitous in NLP tasks[7]

- Emergence of GPT models with increasing parameter counts[30]

- Integration of multimodal capabilities in modern language models[48]

- Achievement of in-context learning capabilities without additional training[48]

- Advancement toward trillion-parameter models with enhanced reasoning abilities[38]

Citations:

[1] https://pplx-res.cloudinary.com/image/upload/v1738640005/user_uploads/ByOMLEkTmKPjdMq/Screenshot-2025-02-03-at-10.31.52-PM.jpg

[2] https://smythos.com/ai-agents/ai-tutorials/symbolic-ai-and-logic/

[3] https://www.linkedin.com/pulse/short-explanation-history-natural-language-models-anton-grechanyuk

[4] https://smythos.com/ai-agents/agent-architectures/symbolic-ai-future-trends/

[5] https://quickcreator.io/quthor_blog/evolution-large-language-models-history/

[6] https://catalog.ngc.nvidia.com/orgs/nvidia/collections/languagemodelling

[7] https://snorkel.ai/large-language-models/

[8] https://hatchworks.com/blog/gen-ai/large-language-models-guide/

[9] https://en.wikipedia.org/wiki/Large_language_model

[10] https://www.scribbledata.io/blog/large-language-models-history-evolutions-and-future/

[11] https://www.dataversity.net/large-language-models-the-new-era-of-ai-and-nlp/

[12] https://arxiv.org/html/2401.11972v2

[13] https://cacm.acm.org/research/language-models/

[14] https://www.linkedin.com/pulse/nleps-uniting-language-models-symbolic-reasoning-ai-ramanujam-kuxbc

[15] https://www.dataversity.net/a-brief-history-of-large-language-models/

[16] https://cleanlanguage.com/symbolic-modelling-overview/

[17] https://shelf.io/blog/master-nlp-history-from-then-to-now/

[18] https://www.youtube.com/watch?v=kSsRV1pSrhs

[19] https://vocal.media/education/history-of-language-models-in-artificial-intelligence

[20] https://pmc.ncbi.nlm.nih.gov/articles/PMC3482824/

[21] https://www.scribbledata.io/blog/large-language-models-history-evolutions-and-future/

[22] https://blog.soumendrak.com/large-language-models-history

[23] https://scikiq.com/blog/the-top-20-most-popular-statistical-models/

[24] https://www.altexsoft.com/blog/language-models-gpt/

[25] https://en.wikipedia.org/wiki/Large_language_model

[26] http://www.scholarpedia.org/article/Neural_net_language_models

[27] https://www.cs.cmu.edu/~roni/papers/survey-slm-IEEE-PROC-0004.pdf

[28] https://www.spglobal.com/en/research-insights/special-reports/language-modeling-the-fundamentals

[29] https://pub.towardsai.net/exploration-of-statistical-language-models-8a9dac14dddc?gi=7f47cbdd8d36

[30] https://toloka.ai/blog/history-of-llms/

[31] https://developers.google.com/machine-learning/resources/intro-llms

[32] https://www.notablehealth.com/blog/patient-ai-personalized-healthcare

[33] https://arxiv.org/abs/2303.18223?trk=cndc-detail

[34] https://www.nature.com/articles/s41586-023-06160-y

[35] https://openreview.net/pdf?id=Q0DVMcVPDn5

[36] https://www.ruder.io/a-review-of-the-recent-history-of-nlp/

[37] https://www.linkedin.com/pulse/history-language-model-deep-learning-natural-saba-sabrin

[38] https://arxiv.org/html/2408.10210v1

[39] https://openreview.net/forum?id=RJq9bVEf6N

[40] https://lanternstudios.com/insights/blog/the-history-of-ai-from-rules-based-algorithms-to-generative-models/

[41] https://en.wikipedia.org/wiki/Symbolic_artificial_intelligence

[42] https://www.appypie.com/blog/evolution-of-language-models

[43] https://arxiv.org/html/2402.06853v1

[44] https://oecs.mit.edu/pub/zp5n8ivs/release/1

[45] https://botpenguin.com/glossary/statistical-language-modeling

[46] https://www.linkedin.com/pulse/brief-history-large-language-models-bob

[47] https://en.wikipedia.org/wiki/Language_model

[48] https://www.labellerr.com/blog/unveiling-the-contrasts-developing-small-scale-and-large-scale-language-models/

[49] https://atscaleconference.com/events/systems-scale-2024/

[50] https://blog.gopenai.com/large-language-models-llms-a-brief-history-applications-challenges-c2fab10fa2e7?gi=078f623cc939

[51] https://aisera.com/blog/small-language-models/

100 of 190

ELIZA (1966) : THE FIRST CHATBOT

Early NLP program developed by Joseph Weizenbaum at MIT. Created illusion of a conversation by rephrasing user statements as questions using pattern matching and substitution methodology. One of the first programs capable of attempting the Turing test.

Try it out at https://web.njit.edu/~ronkowit/eliza.html

101 of 190

The LLM Era – How they work?

102 of 190

Word Embeddings

Represent each word using a “vector” of numbers.

Converts a “discrete” representation to “continuous”.
Many benefits:

More “fine-grained” representations of words.
Useful computations such as cosine and Euclidean distance.
Visualization and mapping of words onto a semantic space.
Can be learnt in self-supervised manner from a large corpus.

Examples:

Word2Vec (2013), GloVe, BERT, ELMo

103 of 190

Seq2Seq Models

Recurrent Neural Networks (RNNs)

● Long Short-Term Memory Networks (LSTMs)

● Capture dependencies between input tokens

● Gates control the flow of information

A simple RNN shown unrolled in time. Network layers are recalculated for each time step, while weights U, V and W are shared across all time steps.

The inputs to each unit consists of the current input xt, previous hidden state h_t-1, and previous context c_t-1
The outputs are a new hidden state h_t and an updated context c_t.

104 of 190

Transformers

Allows to “focus attention” on particular aspects of the input while generating the output.
Done by using a set of parameters, called "weights," that determine how much attention should be paid to each input token at each time step.
These weights are computed using a combination of the input and the current hidden state of the model.

In encoding the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired". The model's representation of the word "it" thus bakes in some of the representation of both "animal" and "tired".

https://jalammar.github.io/illustrated-transformer/

105 of 190

Pre-Training: Data Preparation

A typical data preparation pipeline for pre-training LLMs:

W. Zhao et al. A Survey of Large Language Models. 2023.

106 of 190

What LLMs Can do?

Evolution of LMs from Perspective of Task-Solving Capacity

W. Zhao et al. A Survey of Large Language Models. 2023.

107 of 190

Few-Shot Prompting

T. Brown et al. Language Models are Few-Shot Learners. NeurIPS 2020.

"Great product, 10/10": {"label": "positive"}

"Didn't work very well": {"label": "negative"}

"Super helpful, worth it": {"label": "positive"}

Instruction:

Classify the sentiment of the given text as either positive or negative based on the examples provided.

Few shots examples:

Input: "Amazing quality and fast shipping!"

LLM

Ideal Output:

{"label": "positive"}

108 of 190

Chain-of-Thought Prompting

Instruction:�Classify the sentiment of the given text as either positive or negative. Follow a step-by-step reasoning process to determine the sentiment.

Reasoning:

"Wow!" expresses excitement and enthusiasm, indicating a positive reaction.
"This is fantastic quality" suggests high satisfaction with the product’s quality.
"Fast shipping" is another positive aspect, showing appreciation for timely delivery.
All elements in the sentence convey strong positivity.

Output: {"label": "positive"}

LLM

Input: "Wow! This is fantastic quality and fast shipping!"

Examples:

Input: "Great product, 10/10"�Reasoning: The phrase "Great product" expresses strong approval, and "10/10" indicates a perfect rating, showing high satisfaction.�Output: {"label": "positive"}
Input: "Didn't work very well"�Reasoning: The phrase "Didn't work" suggests malfunction or failure, and "very well" implies that it performed below expectations. This conveys dissatisfaction.�Output: {"label": "negative"}
Input: "Super helpful, worth it"�Reasoning: "Super helpful" indicates a high level of usefulness, and "worth it" suggests that the person finds the product valuable. This implies strong satisfaction.�Output: {"label": "positive"

J. Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.

109 of 190

From Prompting to Fine-Tuning

Unlike prompting, fine-tuning actually changes the model under the hood, giving better domain- or task-specific performance.

https://x.com/karpathy/status/1655994367033884672

Fine-Tuning

110 of 190

Custom Trained Model in Law: Harvey AI

Startup building a custom-trained case law model for drafting documents, answering questions about complex litigation scenarios, and identifying material discrepancies between hundreds of contracts.
Added 10 billion tokens worth of data to power the model, starting with case law from
Delaware, and then expanding to include all of U.S. case law.
Attorneys from 10 large law firms preferred custom model’s output versus GPT-4’s
97% of the time. Main benefit was reduced hallucinations!

Open AI Customer Stories: Harvey. April 2024.

111 of 190

Parameter Eﬃcient Fine-Tuning (PEFT)

Techniques like LoRA construct a low-rank parameterization for parameter eﬃciency during training. For inference, the model can be converted to its original weight parameterization to ensure unchanged inference speed.

E. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.

GPT-3 175B validation accuracy vs. number of trainable parameters of several adaptation methods on WikiSQL. LoRA exhibits better scalability and task performance.

112 of 190

Why the need for Trustworthiness in Generative AI?

113 of 190

Unreliable Reasoning Even On Simple Tasks

Probably due to tokenization!

Generated by gpt-4o’s tokenizer.

Try it out at:

https://tiktokenizer.vercel.app/

Easy reasoning, Sure!

Got confused ??

114 of 190

Jailbreaking Can Bypass Safety

Jailbreaking is the process of altering prompts to evade an LLM’s safeguards, resulting in harmful outputs.

PAIR, influenced by social engineering attacks, involves an attacker LLM that autonomously generates jailbreaks for a targeted LLM. The attacker LLM repeatedly interacts with the target LLM, refining and improving a jailbreak—often within twenty queries.

P. Chao et al. Jailbreaking Black Box Large Language Models in Twenty Queries. 2023.

115 of 190

The Story of a Lawyer Who Employed ChatGPT … trust issues remain

A lawyer, representing a client against an airline, turned to AI assistance for drafting legal documents. The results were less than ideal. https://www.nytimes.com/2023/05/27/nyregion/avianca-airline-lawsuit-chatgpt.html

https://www.nytimes.com/2023/06/22/nyregion/lawyers-chatgpt-schwartz-loduca.html

Legal Consequences for Attorneys Using ChatGPT

Lawyer Acknowledges AI Misuse in Court: During court session, an attorney admitted excessively relying on AI, resulting in a legal motion filled with artificial legal references. https://www.nytimes.com/2023/06/08/nyregion/lawyer-chatgpt-sanctions.html

116 of 190

Neuro Symbolic Legal AI

Open AI Customer Stories: Harvey. April 2024.

117 of 190

Neuro Symbolic Legal AI

Open AI Customer Stories: Harvey. April 2024.

118 of 190

Neuro Symbolic Legal AI

Open AI Customer Stories: Harvey. April 2024.

119 of 190

Nearly Impossible to Explain or Reason Generative Answers

Prompt Injections can leak data

Context Windows are and will remain limited

Bias in Large Language Models that Supervised Learning cannot reduce

Reliability Issues: Different Large Language Models Yield Different Outcomes

Inconsistency in Prompts for Completeness in Outcomes

More Challenges for the Generative AI

120 of 190

Grounding (Part 2)

121 of 190

Grounding

Grounding is defined as ensuring every claim in an LLM response generate verifiable and well-grounded responses to any prompt, relying solely on information from a user-specified knowledge base.

Grounded means that every claim in the response is *attributable to a document in the knowledge base

Verifiably-grounded means that every claim is backed by an appropriate citation

Knowledge base may be a private corpus, a public domain, entire Web

E.g., a healthcare customer may specify a set of journals they trust

122 of 190

Two Core Approaches to Grounded AI

Grounded Generation – Enhancing AI with Verified Knowledge

Method:

Retrieve relevant facts from a trusted knowledge base.
Augment LLM prompts with verified context before generating responses.
Intrinsic phenomena

Grounding Verification – Ensuring AI’s Responses Are Factually Correct

Method:

Cross-check AI-generated claims with authoritative sources.
Generate citations to improve transparency and accountability.
Apply fact-checking models to filter unverified claims.
Extrinsic phenomena

123 of 190

123

Why Grounded Generation ?

LLAMA

“Grounded generation retrieves latest clinical guidelines and provides an evidence-based response”

Not grounded but a generic answer!

124 of 190

Why Grounding Verification ?

INPUT:�What is the target blood pressure for men?

Not according to 2017 guidelines

It should first verify who the intended audience is before ensuring factual accuracy.

125 of 190

Types of Grounding in AI & LLMs

Symbolic Grounding

AI must retrieve, recognize, and structure information correctly before using it (i.e. LLMs should understand and link symbols like words, phrases, numbers to their real-world meanings)

Retrieval-Augmented Generation Based Grounding
Method: AI retrieves external documents before responding.
Example:

A QA system fetching Wikipedia articles before answering a historical question.
Chatbot retrieving product manuals before explaining a feature.

Knowledge Graph-Based Grounding
Method: AI structures information in graphs to improve contextual understanding.
Example:

A search engine linking related topics (e.g., AI connects “COVID-19” with “vaccines” and “pandemics”).
Legal NLP models linking case laws, statutes, and judicial precedents to provide structured responses.

2) Functional Grounding

LLMs should reason, verify, and adapt responses based on context (i.e apply it correctly in context)

Attribution-Based Grounding
Method: AI justifies responses with references and citations.
Example:

A fact-checking AI cross-referencing news claims with verified sources before publishing.
AI writing research papers by citing scientific studies instead of generating unsupported claims.

Interactive & Reinforcement-Based Grounding
Method: AI learns from real-time feedback and improves responses over time.
Example:

Customer support chatbots adapt based on user corrections (e.g., learning new slang or product updates).
Language models are fine-tuned through user interactions to generate more accurate, personalized responses.

126 of 190

Symbolic Grounding

127 of 190

Retrieval-Augmented Generation (RAG)

LLMs lack knowledge beyond their training date, and frequent model updates are impractical.

Idea: Enhance LLMs with a retrieval system!

Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in Neural Information Processing Systems 33 (2020): 9459-9474.

128 of 190

Advantages of RAG

Fact-Checking

Safe

Custom Train

Cost-Effective

Continuous Update

Accessible and Affordable

Domain Knowledge

Easier to Customize

129 of 190

Symbolic Retrieval based Grounding

RAG enabled system
Domain-Specific Training
Enhanced Accuracy and Relevance
Customization for Business Needs
Business Alignment

Source Attribution: Retrieve, recognize, and attribute

Grounded and targeted for generating citations with structured metadata

Image: Tilwani, Deepa, et al. "REASONS: A benchmark for REtrieval and Automated citationS Of scieNtific Sentences using Public and Proprietary LLMs." arXiv preprint arXiv:2405.02228 (2024).

An Evaluation study of Citation Generation on Recent LLMs

130 of 190

How we do Symbolic Retrieval based Grounding ?

131 of 190

But Few Limitations of RAG..

Needs Existing Database

Context Length

Limitation

Latency Issues

Dependent on

Semantic Search

Hallucination still exists

At scale, sensitive to choices of:

1) Chunking Strategy,

2) Embedding Model, and

3) Generation Model.

132 of 190

Flaws in RAG from REASONS Dataset

Only Adv. RAG was able to correctly generate author names

133 of 190

Latency Issues

134 of 190

Domain	OpenAI	M	L	D	RM	RL	P	AdvRAG(L)	AdvRAG(M)
AI	34:25	26:03	11:10	34:11	74:49	73:09	34:31	156:24	163:28
CV	47:45	18:35	19:24	50:22	189:20	198:45	42:05	259:32	302:14
Cryptography	03:50	02:18	04:59	32:21	83:28	89:21	13:23	190:19	194:25
Graphics	07:08	08:55	06:08	58:43	108:08	127:48	16:52	214:25	227:23
HCI	03:01	01:10	00:42	21:56	48:32	50:51	02:47	95:56	98:44
IR	20:31	11:40	06:52	33:34	91:30	99:43	19:50	193:37	202:23
NLP	28:26	11:42	05:09	47:24	91:07	88:40	13:06	175:58	156:49

135 of 190

2. Knowledge Graphs (KG) Based Grounding

Machine-readable structured representation of knowledge
Consisting of entities, entity types, and relationships in various forms (e.g., ontologies, lexicons, labeled property graphs and RDFs).

Speer et al. AAAI’17

Vrandečić et al. ACM Comm’14

Gaur et al. ICSC’19

Miller, ACM Comm’95

ConceptNet

World War I fought_with Poisonous Gas

Subject

Predicate

Object

“KG-based grounding structures information in graphs, linking concepts to improve AI systems' ability to retrieve and generate meaningful responses.”

136 of 190

Illustration of ISEEQ

Sentence BERT Encoder

1. What is gross_domestic_product?

2. What is the measure of gross_domestic product?

3. What is the reason nation income relations gross_domestic_product?

4. What is the influence of inflation to gross_domestic_product?

5. What is the meaning of unemployment in inflation?

6. What is the influence of inflation on cost_of_living?

Query

Title: Economy and Employment Statistics

Description: Learn Information about key economic concepts including gdp, inflation, and the influence on employment

Constituency Parsing

Information + { economy, employment statistics, employment, influence employment, inflation influence employment, gdp, gdp influence employment, key economic concepts}

economics

economy

inflation

employment

gdp

gross domestic product

unemployment

gnp

gross national product

national income

cost of living

income

personal income

income tax

ConceptNet Graph for Semantic Query Expansion

ISQ by Generative Adversarial Reinforcement Learning

137 of 190

ISEEQ could be used in a medical triaging system, where a knowledge graph links symptoms, diseases, and treatments.
A patient query like "chest pain" could be expanded using KG-based knowledge to generate Information seeking questions (ISQs) such as:

"Do you also experience shortness of breath?"
"Have you had previous heart conditions?"
"Are the symptoms triggered by physical activity?"

(A) An example of curiosity-driven ISQs generated by

ISEEQ. (B) overview of ISEEQ

Gaur, M., Gunaratna, K., Srinivasan, V., & Jin, H. (2022). ISEEQ: Information Seeking Question Generation Using Dynamic Meta-Information Retrieval and Knowledge Graphs. AAAI 2022

138 of 190

Functional Grounding

139 of 190

Helps to work on :

Factual verification
Reducing hallucinations
External knowledge integration
Error categorization

A fact-checking approach for cross-referencing news claims with verified sources before publishing.

Evaluating attribution and identifying specific types of

errors with AttrScore. We explore two approaches in AttrScore: (1) prompting LLMs, and (2) fine-tuning LMs on simulated and repurposed datasets from related tasks

Attribution-Based Functional Grounding

Yue, Xiang et al. “Automatic Evaluation of Attribution by Large Language Models.” EMNLP (2023).

140 of 190

Reinforcement Learning for Grounding

What constitutes a good response for a query and context is quite nuanced?

Idea: Capture this using a reward model that scores each <query, context, response> on the appropriateness of the response. The model may be trained on a dataset specifying preferences between response pairs

We can then use reinforcement learning to tune the model to maximize reward while staying within a bounded KL-divergence from the initial model.

References:

Src: https://bdtechtalks.com/2023/01/16/what-is-rlhf/

141 of 190

Interactive & Reinforcement-Based Grounding

Interactive & Reinforcement-Based Grounding ensures that LLMs do not just generate blindly but engage in a feedback-driven, iterative process to reason, verify, and adapt responses based on context.

The code-generating language models as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor.

Le, H., Wang, Y., Gotmare, A. D., Savarese, S., & Hoi, S. C. H. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Neurips 2022

142 of 190

Check out our Dataset for Interactive & Reinforcement-Based Grounding at AAAI 2025

POSTER PRESENTATION ON 28TH FEB

Can LLMs Obfuscate Code? A Systematic Analysis of Large Language Models into Assembly Code Obfuscation

Seyedreza Mohseni, Seyedali Mohammadi, Deepa Tilwani, Yash Saxena, Gerald Ketu Ndawula, Sriram Vema, Edward Raff, Manas Gaur

143 of 190

Grounding Verification

Despite progress in generating grounded responses, post-hoc verification of generated responses is still indispensable

Especially in domains like healthcare where we may want 100% grounding
Especially when the query is complex and/or the retrieval quality is not good
Especially if verifiable-correct citations are required for each claim

144 of 190

144

Symbolic and Functional Grounding Together

LLAMA

Domain Knowledge: PHQ9 Depression ontology

LLAMA + Domain Knowledge Output

S. Dalal, D. Tilwani, M. Gaur, S. Jain, V. L. Shalin and A. P. Sheth, "A Cross Attention Approach to Diagnostic Explainability Using Clinical Practice Guidelines for Depression," in IEEE Journal of Biomedical and Health Informatics

“Grounded generation retrieves latest clinical guidelines and provides an evidence-based response”

“Knowledge Graphs (symbolic grounding) and adapting to domain (functional grounding)”

145 of 190

How to do Symbolic and Functional Grounding Together ?

“Grounded generation retrieves latest clinical guidelines and provides an evidence-based response”

146 of 190

Original Text:

Why do i have sudden bursts of depression know the title probably doesn't make sense but stopped working for a while to peruse business idea had which failed and now i'm about go back into work force only 19 these moments where just feel lost like my family friends as is what dedicated life past 6 months most that time was me sitting in room trying get it off ground floor. really nervous getting job again haven't real one entire am overthinking or will be not bad think.

147 of 190

Self Attention Text (No Highlighting)

(Don’t know Why?)

Why do i have sudden bursts of depression know the title probably doesn't make sense but stopped working for a while to peruse business idea had which failed and now i'm about go back into work force only 19 these moments where just feel lost like my family friends as is what dedicated life past 6 months most that time was me sitting in room trying get it off ground floor. really nervous getting job again haven't real one entire am overthinking or will be not bad think. �

148 of 190

Attention Over PHQ 1:How often have you been bothered by little interest or pleasure in doing things? (No Highlighting)

Why do i have sudden bursts of depression know the title probably doesn't make sense but stopped working for a while to peruse business idea had which failed and now i'm about go back into work force only 19 these moments where just feel lost like my family friends as is what dedicated life past 6 months most that time was me sitting in room trying get it off ground floor. really nervous getting job again haven't real one entire am overthinking or will be not bad think.

149 of 190

Why do i have sudden bursts of depression know the title probably doesn't make sense but stopped working for a while to peruse business idea had which failed and now i'm about go back into work force only 19 these moments where just feel lost like my family friends as is what dedicated life past 6 months most that time was me sitting in room trying get it off ground floor. really nervous getting job again haven't real one entire am overthinking or will be not bad think.

Attention Over PHQ 2 : How often are you bothered by feeling down, depressed, or hopeless?

150 of 190

Attention Over PHQ 9: How often have you been bothered by thoughts that you would be better off dead or of hurting yourself in some way ?

Why do i have sudden bursts of depression know the title probably doesn't make sense but stopped working for a while to peruse business idea had which failed and now i'm about go back into work force only 19 these moments where just feel lost like my family friends as is what dedicated life past 6 months most that time was me sitting in room trying get it off ground floor. really nervous getting job again haven't real one entire am overthinking or will be not bad think.

151 of 190

PHQ-1, PHQ-5, and PHQ-6 are unanswered questions. These are the relevant questions to be asked.

Similarity score between phrases highlighted in Self-Attention and PHQ-9 questions.

(equal attention, confused, and Unexplainable)

Cumulative Cross-Attention Scores

(PHQ-9 infusion explains model attention)

152 of 190

Check Grounding API [Google Cloud]

Check Grounding determines how grounded a given response is in a given set of facts (context)

Returns:

Grounding scores (a support score, and a contradiction score)
Citations
Anti-Citations

Based on custom NLI model

Generally available at: https://cloud.google.com/generative-\ai-app-builder/docs/check-grounding

153 of 190

Open Questions..

What mechanisms can be implemented for LLMs to flag uncertain or unverifiable information in their responses?
How should LLMs handle conflicting information when processing text from multiple sources?
Can LLMs dynamically cross-reference their outputs with primary sources or citations before finalizing a response?

154 of 190

Handoff

155 of 190

TH10: Neurosymbolic AI for EGI:

Explainable and Grounded Generations

Feb 25th 25

Ali Mohammadi, M294@umbc.edu

Ph.D. student at UMBC

University of Maryland, Baltimore County (UMBC), Knowledge Infused AI and Inference (KAI2) Lab

156 of 190

Why NeuroSymbolic explainable AI?

Safety in High-Stake Application

Alignment with Human Values

User Adoption and Confidence

Debugging and Improving Models

Ethical and Fair AI

Trust and Transparency

157 of 190

Key Focus Areas

Large Language Models (LLMs)

Explainability

Wellness Dimension

External Knowledge

158 of 190

159 of 190

160 of 190

Wellness Dimension Datasets

6 wellness dimensions:

Physical
Intellectual
Vocational
Social
Spiritual
Emotional

MultiWD dataset
WellXplain dataset

161 of 190

Explanation and Prediction

Fine-tuned LMs

Fine-tuned/Prompting LLMS

on WD Datasets

The fall semester was one of the worst experiences of my life, and I barely passed my four classes.

Textual Post

Explanation & Label (Expected/Predicted)

Intellectual and Vocational Aspect

fall semester was one of the worst experiences

barely passed

Wellness dimension sample

162 of 190

External Knowledge

163 of 190

External Knowledge

Findings:

General task (e.g. e-snli)

Model already knows the definitions and rely more on their internal knowledge.

Domain specific task (e.g. WellXplain and HateXplain)

Models benefit from the provided external knowledge.

164 of 190

Mohammadi, S., Raff, E., Malekar, J., Palit, V., Ferraro, F., & Gaur, M. (2024). WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 364–388, Miami, Florida, US. ACL.

Instruction

Post: They make me feel unhappy and miserable (SpEA). What should I do?

Output:

SpEA (PA:0, IVA:0, SA:0, SpEA:1)

Explanation: unhappy, miserable

WELLXPLAIN Training Examples

Post: My mum, dad and step-mum (SA) won't leave me alone and they constantly make choices for me and it's starting to get to me.

Output: SA(PA:0, IVA:0, SA:1, SpEA:0)

Explanation: My mum, dad, step-mum

Evaluation

WELLXPLAIN Test Examples

In

Out

In

WellDunn Benchmarking

Encoder 3

Encoder 2

Encoder 1

Decoder 1

Decoder 2

Decoder 3

Fine-tuned LM

Tokenizer

…

Robustness

Explainability

SCE

GL

SVD

AO

FFNN

Prediction and Explanation

165 of 190

Robustness Assessment

Sigmoid Cross-Entropy (SCE):
Gambler’s Loss (GL):

166 of 190

The fall semester was one of the worst experiences of my life, and I barely passed my four classes.

Mohammadi, S., Raff, E., Malekar, J., Palit, V., Ferraro, F., & Gaur, M. (2024). WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 364–388, Miami, Florida, US. ACL.

[Ground Truth Explanation]

[Generated Explanation]

Explainability Assessment

SVD (Singular Value Decomposition) ranking: measures the focus of a model’s attention by analyzing its attention matrix. Lower ranks suggest that the model focuses on fewer, more relevant parts of the input text, aligning closely with concise explanations.
Attention-Overlap (AO) Score: The Number of Common (Purple) Words between grounded and generated explanation divided by Number of Ground-Truth Words.

The fall semester was one of the worst experiences of my life, and I barely passed my four classes.

167 of 190

168 of 190

169 of 190

SCE vs GL attention (Post 1)

I

I don’t cry anymore. want to be around anyone, do anything. Work keeps me getting up everyday. Without it would probably stare at my ceiling until passed back out again m so tired know if there is a question in this, There just isn else tell.

I don't cry anymore. want to be around anyone do anything Work keeps me getting up everyday Without it would probably stare at my ceiling until passed back out again so tired know if there is a question in this, There just isn't else tell.

With SCE Loss:

With GL:

170 of 190

Future Directions

Developing a Transparent Classifier Rooted in Clinical Understanding – Addressing the disparities between prediction accuracy and attention.

Improving Attention Alignment with Ground Truth – Enhancing attention explanations to better reflect actual outcomes.

Exploring Different Prompting and Retrieval-Augmented Generation (RAG) Strategies – Testing alternative methods to improve LLM performance.

Developing a Suitable Dataset for Mental Health Applications – Curating knowledge and constructing a well-suited dataset for retrieval-augmented methods.

171 of 190

Wrap up!

Large Language Models (LLMs)

Explainability

Wellness Dimension

External Knowledge

172 of 190

173 of 190

Reference

Seyedali Mohammadi, Edward Raff, Jinendra Malekar, Vedant Palit, Francis Ferraro, and Manas Gaur. 2024. WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 364–388, Miami, Florida, US. Association for Computational Linguistics.
Liu, Ziyin, et al. "Deep gamblers: Learning to abstain with portfolio theory." Advances in Neural Information Processing Systems 32 (2019).
Alsentzer, Emily, et al. "Publicly available clinical BERT embeddings." arXiv preprint arXiv:1904.03323 (2019).
WELLDUNN: An Annotated Dataset to identify affected Wellness Dimensions in Reddit Posts, Under review.
DUNN HL. High-level wellness for man and society. Am J Public Health Nations Health. 1959 Jun;49(6):786-92. doi: 10.2105/ajph.49.6.786. PMID: 13661471; PMCID: PMC1372807.
Merikangas KR, He JP, Burstein M, Swanson SA, Avenevoli S, Cui L, Benjet C, Georgiades K, Swendsen J. Lifetime prevalence of mental disorders in U.S. adolescents: results from the National Comorbidity Survey Replication--Adolescent Supplement (NCS-A). J Am Acad Child Adolesc Psychiatry. 2010 Oct;49(10):980-9. doi: 10.1016/j.jaac.2010.05.017. Epub 2010 Jul 31. PMID: 20855043; PMCID: PMC2946114.
Garg, M. Mental Health Analysis in Social Media Posts: A Survey. Arch Computat Methods Eng 30, 1819–1842 (2023). https://doi.org/10.1007/s11831-022-09863-z
Garg, Muskan. "WellXplain: Wellness concept extraction and classification in Reddit posts for mental health analysis." Knowledge-Based Systems 284 (2024): 111228.
Sathvik, M. S. V. P. J., and Muskan Garg. "Multiwd: Multiple wellness dimensions in social media posts." Authorea Preprints (2023)

174 of 190

175 of 190

openCHA:

Building Explainable

and Personalized Conversational Agent

Iman Azimi, PhD

February 25, 2025

176 of 190

Healthcare chatbots or Conversational Health Agents

Chatbots have the potential to play a crucial role in healthcare: assisting patients and healthcare providers:

Clinical decision support
Patient monitoring and follow-up
Chronic health management
Patient’s self-awareness

176

Bedi, Suhana, et al. "A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)." medRxiv (2024): 2024-04.

177 of 190

Why are healthcare chatbots not widely used?

Existing chatbots are not able to provide:

Trustworthiness
Personalization (no access to user’s data)
Data analysis

No access to established multimodal data analysis tools

Access to up-to-date information

Ignoring well-established healthcare research

Explainability
…

177

Abbasian, M., Khatibi, E., Azimi, I., Oniani, D., Shakeri Hossein Abad, Z., Thieme, A., ... & Rahmani, A. M. (2024). Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. npj Digital Medicine, 7(1), 82.

178 of 190

openCHA (Conversational Health Agents)

A holistic LLM-powered framework to integrate health data, knowledge, and analytical tools into healthcare chatbots.

178

Abbasian, M., Azimi, I., Rahmani, A.M. and Jain, R., 2023. Conversational Health Agents: A Personalized LLM-Powered Agent Framework. arXiv preprint arXiv:2310.02374.
GitHub repo: https://github.com/Institute4FutureHealth/CHA

179 of 190

openCHA framework

179

180 of 190

Interface

Acts as a bridge between the users and agents

Users’ queries
Metadata

180

181 of 190

Orchestrator

Responsible for problem solving, decision making, and response generation

Input data are aggregated, transformed into structured data, and then analyzed to plan and execute actions
Interacts with external sources to acquire the required information, perform data integration and analysis, and extract insights, among other functions.
Converts the info into an understandable format and inferring the appropriate response.

181

182 of 190

External sources

Obtain essential information from the broader world

Datasets
Knowledge bases
Analytical tools
Translators

182

183 of 190

Demo:

Nutrition causal effects

Tasks involved:

Get data
Causal graph (personal info)
Food’s nutritional content (general info)

183

Z. Yang, E. Khatibi, N. Nagesh,, M. Abbasian, I. Azimi, R. Jain, and A. Rahmani, “ChatDiet: Empowering Personalized Nutrition-Oriented Food Recommender Chatbots through an LLM-Augmented Framework,” Elsevier Smart Health, IEEE/ACM CHASE, 2024

184 of 190

Patient health record reporting

(1)

Tasks involved:

Get data
Statistical analysis
Internet search
Extract text

184

Dataset: S. Labbaf, et al. "Physiological and Emotional Assessment of College Students Using Wearable and Mobile Devices During the 2020 Covid-19 Lockdown: An Intensive, Longitudinal Dataset." Longitudinal Dataset (2023).

185 of 190

Patient health record reporting

(2)

Tasks involved:

Get data
Statistical analysis
SerpAPI
Extract Text

185

Dataset: S. Labbaf, et al. "Physiological and Emotional Assessment of College Students Using Wearable and Mobile Devices During the 2020 Covid-19 Lockdown: An Intensive, Longitudinal Dataset." Longitudinal Dataset (2023).

186 of 190

Objective stress level estimation

Tasks involved:

Get data
PPG analysis (HRV extraction)
HRV analysis (stress estimation)

186

Dataset: S. Labbaf, et al. "Physiological and Emotional Assessment of College Students Using Wearable and Mobile Devices During the 2020 Covid-19 Lockdown: An Intensive, Longitudinal Dataset." Longitudinal Dataset (2023).

187 of 190

Use cases

Yang, Zhongqi, et al. "ChatDiet: Empowering personalized nutrition-oriented food recommender chatbots through an LLM-augmented framework." Smart Health 32 (2024): 100465.
Abbasian, Mahyar, et al. "Knowledge-Infused LLM-Powered Conversational Health Agent: A Case Study for Diabetes Patients." 46th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), (2024).
Park, Jung In, et al. "Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools." arXiv preprint arXiv:2408.04650 (2024).
Abbasian, Mahyar, et al. "Empathy Through Multimodality in Conversational Interfaces." arXiv preprint arXiv:2405.04777 (2024).

187

188 of 190

Future directions

We are looking for contribution from diverse communities: to contribute their ideas and connect their tools to CHA, leading to more precise user responses.

Safety of the responses

Benchmarking

Connecting open datasets, knowledge graphs, etc.

Planning and decision making
Retrieve information

Chronic Health Management

188

189 of 190

Thank You

Questions?

More info about openCHA:

arxiv.org/abs/2310.02374

GitHub repository:

github.com/Institute4FutureHealth/CHA

User guide and quick start:

opencha.com

Should you be interested, please reach out to me at

189

Tutorial website: https://nesy-egi.github.io

Slides Available :

https://nesy-egi.github.io

190 of 190

Thanks! Questions?

Feedback most welcome :-)

manas@umbc.edu
edraff1@umbc.edu
dtilwani@mailbox.sc.edu
m294@umbc.edu
azimi.iman.1988@gmail.com

Tutorial website: https://nesy-egi.github.io