1 of 190

NeuroSymbolic AI

for

Grounding,

Instructibility, and

Explainability

Tutorial 2025

2 of 190

Spread the Word

Making LLMs Explainable, Grounded, and Instructible

3 of 190

4 of 190

Focus of Tutorial

NeuroSymbolic AI and Instructible AI

Vector Symbolic Architectures

Explainability with Knowledge-infused Learning

Grounding with Retrieval Augmented Generation

OpenCHA: Domain Knowledge-driven

LLM-based Conversational Agent for Health

5 of 190

Tutorial’s Central Question

The "black box" nature of AI systems in

High Stakes Decision-Making Application

research has raised concerns about transparency and reproducibility.

How can we go about reducing Blackboxness??

6 of 190

Attention Maps and Feature Visualization

Layer Analysis and Activation Patterns

Attention Analysis and Probing Tasks

Token Attribution and Hidden States with Sparse Autoencoders

Mechanistic Interpretability

Behavioral Testing

?

7 of 190

Statistical AI is a Blackbox

NeuroSymbolic AI

8 of 190

Why NeuroSymbolic AI

Amit Sheth, Kaushik Roy, Manas Gaur, Neurosymbolic Artificial Intelligence (Why, What, and How), IEEE Intelligent Systems, 38 (3), May-June 2023

9 of 190

Explainability

🡪 Explanations that use terms and connections specific to a particular field or industry are more useful than general words that don't help people take action.

  • Features:
    • Local and Global Explanations
    • Causality
    • User-Appropriate Explanations
  • Features:
    • Symbolic Grounding
    • Functional Grounding

10 of 190

NeuroSymbolic AI

Neuro-symbolic AI techniques incorporate broader forms of knowledge (lexical, domain-specific, common-sense, and constraint-based) into addressing limitations of either symbolic or statistical AI approaches, such as model interpretations and user-level explanations. Compared to powerful statistical AI that exploits data, NeSy benefits from data and knowledge.

11 of 190

Neurosymbolic AI in the Era of Large Language Models

12 of 190

Applications f

Applications of NeuroSymbolic AI with a focus on Grounding, Explainability, and Instructibility (EGI)

13 of 190

NeuroSymbolic in Machine Learning and Natural Language Processing

Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., & Miller, A. (2019, November). Language Models as Knowledge Bases?. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

14 of 190

Neural AI

Data

Hyperparameters

Activation Functions

Loss Functions

Computing Power

and Model

Compression

Optimization

Symbolic AI

Knowledge Graphs

Lexicons

Rules

Workflow or Procedural Knowledge

Constraints

15 of 190

Benchmarking Example

WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions,

In BlackboxNLP @ EMNLP 2024

16 of 190

Wellness Dimension

17 of 190

Wellness Dimension

Wellness Dimension Definitions and Questionnaire

https://store.samhsa.gov/sites/default/files/sma16-4958.pdf

18 of 190

MultiWD and

WellXplain Datasets

Content worth 4000 users

6 Wellness Dimensions

  • Physical
  • Intellectual
  • Vocational
  • Social
  • Spiritual
  • Emotional

Clinical expert explanations

19 of 190

Input

Attention Matrix

Explanation

Prediction

Input

Attention Matrix

Explanation

Prediction

Definitions In-Context Learning

Design 1

Design 2

20 of 190

Input

Attention Matrix

Explanation

Prediction

Questionnaire

Workflow-based

In-Context Learning

Design 4

Input

Attention Matrix

Explanation

Prediction

Chain of Thoughts with Definitions

Design 3

21 of 190

Domain-Specific LLMs

General Purpose LLMs

22 of 190

Hybridized Architectures: NeuroSymbolic AI

23 of 190

Process Knowledge-infused Learning

24 of 190

Simple Text Classification

I am really struggling with my bisexuality, which is causing chaos in my relationship with a girl. Being a fan of the LGBTQ community, I am equal to worthless to her. I’m now starting to get drunk because I can’t cope with the obsessive, intrusive thoughts I need to get out of my head.

Don’t want to live anymore. Sexually assault, ignorant family members, and my never-ending loneliness brights up my path to death. 

I do have the potential to live a decent life, but not with people who abandon me. Hopelessness and feelings of betrayal have turned my nights into days. I am developing insomnia because of my restlessness.

I just can’t take it anymore. Been abandoned yet again by someone I cared about. I've been diagnosed with borderline for a while, and I’m just going to isolate myself and sleep forever.

Y = Suicide Ideation

Process Knowledge

Infusion is

better form than data-driven Classification

25 of 190

Simple Text Classification

I am really struggling with my bisexuality, which is causing chaos in my relationship with a girl. Being a fan of the LGBTQ community, I am equal to worthless to her. I’m now starting to get drunk because I can’t cope with the obsessive, intrusive thoughts I need to get out of my head.

Don’t want to live anymore. Sexually assault, ignorant family members, and my never-ending loneliness brights up my path to death. 

I do have the potential to live a decent life, but not with people who abandon me. Hopelessness and feelings of betrayal have turned my nights into days. I am developing insomnia because of my restlessness.

I just can’t take it anymore. Been abandoned yet again by someone I cared about. I've been diagnosed with borderline for a while, and I’m just going to isolate myself and sleep forever.

Y = Suicide Ideation

Process Knowledge-based Classification

Has the subject wished he was dead or wished

he could go to sleep and not wake up?

YES

Has the subject had any thoughts of killing himself?

                                  YES

Has the subject been thinking about how he might do this?

                                  NO

Has the subject has these thoughts and some intentions of acting on them?

                                  NO

26 of 190

Simple Text Classification

I am really struggling with my bisexuality, which is causing chaos in my relationship with a girl. Being a fan of the LGBTQ community, I am equal to worthless to her. I’m now starting to get drunk because I can’t cope with the obsessive, intrusive thoughts I need to get out of my head.

Don’t want to live anymore. Sexually assault, ignorant family members, and my never-ending loneliness brights up my path to death. 

I do have the potential to live a decent life, but not with people who abandon me. Hopelessness and feelings of betrayal have turned my nights into days. I am developing insomnia because of my restlessness.

I just can’t take it anymore. Been abandoned yet again by someone I cared about. I've been diagnosed with borderline for a while, and I’m just going to isolate myself and sleep forever.

Y = Suicide Ideation

27 of 190

28 of 190

29 of 190

30 of 190

31 of 190

32 of 190

NeuroSymbolic AI in Social Media

33 of 190

NeuroSymbolic AI in Social Media

Symbolic AI

  • Knowledge Representation: LDA, Word2Vec; Maintains a "Constantly Updating Lexicon" as a formal knowledge representation structure
  • Semantic Processing: dedicated Location Extraction showing symbolic representation of geographical entities; Key Phrases/Hashtags as symbolic tokens with defined meanings.
  • Semantic Gap Management: Uses neologisms as a symbolic bridge between new terms and existing knowledge
  • Training Domain-Specific Models: LDA and Word2Vec.

34 of 190

NeuroSymbolic AI in Social Media

Symbolic AI Scoring

  • Semantic Mapping: A lexicon with metadata as fields and W2V model vocabulary as terms, created using cosine similarity.
  • Semantic Proximity: A lexicon with mental health concepts as fields and metadata concepts as terms, created using cosine similarity
  • Index Scoring: A cross-correlation matrix from mental health concepts to model vocabulary words, with metadata as transient terms.

35 of 190

NeuroSymbolic in Machine Learning and Natural Language Processing

Neural AI with Symbolic information

36 of 190

NeuroSymbolic in Machine Learning and Natural Language Processing

Anxiety

Depression, Cognitive distortions, panic attacks, hopelessness, physical sensations.

Depression

Mood swings, weight gain, rapid cycling, depressive episode, Impulsivity, mood swings, antisocial conduct, personality disorder

Addiction

Buying oxycodone, pain management, chronic pain, alienation, crippling alcohol, dependent on crack

37 of 190

NeuroSymbolic in Machine Learning and Natural Language Processing

The tables compare model performance for mental health classification across Precision, Recall, and F1-Score. The left table shows traditional models’ results with and without the Neurosymbolic approach, while the right table contrasts the NeuroSymbolic model with state-of-the-art LLMs like LLama, Phi, and Mistral.

The NeuroSymbolic model consistently outperforms both traditional models and state-of-the-art LLMs, achieving higher performance metrics and adaptability in mental health sentiment classification.

NB: Naïve Bayes

RF: Random Forest

38 of 190

CREST

39 of 190

Neural AI and Symbolic AI for achieving Consistency

Bonagiri, Vamshi Krishna, Sreeram Vennam, Priyanshul Govil, Ponnurangam Kumaraguru, and Manas Gaur. "SaGE: Evaluating Moral Consistency in Large Language Models." In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation LREC-COLING 2024.

40 of 190

Claim: LLMs are not semantically consistent, and can give contradictory answers to paraphrased questions

41 of 190

NeuroSymbolic Empirical Analysis

42 of 190

43 of 190

Semantic Graph-driven Consistent LLM Training (SaGE)

Moral Consistency Corpus (MCC)

44 of 190

Forbes, Maxwell, Jena D. Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi. "SOCIAL CHEMISTRY 101: Learning to Reason about Social and Moral Norms." In EMNLP. 2020.

45 of 190

BLEURT

BLEU

ROUGE-L

BERT Score

SaGE (LLAMA 3)

GPT-4

46 of 190

Forbes, Maxwell, Jena D. Hwang, Vered Shwartz, Maarten Sap, and Yejin Choi. "SOCIAL CHEMISTRY 101: Learning to Reason about Social and Moral Norms." In EMNLP. 2020.

47 of 190

Integrated Mental Health Instruction Dataset (105K Samples)

Yang, Kailai, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, Jimin Huang, and Sophia Ananiadou. "MentaLLaMA: interpretable mental health analysis on social media with large language models." In Proceedings of the ACM Web Conference 2024, pp. 4489-4500. 2024.

48 of 190

NeuroSymbolic AI for Reliability

Reliability

Grounding

Ensemble of Large Language Models

Bias Awareness

Mechanistic Interpretability

49 of 190

NeuroSymbolic AI for Reliability

Reliability

Grounding

Ensemble of Large Language Models

Bias Awareness

Mechanistic Interpretability

Explainability

Knowledge Gap Resolution

Source Attribution

Gorti, Atmika, Aman Chadha, and Manas Gaur. "Unboxing Occupational Bias: Debiasing LLMs with US Labor Data." In Proceedings of the AAAI Symposium Series, 2024.

Instructibility

50 of 190

Grounding

A successful AI teammate requires several cognitive capacities including situation assessment, task behavior, language comprehension and generation , and knowledge gap resolution processes. Grounding enables agents with different capabilities to communicate.

51 of 190

Knowledge Gap Resolution

  • Bajaj, Goonmeet, Bortik Bandyopadhyay, Daniel Schmidt, Pranav Maneriker, Christopher Myers, and Srinivasan Parthasarathy. "Understanding knowledge gaps in visual question answering: Implications for gap identification and testing." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 386-387. 2020.

52 of 190

Language Gap : Pay attention to Important Domain Concepts

53 of 190

Language Gap : Pay attention to Important Domain Concepts

What Is the Knowledge Gap?

  • The model’s inability to handle tasks requiring factual updates or logical reasoning beyond its training data.
  • In mental health or legal scenarios, LLMs often overlook subtle context cues.

54 of 190

55 of 190

More Details during

Grounding with Retrieval Augmented Generation

56 of 190

NeuroSymbolic AI for Instructibility

Why Instructibility Needs More Than Instruction Tuning?

Lin, Bill Yuchen, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. "The unlocking spell on base llms: Rethinking alignment via in-context learning." arXiv preprint arXiv:2312.01552

57 of 190

NeuroSymbolic AI for Instructibility

Why Instructibility Needs More Than Instruction Tuning?

Lin, Bill Yuchen, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. "The unlocking spell on base llms: Rethinking alignment via in-context learning." arXiv preprint arXiv:2312.01552

  • Token-Level Impact is Minimal
    • Instruction tuning modifies behavior on only a small fraction of tokens
    • Base and instruction-tuned models generate identical top token choices in most positions
  • Surface-Level Modifications
    • Instruction tuning represent a tiny fraction of model's overall output
    • Focus is on discourse markers and transitional words
  • Base Model Capability
    • Given appropriate prompting, base models can match instruction-tuned performance
    • Indicates instruction tuning surfaces existing capabilities rather than teaching new ones

58 of 190

Most Importantly …

🚫 Why This Is Not Enough

  • Instruction tuning doesn’t fill the knowledge gap; it simply redirects existing behavior.
  • Complex domains need deeper reasoning capabilities that go beyond token-level adjustments.

How Instructibility Relates to Knowledge Gaps

  • A genuinely instructible LLM must dynamically recognize gaps in its knowledge.
  • Requires active reasoning mechanisms to retrieve and apply external knowledge.
  • The model must collaborate with external systems (symbolic reasoners, knowledge bases) to fill these gaps effectively.

59 of 190

Weighted Contextual Mutual Information

Knowledge gap

60 of 190

Gou, Tian, Boyao Zhang, Zhenglie Sun, Jing Wang, Fang Liu, Yangang Wang, and Jue Wang. "Rationality of thought improves reasoning in large language models." In International Conference on Knowledge Science, Engineering and Management, pp. 343-358. Singapore: Springer Nature Singapore, 2024.

61 of 190

62 of 190

LLM Generator

LLM Evaluator

63 of 190

64 of 190

Summary

Definition Integration: The WellDunn framework formalizes the incorporation of clinical definitions into mental health assessment systems, enabling a more accurate understanding of psychological conditions

Rule of Thumb Extraction and Contextualization: SAGE extracts clinical heuristics from mental health knowledge bases as rules of thumb, turning LLM agents into more empathetic and grounded agents.

Semantic Encoding and Decoding Optimization: The SEDO framework preserves nuanced psychological semantics when integrating expert knowledge into mental health assessment systems.

Process Knowledge-infused Learning demonstrates how therapeutic processes and intervention sequences can be incorporated into AI systems to provide ethically sound mental health support.

Knowledge Gaps Assessment enables LLMs to dynamically measure intrinsic contextual uncertainty during conversations, strategically resolving persona knowledge gaps through targeted questions rather than producing hallucinated responses when information is incomplete.

65 of 190

Handoff

Vector Symbolic Architectures

66 of 190

Vector Symbolic architectures in deep learning

Edward Raff

AAAI TUTORIAL NEUROSYMBOLIC AI FOR EGI| 24 FEB 2025

Innovation center, Washington, D.C.

67 of 190

Vector Symbolic Architectures

  •  

This document is confidential and intended solely for the client to whom it is addressed.

67

68 of 190

VSA operations

  •  

This document is confidential and intended solely for the client to whom it is addressed.

68

69 of 190

Holographic Reduced Representations: a primer

  •  

69

Booz Allen Hamilton

Consider a d = 3 dimensional space, where we wish to compute c † ⊗ (c ⊗ x), we will get the result that:

 

 

 

70 of 190

Extreme Multi-label classification (XML)

  •  

This document is confidential and intended solely for the client to whom it is addressed.

70

Eli Chien, Jiong Zhang, Cho-Jui Hsieh, Jyun-Yu Jiang, Wei-Cheng Chang, Olgica Milenkovic, and Hsiang-Fu Yu. 2023. PINA: leveraging side information in extreme multi-label classification via predicted instance neighborhood aggregation. In Proceedings of the 40th International Conference on Machine Learning (ICML'23), Vol. 202. JMLR.org, Article 224, 5616–5630.

71 of 190

A symbolic version of XML

  •  

This document is confidential and intended solely for the client to whom it is addressed.

71

72 of 190

Smaller models

  •  

This document is confidential and intended solely for the client to whom it is addressed.

72

73 of 190

Deploying Convolutional Networks on Untrusted Platforms Using 2D Holographic Reduced Representations

  • Say you want to deploy your model on “AWS” to reduce local compute costs.
    • But you don’t fully trust “AWS” to not peak at your data/model, and trying to infer what you are doing.
  • Homorphic Encryption is cool, but not actually useful. Far too slow!
  • We can exploit the fact that binding inputs are dissimilar between pairs.
  • HRR is defined as the result of an FFT, so we can generalize to images via a 2D-FFT

This document is confidential and intended solely for the client to whom it is addressed.

73

74 of 190

Connectionist Symbolic Pseudo Secrets (CSPS)

  •  

This document is confidential and intended solely for the client to whom it is addressed.

74

75 of 190

CSPS classification accuracy

  • Training is done all locally on one device, it is prediction wea re concerned with.
  • Accuracy is reasonable, but some degradation as images get larger.
  • You can improve accuracy by running the prediction k times with k different secrets, and averaging the results
  • This is far faster than FHE!

This document is confidential and intended solely for the client to whom it is addressed.

75

76 of 190

Do you get any data saving?

  • CSPS can offload at least 65% of computation to a remote machine effectively
    • Some work needs to be done locally for unbind/binding, and we need a small classification head to run locally (too hard for a “linear probe”)

This document is confidential and intended solely for the client to whom it is addressed.

76

77 of 190

Are we giving away any secrets?

  • Utility is predicated on obscurity!
    • Can the adversary ease drop on the input and tell what is happening? The output?
  • Use clustering algorithms, with the true value of the number of classes, to see if they can find the signal. Adjusted Rand index should be =0 for purely random clustering. Far below 100%!

This document is confidential and intended solely for the client to whom it is addressed.

77

Network’s Input

Network’s Output

78 of 190

What if the adversary was even smarter

  • Using CSPS data with true labels to train a model directly requires an impossibly strong adversary.
  • Barely beats random-guessing performance, even with an impossible advantage, is a good sign for heuristic security.
    • MNIST, SVHN, and CIFAR-10 accuracies are at best 2.1× better than random-guessing
    • For CIFAR-100, and Mini-ImageNet it is 2.6 and 4.7× of random guessing

This document is confidential and intended solely for the client to whom it is addressed.

78

Random Guess

10%

10%

10%

1%

1%

79 of 190

Adversary has access to some of the original images, can it learn the secret using projected gradient descent (PGD)?

WHAT ABOUT USING ADVERSARIAL ML?

80 of 190

Adversary has access to some of the original images, can it learn by training an Autoencoder?

WHAT ABOUT TRAINING AN AUTO-ENCODER?

81 of 190

Self Attention

  •  

81

Booz Allen Hamilton

82 of 190

Self Attention with HRRs

82

Booz Allen Hamilton

 

We can think of Self-Attention as a fuzzy dictionary. We are finding the match between query and key and returning the corresponding value – made fuzzy by averaging based on similarity.

So we can perform querying against this “dictionary”, using HRRs as an inductive biased toward key/value lookups!

83 of 190

Self Attention with HRRs: Implementation

83

Booz Allen Hamilton

 

84 of 190

Self Attention with HRRs: Noise?

84

Booz Allen Hamilton

Our self attention works without Gaussian IID coefficients, how? Consider the H dimensional vectors a, b, c, d, and z. If each element of all these vectors is sampled from N (0, 1/H), then we would expect that (a b+c d)a ≈ 1.Similarly, the value z is not present, so we expect that (a b + c d)z ≈ 0. Now let's pretend we have 2D data:

We can query for a + z get and get

Or we can do c + z and get:

In either case, the noise terms share many coefficents, and will result in similar magnitude noise. We can interpret this as an additional noise constant ε that we must add to eachnoise term. Then when we apply the softmax operation, we obtain the benefit that the softmax function is invariant to constant shifts in the input, i.e., ∀ε ∈ R, softmax(x + ε) = softmax(x). Thus, our softmax effectively acts as a clean-up operation over the original values!

85 of 190

Long Range Arena Results

  • We use each datasets. Feature Vectors for all datasets were available, so all have FC results.
  • Most methods are worse than a naïve Transformer, but Hrrformer is always better!

85

Booz Allen Hamilton

86 of 190

Interpretability

  • We can visualize the attention weights that our model uses for each prediction, and see if they correspond with the content of the image.
  • In doing so, we see that the attention maps precisely to informative outlines/content of the image.
  • Remember: the task is linearized images! So the model is learning 2D structure from 1D representations!

87 of 190

Fast & Low Memory Training

  • Transformers are being used in ever larger and more expensive models. Are we fighting that trend? Yes!
  • Hrrformer is the fastest by far compared to all prior methods, up to 2 orders of magnitude.
  • The Hrrformer uses less memory to train by an order of magnitude or more, depending on what baseline we compare against
  • It is nearly the most accurate on the LRA benchmarks.

88 of 190

Fast Predictions

  • The FFT function has better numerical behavior as a function of batch size than matrix multiplication.
  • The gap between time/sample is smaller for varying batch sizes
  • Even our worst case time is better than a transformer’s best-case time!

89 of 190

Malware results

89

Booz Allen Hamilton

90 of 190

A Walsh Hadamard Derived Linear Vector Symbolic�Architecture

  •  

This document is confidential and intended solely for the client to whom it is addressed.

90

91 of 190

Properties of the HLB

  •  

This document is confidential and intended solely for the client to whom it is addressed.

91

92 of 190

Good at classical VSA tasks

  • When binding with random or repeated VSA vectors, HLB’s similarity score remains constant.
  • The magnitude of the vector does not change either as more items are bound
  • Better to design around something that has a known response

This document is confidential and intended solely for the client to whom it is addressed.

92

93 of 190

Better at XML classification

This document is confidential and intended solely for the client to whom it is addressed.

93

94 of 190

Does better at CSPS

  • HLB is more accurate at the CSPS task than prior VSAs
  • Also better at hiding its information for CSPS too!
  • Because CSPS is purely elementwise operations, no extra work for 2D/n-D generalization

This document is confidential and intended solely for the client to whom it is addressed.

94

95 of 190

Questions?

95

Edward Raff

EdwardRaff.com

Raff_Edward@bah.com

We can use VSAs to create neuro-symbolic AI methods

  • We can be clever with the loss function to impose symbolic constraints
  • We can design layers with symbolic interpretations as a way to express priors
  • We can design and simplify our approach to VSAs to achieve better results

96 of 190

Handoff

97 of 190

Grounding Blackbox Language Models with Retrieval Augmented Generation of Diverse Knowledge Form

Deepa Tilwani, �Phd Candidate �University of South Carolina

98 of 190

Introduction and Motivation (Part 1)

99 of 190

Progress in Language Modelling

Symbolic Era

Pre - 1990

  • Rule Based
  • Expert Systems
  • Limited Generalization

Statistical Era

1990 - 2006

  • Data-driven Approaches
  • Probabilistic Models

Scale Era

2006 onwards

  • Deep Learning and neural nets
  • General Purpose LMs
  • Massive Datasets and Computation

Turing Test

1950

ELIZA

ChatGPT

1966

2022

100 of 190

ELIZA (1966) : THE FIRST CHATBOT

Early NLP program developed by Joseph Weizenbaum at MIT. Created illusion of a conversation by rephrasing user statements as questions using pattern matching and substitution methodology. One of the first programs capable of attempting the Turing test.

101 of 190

The LLM Era – How they work?

102 of 190

Word Embeddings

Represent each word using a “vector” of numbers.

  • Converts a “discrete” representation to “continuous”.
  • Many benefits:
    • More “fine-grained” representations of words.
    • Useful computations such as cosine and Euclidean distance.
    • Visualization and mapping of words onto a semantic space.
    • Can be learnt in self-supervised manner from a large corpus.
  • Examples:
    • Word2Vec (2013), GloVe, BERT, ELMo

103 of 190

Seq2Seq Models

Recurrent Neural Networks (RNNs)

● Long Short-Term Memory Networks (LSTMs)

● Capture dependencies between input tokens

● Gates control the flow of information

A simple RNN shown unrolled in time. Network layers are recalculated for each time step, while weights U, V and W are shared across all time steps.

  • The inputs to each unit consists of the current input xt, previous hidden state ht-1, and previous context ct-1
  • The outputs are a new hidden state ht and an updated context ct.

104 of 190

Transformers

  • Allows to “focus attention” on particular aspects of the input while generating the output.
  • Done by using a set of parameters, called "weights," that determine how much attention should be paid to each input token at each time step.
  • These weights are computed using a combination of the input and the current hidden state of the model.

In encoding the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired". The model's representation of the word "it" thus bakes in some of the representation of both "animal" and "tired".

https://jalammar.github.io/illustrated-transformer/

105 of 190

Pre-Training: Data Preparation

A typical data preparation pipeline for pre-training LLMs:

W. Zhao et al. A Survey of Large Language Models. 2023.

106 of 190

What LLMs Can do?

Evolution of LMs from Perspective of Task-Solving Capacity

W. Zhao et al. A Survey of Large Language Models. 2023.

107 of 190

Few-Shot Prompting

T. Brown et al. Language Models are Few-Shot Learners. NeurIPS 2020.

"Great product, 10/10": {"label": "positive"}

"Didn't work very well": {"label": "negative"}

"Super helpful, worth it": {"label": "positive"}

Instruction:

Classify the sentiment of the given text as either positive or negative based on the examples provided.

Few shots examples:

Input: "Amazing quality and fast shipping!"

LLM

Ideal Output:

{"label": "positive"}

108 of 190

Chain-of-Thought Prompting

Instruction:�Classify the sentiment of the given text as either positive or negative. Follow a step-by-step reasoning process to determine the sentiment.

Reasoning:

  • "Wow!" expresses excitement and enthusiasm, indicating a positive reaction.
  • "This is fantastic quality" suggests high satisfaction with the product’s quality.
  • "Fast shipping" is another positive aspect, showing appreciation for timely delivery.
  • All elements in the sentence convey strong positivity.

Output: {"label": "positive"}

LLM

Input: "Wow! This is fantastic quality and fast shipping!"

Examples:

  1. Input: "Great product, 10/10"�Reasoning: The phrase "Great product" expresses strong approval, and "10/10" indicates a perfect rating, showing high satisfaction.�Output: {"label": "positive"}
  2. Input: "Didn't work very well"�Reasoning: The phrase "Didn't work" suggests malfunction or failure, and "very well" implies that it performed below expectations. This conveys dissatisfaction.�Output: {"label": "negative"}
  3. Input: "Super helpful, worth it"�Reasoning: "Super helpful" indicates a high level of usefulness, and "worth it" suggests that the person finds the product valuable. This implies strong satisfaction.�Output: {"label": "positive"

J. Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.

109 of 190

From Prompting to Fine-Tuning

Unlike prompting, fine-tuning actually changes the model under the hood, giving better domain- or task-specific performance.

https://x.com/karpathy/status/1655994367033884672

Fine-Tuning

110 of 190

Custom Trained Model in Law: Harvey AI

  • Startup building a custom-trained case law model for drafting documents, answering questions about complex litigation scenarios, and identifying material discrepancies between hundreds of contracts.
  • Added 10 billion tokens worth of data to power the model, starting with case law from
  • Delaware, and then expanding to include all of U.S. case law.
  • Attorneys from 10 large law firms preferred custom model’s output versus GPT-4’s
  • 97% of the time. Main benefit was reduced hallucinations!

Open AI Customer Stories: Harvey. April 2024.

111 of 190

Parameter Efficient Fine-Tuning (PEFT)

Techniques like LoRA construct a low-rank parameterization for parameter efficiency during training. For inference, the model can be converted to its original weight parameterization to ensure unchanged inference speed.

E. Hu et al. LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.

GPT-3 175B validation accuracy vs. number of trainable parameters of several adaptation methods on WikiSQL. LoRA exhibits better scalability and task performance.

112 of 190

Why the need for Trustworthiness in Generative AI?

113 of 190

Unreliable Reasoning Even On Simple Tasks

Probably due to tokenization!

Generated by gpt-4o’s tokenizer.

Try it out at:

https://tiktokenizer.vercel.app/

Easy reasoning, Sure!

Got confused ??

114 of 190

Jailbreaking Can Bypass Safety

Jailbreaking is the process of altering prompts to evade an LLM’s safeguards, resulting in harmful outputs.

PAIR, influenced by social engineering attacks, involves an attacker LLM that autonomously generates jailbreaks for a targeted LLM. The attacker LLM repeatedly interacts with the target LLM, refining and improving a jailbreak—often within twenty queries.

P. Chao et al. Jailbreaking Black Box Large Language Models in Twenty Queries. 2023.

115 of 190

The Story of a Lawyer Who Employed ChatGPT … trust issues remain

A lawyer, representing a client against an airline, turned to AI assistance for drafting legal documents. The results were less than ideal. https://www.nytimes.com/2023/05/27/nyregion/avianca-airline-lawsuit-chatgpt.html

Legal Consequences for Attorneys Using ChatGPT

Lawyer Acknowledges AI Misuse in Court: During court session, an attorney admitted excessively relying on AI, resulting in a legal motion filled with artificial legal references. https://www.nytimes.com/2023/06/08/nyregion/lawyer-chatgpt-sanctions.html

116 of 190

Neuro Symbolic Legal AI

Open AI Customer Stories: Harvey. April 2024.

117 of 190

Neuro Symbolic Legal AI

Open AI Customer Stories: Harvey. April 2024.

118 of 190

Neuro Symbolic Legal AI

Open AI Customer Stories: Harvey. April 2024.

119 of 190

Nearly Impossible to Explain or Reason Generative Answers

Prompt Injections can leak data

Context Windows are and will remain limited

Bias in Large Language Models that Supervised Learning cannot reduce

Reliability Issues: Different Large Language Models Yield Different Outcomes

Inconsistency in Prompts for Completeness in Outcomes

More Challenges for the Generative AI

120 of 190

Grounding (Part 2)

121 of 190

Grounding

Grounding is defined as ensuring every claim in an LLM response generate verifiable and well-grounded responses to any prompt, relying solely on information from a user-specified knowledge base.

Grounded means that every claim in the response is *attributable to a document in the knowledge base

Verifiably-grounded means that every claim is backed by an appropriate citation

Knowledge base may be a private corpus, a public domain, entire Web

E.g., a healthcare customer may specify a set of journals they trust

122 of 190

Two Core Approaches to Grounded AI

Grounded Generation – Enhancing AI with Verified Knowledge

Method:

  • Retrieve relevant facts from a trusted knowledge base.
  • Augment LLM prompts with verified context before generating responses.
  • Intrinsic phenomena

Grounding Verification – Ensuring AI’s Responses Are Factually Correct

Method:

  • Cross-check AI-generated claims with authoritative sources.
  • Generate citations to improve transparency and accountability.
  • Apply fact-checking models to filter unverified claims.
  • Extrinsic phenomena

123 of 190

123

Why Grounded Generation ?

LLAMA

“Grounded generation retrieves latest clinical guidelines and provides an evidence-based response”

Not grounded but a generic answer!

124 of 190

Why Grounding Verification ?

INPUT:�What is the target blood pressure for men?

Not according to 2017 guidelines

It should first verify who the intended audience is before ensuring factual accuracy.

125 of 190

Types of Grounding in AI & LLMs

  1. Symbolic Grounding

AI must retrieve, recognize, and structure information correctly before using it (i.e. LLMs should understand and link symbols like words, phrases, numbers to their real-world meanings)

  1. Retrieval-Augmented Generation Based Grounding
  2. Method: AI retrieves external documents before responding.
  3. Example:
    • A QA system fetching Wikipedia articles before answering a historical question.
    • Chatbot retrieving product manuals before explaining a feature.
  4. Knowledge Graph-Based Grounding
  5. Method: AI structures information in graphs to improve contextual understanding.
  6. Example:
    • A search engine linking related topics (e.g., AI connects “COVID-19” with “vaccines” and “pandemics”).
    • Legal NLP models linking case laws, statutes, and judicial precedents to provide structured responses.

2) Functional Grounding

LLMs should reason, verify, and adapt responses based on context (i.e apply it correctly in context)

  1. Attribution-Based Grounding
  2. Method: AI justifies responses with references and citations.
  3. Example:
    • A fact-checking AI cross-referencing news claims with verified sources before publishing.
    • AI writing research papers by citing scientific studies instead of generating unsupported claims.
  4. Interactive & Reinforcement-Based Grounding
  5. Method: AI learns from real-time feedback and improves responses over time.
  6. Example:
    • Customer support chatbots adapt based on user corrections (e.g., learning new slang or product updates).
    • Language models are fine-tuned through user interactions to generate more accurate, personalized responses.

126 of 190

Symbolic Grounding

127 of 190

  1. Retrieval-Augmented Generation (RAG)

LLMs lack knowledge beyond their training date, and frequent model updates are impractical.

Idea: Enhance LLMs with a retrieval system!

Lewis, Patrick, et al. "Retrieval-augmented generation for knowledge-intensive nlp tasks." Advances in Neural Information Processing Systems 33 (2020): 9459-9474.

128 of 190

Advantages of RAG

Fact-Checking

Safe

Custom Train

Cost-Effective

Continuous Update

Accessible and Affordable

Domain Knowledge

Easier to Customize

129 of 190

Symbolic Retrieval based Grounding

  • RAG enabled system
  • Domain-Specific Training
  • Enhanced Accuracy and Relevance
  • Customization for Business Needs
  • Business Alignment

Source Attribution: Retrieve, recognize, and attribute

Grounded and targeted for generating citations with structured metadata

Image: Tilwani, Deepa, et al. "REASONS: A benchmark for REtrieval and Automated citationS Of scieNtific Sentences using Public and Proprietary LLMs." arXiv preprint arXiv:2405.02228 (2024).

An Evaluation study of Citation Generation on Recent LLMs

130 of 190

How we do Symbolic Retrieval based Grounding ?

131 of 190

But Few Limitations of RAG..

Needs Existing Database

Context Length

Limitation

Latency Issues

Dependent on

Semantic Search

Hallucination still exists

At scale, sensitive to choices of:

1) Chunking Strategy,

2) Embedding Model, and

3) Generation Model.

132 of 190

Flaws in RAG from REASONS Dataset

Only Adv. RAG was able to correctly generate author names

133 of 190

Latency Issues

134 of 190

Domain

OpenAI

M

L

D

RM

RL

P

AdvRAG(L)

AdvRAG(M)

AI

34:25

26:03

11:10

34:11

74:49

73:09

34:31

156:24

163:28

CV

47:45

18:35

19:24

50:22

189:20

198:45

42:05

259:32

302:14

Cryptography

03:50

02:18

04:59

32:21

83:28

89:21

13:23

190:19

194:25

Graphics

07:08

08:55

06:08

58:43

108:08

127:48

16:52

214:25

227:23

HCI

03:01

01:10

00:42

21:56

48:32

50:51

02:47

95:56

98:44

IR

20:31

11:40

06:52

33:34

91:30

99:43

19:50

193:37

202:23

NLP

28:26

11:42

05:09

47:24

91:07

88:40

13:06

175:58

156:49

135 of 190

2. Knowledge Graphs (KG) Based Grounding

  1. Machine-readable structured representation of knowledge
  2. Consisting of entities, entity types, and relationships in various forms (e.g., ontologies, lexicons, labeled property graphs and RDFs).

Speer et al. AAAI’17

Vrandečić et al. ACM Comm’14

Gaur et al. ICSC’19

Miller, ACM Comm’95

ConceptNet

World War I fought_with Poisonous Gas

Subject

Predicate

Object

“KG-based grounding structures information in graphs, linking concepts to improve AI systems' ability to retrieve and generate meaningful responses.”

136 of 190

Illustration of ISEEQ

Sentence BERT Encoder

Sentence BERT Encoder

1. What is gross_domestic_product?

2. What is the measure of gross_domestic product?

3. What is the reason nation income relations gross_domestic_product?

4. What is the influence of inflation to gross_domestic_product?

5. What is the meaning of unemployment in inflation?

6. What is the influence of inflation on cost_of_living?

Query

Title: Economy and Employment Statistics

Description: Learn Information about key economic concepts including gdp, inflation, and the influence on employment

Constituency Parsing

Information + { economy, employment statistics, employment, influence employment, inflation influence employment, gdp, gdp influence employment, key economic concepts}

economics

economy

inflation

employment

gdp

gross domestic product

unemployment

gnp

gross national product

national income

cost of living

income

personal income

income tax

ConceptNet Graph for Semantic Query Expansion

ISQ by Generative Adversarial Reinforcement Learning

137 of 190

  • ISEEQ could be used in a medical triaging system, where a knowledge graph links symptoms, diseases, and treatments.
  • A patient query like "chest pain" could be expanded using KG-based knowledge to generate Information seeking questions (ISQs) such as:
    • "Do you also experience shortness of breath?"
    • "Have you had previous heart conditions?"
    • "Are the symptoms triggered by physical activity?"

(A) An example of curiosity-driven ISQs generated by

ISEEQ. (B) overview of ISEEQ

Gaur, M., Gunaratna, K., Srinivasan, V., & Jin, H. (2022). ISEEQ: Information Seeking Question Generation Using Dynamic Meta-Information Retrieval and Knowledge Graphs. AAAI 2022

138 of 190

Functional Grounding

139 of 190

Helps to work on :

  • Factual verification
  • Reducing hallucinations
  • External knowledge integration
  • Error categorization

A fact-checking approach for cross-referencing news claims with verified sources before publishing.

Evaluating attribution and identifying specific types of

errors with AttrScore. We explore two approaches in AttrScore: (1) prompting LLMs, and (2) fine-tuning LMs on simulated and repurposed datasets from related tasks

Attribution-Based Functional Grounding

Yue, Xiang et al. “Automatic Evaluation of Attribution by Large Language Models.” EMNLP (2023).

140 of 190

Reinforcement Learning for Grounding

What constitutes a good response for a query and context is quite nuanced?

Idea: Capture this using a reward model that scores each <query, context, response> on the appropriateness of the response. The model may be trained on a dataset specifying preferences between response pairs

We can then use reinforcement learning to tune the model to maximize reward while staying within a bounded KL-divergence from the initial model.

References:

141 of 190

Interactive & Reinforcement-Based Grounding

Interactive & Reinforcement-Based Grounding ensures that LLMs do not just generate blindly but engage in a feedback-driven, iterative process to reason, verify, and adapt responses based on context.

The code-generating language models as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor.

Le, H., Wang, Y., Gotmare, A. D., Savarese, S., & Hoi, S. C. H. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Neurips 2022

142 of 190

Check out our Dataset for Interactive & Reinforcement-Based Grounding at AAAI 2025

POSTER PRESENTATION ON 28TH FEB

Can LLMs Obfuscate Code? A Systematic Analysis of Large Language Models into Assembly Code Obfuscation

Seyedreza Mohseni, Seyedali Mohammadi, Deepa Tilwani, Yash Saxena, Gerald Ketu Ndawula, Sriram Vema, Edward Raff, Manas Gaur

143 of 190

Grounding Verification

Despite progress in generating grounded responses, post-hoc verification of generated responses is still indispensable

  • Especially in domains like healthcare where we may want 100% grounding
  • Especially when the query is complex and/or the retrieval quality is not good
  • Especially if verifiable-correct citations are required for each claim

144 of 190

144

Symbolic and Functional Grounding Together

LLAMA

Domain Knowledge: PHQ9 Depression ontology

LLAMA + Domain Knowledge Output

S. Dalal, D. Tilwani, M. Gaur, S. Jain, V. L. Shalin and A. P. Sheth, "A Cross Attention Approach to Diagnostic Explainability Using Clinical Practice Guidelines for Depression," in IEEE Journal of Biomedical and Health Informatics

“Grounded generation retrieves latest clinical guidelines and provides an evidence-based response”

“Knowledge Graphs (symbolic grounding) and adapting to domain (functional grounding)”

145 of 190

How to do Symbolic and Functional Grounding Together ?

“Grounded generation retrieves latest clinical guidelines and provides an evidence-based response”

146 of 190

Original Text:

Why do i have sudden bursts of depression know the title probably doesn't make sense but stopped working for a while to peruse business idea had which failed and now i'm about go back into work force only 19 these moments where just feel lost like my family friends as is what dedicated life past 6 months most that time was me sitting in room trying get it off ground floor. really nervous getting job again haven't real one entire am overthinking or will be not bad think.

147 of 190

Self Attention Text (No Highlighting)

(Don’t know Why?)

Why do i have sudden bursts of depression know the title probably doesn't make sense but stopped working for a while to peruse business idea had which failed and now i'm about go back into work force only 19 these moments where just feel lost like my family friends as is what dedicated life past 6 months most that time was me sitting in room trying get it off ground floor. really nervous getting job again haven't real one entire am overthinking or will be not bad think. �

148 of 190

Attention Over PHQ 1:How often have you been bothered by little interest or pleasure in doing things? (No Highlighting)

Why do i have sudden bursts of depression know the title probably doesn't make sense but stopped working for a while to peruse business idea had which failed and now i'm about go back into work force only 19 these moments where just feel lost like my family friends as is what dedicated life past 6 months most that time was me sitting in room trying get it off ground floor. really nervous getting job again haven't real one entire am overthinking or will be not bad think.

149 of 190

Why do i have sudden bursts of depression know the title probably doesn't make sense but stopped working for a while to peruse business idea had which failed and now i'm about go back into work force only 19 these moments where just feel lost like my family friends as is what dedicated life past 6 months most that time was me sitting in room trying get it off ground floor. really nervous getting job again haven't real one entire am overthinking or will be not bad think.

Attention Over PHQ 2 : How often are you bothered by feeling down, depressed, or hopeless?

150 of 190

Attention Over PHQ 9: How often have you been bothered by thoughts that you would be better off dead or of hurting yourself in some way ?

Why do i have sudden bursts of depression know the title probably doesn't make sense but stopped working for a while to peruse business idea had which failed and now i'm about go back into work force only 19 these moments where just feel lost like my family friends as is what dedicated life past 6 months most that time was me sitting in room trying get it off ground floor. really nervous getting job again haven't real one entire am overthinking or will be not bad think.

151 of 190

PHQ-1, PHQ-5, and PHQ-6 are unanswered questions. These are the relevant questions to be asked.

Similarity score between phrases highlighted in Self-Attention and PHQ-9 questions.

(equal attention, confused, and Unexplainable)

Cumulative Cross-Attention Scores

(PHQ-9 infusion explains model attention)

152 of 190

Check Grounding API [Google Cloud]

Check Grounding determines how grounded a given response is in a given set of facts (context)

Returns:

  • Grounding scores (a support score, and a contradiction score)
  • Citations
  • Anti-Citations

Based on custom NLI model

Generally available at: https://cloud.google.com/generative-\ai-app-builder/docs/check-grounding

153 of 190

Open Questions..

  • What mechanisms can be implemented for LLMs to flag uncertain or unverifiable information in their responses?
  • How should LLMs handle conflicting information when processing text from multiple sources?
  • Can LLMs dynamically cross-reference their outputs with primary sources or citations before finalizing a response?

154 of 190

Handoff

155 of 190

TH10: Neurosymbolic AI for EGI:

Explainable and Grounded Generations

Feb 25th 25

Ali Mohammadi, M294@umbc.edu

Ph.D. student at UMBC

University of Maryland, Baltimore County (UMBC), Knowledge Infused AI and Inference (KAI2) Lab

156 of 190

Why NeuroSymbolic explainable AI?

Safety in High-Stake Application

Alignment with Human Values

User Adoption and Confidence

Debugging and Improving Models

Ethical and Fair AI

Trust and Transparency

157 of 190

Key Focus Areas

Large Language Models (LLMs)

Explainability

Wellness Dimension

External Knowledge

158 of 190

159 of 190

160 of 190

Wellness Dimension Datasets

6 wellness dimensions:

  • Physical
  • Intellectual
  • Vocational
  • Social
  • Spiritual
  • Emotional
  1. MultiWD dataset
  2. WellXplain dataset

161 of 190

Explanation and Prediction

Fine-tuned LMs

Fine-tuned/Prompting LLMS

on WD Datasets

The fall semester was one of the worst experiences of my life, and I barely passed my four classes.

Textual Post

Explanation & Label (Expected/Predicted)

Intellectual and Vocational Aspect

fall semester was one of the worst experiences

barely passed

Wellness dimension sample

162 of 190

External Knowledge

163 of 190

External Knowledge

Findings:

General task (e.g. e-snli)

  • Model already knows the definitions and rely more on their internal knowledge.

Domain specific task (e.g. WellXplain and HateXplain)

  • Models benefit from the provided external knowledge.

164 of 190

Mohammadi, S., Raff, E., Malekar, J., Palit, V., Ferraro, F., & Gaur, M. (2024). WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 364–388, Miami, Florida, US. ACL.

Instruction

Post: They make me feel unhappy and miserable (SpEA). What should I do?

Output:

SpEA (PA:0, IVA:0, SA:0, SpEA:1)

Explanation: unhappy, miserable

WELLXPLAIN Training Examples

Post: My mum, dad and step-mum (SA) won't leave me alone and they constantly make choices for me and it's starting to get to me.

Output: SA(PA:0, IVA:0, SA:1, SpEA:0)

Explanation: My mum, dad, step-mum

Evaluation

WELLXPLAIN Test Examples

In

Out

In

WellDunn Benchmarking

Encoder 3

Encoder 2

Encoder 1

Decoder 1

Decoder 2

Decoder 3

Fine-tuned LM

Tokenizer

Robustness

Explainability

SCE

GL

SVD

AO

FFNN

Prediction and Explanation

165 of 190

  • Robustness Assessment
    • Sigmoid Cross-Entropy (SCE):
    • Gambler’s Loss (GL):

166 of 190

The fall semester was one of the worst experiences of my life, and I barely passed my four classes.

Mohammadi, S., Raff, E., Malekar, J., Palit, V., Ferraro, F., & Gaur, M. (2024). WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 364–388, Miami, Florida, US. ACL.

[Ground Truth Explanation]

[Generated Explanation]

  • Explainability Assessment
    • SVD (Singular Value Decomposition) ranking: measures the focus of a model’s attention by analyzing its attention matrix. Lower ranks suggest that the model focuses on fewer, more relevant parts of the input text, aligning closely with concise explanations.
    • Attention-Overlap (AO) Score: The Number of Common (Purple) Words between grounded and generated explanation divided by​ Number of Ground-Truth Words.

The fall semester was one of the worst experiences of my life, and I barely passed my four classes.

167 of 190

168 of 190

169 of 190

SCE vs GL attention (Post 1)

I

I don’t cry anymore. want to be around anyone, do anything. Work keeps me getting up everyday. Without it would probably stare at my ceiling until passed back out again m so tired know if there is a question in this, There just isn else tell.

I don't cry anymore. want to be around anyone do anything Work keeps me getting up everyday Without it would probably stare at my ceiling until passed back out again so tired know if there is a question in this, There just isn't else tell.

With SCE Loss:

With GL:

170 of 190

Future Directions

Developing a Transparent Classifier Rooted in Clinical Understanding – Addressing the disparities between prediction accuracy and attention.

Improving Attention Alignment with Ground Truth – Enhancing attention explanations to better reflect actual outcomes.

Exploring Different Prompting and Retrieval-Augmented Generation (RAG) Strategies – Testing alternative methods to improve LLM performance.

Developing a Suitable Dataset for Mental Health Applications – Curating knowledge and constructing a well-suited dataset for retrieval-augmented methods.

171 of 190

Wrap up!

Large Language Models (LLMs)

Explainability

Wellness Dimension

External Knowledge

172 of 190

173 of 190

Reference

  1. Seyedali Mohammadi, Edward Raff, Jinendra Malekar, Vedant Palit, Francis Ferraro, and Manas Gaur. 2024. WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 364–388, Miami, Florida, US. Association for Computational Linguistics.
  2. Liu, Ziyin, et al. "Deep gamblers: Learning to abstain with portfolio theory." Advances in Neural Information Processing Systems 32 (2019).
  3. Alsentzer, Emily, et al. "Publicly available clinical BERT embeddings." arXiv preprint arXiv:1904.03323 (2019).
  4. WELLDUNN: An Annotated Dataset to identify affected Wellness Dimensions in Reddit Posts, Under review.
  5. DUNN HL. High-level wellness for man and society. Am J Public Health Nations Health. 1959 Jun;49(6):786-92. doi: 10.2105/ajph.49.6.786. PMID: 13661471; PMCID: PMC1372807.
  6. Merikangas KR, He JP, Burstein M, Swanson SA, Avenevoli S, Cui L, Benjet C, Georgiades K, Swendsen J. Lifetime prevalence of mental disorders in U.S. adolescents: results from the National Comorbidity Survey Replication--Adolescent Supplement (NCS-A). J Am Acad Child Adolesc Psychiatry. 2010 Oct;49(10):980-9. doi: 10.1016/j.jaac.2010.05.017. Epub 2010 Jul 31. PMID: 20855043; PMCID: PMC2946114.
  7. Garg, M. Mental Health Analysis in Social Media Posts: A Survey. Arch Computat Methods Eng 30, 1819–1842 (2023). https://doi.org/10.1007/s11831-022-09863-z
  8. Garg, Muskan. "WellXplain: Wellness concept extraction and classification in Reddit posts for mental health analysis." Knowledge-Based Systems 284 (2024): 111228.
  9. Sathvik, M. S. V. P. J., and Muskan Garg. "Multiwd: Multiple wellness dimensions in social media posts." Authorea Preprints (2023)

174 of 190

175 of 190

openCHA:

Building Explainable

and Personalized Conversational Agent

Iman Azimi, PhD

February 25, 2025

176 of 190

Healthcare chatbots or Conversational Health Agents

Chatbots have the potential to play a crucial role in healthcare: assisting patients and healthcare providers:

    • Clinical decision support
    • Patient monitoring and follow-up
    • Chronic health management
    • Patient’s self-awareness

176

Bedi, Suhana, et al. "A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)." medRxiv (2024): 2024-04.

177 of 190

Why are healthcare chatbots not widely used?

Existing chatbots are not able to provide:

  1. Trustworthiness
  2. Personalization (no access to user’s data)
  3. Data analysis
    • No access to established multimodal data analysis tools
  4. Access to up-to-date information
    • Ignoring well-established healthcare research
  5. Explainability

177

Abbasian, M., Khatibi, E., Azimi, I., Oniani, D., Shakeri Hossein Abad, Z., Thieme, A., ... & Rahmani, A. M. (2024). Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. npj Digital Medicine, 7(1), 82.

178 of 190

openCHA (Conversational Health Agents)

A holistic LLM-powered framework to integrate health data, knowledge, and analytical tools into healthcare chatbots.

178

  • Abbasian, M., Azimi, I., Rahmani, A.M. and Jain, R., 2023. Conversational Health Agents: A Personalized LLM-Powered Agent Framework. arXiv preprint arXiv:2310.02374.
  • GitHub repo: https://github.com/Institute4FutureHealth/CHA

179 of 190

openCHA framework

179

180 of 190

Interface

Acts as a bridge between the users and agents

  • Users’ queries
  • Metadata

180

181 of 190

Orchestrator

Responsible for problem solving, decision making, and response generation

  • Input data are aggregated, transformed into structured data, and then analyzed to plan and execute actions
  • Interacts with external sources to acquire the required information, perform data integration and analysis, and extract insights, among other functions.
  • Converts the info into an understandable format and inferring the appropriate response.

181

182 of 190

External sources

Obtain essential information from the broader world

  • Datasets
  • Knowledge bases
  • Analytical tools
  • Translators

182

183 of 190

Demo:

Nutrition causal effects

Tasks involved:

  • Get data
  • Causal graph (personal info)
  • Food’s nutritional content (general info)

183

Z. Yang, E. Khatibi, N. Nagesh,, M. Abbasian, I. Azimi, R. Jain, and A. Rahmani, “ChatDiet: Empowering Personalized Nutrition-Oriented Food Recommender Chatbots through an LLM-Augmented Framework,” Elsevier Smart Health, IEEE/ACM CHASE, 2024

184 of 190

Patient health record reporting

(1)

Tasks involved:

  • Get data
  • Statistical analysis
  • Internet search
  • Extract text

184

Dataset: S. Labbaf, et al. "Physiological and Emotional Assessment of College Students Using Wearable and Mobile Devices During the 2020 Covid-19 Lockdown: An Intensive, Longitudinal Dataset." Longitudinal Dataset (2023).

185 of 190

Patient health record reporting

(2)

Tasks involved:

  • Get data
  • Statistical analysis
  • SerpAPI
  • Extract Text

185

Dataset: S. Labbaf, et al. "Physiological and Emotional Assessment of College Students Using Wearable and Mobile Devices During the 2020 Covid-19 Lockdown: An Intensive, Longitudinal Dataset." Longitudinal Dataset (2023).

186 of 190

Objective stress level estimation

Tasks involved:

  • Get data
  • PPG analysis (HRV extraction)
  • HRV analysis (stress estimation)

186

Dataset: S. Labbaf, et al. "Physiological and Emotional Assessment of College Students Using Wearable and Mobile Devices During the 2020 Covid-19 Lockdown: An Intensive, Longitudinal Dataset." Longitudinal Dataset (2023).

187 of 190

Use cases

  • Yang, Zhongqi, et al. "ChatDiet: Empowering personalized nutrition-oriented food recommender chatbots through an LLM-augmented framework." Smart Health 32 (2024): 100465.
  • Abbasian, Mahyar, et al. "Knowledge-Infused LLM-Powered Conversational Health Agent: A Case Study for Diabetes Patients." 46th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), (2024).
  • Park, Jung In, et al. "Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools." arXiv preprint arXiv:2408.04650 (2024).
  • Abbasian, Mahyar, et al. "Empathy Through Multimodality in Conversational Interfaces." arXiv preprint arXiv:2405.04777 (2024).

187

188 of 190

Future directions

We are looking for contribution from diverse communities: to contribute their ideas and connect their tools to CHA, leading to more precise user responses.

  • Safety of the responses
    • Benchmarking
  • Connecting open datasets, knowledge graphs, etc.
    • Planning and decision making
    • Retrieve information
  • Chronic Health Management

188

189 of 190

Thank You

Questions?

More info about openCHA:

arxiv.org/abs/2310.02374

GitHub repository:

github.com/Institute4FutureHealth/CHA

User guide and quick start:

opencha.com

Should you be interested, please reach out to me at

189

  • Tutorial website: https://nesy-egi.github.io

Slides Available :

https://nesy-egi.github.io

190 of 190

Thanks! Questions?

  • Feedback most welcome :-)

  • Tutorial website: https://nesy-egi.github.io