1 of 191

The Blue Sky Research in LLM and Alignment

Soujanya Poria

On behalf of DeCLaRe Lab

2 of 191

POS

HMM/CR

NER

MT

Tag

NLI

TC

Finance

Search/IR

Social Media

Chatbots/

Assistants

Anomaly

Medical

2023 <

2023

2024

Fundamental

Applications

Chatbots

Maths,

Coding

Prompting

Search/IR

LLMs

PPO

DPO

CoT

LLMs

Merge

Distil

k-BitsQ

Reason

Align

LLMs

Navigation

Agents

AutoEval

Search/IR

LLMs

3 of 191

Why Alignment is Needed?

Consider the current-day risks/harms of today’s AI systems

misinformation
fairness/biases
privacy

One thing these items have in common is that the risks directly relate to the things the systems were designed to do.

4 of 191

Preference Alignment: RLHF

5 of 191

Alignment is Often Defined as

Helpfulness – all types of capabilities required to help the user.

Good at reasoning.
Good at perception.
RAG etc.

Harmlessness – all types resistance against harmful intent and affecting users with harmful information

- Trustworthiness.

- Safety.

- Privacy.

6 of 191

Understanding Reasoning Bottlenecks

Hong et al. Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions.

Helpfulness

7 of 191

Are LLMs Math Marvels?

There are lot of existing Mathematical Reasoning datasets:

GSM8k (92%, five shot)
MMLU math (87.5%)
MATH (50.4%)

Does GPT-4 already perform well on arithmetic reasoning?
Is the performance drop due to complexity or diversity of the underlying context? Or something else?

8 of 191

Testing LLMs’ Math and Coding Competency

You perhaps heard about GSM-Symbolic but did not hear about this work 😓�This work came 10 months before GSM-Symbolic

9 of 191

Original	Logic Alteration	Concept Analysis	Format Constraint

Original Question		Math
Question: John has 3 boxes. Each box is 5 inches by 6 inches by 4 inches. The walls are 1 inch thick. What is the inner volume of all 3 boxes? Answer:�Walls are 1 inch thick, reducing each dimension by 2 inches. Thus, the internal dimensions become 3x4x2=24 cubic inches, making the total volume for all 3 boxes 3×24=72 cubic inches.	Question - Variable Relationship: John has X boxes. Each box is Y inches by 6 inches by 4 inches. … If the total inner volume of all the boxes is 72 cubic inches, find the equation that relates X and Y? Answer:�Walls are 1 inch thick, reducing each dimension by 2 inches. Thus, the internal dimensions become (Y-2)x4x2 cubic inches for one box. Given the total volume for all X boxes is 72 cubic inches, the equation relating X and Y simplifies to X*(Y-2)=9.	Question - Step Necessity: …. Suppose we want to find out the total inner volume of all 3 boxes. To solve this math question,is there a way to determine the total inner volume of all 3 boxes without calculating the inner volume of one box? One possible answer:�Calculate by using: total inner volumn = total outer volume - volumn of the wall. Yes.	Question - Reasoning Format: Answer the following question with only base-2 coded values. … What is the total inner volume of all 3 boxes? Answer (in binary):�Walls are 1(bin) inch thick, reducing each dimension by 10(bin) inches. Thus, the internal dimensions become: 11x100x10=11000(bin) cubic inches. making the total volume for all 11(bin) boxes: 11x11000=1001000 (bin) cubic inches.

Original Question		Multimodal
Question: Question: Observing the kitchen scenario depicted, what might be the most likely reason for the water spills and mess on the floor? Answer: The most likely reason for the water spills and mess on the floor in the depicted kitchen scenario could be due to the pot boiling over on the stove. The visible steam and the lid slightly ajar suggest that the contents of the pot may have bubbled up and spilled out, contributing to the water on the countertop and subsequently on the floor. Additionally, the running faucet could imply that water was left running, possibly due to a hurried or distracted attempt to address the boiling over, leading to further spills. The presence of cleaning supplies like a sponge and a brush on the wet floor indicates an effort to clean up the resultant mess, which supports the idea that an accident occurred during cooking.	Question - Variable Relationship: Given the current state of the kitchen, if the presence of an overflowing pot correlates with water spills and a sponge and brush are found on the floor, what might the relationship be between these observed items and the resultant mess? Answer: The correlation could suggest that the water spills are due to the overflowing pot, and the presence of cleaning tools (sponge and brush) indicates an attempt to manage the mess. Therefore, the relationship is that the overflowing pot is the source of the water spills, and the tools are there for cleanup efforts.	Question - Step Necessity: In the depicted kitchen scenario, is it necessary to observe the overflowing pot to deduce that the water spills are a result of someone cooking? One possible answer: No, it is not strictly necessary to observe the overflowing pot to deduce that someone was cooking. Inductive reasoning from the presence of a pot on the stove, along with other cooking utensils, and the mess associated with cooking activities can lead to the conclusion that the water spills are a result of cooking activities.	Question - Reasoning Format: Analyze the depicted kitchen scene using deductive reasoning to determine the cause of the water spills. Explain the reasoning process and conclusion. Answer:�From the image, we see an overflowing pot on the stove, a running faucet, and cleaning tools on the floor. Using deductive reasoning: If a pot is overflowing and the stove is on, then water will spill onto the floor. The stove is on, and a pot is overflowing. Therefore, the overflowing pot is the cause of water on the floor. Next, we deduce: If cleaning tools are present on a wet floor, then someone is likely attempting to clean a spill. There are cleaning tools on the wet floor. Therefore, someone was likely attempting to clean the spill when the mess occurred.

10 of 191

Research Question

How robust is the arithmetic reasoning capability of LLMs?

Through Perturbations. (Great performance drop)
Succeed because of data contamination issues. (Almost certain)

Under which conditions or across which dimensions do these models show limitations in reasoning?

Can GPT4 perform only limited ways of reasoning? (Yes)
Can LLMs understand unusual mathematical questions? (No)

We choose to study those questions from the easiest data - GSM8k.

If we make it harder, we succeed!

11 of 191

Question Decomposition

12 of 191

Ontology

Picked 5 Random Questions

13 of 191

Curation

GPT4 generation

For variability

GPT4 filtering

Manual Filtering (⅓ correct, ⅓ minor modification, ⅓ Fails)

- manual effort is needed when creating a high quality dataset! GPT4 cannot evaluate itself.

14 of 191

Examples - Logic Alteration

Original:

Variable Relationship:

John has X boxes. Each box is Y inches by 6 inches by 4 inches. The walls are 1 inch thick. If the total inner volume of all the boxes is 72 cubic inches, then find the equation that relates X and Y?

15 of 191

Examples - Concept Analysis

Original:

Step Necessity:

John has 3 boxes. Each box is 5 inches by 6 inches by 4 inches. The walls are 1 inch thick. Suppose we want to find out the total inner volume of all 3 boxes. To solve this math question, is there a way to determine the total inner volume of all 3 boxes without calculating the inner volume of one box?

16 of 191

Examples - Format Change

Original:

Question:

John has X boxes. Each box is Y inches by 6 inches by 4 inches. The walls are 1

inch thick. If the total inner volume of all the boxes is 72 cubic inches, then find the equation

that relates X and Y?

17 of 191

Results - General

Pronounced Drop in every model
Concept analysis is the most difficult category as it requires high level understanding of math
Closed sourced models are affected more by Format Change.

18 of 191

Multimodal bottlenecks

Chia et al. PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns. ACL Findings 2024.

19 of 191

PuzzleVQA Ontology

20 of 191

The Struggle of LLMs on PuzzleVQA

21 of 191

How about Algorithmic Puzzles?

22 of 191

The Story does not Change….

23 of 191

How about Planning?

24 of 191

Can Do Dataset

25 of 191

Planning Bottlenecks

26 of 191

Learning to Reason

Chia et al. Learning to Reason and Explore From Diverse Paths. EMNLP 2024

Helpfulness

27 of 191

Motivations

Prompting techniques are not robust.
Upon identifying the reasoning bottlenecks, can we tune LLMs by making them powerful to address those bottlenecks?
What other techniques are possible?

Improved tokenization.

Left to right prediction may not work well for math.

28 of 191

Reasoning Paths Optimization:�A Framework For Exploring And Learning From Diverse Reasoning Paths

Motivation: Reasoning in language models may easily diverse into errors

29 of 191

Framework: Reasoning Paths Optimization

1. Generation

Leverage CoT to obtain reference paths

2. Exploration

From each step in the path, expand with favorable and unfavorable branches

3. Optimization

Provide contrastive feedback to enhance LLM reasoning

30 of 191

Main Results

We observe consistent benefits across math and science reasoning tasks

31 of 191

Analysis

Further experiments show that RPO scales well to longer reasoning solutions

32 of 191

Takeaways

Reasoning in language models can easily diverse into errors
Our approach addresses this with contrastive feedback over the favorable and unfavorable branches
Unlike previous works, we do not require human-annotated reasoning paths
Thus, we believe this is a scalable and effective method to improve reasoning

33 of 191

Improving Helpfulness with Verification

Yu et al. Reward Steering with Evolutionary Heuristics for Inference-time Alignment. Arxiv 2024.

Helpfulness

34 of 191

Not All Votes Count!

Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning

Motivations

LLMs often make arithmetic errors.
Majority-voted answers can still be incorrect.

Key Idea

Use translated programs, derived from natural language solutions, as a verification mechanism to identify and filter out incorrect reasoning paths before aggregating final answers.

35 of 191

Framework

Generating plan and solution
Using Plan and Solve prompting.
Translation
Convert plan and solution to Python program.
Verification
Verify solution using program output.
Selection
Select final answer using majority voting over all verified answers.

36 of 191

Main Results

PROVE consistently outperforms baselines across all model sizes and datasets, achieving improvements of up to 18% on GSM8K and 8% on MATH-500.

37 of 191

Analysis

The improvement over baselines remains consistent as the number of samples increases.

38 of 191

Analysis

PROVE reduces calculation errors.

39 of 191

Takeaways

We demonstrate that using translated programs for verification can effectively filter out low-quality reasoning paths.
Our approach is model-agnostic and does not require fine-tuning or few-shot exemplars for prompting.
PROVE consistently outperforms baseline methods across 13 LLMs and eight mathematical reasoning datasets.

40 of 191

Inference-time Alignment

Yu et al. Reward Steering with Evolutionary Heuristics for Inference-time Alignment. Arxiv 2024.

Helpfulness

41 of 191

Problem and Challenge

Problem : Ensuring LLMs operate in a way that aligns with human intended goals. Involves guiding the model's behavior to be safe, reliable, and aligned with the desired outcomes of its users, avoiding harmful or biased outputs.

Challenges: Current preference optimization methods interferes with model prior LLM training and risking adherence to evolving user expectations.

Inference-Time Alignment: Aligning models without explicit weight updates to LLMs through modifying in decoding method of LLM.

42 of 191

Reward Steering with Evolutionary Heuristics for Inference-time Alignment

Darwin approaches inference-Time Alignment problem

as a reward guided tree search problem.

✔ Decouples exploration and exploitation of tree search

✔ Able to use on top of preference tuning method

✔ Uses an off-the-shelf reward model

✔ Outperform strong baselines on Alpacaeval2 and MT-Bench

42

43 of 191

Exploration and Exploitation

Exploration	Sample N	Sample N independent continuation from a given prompt
Exploration	Instruction Mutation	Prompts LLM to modify original instruction into N mutated instruction. Generate output with each instruction
Exploitation	Best of N	Select the highest rewarded sequences
Exploitation	Reward-guided beam replacement	Periodically replace low reward generated sequences with top-k rewarded sequences

44 of 191

45 of 191

Darwin Workflow

46 of 191

Darwin Improves LLM Performance during Inference

47 of 191

Darwin Improves SIMPO and DPO

48 of 191

Motivations

Prompting techniques are not robust.
Upon identifying the reasoning bottlenecks, can we tune LLMs by making them powerful to address those bottlenecks?
What other techniques are possible?

Improved tokenization.

Left to right prediction may not work well for math.

49 of 191

Model Merging

50 of 191

DELLA-Merging:

Reducing Interference in Model Merging through Magnitude-Based Sampling

Model Merging:

Computationally efficient compared to training multi-task models
Enables model to use information from relevant tasks to improve task performance
Improves out of distribution generalisation
Reduces biases arising from single task training

Problem:

Maintaining separate fine-tuned models for different tasks presents several limitations eg memory footprint, cost, and leverage transfer learning.

51 of 191

DELLA-Merging

Step 1 Drop:

Assigns drop probability p_i to delta parameters inversely proportional to their magnitudes.

Step 2 Elect:

Reduces interference by electing parameters with dominant sign.

Step 3 Fuse:

Perform weighted averaging of the retained delta parameters.

Drop

Elect

Fuse

52 of 191

DELLA-Merging: MagPrune (Drop Step)

Assigns a drop probability, p_i to all parameters and rescale retained parameters by 1/(1-p_i).

Probabilities are assigned based on magnitude. As such, Parameters with higher magnitudes have a lower drop rate.

53 of 191

Results

Della outperforms SOTA methods in 3/4 merge combinations of Instruct, Math and Code models.

Della shows 2.4% improvement over SOTA merging methods.

Deep, Pala Tej, Rishabh Bhardwaj, and Soujanya Poria. "DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling." arXiv preprint arXiv:2406.11617 (2024).

54 of 191

Multimodal RAG

55 of 191

Overview

Motivation

Understanding documents can involve multimodal content such as texts, figures, and tables
But questions answering over documents can involve hundreds of pages and detailed analysis
While leveraging retrieval (RAG) can improve efficiency, models may still be distracted by irrelevant content

Contributions

M-LongDoc: A benchmark and automatic evaluation framework on multimodal long documents
Retrieval-aware tuning framework for multimodal document understanding

Findings

Most models struggle with figure and table-based questions compared to text-based questions, revealing their multimodal bias
Experiments show that our tuning approach achieves a relative improvement of 4.6%

56 of 191

Data Example

In M-LongDoc, we focus on questions with longer explanations or analysis, rather than extracting short answers or counting objects

57 of 191

Data Overview

M-LongDoc covers diverse topics in academic, financial, and product domains, with much longer documents

58 of 191

Data Construction

Given each {text, table, figure}, we generate and verify open-ended questions

59 of 191

Evaluation Framework

To assess open-ended answers, we leverage a detailed evaluation guide with multi-judge scoring

60 of 191

Preliminary Study

Current models tend to be weaker in processing visual contents (figures and tables) than texts
Increasing the retrieval context length is expensive and may hurt performance
Multimodal models may be easily distracted by irrelevant content in the retrieved context

61 of 191

Training Framework

To effectively leverage retrieval, the model is must distinguish between irrelevant and relevant multimodal content

62 of 191

Results

Through our retrieval-aware tuning, we improve Qwen2-VL by 4.6% (3.84 -> 4.02)

63 of 191

Vision Language Action Models

LLM for Robotics

Sun, Qi, Pengfei Hong, Tej Deep Pala, Vernon Toh, U. Tan, Deepanway Ghosal, and Soujanya Poria. "Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning." arXiv preprint arXiv:2412.11974 (2024).

Helpfulness

64 of 191

Meet Emma-X: An Embodied Multimodal Action Model

Existing models have ”muscle memory”
Not capable of reasoning
Not generalizable
Hallucinates

65 of 191

Meet Emma-X: An Embodied Multimodal Action Model

Emma-X employs grounded chain of thought and look-ahead spatial reasoning
Outperforms SOTA by 25%
On out-of-domain tasks, performance improved by 30%
On spatial reasoning tasks, performance improved by 40%

66 of 191

Overview of Emma-X

67 of 191

Overview of Data Construction

68 of 191

Training and Inference with Emma-X

Trained with visually grounded chain-of-thought
Knows how to reason spatially
Easy human intervention

69 of 191

Emma-X is the new SOTA

70 of 191

Emma-X in Action

Open the microwave

Pick up an object that is a kind of vegetable

71 of 191

NORA

VLA Trained from Scratch

VLM → VLA

72 of 191

NORA

73 of 191

Demo (Kitchen)

With object distraction: Put the pink toy in pot

With human distraction: Put the carrot in pot

With human + object distraction: Put the pink toy in pot

74 of 191

NORA

Community Reception

🚀 4K+ downloads in just two weeks

75 of 191

Tango Model Family

Text to Audio Generation

Majumder et al. Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization. ACM MM 2024.

Helpfulness

76 of 191

Background of Tango

LDM with UNet backbone
Flan-T5 encoder as text encoder
63 times fewer training samples than AudioLDM

77 of 191

Observations on Tango Outputs

Missing acoustic event
Temporally misaligned acoustic events

78 of 191

Alignment to the Rescue

79 of 191

Alignment Dataset

Strategy 1: prompt → four audio samples; vary the denoising steps

80 of 191

Alignment Dataset

Strategy 1: prompt → four audio samples; vary the denoising steps

Strategy 2: prompt → perturbed prompts → audio samples

Strategy 3: prompt → temporally perturbed prompts → audio samples

81 of 191

Perturbed Prompts

82 of 191

Alignment Dataset

83 of 191

Audio Alpaca Stats

84 of 191

Results

85 of 191

TangoFlux

Powered by rectified flow matching.
SOTA results thanks to online iterative DPO training.

Hung, Chia-Yu, et al. "TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization." arXiv preprint arXiv:2412.21037 (2024).

86 of 191

Online Iterative Training

87 of 191

Results

88 of 191

Community Response

Youtube video: https://www.youtube.com/watch?v=bzrRhnD23v4&t=376s&ab_channel=AISearch

89 of 191

Multimodal Representation Learning

Helpfulness

90 of 191

Why Multimodal? — Human Communication is Multimodal

Introduction

Each modality provides complimentary information
Key applications in behavior understanding

Analyse interviews.
Deception detection.

Application in legal.

Tele-medication for mental health
Brain-computer interface

Multi-sensory inputs

91 of 191

Major Challenge in Multimodal Analysis

Develop techniques to fuse multiple modalities.
Modalities are heterogenous making it hard to fuse.
Handling large data.

Introduction

92 of 191

Blueprint of Multimodal Fusion

Unimodal

Representations

Joint

Multimodal

Representation

Intermediate

Representations

Audio

Visual

Text

Improving intermediate representations

Lead to better joint multimodal representations.
Apply inductive bias using constraints.

Introduction

Hazarika, Devamanyu, Roger Zimmermann, and Soujanya Poria. "Misa: Modality-invariant and-specific representations for multimodal sentiment analysis." Proceedings of the 28th ACM international conference on multimedia. 2020.

93 of 191

MISA vs Rest

Introduction

94 of 191

Why Disentangle Features?

Modality-invariant features may not be helpful as modalities do not agree w.r.t. label

An inductive bias to be faithful to the input modality composition

Introduction

95 of 191

Introduction

96 of 191

Task Setup

97 of 191

Overall Framework

Method

98 of 191

Combining Modality-invariant

and -specific Features

Method

99 of 191

Distributional Similarity for Invariant Features

Method

100 of 191

Modeling Orthogonal Modality-specific Features

Method

101 of 191

Preventing to Learn Trivial Representations

Method

102 of 191

Combining Modality-invariant and -specific Features

Method

One of the first few works showing attention can be used for multimodal fusion

103 of 191

The Overall Loss Function

Method

104 of 191

Datasets

Experiments

105 of 191

Baselines

Temporal Fusion:

Attention Transformer:

Graph-based:

Tensor-Fusion:

Common Representations:

Inter-utterance Joint Models:

MFN, MARN, MV-LSTM, RMFN

RAVEN, MulT

Graph-MFN

TFN, LMF, LMFN, HFFN

MCTN, ARGF, MFM

BC-LSTM, CH-FUSION, CIA, CIM-MTL, DFF-ATMF

Experiments

106 of 191

State of the Art

Interaction Canonical Correlation Network

Sun, Z. et al., Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. AAAI 2020

Contextual Memory Fusion Network

Hasan et al. UR-FUNNY: A Multimodal Language Dataset for Understanding Humor. EMNLP-IJCNLP 2019

Experiments

107 of 191

Low-level Features

Experiments

Language:

Audio:

Visual:

GloVe Token Embeddings or BERT Sentence Embeddings

COVAREP

(12 Mel-frequency cepstral coefficients, pitch, Voiced/Unvoiced segments, … )

Facial Action Coding System - MOSI/MOSEI

OpenFace - UR_FUNNY

108 of 191

Results

CMU-MOSI

MAE — Lower is better

Consistent improvement over state-of-the-art
Model improves more when the task is difficult

Sentiment/emotion intensity prediction.
7-way sentiment classification.
Indicates the importance of modality invariant and specific features.

109 of 191

Results

CMU-MOSEI

UR_FUNNY

Similar Trend of Results on CMU-MOSEI and UR_Funny

110 of 191

Results

Ablations

Every modality is important.

Language modality is the most crucial.

Due to clean transcriptions.

All three losses are vital.

is the least important.
Non trivial representations learned by modality encoders.

111 of 191

Analysis

t-SNE Projections

Final Loss:

indicates no similarity and difference loss

Presence of

Invariant features are grouped together.

112 of 191

Analysis

Contribution of Learned Vectors

Plot of penultimate multihead attention scores.
Text modality contributes the most.
High scores for both modality-specific and invariant.

113 of 191

Analysis

Learning Curve

Plot indicates all three different losses decrease on the validation data during the training.

Model is learning as per our hypothesis.

114 of 191

Improvements

Multimodal-Infomax: Overall Idea

Maximising Mutual Information can replace the CMD loss in MISA for improved multimodal representation learning.
Maximize MI between each of the modalities: textual modality is denoted by x and visual and speech modality are denoted by y. ��Maximise MI between fused output and individual modality hierarchically: Z is modality-fused representation and�we incorporate this score function into the Noise-Contrastive Estimation framework by treating all other representations of of that modality in the same batch as negative samples.

115 of 191

Improvements

Multimodal-Infomax: Overall Idea

Modalities are correlated

Capture modality correlation in representations

Maximize Mutual Information

Between pair of modalities

Help learning better intermediate representations

Between modalities and fused representation

116 of 191

Improvements

Multimodal-Infomax: Results

117 of 191

Improvements

Multimodal-Infomax: Results

CMU-MOSEI

CMU-MOSI

118 of 191

Trustworthiness

Song et al. Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse. Arxiv 2024.

Harmlessness

119 of 191

LLMs Hallucinate

120 of 191

Do LLMs Know what they Know?

121 of 191

Jokes Apart: The Problem is Really Critical

122 of 191

Problem

123 of 191

Problem

124 of 191

Problem

Realistic but incorrect information - hallucination!
How to verify information? Source?

125 of 191

Retrieval-Augmented Generation (RAG) as a Solution to Hallucination

Want LLM to rely only on the external documents
Need LLM to ground response in the documents rather than rely on parametric knowledge

https://github.com/princeton-nlp/ALCE

126 of 191

Retrieval-Augmented Generation (RAG) as a Solution to Hallucination

Generate more factually reliable answers when provided external documents
Easy verification of statements due to citations

https://github.com/princeton-nlp/ALCE

127 of 191

LLM Groundedness

Grounded response:

✅ Refuse to answer questions it does not have adequate information for
✅ Correctly answers question using only information from the documents
✅ Inline citations to the documents to support generated answers

128 of 191

Previous Works

Evaluation

Current RAG evaluations focus (ALCE, ALiiCE) on the overall system performance, conflating the effects of retriever quality and LLM performance in the metric scores.

Need for new ways to measure LLM effectiveness in RAG systems without the influence of the retriever.

Nomiracl analyzes the refusal capabilities of LLMs in a RAG context but lacks holistic evaluation, as it does not account for both response and citation groundedness.

Mitigation

AGREE, CaLM, FRONT propose frameworks to improve LLM response groundedness but overlook refusal behaviors in their metrics.

Ignoring refusal behaviors, retriever influence, citation and answer groundedness weakens the ability of current metrics to effectively measure LLM performance in RAG.

129 of 191

Key Contributions

TRUST-SCORE comprehensively evaluates LLM performance, including refusal, citation, and answer groundedness
TRUST-ALIGN creates a corresponding alignment dataset, making the metric and approach more unique and holistic for LLM evaluations and alignment in RAG

130 of 191

TRUST-SCORE

Assesses an LLM across multiple dimensions:

1) Grounded Refusals: is the model able to discern which questions can be answered or refused based on the provided documents?

2) Exact Match scores: For the answerable questions, is the response correct?

3) Citation recall: Are the generated statements supported by the corresponding citations?

4) Citation precision: Are the citations to the statements relevant?

131 of 191

TRUST-SCORE

132 of 191

TRUST-ALIGN

Propose alignment dataset consisting of 19K questions, documents, positive (preferred) responses, and negative (unpreferred) responses to enhance groundedness of LLMs
Dataset covers a range of five LLM hallucination types—Inaccurate Answer, Over-Responsiveness, Excessive Refusal, Over-Citation, and Improper Citation

133 of 191

TRUST-ALIGN

134 of 191

Collecting Quality Questions

135 of 191

Collecting D’s

136 of 191

Augmenting (q,D) Set

137 of 191

Answerability Labelling

138 of 191

Details on Claim Document Mapping

We develop an automated data labeling pipeline that synthesizes natural responses from gold claims and maps each statement to the corresponding documents for embedded in-line citations. The gold claims are obtained from the source datasets (ASQA, QAMPARI, ELI5) and calibrated to the provided documents, i.e., filtering out claims that cannot be derived from D.
We first split the questions into answerable and unanswerable samples based on whether the provided documents entail the gold claims.
For an answerable sample, consisting of a question q, a set of documents D, and a list of (calibrated) gold claims, we prompt GPT-4 to generate a natural response by stitching together the gold claims using a template (Table 21). Please refer to the subsection below for more details on how the prompt is structured for each dataset.
The prompt template asks GPT-4 to label each gold claim used with its index from the provided list (e.g., ”[Gold Claim X]”), allowing for later matching of claims to documents. For unanswerable questions, a refusal response is assigned. To generate citations corresponding to each statement generated, we map the ”[Gold Claim X]” labels to the appropriate documents.
First, we extract all such labels from a sentence (which may contain multiple claims and labels). Then, we greedily identify the smallest combination of documents that covers these claims, minimizing over-citation. Details of this process is illustrated in Fig. 4.

139 of 191

Augmenting (q,D) Set

140 of 191

Obtaining r+ and r−

141 of 191

Obtaining r+ and r−

142 of 191

Effectiveness of Our Data Construction Approach

143 of 191

TRUST-ALIGN Boosts Trustworthiness of Models

144 of 191

TRUST-ALIGN Improves Models’ Refusal Capability

145 of 191

TRUST-ALIGN Enhances Models’ Citation Quality

146 of 191

Mixed Results on Exact Match Recall due to Models’ Usage of Parametric Knowledge

147 of 191

Models Aligned with DPO Generally Outperform those Trained with SFT

148 of 191

TRUST-ALIGN Generalizes across Model Families and Sizes

149 of 191

Importance Of Refusal Samples In Trust-Align

150 of 191

Improvements Generalizes on Out-of-Domain Data

151 of 191

Studying Parametric Knowledge Access

Quantify how many unanswerable questions were answered correctly

152 of 191

Revised Metrics Are Less Biased

Reduction in performance gap

153 of 191

Revised Metrics Are Less Biased

Revealing our model’s stronger performance as compared to baseline

154 of 191

Key Findings

TRUST-ALIGN boosts trustworthiness of models.
TRUST-ALIGN improves models’ refusal capability
TRUST-ALIGN enhances models’ citation quality.
Models aligned with DPO generally outperform those trained with SFT
TRUST-ALIGN generalizes across model families and sizes.

155 of 191

Paper:

Codebase:

156 of 191

157 of 191

Introduction: Knowledge-Intensive Tasks

158 of 191

Introduction: Chain-of-Thought

159 of 191

Introduction: Retrieval-Augmented LLMs

160 of 191

Introduction: Synergizing Reasoning, Retrieval, Correction

161 of 191

Overview

Motivation

Reduce LLM hallucination through knowledge retrieval
Diverse knowledge sources, both general and domain-specific, structured and unstructured
Error propagation of factual mistakes is common in multi-hop questions

Proposal

Chain-of-knowledge (CoK): Framework to augment LLMs with knowledge sources
Adaptive query generator: Module to help LLMs retrieve from diverse sources
CoK employs iterative retrieval and correction to mitigate error propagation

Findings

Outperforms previous methods in multiple domains (factual, medical, physical etc)

162 of 191

Framework: Reasoning Generation & Domain Selection

Relevant domains: Factual (Wikidata, Wikipedia)

163 of 191

Framework: Iterative Retrieval and Correction

164 of 191

Framework: Adaptive Query Generation

165 of 191

Diverse Query Examples

166 of 191

Main Results

167 of 191

Analysis: Effect of Multiple Knowledge Sources

168 of 191

Analysis: Factuality of Rationales

169 of 191

Takeaways

Heterogenous knowledge retrieval

Is promising to improve factuality in multiple domains
Requires query generators for specialized domains

Chain-of-knowledge

Synergizes reasoning, retrieval, and correction
Employs iterative correction to mitigate error propagation

Future work

Can integrate new modalities such as images
Can further improve the effectiveness of domain-specific retrieval

170 of 191

Safety

Harmlessness

171 of 191

Safety issues with LLMs

Aligned LLMs can be red-teamed through just prompting.
Training LLMs on just 100 samples make them vulnerable.

172 of 191

Ferret: Motivation

Can we automatically test how vulnerable LLMs are?
Can prompts be diverse?
Can we make automated red-teaming faster?
Can we adapt to mixture of adversarial styles?

173 of 191

Ferret: Methodology

Pala et al. Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique. Arxiv 2024

174 of 191

Ferret: Main Results

All Ferret variant outperform Baselines in Llama Guard 2 ASR.
Reward Model Scoring Function Shows Consistent Performance Across Risk Categories.
Reward Model Scoring Function Shows Greater Alignment with Llama Guard 2 and GPT-4 ASR.

175 of 191

Ferret: Analysis

Ferret (RM) achieves ASR threshold faster than Rainbow Teaming (+CF)

176 of 191

Language Models are Homer Simpson!

177 of 191

178 of 191

The solution is quite simple!

Bhardwaj et al. Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic. ACL 2024.

179 of 191

180 of 191

Summary of the results

181 of 191

182 of 191

Side-effects

183 of 191

More Generalized Version

184 of 191

Safety Arithmetic

Hazra et al. Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations. EMNLP 2024.

185 of 191

Understanding Safe Align

Solution to this equation is the PCA of

Add ICV to latent states

Start with a few exemplars

Take the latent vectors toward safe from unsafe

186 of 191

Simple yet Effective Solution!

Base

SFT

WM for WizardMath, LM for LlamaMath, and EC for EvolCodeAlpaca

Lower is better

187 of 191

WalledEval

A Comprehensive Safety Evaluation Toolkit for Large Language Models

188 of 191

Safety vs Refusal

(exag-safety)

189 of 191

Multilingual Safety

190 of 191

LLM Benchmarking: Numbers on the left for the first four datasets indicate the percentage of safe responses to unsafe prompts, referred to as harmful behavior (Judge: LlamaGuard 2). Nmbers on the right represent the percentage of instances where the LLM correctly chooses to refuse (for unsafe prompts) or accept (for safe prompts), referred to as refusal behavior (Judge: MCQJudge). Green, yellow, and red colors denote the highest, second highest, and lowest scores in the columns, respectively. \textbf{XSTest} (Mutated) refers to XSTestm}.

Judge Benchmarking

191 of 191

Thank you!