1 of 67

LLM Observability, Evaluation, & Guardrails

Making AI Work

| We Make Models Work

2 of 67

Bad LLM Responses Lead to Real Business Impact

| We Make Models Work

3 of 67

ARIZE DEPLOYED AT THE WORLD’S TOP ENTERPRISES

| We Make Models Work

4 of 67

The Emerging LLM Toolchain

AI Memory

Orchestration

Protect & Monitor

Proprietary Data

and LLM Memory

LLM App Frameworks

Production Guardrails & Troubleshooting

LLMs

3rd Party

Models

Evaluation

Testing & Debugging

Google

Vertex AI

Microsoft AutoGen

OSS by Arize

OBSERVABILITY

Azure AI Studio

| We Make Models Work

5 of 67

Common Pains faced bringing

LLM Apps to Production

Early Development

Get something working quickly

Evaluate & Iterate

Benchmark evals, iterate on performance

Production

App is launched, actively collect telemetry and online evaluation data

Improve

Curate data and feedback to refine performance

Don’t know effects of making a change

No proactive protection against jailbreaks, PII leaks, bad performance

Where to focus improvements for performance once live

No visibility into what’s happening behind opaque calls

CI/CD

| We Make Models Work

Early development

A single line of code in a framework can have 100s of calls underneath, unclear what they are doing

iterate

You got it working on an example or two, now you want to run prompts on more data because you don’t trust it to be deterministic/work on every case you might see in the real world

You need to collect example/ (trace data) from development stage and store it somewhere
You need to create different iterations of your prompts and experiment with them. You’ll want to store this too.

Figure out how to evaluate and this is really hard, you don’t know what evaluation metric / how to benchmark

Not just prompt templates, but also on eval templates

Production:

Jailbreaks can be costly for the business

Prevent prompt injections and jailbreaks - guardrails

Evaluating online is hard

Run evals online at scale and monitor

Improve:

Figuring out what data to focus on
Need ways to collect feedback

Human, llm, code

6 of 67

From Development through Production: Build Better AI with Arize

Early Development

Get something working quickly

Evaluate & Iterate

Benchmark evals, iterate on performance

Production

App is launched, actively collect telemetry and online evaluation data

Improve

Curate data and feedback to refine performance

CI/CD

Evaluation

Prompt Playground

Guardrails

Online Evals

Similarity Search

Annotations

Traces

Datasets

Experiments

Monitoring & Dashboards

Copilot

Embeddings Analyzer

| We Make Models Work

Early development

A single line of code in a framework can have 100s of calls underneath, unclear what they are doing

iterate

You got it working on an example or two, now you want to run prompts on more data because you don’t trust it to be deterministic/work on every case you might see in the real world

You need to collect example/ (trace data) from development stage and store it somewhere
You need to create different iterations of your prompts and experiment with them. You’ll want to store this too.

Figure out how to evaluate and this is really hard, you don’t know what evaluation metric / how to benchmark

Not just prompt templates, but also on eval templates

Production:

Jailbreaks can be costly for the business

Prevent prompt injections and jailbreaks - guardrails

Evaluating online is hard

Run evals online at scale and monitor

Improve:

Figuring out what data to focus on
Need ways to collect feedback

Human, llm, code

7 of 67

The LLM Application Development Lifecycle

Copilots

Computer vision

Search/

recomm.

RAG

Develop & Test

Build & experiment

Evaluate

Measure

Trace

Track system

Monitor

Surface issues

Agents

Voice

assistants

Copilots

Computer vision

Search/

recomm.

RAG

Develop, Evaluate, Iterate

Agents

Voice

assistants

| We Make AI Work

8 of 67

LLM Observability

9 of 67

Observability, Traces, and Spans

Observability - complete visibility into every layer of an LLM-based software system: the application, the prompt, and the response.
Traces - telemetry data on full calls to an LLM app or pipeline. Made up of a series of spans.
Spans - telemetry data captured on individual steps in an LLM app or pipeline.

| We Make Models Work

10 of 67

Examples - Traces

| We Make Models Work

11 of 67

Examples - Spans

| We Make Models Work

12 of 67

How Traces are Captured

OpenTelemetry�High-quality, ubiquitous, and portable telemetry to enable effective observability

Used across all kinds of applications

Open-source, vendor-agnostic

| We Make Models Work

13 of 67

How Traces are Captured

OpenInference

�LLM-specific schema and conventions built on top of OpenTelemetry

Used specifically for LLM pipelines

Open-source, vendor agnostic

| We Make Models Work

14 of 67

How Traces are Captured

OTEL Endpoint

UI to view and perform actions with your traces�

| We Make Models Work

15 of 67

How Traces are Captured

COLLECTOR

OTEL PROCESSOR

| We Make Models Work

16 of 67

Agnostic of Frameworks

Edit 1 file, 1 line of code: Change collector destination

Proprietary

instrumentation

@otel

Standardized

@otel

Proprietary

@tracer

Potentially instrumenting 100s–1,000s files to change

Proprietary instrumentation is framework lock in

| We Make Models Work

17 of 67

Arize Phoenix: Open-Source LLM Tracing & Evaluation

Give Us a Star 🌟

| We Make Models Work

18 of 67

LLM Evaluation

19 of 67

Evaluation

| We Make Models Work

20 of 67

LLM Evaluations

LLM as a Judge Evals

Hallucination, summarization, qa

Ground Truth Evals

F1, rouge scores, similarity

Dataset Analysis

Embedding visualizations, annotations

| We Make Models Work

21 of 67

Assertion-based & Ground-truth Evals

Comparing responses against a ground truth set of data:

Accuracy, Precision, Recall, F1 Scores
AUC-ROC
Rouge 1 scores
Similarity scores

Computed metrics:

Perplexity
Entropy
KL-Divergence

| We Make Models Work

22 of 67

LLM as a Judge Evaluations

EVAL PROMPT TEMPLATE

<<template: hallucination>>

Is the {response} using the {context} to answer the {query}

APP IN

PRODUCTION

RESPONSE

{response} Hi, I’m happy to help you plan…

EVALUATOR LLM

PROMPT

QUERY I want to go…

CONTEXT User booking history, hotel inventory…

INSTRUCTIONS You are a friendly travel assistant…

EVAL

No

Did it hallucinate?

EVAL

| We Make Models Work

23 of 67

How do Evals work? (LLM-As-A-Judge)

Eval

Example

“relevant”

“irrelevant”

span

retrieval span

span

Phoenix Library

Model Params

Eval LLM

Eval Template

Example: Retrieval

retrieval span

Span we want to evaluate

Output

User Query

Input

Documents

Eval Template

You are comparing a reference text to a question and trying to determine if the reference text contains information relevant to answering the question. Here is the data:

[BEGIN DATA]

************

[Question]: {query}

************

[Reference text]: {reference}

[END DATA]

Compare the Question above to the Reference text. Determine whether the Reference text contains information that can answer the Question.

| We Make Models Work

24 of 67

Arize Evals Differentiation: LLM as a Judge Framework

Most Comprehensive Library – fast (parallel calls, rate limiting, backoff over batch data), LLM-as-a-judge, works with all common LLMs
Explanations – generate with single flag
Custom evals – use your own templates
Rails – control and standardize with your criteria
RAG support – most extensive available in market
Tracing - Tight tie to tracing

from phoenix.experimental.evals import (

HallucinationEvaluator,

QAEvaluator,

RelevanceEvaluator,

)

| We Make Models Work

25 of 67

Only the Best Models are Good Judges

Of 9 LLM Judges, only GPT-4 Turbo and Llama-3 70B showed very high alignment with humans

GPT-4 Turbo is 12 points behind human judgment

| We Make Models Work

26 of 67

Dataset Analysis

I want to plan a family vacation…

I need accomodations…

Find me flights for…

Plan a 10 day trip to…

I want a refund on…

House rental for 15 guests…

Best Italian restaurants in…

Booking with reward points…

Going to London for…

This didn’t work for me…

Local restaurants in..

I want my money back…

AI Search

Spans

Embeddings Similarity Search

OR

Search Based on Human Understanding

Search Based on Similar Examples

“Find examples where the user is frustrated and mentions refund or return”

“Find queries with attempted jail breaks”

Refund my account…

I’m not satisfied with this…

This is not acceptable…

I shouldn’t be charged for this…

Human / Automated Annotations

Refund my account…

I’m not satisfied with this…

This is not acceptable…

I shouldn’t be charged for this…

Refund requested

Frustrated user

Refund requested

Frustrated user

| We Make Models Work

27 of 67

Embedding Analysis

| We Make Models Work

28 of 67

Visualize Traces in Phoenix

| We Make Models Work

29 of 67

Guardrails

30 of 67

Input

DETECTIONS

ACTIONS

Output

Arize Guardrails Use Dynamic Data to Protect

01

Embeddings guards

02

Few shot LLM prompt

03

LLM Evals

• PII

• User Frustration

• Hallucination

Your datasets

(PII, Jailbreaks, User frustration)

Block

Retry

Default answer

| We Make Models Work

31 of 67

LLM

Evaluation

Embedding

Dataset

Options for Guardrails

Few Shot LLM Prompt From Dataset

Advantage: Block on Evals

Advantage: Continuous Iteration on Breaks

Advantages: Completely Custom

| We Make Models Work

32 of 67

Evaluating RAG

33 of 67

Private Knowledge Sources

Unstructured & Structured sources: Pdfs, Google docs, MQL data, Mp4, Slack, etc

Retrieval-Augmented Generation (RAG)

Prompt

LLM

Output �Specific, well informed answer

Indexing

No context user query

Retrieval

Pre retrieval process

Post

retrieval process

| We Make Models Work

34 of 67

Retrieval-Augmented Generation (RAG)

Retrieval Evals

Are the right chunks retrieved?

Common evals:

Relevancy
Semantic similarity
Ranking (nDCG)
Hit Rate
Precision

Response Evals

Does the LLM generate the right response?

Common evals:

Faithfulness
Relevancy
Hallucination

QA Correctness

| We Make Models Work

35 of 67

Q&A is an End to End Evaluation Metric

| We Make Models Work

36 of 67

Deeper Retrieval Evaluation Metrics

To measure the effectiveness of your top ranked documents.

Takes into account the position of relevant docs.

% of queries that have relevant context.

Hit is a binary metric (relevant document was or wasn’t retrieved)

Precision = % relevant documents, up to ‘K’ retrieved documents.

Precision@3 = 33%, if 1 out 3 docs is relevant.

Assess the accuracy and relevance of the documents that were retrieved

nDCG

Hit Rate

Precision @K

| We Make Models Work

37 of 67

Evaluating Relevance of Each Chunk

| We Make Models Work

38 of 67

Context versus Relevance

Query	Context Returned 1	Context Returned 2	Relevance 1	Relevance 2	Response	Response Eval
Do you need a prediction ID for the training set?	A `prediction ID` is an ID that indicates a unique prediction event. A prediction ID is required to connect predictions with delayed actuals (ground truth). Learn how to send delayed (latent) actuals here.	Ensure Training and Validation records must include both prediction and actual columns	Relevant	Relevant	Yes, a prediction ID is required for the training set.
How do I configure permissions for GBQ?	There are two ways to setup access permissions with Arize Configure An Indiviaual Bucket PolicyGive Arize permission to access individual buckets#configure-an-individual-bucket-policyConfigure Multiple Buckets Via Role Based PermissionsAssign Arize a role to access multiple buckets using external IDs#configure-multiple-buckets-via-role-based-permissions	1. In Google Cloud console: Navigate to the BigQuery SQL Workspace 2. Select the desired table or view, navigate to the Details tab and click "Edit Details". Under the Labels section, click "Add Labels". Add the following label: * Key as "arize-ingestion-key_"_ …	Irrelevant	Relevant	To configure permissions for Google BigQuery (GBQ), you can either configure an individual bucket policy or assign Arize a role to access multiple buckets using external IDs.
Can I copy a dashboard?	Templates are designed as starting points for dashboard and model analysis. Once a dashboard is created from a template, it can be edited and customized as desired.	To add a widget, simply: * Click the Edit Dashboard icon in the top right corner. * Select or drag and drop the widget onto an area of the dashboard.	Irrelevant	Irrelevant	Yes, you can copy a dashboard. To do so, click the Edit Dashboard icon in the top right corner and select the Copy Dashboard option.

| We Make Models Work

39 of 67

Improving your RAG application

Is the response correct?

Great!

Are the retrieved documents relevant?

Response Issue

How to fix:

Prompt engineer
Swap models
Fine-tune

Are there relevant documents in the corpus?

Database is missing information

Retrieval Issue

How to fix:

Experiment with different chunking techniques
Test different embedding models
Improve text extraction techniques

| We Make Models Work

40 of 67

How to Run Sweep Experiments for Retrieval Setup

K size

Retrieval Approach

Script: https://github.com/Arize-ai/phoenix/tree/main/scripts/rag

Chunk Size

| We Make Models Work

41 of 67

Evaluating Agents

42 of 67

Components of an Agent

SKILLS / EXECUTION BRANCHES

The logic chains that do the actual work

E.g. an SQL query skill or RAG retriever

ROUTER

Optional component that decides which next step the agent will take

MEMORY

A shared memory state that can be accessed by each different component

| We Make Models Work

43 of 67

Form – LLM, simple NLP classifier, or even rules-based code
Not used by all agents

ROUTER

SKILLS BRANCHES

MEMORY

SKILL 1

ROUTER

USER QUERY

SKILL 2

Determines which skill or function to call to respond to the user’s query

| We Make Models Work

44 of 67

Made up of Components
Each agent will have one or more

ROUTER

SKILLS BRANCHES

MEMORY

Individual logic blocks and chains that can complete a task

Product�Search

Unstructured to Structured

Search API

Compare�Products

LLM calls

API calls

Application code

| We Make Models Work

45 of 67

Importance – Many LLM APIs rely on receiving each agent step to decide on the next step
Used to store retrieved context, config variables, previous execution steps

messages = []

messages.append({"role": "system", "content": "You are a helpful customer support assistant. Use the supplied tools to assist the user."})

messages.append({"role": "user", "content": "Hi, can you tell me the delivery date for my order?"})

messages.append({"role": "assistant", "content": "Hi there! I can help with that. Can you please provide your order ID?"})

messages.append({"role": "user", "content": "i think it is order_12345"})

ROUTER

SKILLS BRANCHES

MEMORY

Shared state that can be accessed by each component in the agent

| We Make Models Work

46 of 67

Example Agent: Ecommerce Chat to Purchase

USER INPUT

Item Search

LLM

47 of 67

Getting Agents to Work is HARD

→ This is why LLM Evals are critical

LLMs are non-deterministic
A bad response upstream leads to a strange response downstream
Agents can take inefficient paths and still get to the right solution
Frameworks make building easier, but debugging harder

| We Make Models Work

48 of 67

Parts of an Agent You Need to Evaluate

The Router

Function choice and parameter extraction

The Skills / Functions

Can use standard LLM evaluations

The Path

The most challenging to evaluate at scale

| We Make Models Work

49 of 67

Evaluating a Router

Am I using the right skill correctly?

Skills

AGENT ROUTER

A

Parameter Extraction

Function Generation

B

C

| We Make Models Work

50 of 67

Evaluating Skills with Standard LLM & Code Evals

For RAG Skills:

Retrieval Relevance
QA Correctness
Hallucination
Reference / Citation

For Code-Gen Skills:

Code readability
Code correctness

For API Skills:

Code-based integration tests and unit tests

For All skills:

Comparison against ground truth data

| We Make Models Work

51 of 67

Evaluating Convergence

Calculating Convergence:

Run your agent on a set of similar queries
Record the number of steps taken for each
Calculate the convergence score:

avg(minimum steps taken for this query / steps in the run)

Agent

step 1

Step count++

Problematic Node?

Agent

step 2

Agent

step 3

Agent

step 100

Step count++

Problematic Node?

Step count++

Problematic Node?

Step count++

Problematic Node?

| We Make Models Work

52 of 67

Visit us at arize.com

Thank you!

Give Phoenix a star! 🤩

phoenix.arize.com

| We Make Models Work

53 of 67

Appendix

| We Make Models Work

54 of 67

Phoenix is:

Open-source
Built on top of open telemetry and open inference protocols
Self-hostable or accessible via cloud
100% free

Phoenix provides:

Dozens of one-line automatic tracing integrations with popular frameworks
Evaluation-agnostic scaffolding - run any types of evals you want, analyze them in Phoenix
Experimentation, human annotation, prompt versioning, and much more

| We Make Models Work

55 of 67

Evaluation Driven Development

56 of 67

Evaluation Driven Development is a Continuous Process

Examples

Curate Dataset

Track Changes as an Experiment

(Model, Prompt, Retriever)

Evaluate the Experiment

New Output

Score

0.8

You’re a helpful assistant. When user asks about return policy respond with {vars} …

LLM APPS REQUIRE ITERATIVE PERFORMANCE IMPROVEMENTS

| We Make Models Work

57 of 67

Curate a Dataset of Test Cases

| We Make Models Work

58 of 67

Define Evaluators

LLM-as-a-Judge

Assertion

| We Make Models Work

59 of 67

Iterate on your task

| We Make Models Work

60 of 67

Run your experiment, iterate, repeat

| We Make Models Work

61 of 67

Why this Approach?

Eval runs – benchmark and test in development prior to production
Cheaper, can sample – more cost and time efficient than human evaluation
Build confidence – better than going into production with no metrics at all

3. Evaluate performance on benchmark

I want to go…

Hotel inventory, local sites…

No

QUERY

CONTEXT

GROUND TRUTH

Did it hallucinate?

LLM-GENERATED LABEL

I’m planning…

Booking history, hotel inventory…

No

Where should…

Local restaurants…

Yes

My wife and I…

Hotel inventory, local sites…

No

I want to see…

Local event calendar…

Yes

No

Yes

No

2. Run through eval prompt template

QUERY

CONTEXT

GROUND TRUTH

I want to go…

Hotel inventory, local sites…

No

Did it hallucinate?

1. Create golden dataset

HALLUCINATION EVAL

PRECISION

RECALL

F1

0.94

0.75

0.83

RESPONSE

Hi, happy to share a few…

The beaches of Positano are…

The beaches of Narnia are…

Couples often love visiting…

There’s a historical site…

| We Make Models Work

62 of 67

Testing As You Build

Use cases

Experiment traces

Eval traces

You are an assistant debugging RAG, investigate the retrieved results and evals…

PROMPT CHANGE

GITHUB ACTION TEST

| We Make Models Work

63 of 67

Document Chunk Embedding	Document Chunk ID	Document Chunk
<1, 1, 2, 4>	1
<100, 309, 4, 7>	2
<59, 71, 73, 95>	3

Knowledge base of articles

Do you support international calling?

User query

Query embedding

<1, 2, 3, 4>

0.8

Cosine Similarity

From Lookup

0.4

0.1

How LLM Search

& Retrieval Works

Prompt

User is asking “Do you support international calling?”

Here’s relevant content. Can you answer?

1

2

3

4

| We Make Models Work

Let’s just start off how search and retrieval works. I’m sure you’re familiar with a lot of this, or at least starting to familiarize yourself with it, so let’s quickly cover the necessary steps.

First off you’ve got your documents stored in a knowledge base or vector DB.

The user asks a question, which is turned into a query embedding.

Then the lookup or retrieval happens. Each document you have stored is chunked and has an embedding associated with it.

During retrieval, it finds the document chunk embedding closest / most similar to the query embedding. With cosine similarity, the closer to it is to 1, the more similar it is.

That document or context is then added to the prompt, in order to produce a response.

While all these steps happen quickly, we often see issues occurring at each step of the process.

64 of 67

LLM Judges align well with human rankings

Judge models may not assign identical scores as humans, but are more aligned in ranking the exam taker models

All judges struggle to distinguish between the poor-performing exam-taker models

contains demonstrates the highest alignment with the human ranking, swapping the ranks of only 2 of 9 models

| We Make Models Work

65 of 67

Position bias

Shuffle the order of references in LLM Judge input [question, references, model response]

LLM Judge is more likely to evaluate an answer as correct if the corresponding reference appears early in the list of references

Larger judge models consistently maintain their judgments regardless of the reference order

| We Make Models Work

66 of 67

Small models ignore the reference

The smaller judge models use their own knowledge rather than going by the references

Fail to capture all the information in the prompt

| We Make Models Work

67 of 67

LLM Judges have a leniency bias

LLMs tend to judge positively when in doubt

This is more pronounced for small models than for larger ones

Definitions:

Judge assigns the correct judgment with a probability of Pc
Judge assigns the rest of the samples to be “correct” with a probability P+ (leniency bias)

Estimate the values of Pc and P+ from the benchmark results

| We Make Models Work