LLM Observability, Evaluation, & Guardrails
Making AI Work
© All Rights Reserved
| We Make Models Work
Bad LLM Responses Lead to Real Business Impact
© All Rights Reserved
| We Make Models Work
ARIZE DEPLOYED AT THE WORLD’S TOP ENTERPRISES
© All Rights Reserved
| We Make Models Work
The Emerging LLM Toolchain
AI Memory
Orchestration
Protect & Monitor
Proprietary Data
and LLM Memory
LLM App Frameworks
Production Guardrails & Troubleshooting
LLMs
3rd Party
Models
Evaluation
Testing & Debugging
Vertex AI
Microsoft AutoGen
OSS by Arize
OBSERVABILITY
Azure AI Studio
© All Rights Reserved
| We Make Models Work
Common Pains faced bringing
LLM Apps to Production
Early Development
Get something working quickly
Evaluate & Iterate
Benchmark evals, iterate on performance
Production
App is launched, actively collect telemetry and online evaluation data
Improve
Curate data and feedback to refine performance
Don’t know effects of making a change
No proactive protection against jailbreaks, PII leaks, bad performance
Where to focus improvements for performance once live
No visibility into what’s happening behind opaque calls
CI/CD
© All Rights Reserved
| We Make Models Work
From Development through Production: Build Better AI with Arize
Early Development
Get something working quickly
Evaluate & Iterate
Benchmark evals, iterate on performance
Production
App is launched, actively collect telemetry and online evaluation data
Improve
Curate data and feedback to refine performance
CI/CD
Evaluation
Prompt Playground
Guardrails
Online Evals
Similarity Search
Annotations
Traces
Datasets
Experiments
Monitoring & Dashboards
Copilot
Embeddings Analyzer
© All Rights Reserved
| We Make Models Work
The LLM Application Development Lifecycle
Copilots
Computer vision
Search/
recomm.
RAG
Develop & Test
Build & experiment
Evaluate
Measure
Trace
Track system
Monitor
Surface issues
Agents
Voice
assistants
Copilots
Computer vision
Search/
recomm.
RAG
Develop, Evaluate, Iterate
Agents
Voice
assistants
© All Rights Reserved
| We Make AI Work
LLM Observability
Observability, Traces, and Spans
© All Rights Reserved
| We Make Models Work
Examples - Traces
© All Rights Reserved
| We Make Models Work
Examples - Spans
© All Rights Reserved
| We Make Models Work
How Traces are Captured
OpenTelemetry�High-quality, ubiquitous, and portable telemetry to enable effective observability
Used across all kinds of applications
Open-source, vendor-agnostic
© All Rights Reserved
| We Make Models Work
How Traces are Captured
OpenInference
�LLM-specific schema and conventions built on top of OpenTelemetry
Used specifically for LLM pipelines
Open-source, vendor agnostic
© All Rights Reserved
| We Make Models Work
How Traces are Captured
OTEL Endpoint
UI to view and perform actions with your traces�
© All Rights Reserved
| We Make Models Work
How Traces are Captured
COLLECTOR
OTEL PROCESSOR
© All Rights Reserved
| We Make Models Work
Agnostic of Frameworks
Edit 1 file, 1 line of code: Change collector destination
Proprietary
instrumentation
@otel
Standardized
@otel
@otel
@otel
@otel
@otel
@otel
Proprietary
@tracer
@tracer
@tracer
@tracer
@tracer
@tracer
@tracer
Potentially instrumenting 100s–1,000s files to change
Proprietary instrumentation is framework lock in
© All Rights Reserved
| We Make Models Work
Arize Phoenix: Open-Source LLM Tracing & Evaluation
Give Us a Star 🌟
© All Rights Reserved
| We Make Models Work
LLM Evaluation
Evaluation
© All Rights Reserved
| We Make Models Work
LLM Evaluations
LLM as a Judge Evals
Hallucination, summarization, qa
Ground Truth Evals
F1, rouge scores, similarity
Dataset Analysis
Embedding visualizations, annotations
© All Rights Reserved
| We Make Models Work
Assertion-based & Ground-truth Evals
Comparing responses against a ground truth set of data:
Computed metrics:
© All Rights Reserved
| We Make Models Work
LLM as a Judge Evaluations
EVAL PROMPT TEMPLATE
<<template: hallucination>>
Is the {response} using the {context} to answer the {query}
APP IN
PRODUCTION
RESPONSE
{response} Hi, I’m happy to help you plan…
EVALUATOR LLM
PROMPT
QUERY I want to go…
CONTEXT User booking history, hotel inventory…
INSTRUCTIONS You are a friendly travel assistant…
EVAL
No
Did it hallucinate?
EVAL
© All Rights Reserved
| We Make Models Work
How do Evals work? (LLM-As-A-Judge)
Eval
Example
“relevant”
“irrelevant”
span
span
retrieval span
span
Phoenix Library
Model Params
Eval LLM
Eval Template
Example: Retrieval
retrieval span
Span we want to evaluate
Output
User Query
Input
Documents
Eval Template
You are comparing a reference text to a question and trying to determine if the reference text contains information relevant to answering the question. Here is the data:
[BEGIN DATA]
************
[Question]: {query}
************
[Reference text]: {reference}
[END DATA]
Compare the Question above to the Reference text. Determine whether the Reference text contains information that can answer the Question.
© All Rights Reserved
| We Make Models Work
Arize Evals Differentiation: LLM as a Judge Framework
from phoenix.experimental.evals import (
HallucinationEvaluator,
QAEvaluator,
RelevanceEvaluator,
)
© All Rights Reserved
| We Make Models Work
Only the Best Models are Good Judges
Of 9 LLM Judges, only GPT-4 Turbo and Llama-3 70B showed very high alignment with humans
GPT-4 Turbo is 12 points behind human judgment
© All Rights Reserved
| We Make Models Work
Dataset Analysis
I want to plan a family vacation…
I need accomodations…
Find me flights for…
Plan a 10 day trip to…
I want a refund on…
House rental for 15 guests…
Best Italian restaurants in…
Booking with reward points…
Going to London for…
This didn’t work for me…
Local restaurants in..
I want my money back…
AI Search
Spans
Embeddings Similarity Search
OR
Search Based on Human Understanding
Search Based on Similar Examples
“Find examples where the user is frustrated and mentions refund or return”
“Find queries with attempted jail breaks”
Refund my account…
I’m not satisfied with this…
This is not acceptable…
I shouldn’t be charged for this…
Human / Automated Annotations
Refund my account…
I’m not satisfied with this…
This is not acceptable…
I shouldn’t be charged for this…
Refund requested
Frustrated user
Refund requested
Frustrated user
© All Rights Reserved
| We Make Models Work
Embedding Analysis
© All Rights Reserved
| We Make Models Work
Visualize Traces in Phoenix
© All Rights Reserved
| We Make Models Work
Guardrails
Input
DETECTIONS
ACTIONS
Output
Arize Guardrails Use Dynamic Data to Protect
01
Embeddings guards
02
Few shot LLM prompt
03
LLM Evals
• PII
• User Frustration
• Hallucination
Your datasets
(PII, Jailbreaks, User frustration)
Block
Retry
Default answer
© All Rights Reserved
| We Make Models Work
LLM
Evaluation
Embedding
Dataset
Options for Guardrails
Few Shot LLM Prompt From Dataset
Advantage: Block on Evals
Advantage: Continuous Iteration on Breaks
Advantages: Completely Custom
© All Rights Reserved
| We Make Models Work
Evaluating RAG
Private Knowledge Sources
Unstructured & Structured sources: Pdfs, Google docs, MQL data, Mp4, Slack, etc
Retrieval-Augmented Generation (RAG)
Prompt
LLM
Output �Specific, well informed answer
Indexing
No context user query
Retrieval
Pre retrieval process
Post
retrieval process
© All Rights Reserved
| We Make Models Work
Retrieval-Augmented Generation (RAG)
Retrieval Evals
Are the right chunks retrieved?
Common evals:
Response Evals
Does the LLM generate the right response?
Common evals:
QA Correctness
© All Rights Reserved
| We Make Models Work
Q&A is an End to End Evaluation Metric
© All Rights Reserved
| We Make Models Work
Deeper Retrieval Evaluation Metrics
To measure the effectiveness of your top ranked documents.
Takes into account the position of relevant docs.
% of queries that have relevant context.
Hit is a binary metric (relevant document was or wasn’t retrieved)
Precision = % relevant documents, up to ‘K’ retrieved documents.
Precision@3 = 33%, if 1 out 3 docs is relevant.
Assess the accuracy and relevance of the documents that were retrieved
nDCG
Hit Rate
Precision @K
© All Rights Reserved
| We Make Models Work
Evaluating Relevance of Each Chunk
© All Rights Reserved
| We Make Models Work
Context versus Relevance
Query | Context Returned 1 | Context Returned 2 | Relevance 1 | Relevance 2 | Response | Response Eval |
Do you need a prediction ID for the training set? | A `prediction ID` is an ID that indicates a unique prediction event. A prediction ID is **required** to connect predictions with delayed actuals (ground truth). Learn how to send delayed (latent) actuals here. | Ensure Training and Validation records must include **both** prediction and actual columns | Relevant | Relevant | Yes, a prediction ID is required for the training set. | |
How do I configure permissions for GBQ? | There are two ways to setup access permissions with Arize Configure An Indiviaual Bucket PolicyGive Arize permission to access individual buckets#configure-an-individual-bucket-policyConfigure Multiple Buckets Via Role Based PermissionsAssign Arize a role to access multiple buckets using external IDs#configure-multiple-buckets-via-role-based-permissions | 1. **In Google Cloud console**: Navigate to the BigQuery SQL Workspace 2. Select the desired table or view, navigate to the **Details** tab and click "Edit Details". Under the **Labels** section, click "Add Labels". Add the following label: * Key as "**arize-ingestion-key**_**"**_ … | Irrelevant | Relevant | To configure permissions for Google BigQuery (GBQ), you can either configure an individual bucket policy or assign Arize a role to access multiple buckets using external IDs. | |
Can I copy a dashboard? | Templates are designed as starting points for dashboard and model analysis. Once a dashboard is created from a template, it can be edited and customized as desired. | To add a widget, simply: * Click the Edit Dashboard icon in the top right corner. * Select or drag and drop the widget onto an area of the dashboard. | Irrelevant | Irrelevant | Yes, you can copy a dashboard. To do so, click the Edit Dashboard icon in the top right corner and select the Copy Dashboard option. | |
© All Rights Reserved
| We Make Models Work
Improving your RAG application
Is the response correct?
Great!
Are the retrieved documents relevant?
Response Issue
How to fix:
Are there relevant documents in the corpus?
Database is missing information
Retrieval Issue
How to fix:
© All Rights Reserved
| We Make Models Work
How to Run Sweep Experiments for Retrieval Setup
K size
Retrieval Approach
Script: https://github.com/Arize-ai/phoenix/tree/main/scripts/rag
Chunk Size
© All Rights Reserved
| We Make Models Work
Evaluating Agents
Components of an Agent
SKILLS / EXECUTION BRANCHES
The logic chains that do the actual work
E.g. an SQL query skill or RAG retriever
ROUTER
Optional component that decides which next step the agent will take
MEMORY
A shared memory state that can be accessed by each different component
© All Rights Reserved
| We Make Models Work
ROUTER
SKILLS BRANCHES
MEMORY
SKILL 1
ROUTER
USER QUERY
SKILL 2
Determines which skill or function to call to respond to the user’s query
© All Rights Reserved
| We Make Models Work
ROUTER
SKILLS BRANCHES
MEMORY
Individual logic blocks and chains that can complete a task
Product�Search
Unstructured to Structured
Search API
Compare�Products
LLM calls
API calls
Application code
© All Rights Reserved
| We Make Models Work
messages = []
messages.append({"role": "system", "content": "You are a helpful customer support assistant. Use the supplied tools to assist the user."})
messages.append({"role": "user", "content": "Hi, can you tell me the delivery date for my order?"})
messages.append({"role": "assistant", "content": "Hi there! I can help with that. Can you please provide your order ID?"})
messages.append({"role": "user", "content": "i think it is order_12345"})
ROUTER
SKILLS BRANCHES
MEMORY
Shared state that can be accessed by each component in the agent
© All Rights Reserved
| We Make Models Work
Example Agent: Ecommerce Chat to Purchase
USER INPUT
Item Search
LLM
Recommended Item
Deals
LLM
Query�Response
Are there any current discounts on Kindle e-Readers?
ROUTER
Skill Branch 1: Purchase
Skill Branch 2: Deals
MEMORY / STATE
© All Rights Reserved
| We Make Models Work
Getting Agents to Work is HARD
→ This is why LLM Evals are critical
© All Rights Reserved
| We Make Models Work
Parts of an Agent You Need to Evaluate
The Router
Function choice and parameter extraction
The Skills / Functions
Can use standard LLM evaluations
The Path
The most challenging to evaluate at scale
© All Rights Reserved
| We Make Models Work
Evaluating a Router
Am I using the right skill correctly?
Skills
AGENT ROUTER
A
Parameter Extraction
Function Generation
B
C
© All Rights Reserved
| We Make Models Work
Evaluating Skills with Standard LLM & Code Evals
For RAG Skills:
For Code-Gen Skills:
For API Skills:
For All skills:
© All Rights Reserved
| We Make Models Work
Evaluating Convergence
Calculating Convergence:
Agent
step 1
Step count++
Problematic Node?
Agent
step 2
Agent
step 3
Agent
step 100
Step count++
Problematic Node?
Step count++
Problematic Node?
Step count++
Problematic Node?
© All Rights Reserved
| We Make Models Work
Visit us at arize.com
Thank you!
Give Phoenix a star! 🤩
phoenix.arize.com
© All Rights Reserved
| We Make Models Work
Appendix
© All Rights Reserved
| We Make Models Work
Phoenix is:
Phoenix provides:
© All Rights Reserved
| We Make Models Work
Evaluation Driven Development
Evaluation Driven Development is a Continuous Process
Examples
Curate Dataset
Track Changes as an Experiment
(Model, Prompt, Retriever)
Evaluate the Experiment
New Output
Score
0.8
You’re a helpful assistant. When user asks about return policy respond with {vars} …
LLM APPS REQUIRE ITERATIVE PERFORMANCE IMPROVEMENTS
© All Rights Reserved
| We Make Models Work
Curate a Dataset of Test Cases
© All Rights Reserved
| We Make Models Work
Define Evaluators
LLM-as-a-Judge
Assertion
© All Rights Reserved
| We Make Models Work
Iterate on your task
© All Rights Reserved
| We Make Models Work
Run your experiment, iterate, repeat
© All Rights Reserved
| We Make Models Work
Why this Approach?
3. Evaluate performance on benchmark
I want to go…
Hotel inventory, local sites…
No
QUERY
CONTEXT
GROUND TRUTH
Did it hallucinate?
LLM-GENERATED LABEL
I’m planning…
Booking history, hotel inventory…
No
Where should…
Local restaurants…
Yes
My wife and I…
Hotel inventory, local sites…
No
I want to see…
Local event calendar…
Yes
No
No
Yes
No
No
2. Run through eval prompt template
QUERY
CONTEXT
GROUND TRUTH
I want to go…
Hotel inventory, local sites…
No
Did it hallucinate?
1. Create golden dataset
HALLUCINATION EVAL
PRECISION
RECALL
F1
0.94
0.75
0.83
RESPONSE
RESPONSE
Hi, happy to share a few…
Hi, happy to share a few…
The beaches of Positano are…
The beaches of Narnia are…
Couples often love visiting…
There’s a historical site…
© All Rights Reserved
| We Make Models Work
Testing As You Build
Use cases
Experiment traces
Eval traces
You are an assistant debugging RAG, investigate the retrieved results and evals…
PROMPT CHANGE
GITHUB ACTION TEST
© All Rights Reserved
| We Make Models Work
Document Chunk Embedding | Document Chunk ID | Document Chunk |
<1, 1, 2, 4> | 1 | |
<100, 309, 4, 7> | 2 | |
<59, 71, 73, 95> | 3 | |
Knowledge base of articles
Do you support international calling?
User query
Query embedding
<1, 2, 3, 4>
0.8
Cosine Similarity
From Lookup
0.4
0.1
How LLM Search
& Retrieval Works
Prompt
User is asking “Do you support international calling?”
Here’s relevant content. Can you answer?
1
2
3
4
© All Rights Reserved
| We Make Models Work
LLM Judges align well with human rankings
Judge models may not assign identical scores as humans, but are more aligned in ranking the exam taker models
All judges struggle to distinguish between the poor-performing exam-taker models
contains demonstrates the highest alignment with the human ranking, swapping the ranks of only 2 of 9 models
© All Rights Reserved
| We Make Models Work
Position bias
Shuffle the order of references in LLM Judge input [question, references, model response]
LLM Judge is more likely to evaluate an answer as correct if the corresponding reference appears early in the list of references
Larger judge models consistently maintain their judgments regardless of the reference order
© All Rights Reserved
| We Make Models Work
Small models ignore the reference
The smaller judge models use their own knowledge rather than going by the references
Fail to capture all the information in the prompt
© All Rights Reserved
| We Make Models Work
LLM Judges have a leniency bias
LLMs tend to judge positively when in doubt
This is more pronounced for small models than for larger ones
Definitions:
Estimate the values of Pc and P+ from the benchmark results
© All Rights Reserved
| We Make Models Work