1 of 67

LLM Observability, Evaluation, & Guardrails

Making AI Work

© All Rights Reserved

| We Make Models Work

2 of 67

Bad LLM Responses Lead to Real Business Impact

© All Rights Reserved

| We Make Models Work

3 of 67

ARIZE DEPLOYED AT THE WORLD’S TOP ENTERPRISES

© All Rights Reserved

| We Make Models Work

4 of 67

The Emerging LLM Toolchain

AI Memory

Orchestration

Protect & Monitor

Proprietary Data

and LLM Memory

LLM App Frameworks

Production Guardrails & Troubleshooting

LLMs

3rd Party

Models

Evaluation

Testing & Debugging

Google

Vertex AI

Microsoft AutoGen

OSS by Arize

OBSERVABILITY

Azure AI Studio

© All Rights Reserved

| We Make Models Work

5 of 67

Common Pains faced bringing

LLM Apps to Production

Early Development

Get something working quickly

Evaluate & Iterate

Benchmark evals, iterate on performance

Production

App is launched, actively collect telemetry and online evaluation data

Improve

Curate data and feedback to refine performance

Don’t know effects of making a change

No proactive protection against jailbreaks, PII leaks, bad performance

Where to focus improvements for performance once live

No visibility into what’s happening behind opaque calls

CI/CD

© All Rights Reserved

| We Make Models Work

6 of 67

From Development through Production: Build Better AI with Arize

Early Development

Get something working quickly

Evaluate & Iterate

Benchmark evals, iterate on performance

Production

App is launched, actively collect telemetry and online evaluation data

Improve

Curate data and feedback to refine performance

CI/CD

Evaluation

Prompt Playground

Guardrails

Online Evals

Similarity Search

Annotations

Traces

Datasets

Experiments

Monitoring & Dashboards

Copilot

Embeddings Analyzer

© All Rights Reserved

| We Make Models Work

7 of 67

The LLM Application Development Lifecycle

Copilots

Computer vision

Search/

recomm.

RAG

Develop & Test

Build & experiment

Evaluate

Measure

Trace

Track system

Monitor

Surface issues

Agents

Voice

assistants

Copilots

Computer vision

Search/

recomm.

RAG

Develop, Evaluate, Iterate

Agents

Voice

assistants

© All Rights Reserved

| We Make AI Work

8 of 67

LLM Observability

9 of 67

Observability, Traces, and Spans

  • Observability - complete visibility into every layer of an LLM-based software system: the application, the prompt, and the response.
  • Traces - telemetry data on full calls to an LLM app or pipeline. Made up of a series of spans.
  • Spans - telemetry data captured on individual steps in an LLM app or pipeline.

© All Rights Reserved

| We Make Models Work

10 of 67

Examples - Traces

© All Rights Reserved

| We Make Models Work

11 of 67

Examples - Spans

© All Rights Reserved

| We Make Models Work

12 of 67

How Traces are Captured

OpenTelemetry�High-quality, ubiquitous, and portable telemetry to enable effective observability

Used across all kinds of applications

Open-source, vendor-agnostic

© All Rights Reserved

| We Make Models Work

13 of 67

How Traces are Captured

OpenInference

�LLM-specific schema and conventions built on top of OpenTelemetry

Used specifically for LLM pipelines

Open-source, vendor agnostic

© All Rights Reserved

| We Make Models Work

14 of 67

How Traces are Captured

OTEL Endpoint

UI to view and perform actions with your traces�

© All Rights Reserved

| We Make Models Work

15 of 67

How Traces are Captured

COLLECTOR

OTEL PROCESSOR

© All Rights Reserved

| We Make Models Work

16 of 67

Agnostic of Frameworks

Edit 1 file, 1 line of code: Change collector destination

Proprietary

instrumentation

@otel

Standardized

@otel

@otel

@otel

@otel

@otel

@otel

Proprietary

@tracer

@tracer

@tracer

@tracer

@tracer

@tracer

@tracer

Potentially instrumenting 100s–1,000s files to change

Proprietary instrumentation is framework lock in

© All Rights Reserved

| We Make Models Work

17 of 67

Arize Phoenix: Open-Source LLM Tracing & Evaluation

Give Us a Star 🌟

© All Rights Reserved

| We Make Models Work

18 of 67

LLM Evaluation

19 of 67

Evaluation

© All Rights Reserved

| We Make Models Work

20 of 67

LLM Evaluations

LLM as a Judge Evals

Hallucination, summarization, qa

Ground Truth Evals

F1, rouge scores, similarity

Dataset Analysis

Embedding visualizations, annotations

© All Rights Reserved

| We Make Models Work

21 of 67

Assertion-based & Ground-truth Evals

Comparing responses against a ground truth set of data:

  • Accuracy, Precision, Recall, F1 Scores
  • AUC-ROC
  • Rouge 1 scores
  • Similarity scores

Computed metrics:

  • Perplexity
  • Entropy
  • KL-Divergence

© All Rights Reserved

| We Make Models Work

22 of 67

LLM as a Judge Evaluations

EVAL PROMPT TEMPLATE

<<template: hallucination>>

Is the {response} using the {context} to answer the {query}

APP IN

PRODUCTION

RESPONSE

{response} Hi, I’m happy to help you plan…

EVALUATOR LLM

PROMPT

QUERY I want to go…

CONTEXT User booking history, hotel inventory…

INSTRUCTIONS You are a friendly travel assistant…

EVAL

No

Did it hallucinate?

EVAL

© All Rights Reserved

| We Make Models Work

23 of 67

How do Evals work? (LLM-As-A-Judge)

Eval

Example

“relevant”

“irrelevant”

span

span

retrieval span

span

Phoenix Library

Model Params

Eval LLM

Eval Template

Example: Retrieval

retrieval span

Span we want to evaluate

Output

User Query

Input

Documents

Eval Template

You are comparing a reference text to a question and trying to determine if the reference text contains information relevant to answering the question. Here is the data:

[BEGIN DATA]

************

[Question]: {query}

************

[Reference text]: {reference}

[END DATA]

Compare the Question above to the Reference text. Determine whether the Reference text contains information that can answer the Question.

© All Rights Reserved

| We Make Models Work

24 of 67

Arize Evals Differentiation: LLM as a Judge Framework

  • Most Comprehensive Library – fast (parallel calls, rate limiting, backoff over batch data), LLM-as-a-judge, works with all common LLMs
  • Explanations – generate with single flag
  • Custom evals – use your own templates
  • Rails – control and standardize with your criteria
  • RAG support – most extensive available in market
  • Tracing - Tight tie to tracing

from phoenix.experimental.evals import (

HallucinationEvaluator,

QAEvaluator,

RelevanceEvaluator,

)

© All Rights Reserved

| We Make Models Work

25 of 67

Only the Best Models are Good Judges

Of 9 LLM Judges, only GPT-4 Turbo and Llama-3 70B showed very high alignment with humans

GPT-4 Turbo is 12 points behind human judgment

© All Rights Reserved

| We Make Models Work

26 of 67

Dataset Analysis

I want to plan a family vacation…

I need accomodations…

Find me flights for…

Plan a 10 day trip to…

I want a refund on…

House rental for 15 guests…

Best Italian restaurants in…

Booking with reward points…

Going to London for…

This didn’t work for me…

Local restaurants in..

I want my money back…

AI Search

Spans

Embeddings Similarity Search

OR

Search Based on Human Understanding

Search Based on Similar Examples

“Find examples where the user is frustrated and mentions refund or return”

“Find queries with attempted jail breaks”

Refund my account…

I’m not satisfied with this…

This is not acceptable…

I shouldn’t be charged for this…

Human / Automated Annotations

Refund my account…

I’m not satisfied with this…

This is not acceptable…

I shouldn’t be charged for this…

Refund requested

Frustrated user

Refund requested

Frustrated user

© All Rights Reserved

| We Make Models Work

27 of 67

Embedding Analysis

© All Rights Reserved

| We Make Models Work

28 of 67

Visualize Traces in Phoenix

© All Rights Reserved

| We Make Models Work

29 of 67

Guardrails

30 of 67

Input

DETECTIONS

ACTIONS

Output

Arize Guardrails Use Dynamic Data to Protect

01

Embeddings guards

02

Few shot LLM prompt

03

LLM Evals

PII

• User Frustration

• Hallucination

Your datasets

(PII, Jailbreaks, User frustration)

Block

Retry

Default answer

© All Rights Reserved

| We Make Models Work

31 of 67

LLM

Evaluation

Embedding

Dataset

Options for Guardrails

Few Shot LLM Prompt From Dataset

Advantage: Block on Evals

Advantage: Continuous Iteration on Breaks

Advantages: Completely Custom

© All Rights Reserved

| We Make Models Work

32 of 67

Evaluating RAG

33 of 67

Private Knowledge Sources

Unstructured & Structured sources: Pdfs, Google docs, MQL data, Mp4, Slack, etc

Retrieval-Augmented Generation (RAG)

Prompt

LLM

OutputSpecific, well informed answer

Indexing

No context user query

Retrieval

Pre retrieval process

Post

retrieval process

© All Rights Reserved

| We Make Models Work

34 of 67

Retrieval-Augmented Generation (RAG)

Retrieval Evals

Are the right chunks retrieved?

Common evals:

  • Relevancy
  • Semantic similarity
  • Ranking (nDCG)
  • Hit Rate
  • Precision

Response Evals

Does the LLM generate the right response?

Common evals:

  • Faithfulness
  • Relevancy
  • Hallucination

QA Correctness

© All Rights Reserved

| We Make Models Work

35 of 67

Q&A is an End to End Evaluation Metric

© All Rights Reserved

| We Make Models Work

36 of 67

Deeper Retrieval Evaluation Metrics

To measure the effectiveness of your top ranked documents.

Takes into account the position of relevant docs.

% of queries that have relevant context.

Hit is a binary metric (relevant document was or wasn’t retrieved)

Precision = % relevant documents, up to ‘K’ retrieved documents.

Precision@3 = 33%, if 1 out 3 docs is relevant.

Assess the accuracy and relevance of the documents that were retrieved

nDCG

Hit Rate

Precision @K

© All Rights Reserved

| We Make Models Work

37 of 67

Evaluating Relevance of Each Chunk

© All Rights Reserved

| We Make Models Work

38 of 67

Context versus Relevance

Query

Context Returned 1

Context Returned 2

Relevance 1

Relevance 2

Response

Response Eval

Do you need a prediction ID for the training set?

A `prediction ID` is an ID that indicates a unique prediction event. A prediction ID is **required** to connect predictions with delayed actuals (ground truth). Learn how to send delayed (latent) actuals here.

Ensure Training and Validation records must include **both** prediction and actual columns

Relevant

Relevant

Yes, a prediction ID is required for the training set.

How do I configure permissions for GBQ?

There are two ways to setup access permissions with Arize

Configure An Indiviaual Bucket PolicyGive Arize permission to access individual buckets#configure-an-individual-bucket-policyConfigure Multiple Buckets Via Role Based PermissionsAssign Arize a role to access multiple buckets using external IDs#configure-multiple-buckets-via-role-based-permissions

1. **In Google Cloud console**: Navigate to the BigQuery SQL Workspace

2. Select the desired table or view, navigate to the **Details** tab and click "Edit Details". Under the **Labels** section, click "Add Labels". Add the following label:

* Key as "**arize-ingestion-key**_**"**_

Irrelevant

Relevant

To configure permissions for Google BigQuery (GBQ), you can either configure an individual bucket policy or assign Arize a role to access multiple buckets using external IDs.

Can I copy a dashboard?

Templates are designed as starting points for dashboard and model analysis. Once a dashboard is created from a template, it can be edited and customized as desired.

To add a widget, simply:

* Click the Edit Dashboard icon in the top right corner.

* Select or drag and drop the widget onto an area of the dashboard.

Irrelevant

Irrelevant

Yes, you can copy a dashboard. To do so, click the Edit Dashboard icon in the top right corner and select the Copy Dashboard option.

© All Rights Reserved

| We Make Models Work

39 of 67

Improving your RAG application

Is the response correct?

Great!

Are the retrieved documents relevant?

Response Issue

How to fix:

  1. Prompt engineer
  2. Swap models
  3. Fine-tune

Are there relevant documents in the corpus?

Database is missing information

Retrieval Issue

How to fix:

  1. Experiment with different chunking techniques
  2. Test different embedding models
  3. Improve text extraction techniques

© All Rights Reserved

| We Make Models Work

40 of 67

How to Run Sweep Experiments for Retrieval Setup

K size

Retrieval Approach

Script: https://github.com/Arize-ai/phoenix/tree/main/scripts/rag

Chunk Size

© All Rights Reserved

| We Make Models Work

41 of 67

Evaluating Agents

42 of 67

Components of an Agent

SKILLS / EXECUTION BRANCHES

The logic chains that do the actual work

E.g. an SQL query skill or RAG retriever

ROUTER

Optional component that decides which next step the agent will take

MEMORY

A shared memory state that can be accessed by each different component

© All Rights Reserved

| We Make Models Work

43 of 67

  • Form – LLM, simple NLP classifier, or even rules-based code
  • Not used by all agents

ROUTER

SKILLS BRANCHES

MEMORY

SKILL 1

ROUTER

USER QUERY

SKILL 2

Determines which skill or function to call to respond to the user’s query

© All Rights Reserved

| We Make Models Work

44 of 67

  • Made up of Components
  • Each agent will have one or more

ROUTER

SKILLS BRANCHES

MEMORY

Individual logic blocks and chains that can complete a task

Product�Search

Unstructured to Structured

Search API

Compare�Products

LLM calls

API calls

Application code

© All Rights Reserved

| We Make Models Work

45 of 67

  • Importance – Many LLM APIs rely on receiving each agent step to decide on the next step
  • Used to store retrieved context, config variables, previous execution steps

messages = []

messages.append({"role": "system", "content": "You are a helpful customer support assistant. Use the supplied tools to assist the user."})

messages.append({"role": "user", "content": "Hi, can you tell me the delivery date for my order?"})

messages.append({"role": "assistant", "content": "Hi there! I can help with that. Can you please provide your order ID?"})

messages.append({"role": "user", "content": "i think it is order_12345"})

ROUTER

SKILLS BRANCHES

MEMORY

Shared state that can be accessed by each component in the agent

© All Rights Reserved

| We Make Models Work

46 of 67

Example Agent: Ecommerce Chat to Purchase

USER INPUT

Item Search

LLM

Recommended Item

Deals

LLM

Query�Response

Are there any current discounts on Kindle e-Readers?

ROUTER

Skill Branch 1: Purchase

Skill Branch 2: Deals

MEMORY / STATE

© All Rights Reserved

| We Make Models Work

47 of 67

Getting Agents to Work is HARD

→ This is why LLM Evals are critical

  • LLMs are non-deterministic
  • A bad response upstream leads to a strange response downstream
  • Agents can take inefficient paths and still get to the right solution
  • Frameworks make building easier, but debugging harder

© All Rights Reserved

| We Make Models Work

48 of 67

Parts of an Agent You Need to Evaluate

The Router

Function choice and parameter extraction

The Skills / Functions

Can use standard LLM evaluations

The Path

The most challenging to evaluate at scale

© All Rights Reserved

| We Make Models Work

49 of 67

Evaluating a Router

Am I using the right skill correctly?

Skills

AGENT ROUTER

A

Parameter Extraction

Function Generation

B

C

© All Rights Reserved

| We Make Models Work

50 of 67

Evaluating Skills with Standard LLM & Code Evals

For RAG Skills:

  • Retrieval Relevance
  • QA Correctness
  • Hallucination
  • Reference / Citation

For Code-Gen Skills:

  • Code readability
  • Code correctness

For API Skills:

  • Code-based integration tests and unit tests

For All skills:

  • Comparison against ground truth data

© All Rights Reserved

| We Make Models Work

51 of 67

Evaluating Convergence

Calculating Convergence:

  1. Run your agent on a set of similar queries
  2. Record the number of steps taken for each
  3. Calculate the convergence score:
    1. avg(minimum steps taken for this query / steps in the run)

Agent

step 1

Step count++

Problematic Node?

Agent

step 2

Agent

step 3

Agent

step 100

Step count++

Problematic Node?

Step count++

Problematic Node?

Step count++

Problematic Node?

© All Rights Reserved

| We Make Models Work

52 of 67

Visit us at arize.com

Thank you!

Give Phoenix a star! 🤩

phoenix.arize.com

© All Rights Reserved

| We Make Models Work

53 of 67

Appendix

© All Rights Reserved

| We Make Models Work

54 of 67

Phoenix is:

  • Open-source
  • Built on top of open telemetry and open inference protocols
  • Self-hostable or accessible via cloud
  • 100% free

Phoenix provides:

  • Dozens of one-line automatic tracing integrations with popular frameworks
  • Evaluation-agnostic scaffolding - run any types of evals you want, analyze them in Phoenix
  • Experimentation, human annotation, prompt versioning, and much more

© All Rights Reserved

| We Make Models Work

55 of 67

Evaluation Driven Development

56 of 67

Evaluation Driven Development is a Continuous Process

Examples

Curate Dataset

Track Changes as an Experiment

(Model, Prompt, Retriever)

Evaluate the Experiment

New Output

Score

0.8

You’re a helpful assistant. When user asks about return policy respond with {vars}

LLM APPS REQUIRE ITERATIVE PERFORMANCE IMPROVEMENTS

© All Rights Reserved

| We Make Models Work

57 of 67

Curate a Dataset of Test Cases

© All Rights Reserved

| We Make Models Work

58 of 67

Define Evaluators

LLM-as-a-Judge

Assertion

© All Rights Reserved

| We Make Models Work

59 of 67

Iterate on your task

© All Rights Reserved

| We Make Models Work

60 of 67

Run your experiment, iterate, repeat

© All Rights Reserved

| We Make Models Work

61 of 67

Why this Approach?

  • Eval runs – benchmark and test in development prior to production
  • Cheaper, can sample – more cost and time efficient than human evaluation
  • Build confidence – better than going into production with no metrics at all

3. Evaluate performance on benchmark

I want to go…

Hotel inventory, local sites…

No

QUERY

CONTEXT

GROUND TRUTH

Did it hallucinate?

LLM-GENERATED LABEL

I’m planning…

Booking history, hotel inventory…

No

Where should…

Local restaurants…

Yes

My wife and I…

Hotel inventory, local sites…

No

I want to see…

Local event calendar…

Yes

No

No

Yes

No

No

2. Run through eval prompt template

QUERY

CONTEXT

GROUND TRUTH

I want to go…

Hotel inventory, local sites…

No

Did it hallucinate?

1. Create golden dataset

HALLUCINATION EVAL

PRECISION

RECALL

F1

0.94

0.75

0.83

RESPONSE

RESPONSE

Hi, happy to share a few…

Hi, happy to share a few…

The beaches of Positano are…

The beaches of Narnia are…

Couples often love visiting…

There’s a historical site…

© All Rights Reserved

| We Make Models Work

62 of 67

Testing As You Build

Use cases

Experiment traces

Eval traces

You are an assistant debugging RAG, investigate the retrieved results and evals…

PROMPT CHANGE

GITHUB ACTION TEST

© All Rights Reserved

| We Make Models Work

63 of 67

Document

Chunk Embedding

Document

Chunk

ID

Document Chunk

<1, 1, 2, 4>

1

<100, 309, 4, 7>

2

<59, 71, 73, 95>

3

Knowledge base of articles

Do you support international calling?

User query

Query embedding

<1, 2, 3, 4>

0.8

Cosine Similarity

From Lookup

0.4

0.1

How LLM Search

& Retrieval Works

Prompt

User is asking “Do you support international calling?

Here’s relevant content. Can you answer?

1

2

3

4

© All Rights Reserved

| We Make Models Work

64 of 67

LLM Judges align well with human rankings

Judge models may not assign identical scores as humans, but are more aligned in ranking the exam taker models

All judges struggle to distinguish between the poor-performing exam-taker models

contains demonstrates the highest alignment with the human ranking, swapping the ranks of only 2 of 9 models

© All Rights Reserved

| We Make Models Work

65 of 67

Position bias

Shuffle the order of references in LLM Judge input [question, references, model response]

LLM Judge is more likely to evaluate an answer as correct if the corresponding reference appears early in the list of references

Larger judge models consistently maintain their judgments regardless of the reference order

© All Rights Reserved

| We Make Models Work

66 of 67

Small models ignore the reference

The smaller judge models use their own knowledge rather than going by the references

Fail to capture all the information in the prompt

© All Rights Reserved

| We Make Models Work

67 of 67

LLM Judges have a leniency bias

LLMs tend to judge positively when in doubt

This is more pronounced for small models than for larger ones

Definitions:

  • Judge assigns the correct judgment with a probability of Pc
  • Judge assigns the rest of the samples to be “correct” with a probability P+ (leniency bias)

Estimate the values of Pc and P+ from the benchmark results

© All Rights Reserved

| We Make Models Work