1 of 16

How to test AI using AI

Leveraging OSS for Testing AI Based Applications

Home

www.robin-gupta.com

@smilinrobin

www.robin-gupta.com

/polymorphicrobin

SLIDESMANIA.COM

2 of 16

About me

Home

Recent Tabs

About Me

www.robin-gupta.com

Explorer, open source contributor, maker, author, international speaker and Dad at home.

More boring stuff about me here:

SLIDESMANIA.COM

3 of 16

Agenda...

Recent Tabs

www.robin-gupta.com

Home

Introduction

🙏

Understand the Why

🤔

When and How of Evals

🧮

Application under Test

💻

Rule based Evals

📐

Using AI for Evals

🤖

AI based reasoning

🤖🧠

CICT pipeline for AI apps

💀

QnA and Chocolates�🍫

Icebreaker

❄️

SLIDESMANIA.COM

4 of 16

But first , what is AI?

Home

Recent Tabs

Problem domain

www.robin-gupta.com

“Artificial intelligence (AI) refers to computer systems capable of performing complex tasks that historically only a human could do, such as reasoning, making decisions, or solving problems. “ - Coursera.

GenAI

Our focus today

SLIDESMANIA.COM

5 of 16

Our case study

Home

Recent Tabs

Solution domain

www.robin-gupta.com

We built a RAG application which serves as a customer facing chat bot.
Testing took longer than development, for checking:

Context adherence
Accuracy
Bias

Live in production and serving 1000+ answers so far
Automation testing is as important as manual testing for AI

SLIDESMANIA.COM

6 of 16

Testing which is easier?

Home

Recent Tabs

Problem domain

www.robin-gupta.com

SLIDESMANIA.COM

7 of 16

Hallucinations and human-in-the-loop

Home

Recent Tabs

Problem domain

www.robin-gupta.com

Note: Observed on 25 Feb 2024

SLIDESMANIA.COM

8 of 16

Changing needs for software testing

Home

Recent Tabs

Problem domain

www.robin-gupta.com

Criterion	Traditional Software	LLM based apps
Behaviour	Predefined Rules	Probability+Prediction
Output	Deterministic (1 input=1 output)	Non-deterministic (1 input = Many possible outputs)
Testing Strategy	Evaluate as right or wrong	Evaluate on: Accuracy Quality Bias Consistency Toxicity

SLIDESMANIA.COM

9 of 16

LLM Testing Eval

Home

Recent Tabs

Solution domain

www.robin-gupta.com

We can evaluate LLM outputs on:

• Performance (speed and functionality)

• Effectiveness (accuracy and utility)

• Quality (user experience and reliability)

Benchmark	Eval Criteria
MMLU (Mean Message Length in Utterance)	Performance
HellaSwag	Performance effectiveness
HumanEval	Quality
ARC (A12 Reasoning Challenge)	Effectiveness
WinoGrande	Effectiveness/quality

SLIDESMANIA.COM

10 of 16

Automated Evals: What and When

Home

Recent Tabs

Solution domain

www.robin-gupta.com

What should you evaluate?

Context adherence
Context relevance
Correctness
Bias and Toxicity

When should you evaluate?

After every change (bug fixes, feature updates, data changes)
Pre-deployment (merges to prod branch, end of sprint, prior to shipping hotfix)
Post-deployment (on demand business needs)

SLIDESMANIA.COM

11 of 16

Demo Application under test: Quiz generator

Home

Recent Tabs

Solution domain

www.robin-gupta.com

Quiz generator

LLM

Data

User

Input

Output

Input: Write a quiz about science

Output: Sure ! Here are three science questions for you:

True or False: Water slows down the speed of light.
What did Marie and Pierre Curie discover in Paris?
Where were the first refracting telescopes invented?

SLIDESMANIA.COM

12 of 16

Test Strategy for AI systems

Home

Recent Tabs

Solution domain

www.robin-gupta.com

Start

Quiz Generator

Quiz Bank

Prompt Template

User �Input

Contains words

Fail

Contains formatting

Fail

Check�Hallucination

Fail

Stop

Yes

Rule based Evals

AI based automated Evals

SLIDESMANIA.COM

13 of 16

CI/CT pipeline for AI apps

Home

Recent Tabs

Solution domain

www.robin-gupta.com

Change

Per-Commit evals

Merge to main

Pre-release evals

Deploy

Report

Dev/feature branch

Main/Prod branch

Human review

SLIDESMANIA.COM

14 of 16

Key takeaways-1

Home

Recent Tabs

Solution domain

www.robin-gupta.com

Problem

Solution

Tip

Evaluating LLM based apps is hard as they are non-deterministic
LLM hallucinations need to be checked for every feature

We can use rule based evals for string/pattern matching
We can use LLMs to test non-deterministic nature and hallucinations in LLM based apps.

Ensure that all eval results are passed through a human in the loop for review.

SLIDESMANIA.COM

15 of 16

Key takeaways-2

Home

Recent Tabs

Solution domain

www.robin-gupta.com

Basics

Advanced

God mode

Colab notebooks presented in this session:

SLIDESMANIA.COM

16 of 16

QnA and prizes

Home

Recent Tabs

Solution domain

www.robin-gupta.com

SLIDESMANIA.COM