How to test AI using AI
Leveraging OSS for Testing AI Based Applications
Home
www.robin-gupta.com
@smilinrobin
www.robin-gupta.com
/polymorphicrobin
SLIDESMANIA.COM
About me
Home
Recent Tabs
About Me
www.robin-gupta.com
www.robin-gupta.com
Explorer, open source contributor, maker, author, international speaker and Dad at home.
More boring stuff about me here:
SLIDESMANIA.COM
Agenda...
Recent Tabs
www.robin-gupta.com
Home
Introduction
🙏
Understand the Why
🤔
When and How of Evals
🧮
Application under Test
💻
Rule based Evals
📐
Using AI for Evals
🤖
AI based reasoning
🤖🧠
CICT pipeline for AI apps
💀
QnA and Chocolates�🍫
Icebreaker
❄️
SLIDESMANIA.COM
But first , what is AI?
Home
Recent Tabs
Problem domain
www.robin-gupta.com
“Artificial intelligence (AI) refers to computer systems capable of performing complex tasks that historically only a human could do, such as reasoning, making decisions, or solving problems. “ - Coursera.
GenAI
Our focus today
SLIDESMANIA.COM
Our case study
Home
Recent Tabs
Solution domain
www.robin-gupta.com
SLIDESMANIA.COM
Testing which is easier?
Home
Recent Tabs
Problem domain
www.robin-gupta.com
A
B
SLIDESMANIA.COM
Hallucinations and human-in-the-loop
Home
Recent Tabs
Problem domain
www.robin-gupta.com
Note: Observed on 25 Feb 2024
SLIDESMANIA.COM
Changing needs for software testing
Home
Recent Tabs
Problem domain
www.robin-gupta.com
Criterion | Traditional Software | LLM based apps |
Behaviour | Predefined Rules | Probability+Prediction |
Output | Deterministic (1 input=1 output) | Non-deterministic (1 input = Many possible outputs) |
Testing Strategy | Evaluate as right or wrong | Evaluate on: Accuracy Quality Bias Consistency Toxicity |
SLIDESMANIA.COM
LLM Testing Eval
Home
Recent Tabs
Solution domain
www.robin-gupta.com
We can evaluate LLM outputs on:
• Performance (speed and functionality)
• Effectiveness (accuracy and utility)
• Quality (user experience and reliability)
Benchmark | Eval Criteria |
MMLU (Mean Message Length in Utterance) | Performance |
HellaSwag | Performance effectiveness |
HumanEval | Quality |
ARC (A12 Reasoning Challenge) | Effectiveness |
WinoGrande | Effectiveness/quality |
SLIDESMANIA.COM
Automated Evals: What and When
Home
Recent Tabs
Solution domain
www.robin-gupta.com
What should you evaluate?
When should you evaluate?
SLIDESMANIA.COM
Demo Application under test: Quiz generator
Home
Recent Tabs
Solution domain
www.robin-gupta.com
Quiz generator
LLM
Data
User
Input
Output
Input: Write a quiz about science
Output: Sure ! Here are three science questions for you:
SLIDESMANIA.COM
Test Strategy for AI systems
Home
Recent Tabs
Solution domain
www.robin-gupta.com
Start
Quiz Generator
Quiz Bank
Prompt Template
User �Input
Contains words
Fail
Contains formatting
Fail
Check�Hallucination
Fail
Stop
No
No
No
Yes
Yes
Yes
1
2
3
Rule based Evals
AI based automated Evals
SLIDESMANIA.COM
CI/CT pipeline for AI apps
Home
Recent Tabs
Solution domain
www.robin-gupta.com
Change
Per-Commit evals
Merge to main
Pre-release evals
Deploy
Report
Dev/feature branch
Main/Prod branch
Human review
Human review
SLIDESMANIA.COM
Key takeaways-1
Home
Recent Tabs
Solution domain
www.robin-gupta.com
Problem
Solution
Tip
Ensure that all eval results are passed through a human in the loop for review.
SLIDESMANIA.COM
Key takeaways-2
Home
Recent Tabs
Solution domain
www.robin-gupta.com
Basics
Advanced
God mode
Colab notebooks presented in this session:
SLIDESMANIA.COM
QnA and prizes
Home
Recent Tabs
Solution domain
www.robin-gupta.com
SLIDESMANIA.COM