1 of 16

How to test AI using AI

Leveraging OSS for Testing AI Based Applications

Home

www.robin-gupta.com

@smilinrobin

www.robin-gupta.com

/polymorphicrobin

SLIDESMANIA.COM

2 of 16

About me

Home

Recent Tabs

About Me

www.robin-gupta.com

www.robin-gupta.com

Explorer, open source contributor, maker, author, international speaker and Dad at home.

More boring stuff about me here:

SLIDESMANIA.COM

3 of 16

Agenda...

Recent Tabs

www.robin-gupta.com

Home

Introduction

🙏

Understand the Why

🤔

When and How of Evals

🧮

Application under Test

💻

Rule based Evals

📐

Using AI for Evals

🤖

AI based reasoning

🤖🧠

CICT pipeline for AI apps

💀

QnA and Chocolates�🍫

Icebreaker

❄️

SLIDESMANIA.COM

4 of 16

But first , what is AI?

Home

Recent Tabs

Problem domain

www.robin-gupta.com

Artificial intelligence (AI) refers to computer systems capable of performing complex tasks that historically only a human could do, such as reasoning, making decisions, or solving problems. “ - Coursera.

GenAI

Our focus today

SLIDESMANIA.COM

5 of 16

Our case study

Home

Recent Tabs

Solution domain

www.robin-gupta.com

  • We built a RAG application which serves as a customer facing chat bot.
  • Testing took longer than development, for checking:
    • Context adherence
    • Accuracy
    • Bias
  • Live in production and serving 1000+ answers so far
  • Automation testing is as important as manual testing for AI

SLIDESMANIA.COM

6 of 16

Testing which is easier?

Home

Recent Tabs

Problem domain

www.robin-gupta.com

A

B

SLIDESMANIA.COM

7 of 16

Hallucinations and human-in-the-loop

Home

Recent Tabs

Problem domain

www.robin-gupta.com

Note: Observed on 25 Feb 2024

SLIDESMANIA.COM

8 of 16

Changing needs for software testing

Home

Recent Tabs

Problem domain

www.robin-gupta.com

Criterion

Traditional Software

LLM based apps

Behaviour

Predefined Rules

Probability+Prediction

Output

Deterministic (1 input=1 output)

Non-deterministic (1 input = Many possible outputs)

Testing Strategy

Evaluate as right or wrong

Evaluate on:

Accuracy

Quality

Bias

Consistency

Toxicity

SLIDESMANIA.COM

9 of 16

LLM Testing Eval

Home

Recent Tabs

Solution domain

www.robin-gupta.com

We can evaluate LLM outputs on:

• Performance (speed and functionality)

• Effectiveness (accuracy and utility)

• Quality (user experience and reliability)

Benchmark

Eval Criteria

MMLU (Mean Message Length in Utterance)

Performance

HellaSwag

Performance effectiveness

HumanEval

Quality

ARC (A12 Reasoning Challenge)

Effectiveness

WinoGrande

Effectiveness/quality

SLIDESMANIA.COM

10 of 16

Automated Evals: What and When

Home

Recent Tabs

Solution domain

www.robin-gupta.com

What should you evaluate?

  • Context adherence
  • Context relevance
  • Correctness
  • Bias and Toxicity

When should you evaluate?

  • After every change (bug fixes, feature updates, data changes)
  • Pre-deployment (merges to prod branch, end of sprint, prior to shipping hotfix)
  • Post-deployment (on demand business needs)

SLIDESMANIA.COM

11 of 16

Demo Application under test: Quiz generator

Home

Recent Tabs

Solution domain

www.robin-gupta.com

Quiz generator

LLM

Data

User

Input

Output

Input: Write a quiz about science

Output: Sure ! Here are three science questions for you:

  1. True or False: Water slows down the speed of light.
  2. What did Marie and Pierre Curie discover in Paris?
  3. Where were the first refracting telescopes invented?

SLIDESMANIA.COM

12 of 16

Test Strategy for AI systems

Home

Recent Tabs

Solution domain

www.robin-gupta.com

Start

Quiz Generator

Quiz Bank

Prompt Template

User �Input

Contains words

Fail

Contains formatting

Fail

Check�Hallucination

Fail

Stop

No

No

No

Yes

Yes

Yes

1

2

3

Rule based Evals

AI based automated Evals

SLIDESMANIA.COM

13 of 16

CI/CT pipeline for AI apps

Home

Recent Tabs

Solution domain

www.robin-gupta.com

Change

Per-Commit evals

Merge to main

Pre-release evals

Deploy

Report

Dev/feature branch

Main/Prod branch

Human review

Human review

SLIDESMANIA.COM

14 of 16

Key takeaways-1

Home

Recent Tabs

Solution domain

www.robin-gupta.com

Problem

Solution

Tip

  • Evaluating LLM based apps is hard as they are non-deterministic
  • LLM hallucinations need to be checked for every feature
  • We can use rule based evals for string/pattern matching
  • We can use LLMs to test non-deterministic nature and hallucinations in LLM based apps.

Ensure that all eval results are passed through a human in the loop for review.

SLIDESMANIA.COM

15 of 16

Key takeaways-2

Home

Recent Tabs

Solution domain

www.robin-gupta.com

Basics

Advanced

God mode

Colab notebooks presented in this session:

  • Langchain
  • promptfoo

SLIDESMANIA.COM

16 of 16

QnA and prizes

Home

Recent Tabs

Solution domain

www.robin-gupta.com

SLIDESMANIA.COM