1 of 15

2024 Cloud Forum

Rapid Evaluation of AI Tools

John Bailey

Asst. Director, Cloud Systems

jwbailey@wustl.edu

2 of 15

So, You Want to Test AI Services?

  • What are the AI services we should test/compare?
  • What version of that AI service should we test?
  • How can we evaluate and score the AI capabilities?
  • Where do we even start?!?

AI Artwork by: Adobe Firefly 3 (preview)

Prompt: “A person looking out across a vast landscape made of circuit boards and other technology parts.”

Settings: Aspect Ratio: Landscape, Type: Art

Date: 4/25/2024

3 of 15

AI is At the Peak of Inflated Expectations

Credit: Gartner

Date: 4/25/2024

4 of 15

Where to Start: At Play

  • Human beings are designed to learn by doing (or by playing) - particularly as a group.
  • Take the time to gain access to as many AI tools as possible.
  • Schedule time in-person with a diverse team and play with the AI tools in a totally unstructured way.
    • Take turns submitting prompts and adjusting settings.
    • Laugh together at how spectacularly the AI fails at some tasks.
    • Marvel together at how well the AI succeed at some tasks.

AI Artwork by: Adobe Firefly 3 (preview)

Prompt: “A group of people in a conference room laughing together.”

Settings: Aspect Ratio: Landscape, Type: Art

Date: 4/25/2024

5 of 15

Bringing Order to the AI Chaos

  • Identify use cases – talk to your internal customers.
  • Decide (guess) which AI tools you have played with in the past may meet the customer’s need.
  • Consider a scoring rubric: 1=failed, 5=exceeded expectations.
  • Plan where to and how to document:
    • Use cases.
    • AI prompts and results.
    • AI Test scores.
    • Important metadata (name of tester, date of test, etc.)
  • Be as detailed as possible with your AI test result records!

6 of 15

Example: AI Test Record Table

AI Tool

Version

Category

Tester

Date

Prompt

Settings

Response

Score

Retries

Notes

Gemini

1.5

Text generation

John Bailey

2024-04-25

Tell me a story about…

Default

There once was…

3

0

Didn’t understand requested story structure.

Chat GPT

4

Text generation

John Bailey

2024-04-25

Tell me a story about…

Default

Once upon a time…

4

0

Good story, but still felt AI generated.

7 of 15

Example – AI Tool Comparison

AI Artwork by: Adobe Firefly 3 (Preview)

Prompt: “A frog leaping off of a lily pad, nighttime, photo realistic, with fireflies in the background.”

Settings: Aspect Ratio: Landscape, Type: Photo

Date: 4/25/2024

AI Artwork by: Google Gemini (Public version)

Prompt: “A frog leaping off of a lily pad, nighttime, photo realistic, with fireflies in the background.”

Settings: N/A (none)

Date: 4/25/2024

8 of 15

Example Use Case: Training Narration

  • Professionally narrate internal WashU self-guided training exercises using the instruction text from the training.
    • This task is often done by our training staff, but it is time consuming and tedious.
    • The resulting AI-generated audio must exceed the quality of Siri/Google assistants and be as good or better than human narration.
    • Sample Text: “Faculty and Staff who work with or may be exposed to Controlled Unclassified Information (C U I) in the course of their job duties are required to complete Insider Threat training annually.”

Audio by Siri (Mac OS Built-in text-to-speech)

Settings: N/A

Date: 4/25/2024

Audio by: Google Vertex AI Text-To-Speech

Settings: Voice: English - Female

Date: 4/25/2024

Audio by: Google Vertex AI Text-To-Speech (After using Google Translate to produce Spanish text.)

Settings: Voice: Spanish - Male

Date: 4/25/2024

9 of 15

What if There is No Comparable Tool?

  • Develop and save a scoring rubric that you can re-use later when new AI tools emerge.
  • Carefully document the shortcomings of the tool so you can see if they are overcome in future tests.

AI Artwork by: Adobe Firefly 3 (preview)

Prompt: “A young woman at a desk evaluating the results of her work.”

Settings: Aspect Ratio: Landscape, Type: Art

Date: 4/25/2024

10 of 15

What Not to Do

  • Attempt to research the technical details of the AI model.
    • The vendors won’t tell you, and even if they would, the truth is that they do not fully understand what they have built.
  • Assume that you can hand out an AI model to your campus and they will find it immediately useful.
  • Use publicly available AI models to analyze institutional data.
    • This may leak your data to the model owners.
    • Provides free training for tech companies (they should partner with us to get our data for training.)

11 of 15

What to Do Instead

  • Research the company building the AI model – especially their track record for acknowledging problems with their past and current AI models.
  • Talk to your internal customers about what problems they are hoping AI can help with and then test the models against those problems before rolling out the AI model.
  • Use isolated, static AI models deployed within a University-owned cloud subscription to leverage AI with your institutional data.

12 of 15

Challenges

  • Gaining access to AI tools.
    • Payment gates.
    • Limited beta programs.
  • Ever-changing labyrinth of models.
  • Lack of understanding of model capabilities and limitations.

AI Artwork by: Adobe Firefly 3 (Preview)

Prompt: “A man walking toward a labyrinth.”

Settings: Aspect Ratio: Landscape, Type: Art

Date: 4/25/2024

13 of 15

Opportunities

  • Reach out to your cloud reps and tell them you want access to all the AI betas and previews!
  • Pick a few key partners and focus on their AI models, rather trying the boil the AI ocean.
  • Move fast and play with the AI models to understand them!

AI Artwork by: Adobe Firefly 3 (preview)

Prompt: “A diverse group of young male runners at the starting line of a 100m dash, wide angle, viewed from the side.”

Settings: Aspect Ratio: Landscape, Type: Art

Date: 4/25/2024

14 of 15

Key Takeaways

  • Just because your test of AI didn’t yield anything useful doesn’t mean you did something wrong – AI fails a lot.
  • Don’t give up testing a use case after testing the first AI tool.
    • Example: Google Gemini may succeed at a task that Chat GPT fails.
  • Things change quickly in AI land - If a test fails, schedule a future date to re-try the test against the same tool.
  • Document, Document, Document.
    • Prompts, results, scores, settings, dates, etc.

15 of 15

Discussion

  • Question for the audience: how do you evaluate and compare AI tools at your institution?

AI Artwork by: Adobe Firefly 3 (preview)

Prompt: “A large question mark floating above rolling hills of green grass.”

Settings: Aspect Ratio: Landscape, Type: Art

Date: 4/25/2024