1 of 21

AI4LAM Community Call

January 21, 2025

2 of 21

  1. Welcome & Announcements (15 mins)

  • Kicking off Evaluation of AI in LAMs series (15 mins)

  • Discussion - small group (20 mins) - Please help us design the series program

  • Share out (10 mins)

3 of 21

Evaluation of AI in LAMs - introduction

Why evaluate, when to evaluate, what to evaluate, who evaluates and how to evaluate?

4 of 21

Why evaluate AI?

Jan. 3, 2025

https://www.nytimes.com/2025/01/03/us/fable-ai-books-racism.html

Jan. 16, 2025

https://www.nytimes.com/2025/01/16/technology/apple-ai-news-notifications.html

  • tools are untested on LAM data and use cases
  • performance is often overstated or overestimated
  • need for performance monitoring
  • need for upskilling, expertise and to understand labor implications
  • legal and ethical uncertainties
  • emerging policies and regulations
  • harmful environmental impacts
  • long term costs are unknown
  • do benefits outweigh the costs?

Apr 24, 2024

https://www.theguardian.com/australia-news/2024/apr/24/queensland-state-library-ai-war-veteran-chatbot-charlie

5 of 21

When to evaluate AI ?

6 of 21

What to evaluate?

  1. Use Case
  2. Models
  3. Output
  4. Implementation
  5. Impacts

7 of 21

What to evaluate: Use Case

Use Case Risk Assessment example: LC Labs Use Case Risk Assessment Worksheet

8 of 21

The importance of ground truth data

What to evaluate: Data Readiness/Ground Truth

9 of 21

What to evaluate: Models

  • License/openness of tool/model
    1. transparency/black box
    2. Restrictions on reuse of models

  • Public metrics & leaderboards

  • Benchmarks w/ use case specific data, metrics & evaluations

10 of 21

What and How to evaluate: Output

  • Output from tests or prototypes or experiments
    • Quality Review
      1. Example: Evaluation Matrix at UC Davis (Ensberg)

  • Output at scale at program or production level
    • HITL workflows, human review/intervention
      • Example: Exploring Computational Description (LC Labs experiment)

  • Monitoring Output at production level over time
    • Feedback mechanisms
    • Protocols

11 of 21

What to evaluate: Implementation

12 of 21

What to evaluate: Impacts

Does the AI tool/process/program

    • Support organizational strategic goals?
    • Are organizational values, principles uheld?
      • Examples: public trust, fairness, environmental impact
    • Meet legal requirements?
    • Satisfy safety and security requirements?
    • Support trusting relationships with impacted communities and users
      • What are the programs of outreach and mechanisms for integrating feedback?

Do the costs and risk outweigh the benefits?

  1. How are benefits understood and measured?
  2. How are costs understood and measured?

13 of 21

Who Evaluates?

Cross-cutting and wide group of people

  1. People whose work is implicated
    1. People who already perform quality review and assurance work
    2. People who know what is “good enough”
  2. People who are represented in the data
  3. People who are responsible for quality outcomes
  4. People who are responsible for developing, implementing and maintaining systems

14 of 21

Help design the AI Evaluation series

  • Small Group Discussion (20 mins)
    • We will put you into (at least 6) small groups
    • Each group will discuss all of the the questions in your slide and fill in answers
    • Please introduce yourselves and add your names to the “speaker notes” area of your slide
    • Chose one or two people to share the highlights of the discussion, focusing on your highlighted question

15 of 21

  1. How is your team/org approaching evaluation of AI?

Rutgers- break into functions, research, services, product by product (structured), how to apply to x,y,z. Half volunteer/assigned, reporting up to a central Library/report, policy and research aren’t connected to faculty research

ENC : Evaluating computer vision evaluation for museum, want to design an interface to allow no-specialists to evaluate the output of CV models

U of T - personal research around AI assessment for use in technology services; I am slowly collecting AI Frameworks in this Zotero folder: https://www.zotero.org/groups/5199888/ai_tools_for_archives/collections/XQGTKRRF/items/IN5HE2UF/item-list (should be public)

In Ontario there are also consortial initiatives: https://ocul.on.ca/ai-machine-learning-final-report-strategy

2. On the topic of Evaluating AI, what do you want more information/experience with?

3. What else should AI4LAM explore related to evaluation?

4. What are evaluation models we could emulate/learn from?

5. What can you share related to this topic?

https://www.comparia.beta.gouv.fr/ - for french speaking materials, create datasets to improve LLMs, compare quality, and other characteristics such as environmental impact, nb of parameters (LLM)

6. What other questions do you have?

How to approach when understaffed and under resourced?

Metrics and tools for evaluating commercial

16 of 21

  • How is your team/org approaching evaluation of AI?

2. On the topic of Evaluating AI, what do you want more information/experience with?

–how to do fine-tuning

3. What else should AI4LAM explore related to evaluation?

–benchmarks for evaluation (e.g. energy used to produce an answer?)

–how to use them to decide whether to use the tech

4. What are evaluation models we could emulate/learn from?

5. What can you share related to this topic?

–2 stages of evaluation: models (NorEval for lm-eval-harness) and production systems

– 1) creation of a proof of concept application with a nice interface with one model 2) need to try with other models and get help from domain experts who can help to better evaluate the results / answers

6. What other questions do you have?

17 of 21

  • How is your team/org approaching evaluation of AI?

Room 3

Evaluating embedding models using leaderboards and then a creation of a data set for “relevancy” benchmarking.

Bexhill Museum UK: Looking at ways to evaluate responses based on very specific inputs of information: the paper original and digital CMS record and supporting information from the museum’s own library of books and papers, We have built an Open AI enhanced platform that reviews the records and the additional submitted information. We then plan to ask experts to review the outputs for accuracy & authenticity.

2. On the topic of Evaluating AI, what do you want more information/experience with?

3. What else should AI4LAM explore related to evaluation?

4. What are evaluation models we could emulate/learn from?

5. What can you share related to this topic?

6. What other questions do you have?

18 of 21

  • How is your team/org approaching evaluation of AI?

Room 4

Just starting out–sense that there isn’t a lot of time for evaluation. Dealing with lack of language expertise on team

Speech-to-text

Data evaluation and generation for catalog and web crawling

2. On the topic of Evaluating AI, what do you want more information/experience with?

Tools for evaluating output in a language you don’t speak

Questions/criteria to guide evaluation by subject/language specialists

3. What else should AI4LAM explore related to evaluation?

4. What are evaluation models we could emulate/learn from?

LOC seems like a heavyweight evaluation method. We don’t feel like we have a lot of time for evaluation, but at the same time think it is important to do so, because we don’t want to depend on tools that are not reliable.

5. What can you share related to this topic?

6. What other questions do you have?

19 of 21

  • How is your team/org approaching evaluation of AI?

If small group evaluation while in development to improve the work quickly.

Evaluation embedded process, approved and written through a quality assurance procedures, fixed process

Risk based evaluation: Without a focus on the risks, you don’t see them.

2. On the topic of Evaluating AI, what do you want more information/experience with?

3. What else should AI4LAM explore related to evaluation?

4. What are evaluation models we could emulate/learn from?

5. What can you share related to this topic?

Evaluation embedded process, approved and written through a quality assurance procedures, fixed process.

Cross-disciplinary cooperation in development and evaluation.

Risk based evaluation: Without a focus on the risks, you don’t see them.

Workflows for keeping humans in the loop, which makes informal evaluation more viable.

6. What other questions do you have?

20 of 21

  • How is your team/org approaching evaluation of AI?

2. On the topic of Evaluating AI, what do you want more information/experience with?

3. What else should AI4LAM explore related to evaluation?

-Share resources made by other partners (spreadsheet, etc)

- Explore what has been done by EU on this topic and how it could be applied specifically on our needs

4. What are evaluation models we could emulate/learn from?

5. What can you share related to this topic?

German National Library: (project about to start) Training of an LLM from scratch on constrained data (i.e. in order to respect copyright law) and evaluation of that model

From the Library of Congress Labs – AI risk evaluation framework: https://libraryofcongress.github.io/labs-ai-bframework/

6. What other questions do you have?

21 of 21

Share out (10 mins) - Thank you!

  1. Groups sharing feedback to question 1
  2. Groups sharing feedback to question 2
  3. … question 3
  4. . . . question 4
  5. . . . question 5
  6. . . . question 6