1 of 21

AI4LAM Community Call

January 21, 2025

2 of 21

Agenda

Welcome & Announcements (15 mins)

Kicking off Evaluation of AI in LAMs series (15 mins)

Discussion - small group (20 mins) - Please help us design the series program

Share out (10 mins)

3 of 21

Evaluation of AI in LAMs - introduction

Why evaluate, when to evaluate, what to evaluate, who evaluates and how to evaluate?

https://www.nist.gov/trustworthy-and-responsible-ai

https://libraryofcongress.github.io/labs-ai-framework/

4 of 21

Why evaluate AI?

Jan. 3, 2025

https://www.nytimes.com/2025/01/03/us/fable-ai-books-racism.html

Jan. 16, 2025

https://www.nytimes.com/2025/01/16/technology/apple-ai-news-notifications.html

tools are untested on LAM data and use cases
performance is often overstated or overestimated
need for performance monitoring
need for upskilling, expertise and to understand labor implications
legal and ethical uncertainties
emerging policies and regulations
harmful environmental impacts
long term costs are unknown
do benefits outweigh the costs?

Apr 24, 2024

https://www.theguardian.com/australia-news/2024/apr/24/queensland-state-library-ai-war-veteran-chatbot-charlie

The early struggles with A.I. products have fueled questions about the technology’s near-term potential. The technology, which can answer questions, create images and write code, has been heralded for its potential to disrupt businesses and create trillions of dollars in economic value. But some on Wall Street and in Silicon Valley have expressed doubts about whether A.I. will quickly produce enough benefits to justify its staggering costs.

“It’s hard, and it’s early and there’s not a clear value proposition yet for mainstream consumers,” said Ben Bajarin, chief executive of Creative Strategies, a tech research firm. “It is going to take time, and it’s going to be a real slow roll. No one knows yet what someone is going to look at and say, ‘That’s really valuable.’”

5 of 21

When to evaluate AI ?

6 of 21

What to evaluate?

Use Case
Models
Output
Implementation
Impacts

7 of 21

What to evaluate: Use Case

Use Case Risk Assessment example: LC Labs Use Case Risk Assessment Worksheet

8 of 21

The importance of ground truth data

https://www.biodiversitylibrary.org/page/46232531

https://transcription.si.edu/view/6656/EBOtg

From BHL Blog: "The Power of Community Science: How Smithsonian Volunpeers Transform Scientific Field Notes"

What to evaluate: Data Readiness/Ground Truth

9 of 21

What to evaluate: Models

License/openness of tool/model

transparency/black box
Restrictions on reuse of models

Public metrics & leaderboards

Benchmarks w/ use case specific data, metrics & evaluations

https://lmarena.ai/

10 of 21

What and How to evaluate: Output

Output from tests or prototypes or experiments

Quality Review

Example: Evaluation Matrix at UC Davis (Ensberg)

Output at scale at program or production level

HITL workflows, human review/intervention

Example: Exploring Computational Description (LC Labs experiment)

Monitoring Output at production level over time

Feedback mechanisms
Protocols

11 of 21

What to evaluate: Implementation

12 of 21

What to evaluate: Impacts

Does the AI tool/process/program

Support organizational strategic goals?
Are organizational values, principles uheld?

Examples: public trust, fairness, environmental impact

Meet legal requirements?
Satisfy safety and security requirements?
Support trusting relationships with impacted communities and users

What are the programs of outreach and mechanisms for integrating feedback?

Do the costs and risk outweigh the benefits?

How are benefits understood and measured?
How are costs understood and measured?

13 of 21

Who Evaluates?

Cross-cutting and wide group of people

People whose work is implicated

People who already perform quality review and assurance work
People who know what is “good enough”

People who are represented in the data
People who are responsible for quality outcomes
People who are responsible for developing, implementing and maintaining systems

14 of 21

Help design the AI Evaluation series

Small Group Discussion (20 mins)

We will put you into (at least 6) small groups
Each group will discuss all of the the questions in your slide and fill in answers
Please introduce yourselves and add your names to the “speaker notes” area of your slide
Chose one or two people to share the highlights of the discussion, focusing on your highlighted question

15 of 21

How is your team/org approaching evaluation of AI? Rutgers- break into functions, research, services, product by product (structured), how to apply to x,y,z. Half volunteer/assigned, reporting up to a central Library/report, policy and research aren’t connected to faculty research ENC : Evaluating computer vision evaluation for museum, want to design an interface to allow no-specialists to evaluate the output of CV models U of T - personal research around AI assessment for use in technology services; I am slowly collecting AI Frameworks in this Zotero folder: https://www.zotero.org/groups/5199888/ai_tools_for_archives/collections/XQGTKRRF/items/IN5HE2UF/item-list (should be public) In Ontario there are also consortial initiatives: https://ocul.on.ca/ai-machine-learning-final-report-strategy	2. On the topic of Evaluating AI, what do you want more information/experience with?
3. What else should AI4LAM explore related to evaluation?	4. What are evaluation models we could emulate/learn from?
5. What can you share related to this topic? https://www.comparia.beta.gouv.fr/ - for french speaking materials, create datasets to improve LLMs, compare quality, and other characteristics such as environmental impact, nb of parameters (LLM)	6. What other questions do you have? How to approach when understaffed and under resourced? Metrics and tools for evaluating commercial

16 of 21

How is your team/org approaching evaluation of AI?	2. On the topic of Evaluating AI, what do you want more information/experience with? –how to do fine-tuning
3. What else should AI4LAM explore related to evaluation? –benchmarks for evaluation (e.g. energy used to produce an answer?) –how to use them to decide whether to use the tech	4. What are evaluation models we could emulate/learn from?
5. What can you share related to this topic? –2 stages of evaluation: models (NorEval for lm-eval-harness) and production systems – 1) creation of a proof of concept application with a nice interface with one model 2) need to try with other models and get help from domain experts who can help to better evaluate the results / answers	6. What other questions do you have?

17 of 21

How is your team/org approaching evaluation of AI? Room 3 Evaluating embedding models using leaderboards and then a creation of a data set for “relevancy” benchmarking. Bexhill Museum UK: Looking at ways to evaluate responses based on very specific inputs of information: the paper original and digital CMS record and supporting information from the museum’s own library of books and papers, We have built an Open AI enhanced platform that reviews the records and the additional submitted information. We then plan to ask experts to review the outputs for accuracy & authenticity.	2. On the topic of Evaluating AI, what do you want more information/experience with?
3. What else should AI4LAM explore related to evaluation?	4. What are evaluation models we could emulate/learn from?
5. What can you share related to this topic?	6. What other questions do you have?

18 of 21

How is your team/org approaching evaluation of AI? Room 4 Just starting out–sense that there isn’t a lot of time for evaluation. Dealing with lack of language expertise on team Speech-to-text Data evaluation and generation for catalog and web crawling	2. On the topic of Evaluating AI, what do you want more information/experience with? Tools for evaluating output in a language you don’t speak Questions/criteria to guide evaluation by subject/language specialists
3. What else should AI4LAM explore related to evaluation?	4. What are evaluation models we could emulate/learn from? LOC seems like a heavyweight evaluation method. We don’t feel like we have a lot of time for evaluation, but at the same time think it is important to do so, because we don’t want to depend on tools that are not reliable.
5. What can you share related to this topic?	6. What other questions do you have?

19 of 21

How is your team/org approaching evaluation of AI? If small group evaluation while in development to improve the work quickly. Evaluation embedded process, approved and written through a quality assurance procedures, fixed process Risk based evaluation: Without a focus on the risks, you don’t see them.	2. On the topic of Evaluating AI, what do you want more information/experience with?
3. What else should AI4LAM explore related to evaluation?	4. What are evaluation models we could emulate/learn from?
5. What can you share related to this topic? Evaluation embedded process, approved and written through a quality assurance procedures, fixed process. Cross-disciplinary cooperation in development and evaluation. Risk based evaluation: Without a focus on the risks, you don’t see them. Workflows for keeping humans in the loop, which makes informal evaluation more viable.	6. What other questions do you have?

20 of 21

How is your team/org approaching evaluation of AI?	2. On the topic of Evaluating AI, what do you want more information/experience with?
3. What else should AI4LAM explore related to evaluation? -Share resources made by other partners (spreadsheet, etc) - Explore what has been done by EU on this topic and how it could be applied specifically on our needs	4. What are evaluation models we could emulate/learn from?
5. What can you share related to this topic? German National Library: (project about to start) Training of an LLM from scratch on constrained data (i.e. in order to respect copyright law) and evaluation of that model From the Library of Congress Labs – AI risk evaluation framework: https://libraryofcongress.github.io/labs-ai-bframework/	6. What other questions do you have?

21 of 21

Share out (10 mins) - Thank you!

Groups sharing feedback to question 1
Groups sharing feedback to question 2
… question 3
. . . question 4
. . . question 5
. . . question 6