AI4LAM Community Call
January 21, 2025
Evaluation of AI in LAMs - introduction
Why evaluate, when to evaluate, what to evaluate, who evaluates and how to evaluate?
Why evaluate AI?
Jan. 3, 2025
https://www.nytimes.com/2025/01/03/us/fable-ai-books-racism.html
Jan. 16, 2025
https://www.nytimes.com/2025/01/16/technology/apple-ai-news-notifications.html
Apr 24, 2024
https://www.theguardian.com/australia-news/2024/apr/24/queensland-state-library-ai-war-veteran-chatbot-charlie
When to evaluate AI ?
What to evaluate?
What to evaluate: Use Case
Use Case Risk Assessment example: LC Labs Use Case Risk Assessment Worksheet
The importance of ground truth data
From BHL Blog: "The Power of Community Science: How Smithsonian Volunpeers Transform Scientific Field Notes"
What to evaluate: Data Readiness/Ground Truth
What to evaluate: Models
What and How to evaluate: Output
What to evaluate: Implementation
What to evaluate: Impacts
Does the AI tool/process/program
Do the costs and risk outweigh the benefits?
Who Evaluates?
Cross-cutting and wide group of people
Help design the AI Evaluation series
Rutgers- break into functions, research, services, product by product (structured), how to apply to x,y,z. Half volunteer/assigned, reporting up to a central Library/report, policy and research aren’t connected to faculty research ENC : Evaluating computer vision evaluation for museum, want to design an interface to allow no-specialists to evaluate the output of CV models U of T - personal research around AI assessment for use in technology services; I am slowly collecting AI Frameworks in this Zotero folder: https://www.zotero.org/groups/5199888/ai_tools_for_archives/collections/XQGTKRRF/items/IN5HE2UF/item-list (should be public) In Ontario there are also consortial initiatives: https://ocul.on.ca/ai-machine-learning-final-report-strategy | 2. On the topic of Evaluating AI, what do you want more information/experience with? |
3. What else should AI4LAM explore related to evaluation? | 4. What are evaluation models we could emulate/learn from? |
5. What can you share related to this topic? https://www.comparia.beta.gouv.fr/ - for french speaking materials, create datasets to improve LLMs, compare quality, and other characteristics such as environmental impact, nb of parameters (LLM) | 6. What other questions do you have? How to approach when understaffed and under resourced? Metrics and tools for evaluating commercial |
| 2. On the topic of Evaluating AI, what do you want more information/experience with? –how to do fine-tuning |
3. What else should AI4LAM explore related to evaluation? –benchmarks for evaluation (e.g. energy used to produce an answer?) –how to use them to decide whether to use the tech | 4. What are evaluation models we could emulate/learn from? |
5. What can you share related to this topic? –2 stages of evaluation: models (NorEval for lm-eval-harness) and production systems – 1) creation of a proof of concept application with a nice interface with one model 2) need to try with other models and get help from domain experts who can help to better evaluate the results / answers | 6. What other questions do you have? |
Room 3 Evaluating embedding models using leaderboards and then a creation of a data set for “relevancy” benchmarking. Bexhill Museum UK: Looking at ways to evaluate responses based on very specific inputs of information: the paper original and digital CMS record and supporting information from the museum’s own library of books and papers, We have built an Open AI enhanced platform that reviews the records and the additional submitted information. We then plan to ask experts to review the outputs for accuracy & authenticity. | 2. On the topic of Evaluating AI, what do you want more information/experience with? |
3. What else should AI4LAM explore related to evaluation? | 4. What are evaluation models we could emulate/learn from? |
5. What can you share related to this topic? | 6. What other questions do you have? |
Room 4 Just starting out–sense that there isn’t a lot of time for evaluation. Dealing with lack of language expertise on team Speech-to-text Data evaluation and generation for catalog and web crawling | 2. On the topic of Evaluating AI, what do you want more information/experience with? Tools for evaluating output in a language you don’t speak Questions/criteria to guide evaluation by subject/language specialists |
3. What else should AI4LAM explore related to evaluation? | 4. What are evaluation models we could emulate/learn from? LOC seems like a heavyweight evaluation method. We don’t feel like we have a lot of time for evaluation, but at the same time think it is important to do so, because we don’t want to depend on tools that are not reliable. |
5. What can you share related to this topic? | 6. What other questions do you have? |
If small group evaluation while in development to improve the work quickly. Evaluation embedded process, approved and written through a quality assurance procedures, fixed process Risk based evaluation: Without a focus on the risks, you don’t see them. | 2. On the topic of Evaluating AI, what do you want more information/experience with? |
3. What else should AI4LAM explore related to evaluation? | 4. What are evaluation models we could emulate/learn from? |
5. What can you share related to this topic? Evaluation embedded process, approved and written through a quality assurance procedures, fixed process. Cross-disciplinary cooperation in development and evaluation. Risk based evaluation: Without a focus on the risks, you don’t see them. Workflows for keeping humans in the loop, which makes informal evaluation more viable. | 6. What other questions do you have? |
| 2. On the topic of Evaluating AI, what do you want more information/experience with? |
3. What else should AI4LAM explore related to evaluation? -Share resources made by other partners (spreadsheet, etc) - Explore what has been done by EU on this topic and how it could be applied specifically on our needs | 4. What are evaluation models we could emulate/learn from? |
5. What can you share related to this topic? German National Library: (project about to start) Training of an LLM from scratch on constrained data (i.e. in order to respect copyright law) and evaluation of that model From the Library of Congress Labs – AI risk evaluation framework: https://libraryofcongress.github.io/labs-ai-bframework/ | 6. What other questions do you have? |
Share out (10 mins) - Thank you!