第 1 页，共 44 页

Ai4lam Look Book

A Growing Knowledge Base of AI �Projects in Libraries, Archives and Museums

第 2 页，共 44 页

GallicaSNOOP (2018-2020)

R&D project based on the visual similarity search engine “SNOOP”, developed by INRIA (FR national computer science lab) and INA research team (FR audiovisual heritage agency)

GallicaSNOOP experiments

visual similarity search,
“human in the loop” feature,

on a 1M Gallica images dataset scraped with IIIF

Contact: obuisson@ina.fr, alexis.joly@inria.fr, jean-philippe.moreux@bnf.fr

第 3 页，共 44 页

SNOOP

SNOOP is the Pl@ntNet app search engine

�

https://plantnet.org/

第 4 页，共 44 页

GallicaSNOOP proof of concept

Query=press �agency photo (Gallica,1912)

Results=reproductions �in newspapers (1912)

“Human in the loop” query: iterating on results

First query=user �photos (2020)

Results=heritage photos �(Gallica, 1910-1920)

第 5 页，共 44 页

Transkribus�A platform for the transcription, recognition and searching of historical documents

Günter Mühlberger

Digitisation and Digital Preservation group

University of Innsbruck

第 6 页，共 44 页

Dirc Jansz �Schiouwer � Op hHuijden den 18en. Decemb. @ 1638. � Compareerde voor mij Hendrick Schaef � Notaris pPub. etc. Trijn Barents huijsvr: van

第 7 页，共 44 页

Rules of thumb

300ppi scans, good quality
Printed text

0,5-2% Character Error Rate

Single hand – simple writing

2-4% Character Error Rate / +10.000 words

Several hands – all seen during training

4-6% Character Error Rate / several 10.000 words

Many hands from same period and region – not all seen during training

6-8% Character Error Rate / several hundreds of thousands of words

Double the number of training data = 20-25% decrease of error rate
Hands not seen in any way, or concept writing 🡪 much worse results
For the next years it is to expect that specialized models will remain necessary

第 8 页，共 44 页

第 9 页，共 44 页

Google's Cloud Offering

Generic OCR API for 232 languages across 31 scripts

HTR support for languages written in Latin, Japanese, and Korean scripts

Unified model recognizing both handwriting and block printed

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Proprietary + Confidential

第 10 页，共 44 页

One of the first European Cooperative Societies
Collaboration among all user groups – shared ownership
Direct benefit for members – no shareholder value
Monetarisation of the Transkribus platform
5 FTEs working currently in READ-COOP to run Transkribus

Already 70+ members from archives, libraries, universities, companies, family historian associations, private persons have joined the coop

第 11 页，共 44 页

Google's Cloud Offering

Generic public OCR API
Enterprise focused private Document AI API

Specialized OCR for forms, invoices, etc
Additional features like key-value pair extraction
Higher price

Long term

AutoML for OCR [Maybe. We hope.]

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Proprietary + Confidential

第 12 页，共 44 页

OCR for Bangla – The Challenge

Since 2016 digitised 1,600 early Indian printed books 1713-1914

Find optimal solution for automating recognition of printed Bangla

Apply method to Bengali books, provide open datasets, enable digital research, collaborate to further research in this area

Extensive alphabet with complex character forms inc. floating diacritics

12

bl.uk/early-indian-printed-books @BL_IndianPrint tom.derrick@bl.uk

Typical page from BL Bengali Books

www.bl.uk

第 13 页，共 44 页

94% character accuracy. Big improvement using HTR+ engine
Independently verified. Now processing Bangla books

13

Initiatives to find Bangla OCR solution

ICDAR Competitions 2017 & 2019

Transkribus

Small training set of 51 pages, ground truth transcriptions by Jadavpur University
Character/bag of words/layout analysis by PRIMA
Google Cloud Vision best performer

bl.uk/early-indian-printed-books | primaresearch.org/datasets/REID2019 | doi.org/10.23536/505 | tom.derrick@bl.uk

primaresearch.org/datasets/REID2019 | ICDAR dataset: doi.org/10.23536/505 | Transkribus dataset: doi.org/10.23636/506

www.bl.uk

第 14 页，共 44 页

14

Wikisource Transcriptions

Bengali books into Wikisource with side-by-side transcription view
Integrated OCR using Google Vision API
2021 competition with Bengali Wikisource community to proofread OCR
Export transcriptions as open dataset and enable keyword searching

https://commons.wikimedia.org/wiki/Category:Two_Centuries_of_Indian_Print

www.bl.uk

第 15 页，共 44 页

http://www.robots.ox.ac.uk/~vgg/

第 16 页，共 44 页

Role of the VGG Digital Humanities Ambassadorship

“To disseminate VGG research to appropriate communities…” and to feed back research questions, interesting datasets, ideas for new development, bug reports, feature requests etc…

http://www.robots.ox.ac.uk/~vgg/software

www.printing-machine.org

giles.bergel@eng.ox.ac.uk

第 17 页，共 44 页

�

15C Booktrade (2018)

Your Paintings (2014)

VGG implementations: searching by instance (left) and category (right)

Other VGG demos - http://www.robots.ox.ac.uk/~vgg/demo

第 18 页，共 44 页

SARAH system (2018-ongoing)

Computer vision for automated tag creation in archives

zuzana.bukovcikova@stuba.sk
Slovak University of Technology in Bratislava
New Age Factory

第 19 页，共 44 页

SARAH system (2018-ongoing) - overview

Input: digitalized historical images (cca 1880-1920) from Slovak National Archives w/ only filenames
Goal: to make it searchable (create tags)
Research:

Detection of faces in images
Recognize similar faces
Basic object detection:

Filter images with people/animals..

Functionalities:

Interface for searching and viewing of content
Interface for smart face tagging
User tag creation

Ongoing topics/research/discussion:

Insufficient amount of training (historical) data with tags
Discussion about the desired outcome (tags)- collection specific?

第 20 页，共 44 页

SARAH system (2018-ongoing) - historical figures identification

第 21 页，共 44 页

Common Crawl News

Common Crawl Stories

Open WebText

The Colossal Norwegian Corpus

第 22 页，共 44 页

You need to understand the structure of the ALTO-files to be able to extract the text

OCR quality

After 2018 OCR is close to perfect.
Filtered out a lot of older text, especially text OCR-ed prior to 2009.
Scans are perfect, so a reOCR will make more text usable.

Takes some time. We have spent roughly 6 man-months on this work

Some processes need large computer resources. Extracting text from Common Crawl is almost impossible for us.

Lessons Learned

第 23 页，共 44 页

https://github.com/NBAiLab/notram/

AI-Lab (from North to South)

Per Egil Kummervold
Freddy Wetjen
Svein Arne Brygfjeld
Javier de la Rosa

Norwegian Transformer Model

&

Colossal Norwegian Corpus

per.kummervold@nb.no

第 24 页，共 44 页

Project State (15.12.20)

Improve Optical Character Recognition

Accuracy - Improvement for an estimated 30% of text blocks
Production - OCR application on entire corpus in coming weeks

Improve Newspaper Exploration

NER - Finished ground truth generation - observing first results
New User Interface - Started front end development for new interactive interface

pit.schneider@bnl.etat.lu

第 25 页，共 44 页

Quality Evaluation

kNN algorithm
Features: Dictionary matching, trigram similarity, garbage string detection and publication year

Binarization

OpenCV
Cleaning, dilation, padding, inversion and white on black detection

Segmentation

OpenCV
Combination of morphology operations and horizontal histogram projection

Font Recognition

Keras & Tensorflow
Binary CNN classifier distinguishing between Fraktur and other fonts

OCR

kraken OCR
Models trained on custom ground truth data

ALTO

Output is stored in ALTO XML files

NER

spaCy
Models trained on ground truth data annotated using the prodigy annotation tool

pit.schneider@bnl.etat.lu

第 26 页，共 44 页

The Netherlands Institute for Sound and Vision is a use case partner on AI for Social Sciences and Humanities research into issues of bias, framing and representation in media.��Resulting in user requirements research, validation and demonstration of AI tooling that is Trustworthy, Interoperable and enables multimodal media analysis in a configurable manner.

26

2020 - 2024

https://ai4media.eu/

Philo van Kemenade

pvkemenade@beeldengeluid.nl

第 27 页，共 44 页

Distributed Annotation ‘n’ Enrichment (DANE)

The Distributed Annotation ‘n’ Enrichment (DANE) system handles compute task assignment and file storage for the automatic annotation of content.

The use-case for which DANE was designed centres around the issue that the compute resources, and the collection of source media are not on the same device. Due to limited resources or policy choices it might not be possible or desirable to bulk transfer all source media to the compute resources, alternatively the source collection might be continuously growing or require on-demand processing.

��Development by Nanne van Noord�https://dane.readthedocs.io/en/latest/index.html

Text detection

Object detection

Classification

第 28 页，共 44 页

Indexing process automation

Project description

In 2014, the French National Audiovisual start to redesign its whole information system, centralizing databases and harmonizing data models in order to provide and maintain data consistency.

in 2019, the architecture of this whole new information system was completed. This allowed us to work on AI-based solutions in order to fully or partially automate segmentation and indexing process of TV programs.

Olivio Segura - osegura@ina.fr

Project manager - Collection Management Department

Gautier Poupeau - gpoupeau@ina.fr

Data architect - Information system division

Institut National de l’Audiovisuel - 94360 Bry-Sur-Marne, France

第 29 页，共 44 页

A toolbox made for set up AI workflows

第 30 页，共 44 页

What have we learned?

Some takeaways from the work :

Always think usecase first

Build transversal teams

Involve final users

Build human-computer interfaces

Combine tools and stay flexible

Centralize systems and models

Keep data consistency

Segmentation and indexing of daily broadcast on news channels

第 31 页，共 44 页

ReTV: Bringing Archival Content to Audiences Online

https://retv-project.eu/

if you build it, it doesn’t mean they will come looking for you nor stay there
carefully consider the benefits & risks of digital channels, platforms and devices that bring content directly to your audiences
use AI to monitor engagement and create personalised & generous experiences at scale

https://www.visualcapitalist.com/media-consumption-covid-19/

Rasa Bocyte rbocyte@beeldengeluid.nl @rasa_bocyte

第 32 页，共 44 页

goal: adapt full-length AV content for publication on social media
Video Summarisation encapsulated the flow of the story and its essential parts in a short trailer
easily editable & customisable to support creative processes
Supportive role of AI in creative processes - no need to automate everything

32

👉 Online Demo for Video Summarisation: http://multimedia2.iti.gr/videosummarization/service/start.html 👈

Content Wizard Prototype

Rasa Bocyte rbocyte@beeldengeluid.nl @rasa_bocyte

第 33 页，共 44 页

AV Archives in Your Pocket | 4u2 Messenger Prototype

Personalised video feed based on explicit user reactions video emojis
Bringing archival content into your information stream
Leverage the long-tail of collections - matching niche content with individual interests
Balance between personalisation & serendipity
Adapting to consumption habits on commercial platforms but also influencing new habits by bringing contextualised & trustworthy content online

Rasa Bocyte rbocyte@beeldengeluid.nl @rasa_bocyte

第 34 页，共 44 页

AMP: Audiovisual Metadata Platform

Challenge: Abundance of digitized and born-digital AV media

Including from mass digitization projects such as Indiana University’s MDPI
Lack of metadata for Discovery, Identification, Navigation, Rights, Accessibility
Institutions lack resources for large cataloging/transcription/inventory/rights clearance projects

Proposed solution: Leverage automation / machine learning together with human expertise to produce more efficient workflows

Workflow pipeline for MGMs, metadata generation mechanisms
Integration of automated/AI-based MGMs: speech-to-text, video OCR, NLP, segmentation, object detection, music IR, …
Integration of human MGMs

Media Content

Existing Metadata

Workflow system

MGM

Enriched Metadata

Target System

Users

AMP

第 35 页，共 44 页

Current Phase: AMP Pilot Development (AMPPD)

Andrew W. Mellon Foundation, October 2018 - June 2021
Build and pilot AMP system using three test collections of ~100 hours each:

2 from Indiana University: University Archives events, School of Music performances
1 from New York Public Library: Gay Men’s Health Crisis Collection

Develop workflow engine (using Galaxy), user interface
Evaluate and integrate both commercial and open source MGM tools
Test proposed approach, including use of metadata in target systems �(e.g. Avalon Media System)
Create foundation for future development and deployment

More information at https://go.iu.edu/amppd�Twitter: @AVMetadata �Contact: Jon Dunn, jwd@iu.edu

第 36 页，共 44 页

Current Phase: AMP Pilot Development (AMPPD)

Andrew W. Mellon Foundation, October 2018 - June 2021
Build and pilot AMP system using three test collections of ~100 hours each:

2 from Indiana University: University Archives events, School of Music performances
1 from New York Public Library: Gay Men’s Health Crisis Collection

Develop workflow engine (using Galaxy), user interface
Evaluate and integrate both commercial and open source MGM tools
Test proposed approach, including use of metadata in target systems �(e.g. Avalon Media System)
Create foundation for future development and deployment

More information at https://go.iu.edu/amppd�Twitter: @AVMetadata �Contact: Jon Dunn, jwd@iu.edu

第 37 页，共 44 页

Current Phase: AMP Pilot Development (AMPPD)

Andrew W. Mellon Foundation, October 2018 - June 2021
Build and pilot AMP system using three test collections of ~100 hours each:

2 from Indiana University: University Archives events, School of Music performances
1 from New York Public Library: Gay Men’s Health Crisis Collection

Develop workflow engine (using Galaxy), user interface
Evaluate and integrate both commercial and open source MGM tools
Test proposed approach, including use of metadata in target systems �(e.g. Avalon Media System)
Create foundation for future development and deployment

More information at https://go.iu.edu/amppd�Twitter: @AVMetadata �Contact: Jon Dunn, jwd@iu.edu

第 38 页，共 44 页

Dr. Oonagh Murphy, Goldsmiths, University of London

Dr. Elena Villaespesa, Pratt Institute

themuseumsai.network�

The Museums and Artificial Intelligence Network brought together a range of senior museum professionals and prominent academics to develop the conversation around AI, ethics and museums. This project was funded by the AHRC.

Through a series of industry workshops in London, New York and San Diego, the network facilitated in depth discussions designed to open up debate around the key parameters, methods and paradigms of AI in a museum context.

4378503678

第 39 页，共 44 页

AI + Visitor data

AI + Collection data

Do museums have the necessary data governance and processes in place to manage AI?

How does the current museum sector code of ethics and regulations cover the rapid growing AI field?

What are the best ethical practices to collect and analyze data with AI?

What skills might museum workers need to have to work with AI to get visitor insights?

What are the opportunities and challenges to apply AI technologies to collections data?

How can museums minimize algorithm biases to interpret their collections?

Would the lack of diversity in the museum and AI fields be reflected in the outcomes of using these technologies?

What are the implications of museums engaging with big tech companies?

4378503678

And, as the possibilities that these technologies offers us increase, so too do the challenges of developing and managing these technologies within the context of galleries, libraries, museums and archives.

Do museums have the necessary data governance and processes in place to manage AI?

------

How does the current museum sector code of ethics and regulations cover the rapid growing AI field?

What are the best ethical practices to collect and analyze data with AI?

What skills might museum workers need to have to work with AI to get visitor insights?

How can museums minimize algorithm biases to interpret their collections?

Would the lack of diversity in the museum and AI fields be reflected in the outcomes of using these technologies?

What are the implications of museums engaging with big tech companies?

第 40 页，共 44 页

Toolkit

them

useumsai.network

第 41 页，共 44 页

Discovering Pathways Through Collections:

A Museum Recommender System

Recommender system using digital collections:

Museum Recommender Engine for Collections

Creates individually curated pathways
Potential to capture the tastes and interests of users
Uses metadata and images (extendable)
Born out of the idea to support co-curation of community-generated exhibitions and facilitate exploration of collections on learning platforms
System acts as intermediary and facilitator rather than mere access/endpoint

Why might all this ML be useful?

Helps to explore collections in a different way and beyond classic collection search interfaces
Helps to engage with the full breadth of collections in a personalised and more meaningful way
It helps professionals to retrieve information in a new way
It gives us new ways of seeing collections and meaning-making
It engages communities and can be used as a democratic way of curating

Lukas Noehrer [@LukasNoehrer] [lukas.noehrer@manchester.ac.uk]

第 42 页，共 44 页

Discovering Pathways Through Collections:

A Museum Recommender System

Data collection and entry define outcome of computational process

Reverse collecting

Questioning the dataset is pivotal to critically address:

- “history of elitism and exclusion”

- bias, racism, and inequalities in collections

ML community needs fair and ethical approaches: avoid ‘blind ingestions of data’ -> downstream harm

ML can help to explore and create new knowledge

Caveat of fallacies: true interpretation? Algorithms are just as biased as the data used

Reproducibility and openness are key

Most commercial systems trim data to suit certain audiences

Define according to data available rather than to trim the data: avoid data cosmetics

Consider different needs of audiences and their information seeking behaviour

Outside the search box exploration beyond authoritative narratives and fictitious ‘neutral’ displays

Develop together with audiences from the beginning

Data can hardly ever be used in its original state

Feature engineering presents challenges and complexities

Interdisciplinary effort

Curatorial trap

Documentation of what was added/manipulated/deleted

Choice of model and algorithm needs careful consideration and reflection

Power and limitations of ML techniques

Aim:

(i) relevant objects,

(ii) novel data,

(iii) serendipitous effects, and

(iv) diverse range

Not all models feasible due to data requirements, scalability, and affordability

System can be evaluated with statistical methods, i.e. accuracy and model performance

User experience might be a better indicator for museum applications

Trade-offs (model vs. user-centric)

User-centric methods

Content-based: TF-IDF and Conv. Autoencoder

第 43 页，共 44 页

Discovering Pathways Through Collections:

A Museum Recommender System

第 44 页，共 44 页

MAKING SMITHSONIAN OPEN ACCESS ACCESSIBLE WITH PYTHON AND DASK

SI OPEN ACCESS RELEASE

Of the Smithsonian’s 155 million objects, 2.1 million library volumes and 156,000 cubic feet of archival collections:

2.8 million 2-D and 3-D images
Over 17 million collection metadata objects

3 WAYS TO ACCESS

Web API

http://edan.si.edu/openaccess/apidocs/

GitHub

All Open Access metadata
https://github.com/smithsonian/openaccess

AWS S3

Metadata and Images
https://registry.opendata.aws/smithsonian-open-access/