第 1 页,共 44 页

Ai4lam Look Book

A Growing Knowledge Base of AI �Projects in Libraries, Archives and Museums

第 2 页,共 44 页

GallicaSNOOP (2018-2020)

R&D project based on the visual similarity search engine “SNOOP”, developed by INRIA (FR national computer science lab) and INA research team (FR audiovisual heritage agency)

GallicaSNOOP experiments

  • visual similarity search,
  • “human in the loop” feature,

on a 1M Gallica images dataset scraped with IIIF

第 3 页,共 44 页

SNOOP

SNOOP is the Pl@ntNet app search engine

https://plantnet.org/

第 4 页,共 44 页

GallicaSNOOP proof of concept

Query=press �agency photo (Gallica,1912)

Results=reproductions �in newspapers (1912)

“Human in the loop” query: iterating on results

First query=user �photos (2020)

Results=heritage photos �(Gallica, 1910-1920)

第 5 页,共 44 页

TranskribusA platform for the transcription, recognition and searching of historical documents

Günter Mühlberger

Digitisation and Digital Preservation group

University of Innsbruck

第 6 页,共 44 页

Dirc Jansz �Schiouwer � Op hHuijden den 18en. Decemb. @ 1638. � Compareerde voor mij Hendrick Schaef � Notaris pPub. etc. Trijn Barents huijsvr: van

第 7 页,共 44 页

Rules of thumb

  • 300ppi scans, good quality
  • Printed text
    • 0,5-2% Character Error Rate
  • Single hand – simple writing
    • 2-4% Character Error Rate / +10.000 words
  • Several hands – all seen during training
    • 4-6% Character Error Rate / several 10.000 words
  • Many hands from same period and region – not all seen during training
    • 6-8% Character Error Rate / several hundreds of thousands of words
  • Double the number of training data = 20-25% decrease of error rate
  • Hands not seen in any way, or concept writing 🡪 much worse results
  • For the next years it is to expect that specialized models will remain necessary

第 8 页,共 44 页

第 9 页,共 44 页

Google's Cloud Offering

Generic OCR API for 232 languages across 31 scripts

HTR support for languages written in Latin, Japanese, and Korean scripts

Unified model recognizing both handwriting and block printed

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Proprietary + Confidential

第 10 页,共 44 页

  • One of the first European Cooperative Societies
  • Collaboration among all user groups – shared ownership
  • Direct benefit for members – no shareholder value
  • Monetarisation of the Transkribus platform
  • 5 FTEs working currently in READ-COOP to run Transkribus

  • Already 70+ members from archives, libraries, universities, companies, family historian associations, private persons have joined the coop

第 11 页,共 44 页

Google's Cloud Offering

  • Generic public OCR API
  • Enterprise focused private Document AI API
    • Specialized OCR for forms, invoices, etc
    • Additional features like key-value pair extraction
    • Higher price
  • Long term
    • AutoML for OCR [Maybe. We hope.]

Source: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis non erat sem

Proprietary + Confidential

第 12 页,共 44 页

OCR for Bangla – The Challenge

  • Since 2016 digitised 1,600 early Indian printed books 1713-1914

  • Find optimal solution for automating recognition of printed Bangla

  • Apply method to Bengali books, provide open datasets, enable digital research, collaborate to further research in this area

  • Extensive alphabet with complex character forms inc. floating diacritics

12

bl.uk/early-indian-printed-books @BL_IndianPrint tom.derrick@bl.uk

Typical page from BL Bengali Books

www.bl.uk

第 13 页,共 44 页

  • 94% character accuracy. Big improvement using HTR+ engine
  • Independently verified. Now processing Bangla books

13

Initiatives to find Bangla OCR solution

ICDAR Competitions 2017 & 2019

Transkribus

  • Small training set of 51 pages, ground truth transcriptions by Jadavpur University
  • Character/bag of words/layout analysis by PRIMA
  • Google Cloud Vision best performer

bl.uk/early-indian-printed-books | primaresearch.org/datasets/REID2019 | doi.org/10.23536/505 | tom.derrick@bl.uk

primaresearch.org/datasets/REID2019 | ICDAR dataset: doi.org/10.23536/505 | Transkribus dataset: doi.org/10.23636/506

www.bl.uk

第 14 页,共 44 页

14

Wikisource Transcriptions

  • Bengali books into Wikisource with side-by-side transcription view
  • Integrated OCR using Google Vision API
  • 2021 competition with Bengali Wikisource community to proofread OCR
  • Export transcriptions as open dataset and enable keyword searching

https://commons.wikimedia.org/wiki/Category:Two_Centuries_of_Indian_Print

https://commons.wikimedia.org/wiki/Category:Two_Centuries_of_Indian_Print

www.bl.uk

第 15 页,共 44 页

http://www.robots.ox.ac.uk/~vgg/

第 16 页,共 44 页

Role of the VGG Digital Humanities Ambassadorship

“To disseminate VGG research to appropriate communities…” and to feed back research questions, interesting datasets, ideas for new development, bug reports, feature requests etc…

第 17 页,共 44 页

VGG implementations: searching by instance (left) and category (right)

第 18 页,共 44 页

SARAH system (2018-ongoing)

Computer vision for automated tag creation in archives

  • zuzana.bukovcikova@stuba.sk
  • Slovak University of Technology in Bratislava
  • New Age Factory

第 19 页,共 44 页

SARAH system (2018-ongoing) - overview

  • Input: digitalized historical images (cca 1880-1920) from Slovak National Archives w/ only filenames
  • Goal: to make it searchable (create tags)
  • Research:
    • Detection of faces in images
    • Recognize similar faces
    • Basic object detection:
      • Filter images with people/animals..
  • Functionalities:
    • Interface for searching and viewing of content
    • Interface for smart face tagging
    • User tag creation
  • Ongoing topics/research/discussion:
    • Insufficient amount of training (historical) data with tags
    • Discussion about the desired outcome (tags)- collection specific?

第 20 页,共 44 页

SARAH system (2018-ongoing) - historical figures identification

第 21 页,共 44 页

Common Crawl News

Common Crawl Stories

Open WebText

The Colossal Norwegian Corpus

第 22 页,共 44 页

  • You need to understand the structure of the ALTO-files to be able to extract the text

  • OCR quality
    • After 2018 OCR is close to perfect.
    • Filtered out a lot of older text, especially text OCR-ed prior to 2009.
    • Scans are perfect, so a reOCR will make more text usable.

  • Takes some time. We have spent roughly 6 man-months on this work

  • Some processes need large computer resources. Extracting text from Common Crawl is almost impossible for us.

Lessons Learned

第 23 页,共 44 页

https://github.com/NBAiLab/notram/

AI-Lab (from North to South)

  • Per Egil Kummervold
  • Freddy Wetjen
  • Svein Arne Brygfjeld
  • Javier de la Rosa

Norwegian Transformer Model

&

Colossal Norwegian Corpus

per.kummervold@nb.no

第 24 页,共 44 页

Project State (15.12.20)

  • Improve Optical Character Recognition
    • Accuracy - Improvement for an estimated 30% of text blocks
    • Production - OCR application on entire corpus in coming weeks
  • Improve Newspaper Exploration
    • NER - Finished ground truth generation - observing first results
    • New User Interface - Started front end development for new interactive interface

pit.schneider@bnl.etat.lu

第 25 页,共 44 页

  • Quality Evaluation
    • kNN algorithm
    • Features: Dictionary matching, trigram similarity, garbage string detection and publication year
  • Binarization
    • OpenCV
    • Cleaning, dilation, padding, inversion and white on black detection
  • Segmentation
    • OpenCV
    • Combination of morphology operations and horizontal histogram projection
  • Font Recognition
    • Keras & Tensorflow
    • Binary CNN classifier distinguishing between Fraktur and other fonts
  • OCR
    • kraken OCR
    • Models trained on custom ground truth data
  • ALTO
    • Output is stored in ALTO XML files
  • NER
    • spaCy
    • Models trained on ground truth data annotated using the prodigy annotation tool

pit.schneider@bnl.etat.lu

第 26 页,共 44 页

The Netherlands Institute for Sound and Vision is a use case partner on AI for Social Sciences and Humanities research into issues of bias, framing and representation in media.��Resulting in user requirements research, validation and demonstration of AI tooling that is Trustworthy, Interoperable and enables multimodal media analysis in a configurable manner.

26

Philo van Kemenade

pvkemenade@beeldengeluid.nl

第 27 页,共 44 页

Distributed Annotation ‘n’ Enrichment (DANE)

The Distributed Annotation ‘n’ Enrichment (DANE) system handles compute task assignment and file storage for the automatic annotation of content.

The use-case for which DANE was designed centres around the issue that the compute resources, and the collection of source media are not on the same device. Due to limited resources or policy choices it might not be possible or desirable to bulk transfer all source media to the compute resources, alternatively the source collection might be continuously growing or require on-demand processing.

��Development by Nanne van Noordhttps://dane.readthedocs.io/en/latest/index.html

Text detection

Object detection

Classification

第 28 页,共 44 页

Indexing process automation

Project description

In 2014, the French National Audiovisual start to redesign its whole information system, centralizing databases and harmonizing data models in order to provide and maintain data consistency.

in 2019, the architecture of this whole new information system was completed. This allowed us to work on AI-based solutions in order to fully or partially automate segmentation and indexing process of TV programs.

Olivio Segura - osegura@ina.fr

Project manager - Collection Management Department

Gautier Poupeau - gpoupeau@ina.fr

Data architect - Information system division

Institut National de l’Audiovisuel - 94360 Bry-Sur-Marne, France

第 29 页,共 44 页

A toolbox made for set up AI workflows

第 30 页,共 44 页

What have we learned?

Some takeaways from the work :

Always think usecase first

Build transversal teams

Involve final users

Build human-computer interfaces

Combine tools and stay flexible

Centralize systems and models

Keep data consistency

Segmentation and indexing of daily broadcast on news channels

第 31 页,共 44 页

ReTV: Bringing Archival Content to Audiences Online

https://retv-project.eu/

  • if you build it, it doesn’t mean they will come looking for you nor stay there
  • carefully consider the benefits & risks of digital channels, platforms and devices that bring content directly to your audiences
  • use AI to monitor engagement and create personalised & generous experiences at scale

https://www.visualcapitalist.com/media-consumption-covid-19/

Rasa Bocyte rbocyte@beeldengeluid.nl @rasa_bocyte

第 32 页,共 44 页

  • goal: adapt full-length AV content for publication on social media
  • Video Summarisation encapsulated the flow of the story and its essential parts in a short trailer
  • easily editable & customisable to support creative processes
  • Supportive role of AI in creative processes - no need to automate everything

32

👉 Online Demo for Video Summarisation: http://multimedia2.iti.gr/videosummarization/service/start.html 👈

Content Wizard Prototype

Rasa Bocyte rbocyte@beeldengeluid.nl @rasa_bocyte

第 33 页,共 44 页

AV Archives in Your Pocket | 4u2 Messenger Prototype

  • Personalised video feed based on explicit user reactions video emojis
  • Bringing archival content into your information stream
  • Leverage the long-tail of collections - matching niche content with individual interests
  • Balance between personalisation & serendipity
  • Adapting to consumption habits on commercial platforms but also influencing new habits by bringing contextualised & trustworthy content online

Rasa Bocyte rbocyte@beeldengeluid.nl @rasa_bocyte

第 34 页,共 44 页

AMP: Audiovisual Metadata Platform

Challenge: Abundance of digitized and born-digital AV media

  • Including from mass digitization projects such as Indiana University’s MDPI
  • Lack of metadata for Discovery, Identification, Navigation, Rights, Accessibility
  • Institutions lack resources for large cataloging/transcription/inventory/rights clearance projects

Proposed solution: Leverage automation / machine learning together with human expertise to produce more efficient workflows

  • Workflow pipeline for MGMs, metadata generation mechanisms
  • Integration of automated/AI-based MGMs: speech-to-text, video OCR, NLP, segmentation, object detection, music IR, …
  • Integration of human MGMs

Media Content

Existing Metadata

Workflow system

MGM

MGM

MGM

Enriched Metadata

Target System

Users

AMP

第 35 页,共 44 页

Current Phase: AMP Pilot Development (AMPPD)

  • Andrew W. Mellon Foundation, October 2018 - June 2021
  • Build and pilot AMP system using three test collections of ~100 hours each:
    • 2 from Indiana University: University Archives events, School of Music performances
    • 1 from New York Public Library: Gay Men’s Health Crisis Collection
  • Develop workflow engine (using Galaxy), user interface
  • Evaluate and integrate both commercial and open source MGM tools
  • Test proposed approach, including use of metadata in target systems �(e.g. Avalon Media System)
  • Create foundation for future development and deployment

More information at https://go.iu.edu/amppd�Twitter: @AVMetadata �Contact: Jon Dunn, jwd@iu.edu

第 36 页,共 44 页

Current Phase: AMP Pilot Development (AMPPD)

  • Andrew W. Mellon Foundation, October 2018 - June 2021
  • Build and pilot AMP system using three test collections of ~100 hours each:
    • 2 from Indiana University: University Archives events, School of Music performances
    • 1 from New York Public Library: Gay Men’s Health Crisis Collection
  • Develop workflow engine (using Galaxy), user interface
  • Evaluate and integrate both commercial and open source MGM tools
  • Test proposed approach, including use of metadata in target systems �(e.g. Avalon Media System)
  • Create foundation for future development and deployment

More information at https://go.iu.edu/amppd�Twitter: @AVMetadata �Contact: Jon Dunn, jwd@iu.edu

第 37 页,共 44 页

Current Phase: AMP Pilot Development (AMPPD)

  • Andrew W. Mellon Foundation, October 2018 - June 2021
  • Build and pilot AMP system using three test collections of ~100 hours each:
    • 2 from Indiana University: University Archives events, School of Music performances
    • 1 from New York Public Library: Gay Men’s Health Crisis Collection
  • Develop workflow engine (using Galaxy), user interface
  • Evaluate and integrate both commercial and open source MGM tools
  • Test proposed approach, including use of metadata in target systems �(e.g. Avalon Media System)
  • Create foundation for future development and deployment

More information at https://go.iu.edu/amppd�Twitter: @AVMetadata �Contact: Jon Dunn, jwd@iu.edu

第 38 页,共 44 页

Dr. Oonagh Murphy, Goldsmiths, University of London

Dr. Elena Villaespesa, Pratt Institute

themuseumsai.network

The Museums and Artificial Intelligence Network brought together a range of senior museum professionals and prominent academics to develop the conversation around AI, ethics and museums. This project was funded by the AHRC.

Through a series of industry workshops in London, New York and San Diego, the network facilitated in depth discussions designed to open up debate around the key parameters, methods and paradigms of AI in a museum context.

4378503678

第 39 页,共 44 页

AI + Visitor data

AI + Collection data

Do museums have the necessary data governance and processes in place to manage AI?

How does the current museum sector code of ethics and regulations cover the rapid growing AI field?

What are the best ethical practices to collect and analyze data with AI?

What skills might museum workers need to have to work with AI to get visitor insights?

What are the opportunities and challenges to apply AI technologies to collections data?

How can museums minimize algorithm biases to interpret their collections?

Would the lack of diversity in the museum and AI fields be reflected in the outcomes of using these technologies?

What are the implications of museums engaging with big tech companies?

4378503678

第 40 页,共 44 页

Toolkit

第 41 页,共 44 页

Discovering Pathways Through Collections:

A Museum Recommender System

Recommender system using digital collections:

Museum Recommender Engine for Collections

  • Creates individually curated pathways
  • Potential to capture the tastes and interests of users
  • Uses metadata and images (extendable)
  • Born out of the idea to support co-curation of community-generated exhibitions and facilitate exploration of collections on learning platforms
  • System acts as intermediary and facilitator rather than mere access/endpoint

Why might all this ML be useful?

  • Helps to explore collections in a different way and beyond classic collection search interfaces
  • Helps to engage with the full breadth of collections in a personalised and more meaningful way
  • It helps professionals to retrieve information in a new way
  • It gives us new ways of seeing collections and meaning-making
  • It engages communities and can be used as a democratic way of curating

Lukas Noehrer [@LukasNoehrer] [lukas.noehrer@manchester.ac.uk]

第 42 页,共 44 页

Discovering Pathways Through Collections:

A Museum Recommender System

Data collection and entry define outcome of computational process

Reverse collecting

Questioning the dataset is pivotal to critically address:

- “history of elitism and exclusion”

- bias, racism, and inequalities in collections

ML community needs fair and ethical approaches: avoid ‘blind ingestions of data’ -> downstream harm

ML can help to explore and create new knowledge

Caveat of fallacies: true interpretation? Algorithms are just as biased as the data used

Reproducibility and openness are key

Most commercial systems trim data to suit certain audiences

Define according to data available rather than to trim the data: avoid data cosmetics

Consider different needs of audiences and their information seeking behaviour

Outside the search box exploration beyond authoritative narratives and fictitious ‘neutral’ displays

Develop together with audiences from the beginning

Data can hardly ever be used in its original state

Feature engineering presents challenges and complexities

Interdisciplinary effort

Curatorial trap

Documentation of what was added/manipulated/deleted

Choice of model and algorithm needs careful consideration and reflection

Power and limitations of ML techniques

Aim:

(i) relevant objects,

(ii) novel data,

(iii) serendipitous effects, and

(iv) diverse range

Not all models feasible due to data requirements, scalability, and affordability

System can be evaluated with statistical methods, i.e. accuracy and model performance

User experience might be a better indicator for museum applications

Trade-offs (model vs. user-centric)

User-centric methods

Content-based: TF-IDF and Conv. Autoencoder

第 43 页,共 44 页

Discovering Pathways Through Collections:

A Museum Recommender System

第 44 页,共 44 页

MAKING SMITHSONIAN OPEN ACCESS ACCESSIBLE WITH PYTHON AND DASK

SI OPEN ACCESS RELEASE

Of the Smithsonian’s 155 million objects, 2.1 million library volumes and 156,000 cubic feet of archival collections:

  • 2.8 million 2-D and 3-D images
  • Over 17 million collection metadata objects

3 WAYS TO ACCESS

  • Web API
  • GitHub
  • AWS S3