Workshop on�“AI in the Archives”�Thursday , 16 November 2023 ��European Parliament - Luxembourg - Adenauer Building – Kirchberg���TRUST AI CU05 “THE USE OF AI IN IDENTIFYING OR RECREATING ARCHIVAL AGGREGATIONS”: THE CASE OF NATO ARCHIVES AND USE OF AI TECHNOLOGY ��Samir Musa �Digital Archivist �Historical Archives of the European Union (HAEU)
The CU05 Study main research question
Can we use AI tools to build or recreate archival aggregations and to metadata schemas for them?
The role of AI in identifying or reconstituting archival aggregations of digital records and enriching metadata schemas
CU05
The CU05 Study main research question
In many public administrations and private companies, documents are neither classified nor aggregated
In other cases, aggregations of documents are not well created, resulting in an uncontrolled number of documents that are not sorted, not placed in the correct folder and difficult to find.
In many cases metadata - necessary to ensure the reliability, trustworthiness, quality and sustainability of appraisal and acquisition - are missing.
Despite progress on various technologies to support document management, software support for those activities remains limited.
3
The CU05 Study main research question
Email management has become one of the most time-consuming activities both in the public sector and also in private companies and in personal activities.
Emails are often managed as single records without any bond with other emails and are not classified or filed in archival aggregations (folders) nor are connected to and classified in the record management system of the creator.
4
4
Inbox
Outbox
Subject 1
.
.
.
Subject 2
Subject 3
Subject n
The overall research team
Mariella Guercio (co-chair) (Associazione nazionale archivistica italiana - ANAI)
Stefano Allegrezza (co-chair) (Università di Bologna - Research Institute for Human-Centered Artificial Intelligence-ALMA AI)
Georgia Barloura (European Free Trade Association - EFTA)
Ineke Deserno (North Atlantic Treaty Organization - NATO)
Nicola Di Matteo (Halifax University, Canada)
Georg Gaenser (European Free Trade Association - EFTA)
Massimiliano Grandi (Associazione nazionale archivistica italiana - ANAI)
Bruna La Sorda (Associazione nazionale archivistica italiana - ANAI)
Francesca Magnoni (North Atlantic Treaty Organization - NATO)
Maria Mata Caravaca (International Centre for the Study of the Preservation and Restoration of Cultural Property - ICCROM)
Leonardo Mineo (Associazione nazionale archivistica italiana - ANAI
Samir Musa (Historical Archives of European Union – HAEU
Luís-Esteve Casellas Serra, Municipality of Girona – Spain (connection with AA01 “Employing AI for Retention & Disposition in Digital Information and Recordkeeping Systems (DIRS)”)
5
What AI technologies might be useful under what conditions
Which AI technologies could be useful for this purpose for the automatic or semi-automatic management of emails, for example:
Identification of AI companies
Initially 300 companies of interest to the study were identified.
Companies that develop IT products and:
Tools for building the list:
CU05 The role of AI in identifying or reconstituting archival aggregations of digital records and enriching metadata schemas
Identification of AI companies
Since it was not possible to interview all companies, from the initial list we selected a list of 28 companies on the basis of:
We tried to contact in particular information management specialists, software engineers, and DMOs and archivists (if any).
8
1 | Microsoft | Washington, DC, USA | |
2 | Iron Mountain | Boston, Massachusetts, USA | |
3 | Adlib | Burlington, Ontario, Canada | |
4 | Castlepoint | Canberra, Australia | |
5 | Gimmal | Texas, USA | |
6 | Quest-it | Sienna, Italy | |
7 | Grupo Adapting | Valencia, Spain | |
8 | Hyland | Westlake, Ohio, USA | |
9 | Stratagem | Aurora, Colorado, USA | |
10 | Aluma | Cambridge, UK and New York, USA | |
11 | Collabware | Washington, DC, USA | |
12 | Ephesoft | Irvine, California, USA | |
13 | Read-Coop | Innsbruck, Austria | |
14 | RecordPoint | Sydney, Australia | |
15 | Prism Software | California, USA | |
16 | ExpertSystem | Modena, Italy | |
17 | GRMdocument management | New Jersey, USA | |
18 | Grooper | Oklahoma, USA | |
19 | Ripcord | Hayward, California, USA | |
20 | Cortical | New York, USA | |
21 | AmyGB.ai | Mumbai, India | |
22 | Bizamica | Pune, India | |
23 | Docxflow | Popayán, Colombia | |
24 | Gleematic AI | Singapore | |
25 | SBK Business Solutions | São Bernardo do Campo, São Paulo, Brazil | |
26 | Datacentrix | Johannesburg, South Africa |
Anzyz (Norway)
DXC (Italy)
+
Rating of relevance to CU05 - Grading
The companies developing AI-based applications divided into 4 groups on the basis of their respective rating in reference to relevance to CU05:
9
CU05 rating by relevance
Only 10 companies have been assigned 3 as a rating value:
10
100 companies – geographical distribution
USA: 45
UK: 10
Germany: 5
Australia: 4
Netherlands 4
Austria: 3
Spain: 3
Switzerland: 3
Belgium: 2
France: 2
Ireland: 2
Singapore: 2
Brazil: 1
Bulgaria: 1
Colombia: 1
Cyprus: 1
Czech Republic: 1
Finland: 1
Lithuania: 1
New Zealand: 1
Portugal: 1
Italy: 4
Canada: 2
Which services are delivered
They may be grouped in 5 categories:
12
Questionnaire and interviews
In order to gather more precise information, we prepared a very detailed questionnaire aimed at collecting systematically the information for an adequate assessment of the applications
We sent to the 28 companies an official invitation letter (in English, in Spanish or Portuguese, according to the preferred language of the company) to take part in the survey
The questionnaire was explained orally during a preliminary meeting with information management staff and software engineers.
Subsequently, the companies filled out the questionnaire available on Google Forms
https://docs.google.com/forms/d/e/1FAIpQLSc8US3a89JbVjhfdma2EqYm1Xo_LVqP3qh_7kM4CJptKQStTg/viewform
The questionnaire: the sections
achievements
specific capabilities (for recordkeeping and email systems)
audit-checks -- key performance indicators
25 questions
I SECTION
II SECTION
III SECTION
technologies and methods used in the IA applications
IV SECTION
Companies that replied to the survey
13 companies replied:
Iron Mountain (USA)
Bis (USA)
Castelpoint Systems (Australia)
RecordPoint (Australia)
Cortical (Austria)
Read-Coop (Austria)
expert.ai (Italia)
Quest-it (Italia)
Collabware (Canada)
Grupo Adapting (Spain)
Bizamica (India)
Aluma (UK)
Anzyz Technologies AS (Norway)
The portfolio of the companies
All the market players interviewed have developed solutions based on AI technologies for indexing and/or classifying structured, semi-structured and unstructured data/records based on automatic learning techniques and automatic data extraction.
The amount of specific services listed is huge, detailed and diversified:
Involvement with records management and archives
Companies mostly involved in archive and records management underlined their capabilities in different processes such as document classification, indexing, managing the whole life cycle of documents and records, including the accession to archives.
The applications declare that they comply with the following standards (or have been designed to support them):
17
Survey outcome from the archival perspective
The majority of the market players interviewed have proved:
18
not filed or lost records
Survey outcome from the archival perspective
Automatic classification:
19
Survey outcome from the archival perspective
20
Survey outcome from the archival perspective
Filing / Aggregation:
Inferences on records grouping:
Inference on organisation or person:
21
Survey outcome from the archival perspective
In terms of records aggregation or re-aggregation, the promises for automatization are not very encouraging, as this possibility is confirmed to be limited to very specific cases such as
The automatic or semi-automatic aggregation based on the document content is only suggested and is usually supported by user validation, of human-in-the-loop workflow or rules availaible at the creation
22
Survey outcome from the archival perspective
Even the provenance information seems not easily recognizable by AI solutions when based on inferences and without very specific requirements such as
Also the reconstitution of the archival bond – when lost or not explicitly defined – is recognized as a complex activity, without the significant help of users and/or consistent descriptive information available and, in any case, it implies more investments, not yet supported by the market
23
Technology Solutions: Techniques and Analysis Models
which is unsurprising as the companies were selected because of their expertise at least in document management
However 18 different types of techniques overall have been reported
24
Technology Solutions: Training Strategies
Supervised Learning: 11 companies
Semi-Supervised Learning: 4 companies
Unsupervised Learning: 6 companies
Self-Supervised Learning: 2 companies
Rule-based Learning: 2 companies
We can also see which combinations of different training strategies the companies use:
25
Lessons learnt
Our research is only at its first phase but we have already recognized that we can and must accept the challenges without being intimidated:
Thank you!��Any comments are welcome!��
27