1 of 27

Workshop on�“AI in the Archives”�Thursday , 16 November 2023 ��European Parliament - Luxembourg - Adenauer Building – Kirchberg��TRUST AI CU05 “THE USE OF AI IN IDENTIFYING OR RECREATING ARCHIVAL AGGREGATIONS”: THE CASE OF NATO ARCHIVES AND USE OF AI TECHNOLOGY ��Samir Musa �Digital Archivist �Historical Archives of the European Union (HAEU)

2 of 27

The CU05 Study main research question

Can we use AI tools to build or recreate archival aggregations and to metadata schemas for them?

The role of AI in identifying or reconstituting archival aggregations of digital records and enriching metadata schemas

CU05

3 of 27

The CU05 Study main research question

In many public administrations and private companies, documents are neither classified nor aggregated

In other cases, aggregations of documents are not well created, resulting in an uncontrolled number of documents that are not sorted, not placed in the correct folder and difficult to find.

In many cases metadata - necessary to ensure the reliability, trustworthiness, quality and sustainability of appraisal and acquisition - are missing.

Despite progress on various technologies to support document management, software support for those activities remains limited.

3

4 of 27

The CU05 Study main research question

Email management has become one of the most time-consuming activities both in the public sector and also in private companies and in personal activities.

Emails are often managed as single records without any bond with other emails and are not classified or filed in archival aggregations (folders) nor are connected to and classified in the record management system of the creator.

4

Inbox

Outbox

Subject 1

.

Subject 2

Subject 3

Subject n

5 of 27

The overall research team

Mariella Guercio (co-chair) (Associazione nazionale archivistica italiana - ANAI)

Stefano Allegrezza (co-chair) (Università di Bologna - Research Institute for Human-Centered Artificial Intelligence-ALMA AI)

Georgia Barloura (European Free Trade Association - EFTA)

Ineke Deserno (North Atlantic Treaty Organization - NATO)

Nicola Di Matteo (Halifax University, Canada)

Georg Gaenser (European Free Trade Association - EFTA)

Massimiliano Grandi (Associazione nazionale archivistica italiana - ANAI)

Bruna La Sorda (Associazione nazionale archivistica italiana - ANAI)

Francesca Magnoni (North Atlantic Treaty Organization - NATO)

Maria Mata Caravaca (International Centre for the Study of the Preservation and Restoration of Cultural Property - ICCROM)

Leonardo Mineo (Associazione nazionale archivistica italiana - ANAI

Samir Musa (Historical Archives of European Union – HAEU

Luís-Esteve Casellas Serra, Municipality of Girona – Spain (connection with AA01 “Employing AI for Retention & Disposition in Digital Information and Recordkeeping Systems (DIRS)”)

5

6 of 27

What AI technologies might be useful under what conditions

Which AI technologies could be useful for this purpose for the automatic or semi-automatic management of emails, for example:

for automatic classification?
for aggregating the records?
for filtering emails?
for integrating metadata for describing the creation context and use?
for automatic appraisal and disposal?

7 of 27

Identification of AI companies

Initially 300 companies of interest to the study were identified.

Companies that develop IT products and:

are based on AI-related technologies
are relevant to the scope of the CU05 study

Tools for building the list:

direct Internet searches using keywords and text strings;
resources and knowledge made available by professionals�(Alan Pelz-Sharpe, Andrew Warland, James Lappin, Jenny Bunn and Paul Young)

CU05 The role of AI in identifying or reconstituting archival aggregations of digital records and enriching metadata schemas

8 of 27

Identification of AI companies

Since it was not possible to interview all companies, from the initial list we selected a list of 28 companies on the basis of:

their portfolio
their direct involvement in the record field
their compliance with regulatory frameworks and standards relevant in the domain
the general reputation of the company.

We tried to contact in particular information management specialists, software engineers, and DMOs and archivists (if any).

8

1	Microsoft	Washington, DC, USA	https://www.microsoft.com/en-gb
2	Iron Mountain	Boston, Massachusetts, USA	www.ironmountain.com
3	Adlib	Burlington, Ontario, Canada	www.adlibsoftware.com
4	Castlepoint	Canberra, Australia	www.castlepoint.systems
5	Gimmal	Texas, USA	https://www.gimmal.com/
6	Quest-it	Sienna, Italy	www.quest-it.com
7	Grupo Adapting	Valencia, Spain	https://www.adapting.com/en/
8	Hyland	Westlake, Ohio, USA	https://www.hyland.com/en
9	Stratagem	Aurora, Colorado, USA	www.stratagemgroup.com
10	Aluma	Cambridge, UK and New York, USA	https://aluma.io/
11	Collabware	Washington, DC, USA	collabware.com
12	Ephesoft	Irvine, California, USA	https://ephesoft.com/
13	Read-Coop	Innsbruck, Austria	https://readcoop.eu/transkribus/
14	RecordPoint	Sydney, Australia	www.recordpoint.com
15	Prism Software	California, USA	https://prismsoftware.com/
16	ExpertSystem	Modena, Italy	https://www.expert.ai/
17	GRMdocument management	New Jersey, USA	https://www.grmdocumentmanagement.com/
18	Grooper	Oklahoma, USA	https://www.bisok.com/intelligent-document-processing/
19	Ripcord	Hayward, California, USA	www.ripcord.com
20	Cortical	New York, USA	www.cortical.io
21	AmyGB.ai	Mumbai, India	www.amygb.ai
22	Bizamica	Pune, India	www.bizamica.com
23	Docxflow	Popayán, Colombia	https://www.docxflow.com/
24	Gleematic AI	Singapore	https://gleematic.com/
25	SBK Business Solutions	São Bernardo do Campo, São Paulo, Brazil	www.sbkbs.com.br
26	Datacentrix	Johannesburg, South Africa	www.datacentrix.co.za

Anzyz (Norway)

DXC (Italy)

+

9 of 27

Rating of relevance to CU05 - Grading

The companies developing AI-based applications divided into 4 groups on the basis of their respective rating in reference to relevance to CU05:

Value “3” – the highest. The company states their products support archives and/or RM and furthermore either pledges compliance with a recognized RM or archives standard / good practice or has been endorsed by some reputable archival or RM institution (e.g. UK TNA)
Value “2”. The company states their products support archives and/or RM (but no pledge of compliance with domain-related standards/rules and no endorsement by archival or RM institution)
Value “1”. The company just states their product supports document management (no reference to archives or RM)
Value “0”. The company does not openly include document management among the objectives of its mission, but some features of its product(s) might be of interest to CU05

9

10 of 27

CU05 rating by relevance

Only 10 companies have been assigned 3 as a rating value:

Adlib – Ontario, Canada - www.adlibsoftware.com
Castlepoint – Canberra, Australia - www.castlepoint.systems
DocuNav – Texas, US - www.docunav.com
Docxflow – Popayán, Colombia - www.docxflow.com
Gimmal – Texas, US - www.gimmal.com
Grupo Adapting – Valencia, Spain - www.adapting.com
Hyland – Ohio, US - www.hyland.com
Quest-IT – Sienna, Italy - www.quest-it.com
RecordPoint – Sydney, Australia - www.recordpoint.com
Stratagem – Colorado, US - www.stratagemgroup.com

10

11 of 27

100 companies – geographical distribution

USA: 45

UK: 10

Germany: 5

Australia: 4

Netherlands 4

Austria: 3

Spain: 3

Switzerland: 3

Belgium: 2

France: 2

Ireland: 2

Singapore: 2

Brazil: 1

Bulgaria: 1

Colombia: 1

Cyprus: 1

Czech Republic: 1

Finland: 1

Lithuania: 1

New Zealand: 1

Portugal: 1

Italy: 4

Canada: 2

12 of 27

Which services are delivered

They may be grouped in 5 categories:

Automatic indexation / classification - this seems to be by far the most advertised function;
Automatic data extraction - when papers are considered, often at the same time as a document is scanned;
Intelligent processing - i.e. the application starts and advances processes automatically on the basis of features detected on the document, e.g. route documents to specific people or implements retention schedules;
Intelligent discovery – information retrieval by e.g. comparing documents or analyzing concepts;
Automatic redaction (relating to data protection)

12

13 of 27

Questionnaire and interviews

In order to gather more precise information, we prepared a very detailed questionnaire aimed at collecting systematically the information for an adequate assessment of the applications

We sent to the 28 companies an official invitation letter (in English, in Spanish or Portuguese, according to the preferred language of the company) to take part in the survey

The questionnaire was explained orally during a preliminary meeting with information management staff and software engineers.

Subsequently, the companies filled out the questionnaire available on Google Forms

https://docs.google.com/forms/d/e/1FAIpQLSc8US3a89JbVjhfdma2EqYm1Xo_LVqP3qh_7kM4CJptKQStTg/viewform

14 of 27

The questionnaire: the sections

achievements

specific capabilities (for recordkeeping and email systems)

audit-checks -- key performance indicators

25 questions

I SECTION

II SECTION

III SECTION

technologies and methods used in the IA applications

IV SECTION

15 of 27

Companies that replied to the survey

13 companies replied:

Iron Mountain (USA)

Bis (USA)

Castelpoint Systems (Australia)

RecordPoint (Australia)

Cortical (Austria)

Read-Coop (Austria)

expert.ai (Italia)

Quest-it (Italia)

Collabware (Canada)

Grupo Adapting (Spain)

Bizamica (India)

Aluma (UK)

Anzyz Technologies AS (Norway)

16 of 27

The portfolio of the companies

All the market players interviewed have developed solutions based on AI technologies for indexing and/or classifying structured, semi-structured and unstructured data/records based on automatic learning techniques and automatic data extraction.

The amount of specific services listed is huge, detailed and diversified:

not necessarily these peculiarities testify approaches really different;
could the variety of creative solutions be the consequence of the complex tasks required for respecting the peculiarities of the archival requirements? or
does it reflect the intrinsic nature of dynamic technologies still dominated by an ongoing process of evolution and transformation?

17 of 27

Involvement with records management and archives

Companies mostly involved in archive and records management underlined their capabilities in different processes such as document classification, indexing, managing the whole life cycle of documents and records, including the accession to archives.

The applications declare that they comply with the following standards (or have been designed to support them):

ISO 15489 (Records management);
ISO 16175 (Information and documentation — Processes and functional requirements for software for managing records);
ISO 23081-1:2017 (Information and documentation — Records management processes - Metadata for records);
ISO 30301:2019 (Information and documentation — Management systems for records — Requirements);
ISO/IEC 27001 (Information security management systems);
MoReq 2010 (Modular Requirements for records systems);

17

18 of 27

Survey outcome from the archival perspective

The majority of the market players interviewed have proved:

to be able to understand the complexity and the relevance of archival environment and functions
to be aware of the uniqueness of the original metadata acquired in the creator’s current activities, both if the issue concerns the records’ automatic classification or in case of the creation of archival aggregations.

18

not filed or lost records

19 of 27

Survey outcome from the archival perspective

Automatic classification:

analysis of metadata elements available both in the records and aggregations (case-folder specifications)
identification of document type
in case the available metadata should prove to be insufficient for classification, then classification is based on the record content
generation of labels and tags belonging to any record classification scheme (taxonomy or term ontology)

19

20 of 27

Survey outcome from the archival perspective

The role of any metadata fields found or inferred is always at the center of any reply.

The records typology – when available – is often considered another crucial component for the successful application of the AI techniques to the records.

In terms of records classification, only one company pointed out the capacity of its platform to be trained by the users thanks to a specific set of data for generating autonomously labels and tags related to any record classification scheme understood as based on taxonomy or term ontology.

In the other cases the human intermediation is considered not replaceable for providing consistent results.

20

21 of 27

Survey outcome from the archival perspective

Filing / Aggregation:

by document type
original structure of the content source
generation of labels and tags from any record classification scheme, based on the record content

Inferences on records grouping:

Based on content and/or context
If there is metadata to represent those processes (e.g. a case file number)

Inference on organisation or person:

If the involved entities are stated in the content of the document

21

22 of 27

Survey outcome from the archival perspective

In terms of records aggregation or re-aggregation, the promises for automatization are not very encouraging, as this possibility is confirmed to be limited to very specific cases such as

defining records types, when the users’ specifications are already in place, or
establishing functional relations among records when the original structure of the content source already provides basic intelligent information.

The automatic or semi-automatic aggregation based on the document content is only suggested and is usually supported by user validation, of human-in-the-loop workflow or rules availaible at the creation

in more cases even these limited capacities are not already developed but in the process of being developed.

22

23 of 27

Survey outcome from the archival perspective

Even the provenance information seems not easily recognizable by AI solutions when based on inferences and without very specific requirements such as

the identification of the right case-folder,
the presence of a stamp, a statement clearly expressed in the record,
specific metadata and/or classification elements.

Also the reconstitution of the archival bond – when lost or not explicitly defined – is recognized as a complex activity, without the significant help of users and/or consistent descriptive information available and, in any case, it implies more investments, not yet supported by the market

23

24 of 27

Technology Solutions: Techniques and Analysis Models

The 13 companies listed a wide range of different analysis models: 24 different entries - the 5 most recurring are:

Neural Network Models (4 companies)
Support Vector Machines (4 companies)
Decision Trees (3 companies)
Random Forests (3 companies)
LSTM - Long-Short Term Memory (3 companies)

As to the types of techniques featured in the products of the companies - the 2 most recurring are:

Classification (9 companies)
Clustering (5 companies)

which is unsurprising as the companies were selected because of their expertise at least in document management

However 18 different types of techniques overall have been reported

24

25 of 27

Technology Solutions: Training Strategies

Supervised Learning: 11 companies

Semi-Supervised Learning: 4 companies

Unsupervised Learning: 6 companies

Self-Supervised Learning: 2 companies

Rule-based Learning: 2 companies

We can also see which combinations of different training strategies the companies use:

6 companies use only one strategy: 4 Supervised Learning; 1 Unsupervised Learning; 1 Rule-based Learning;
5 companies use two strategies: 2 Supervised Learning + Unsupervised Learning; 2 Supervised Learning + Semi-Supervised Learning; 1 Supervised Learning + Self-Supervised Learning
1 company uses three strategies: Supervised, Unsupervised and Semi-Supervised Learning
1 company uses five strategies (i.e., everything): Supervised, Unsupervised, Semi-Supervised, Self-Supervised and Rule-based Learning

25

26 of 27

Lessons learnt

Our research is only at its first phase but we have already recognized that we can and must accept the challenges without being intimidated:

by the pressure of top management asking for archival miracles based on new disruptives technologies,
by AI market players promises which have to be checked very carefully and, last but not least,
by the complexity of AI technologies, because the solutions they offer imply more than in the past our knowledge and experience

27 of 27

Thank you!��Any comments are welcome!��

27