2 of 13

Matching Algorithms Task Force Charge

Began work in Q2 2022

Reported to the Infrastructure Working Group of the Partnership for Shared Book Collections (now the Shared Print Partnership)

Investigate and document current algorithms used by vendors, service providers, and open source tools
Gather input from experts
Examine different levels of matching needs for different use cases (e.g. shared print, ILL, library analytics)
Report back to the community

Early in the process commercial services were interested in the work but were mostly unwilling to share their approach to bibliographic matching due to proprietary reasons

3 of 13

Task Force Members

Current Members

Sara Amato - EAST
Ian Bogus (Chair) - ReCap
Barbara Cormack - California Digital Library (CDL)
Andy Hart- University of Pennsylvania
George Machovec - Colo Alliance of Research Libraries
Steve Smith - UMass Boston
Karla Strieb - BTAA/Ohio State University
Raiden van Bronkhorst - CDL

Former Members

David Almovodar - Pace University

Claudia Conrad (Chair) - CDL

Judy Dobry - CDL

Dana Jemison - CDL

Visitors

A variety of other people from various organizations visited including from several commercial library vendors

4 of 13

Why Do We Care?

Shared Print Programs

Ensuring that retention commitments are not over or under represented

Resource Sharing

Is the patron getting the item they requested. Bad clustering is a disaster

Collection Analytics

Determining overlap for cooperative collection development, deselection, space analysis, etc.

Cooperative Programs

Digitization, building communities, etc.

5 of 13

Rationale for What Algorithms to Study

Algorithmic Diversity: Each chosen algorithm represents a fundamentally different approach

*Organizational Transparency: The research prioritized algorithms from organizations willing to provide detailed insights into their approaches

*Vendor algorithms were not directly included for this reason

Algorithm types compared in study

Three types of algorithms were examined:

“Match keys” based on bibliographic data
Machine Learning/AI based matching
Matching on standard control numbers

6 of 13

MARC21 Data Sets Used for Analysis

English Monographs (2013-2017)

Current cataloging selected from two Ivy Plus libraries (with their approval)

Recent non-Roman Monographs (2013-2017)

Current cataloging selected from two Ivy Plus libraries (with their approval)

Older English Language Monographs (pre-1950)

Selected from two EAST libraries (with their approval)

7 of 13

Algorithms Overview

Gold Rush - Bibliographic Match Key Algorithm

Shared Collection Service Bus (SCSB) - Control Number Dependent Matching

MARC-AI - Machine Learning Matching Algorithms

OCLC Primary (benchmark #1) – First instance of the 035$a

OCLC Reconciled (benchmark #2) – Leveraged the WorldCat API to find merged OCLC numbers

8 of 13

Methodology

Library 1

Bib Records

Library 2

Bib Records

Gold Rush

Matches

MARC AI

Matches

SCSB

Matches

OCLC Primary

Matches

OCLC Reconciled

Matches

None of the Five Matched

Matched All

Five

Island of Uncertainty

9 of 13

Summary Results

	English (2013-2017)		Non-Roman (2013-2017)		English (pre-1950)
	# Records	# Match Groups	# Records	# Match Groups	# Records	# Match Groups
Library 1	62,276		228,403		50,655
Library 2	54,402		108,423		18,706
All Five Matched	36,120	18,051	59,410	29,676	3,411	1,705
Island of Uncertainty	6,051	2,967	90,796	44,886	8,129	4,021

10 of 13

Scenario 1

Recent English-language monographs (2013-2017)

Common Record Issues

“First Edition” vs. [blank]
Minor variations in title
Inconsistency in the 008 dates
Minor variations in publisher "Oxford University Press" vs "Published for the British Academy by Oxford University Press"

	True Positives	False Positives
Gold Rush	91.32%	0.61%
SCSB	99.27%	0.97%
MARC-AI	96.46%	0.45%
OCLC Primary	95.05%	0.10%
OCLC Reconciled	97.76%	0.18%

11 of 13

Scenario 2

Recent non-Roman-language monographs (2013-2017)

Common Record Issues

Differences in record preference handling for diacritics and markings can cause problems
Differences of where transliteration and vernacular are within records.

	True Positives	False Positives
Gold Rush	46.75%	0.33%
SCSB	94.99%	2.57%
MARC-AI	91.70%	2.82%
OCLC Primary	87.16%	1.59%
OCLC Reconciled	89.04%	1.56%

12 of 13

Scenario 3

English-language monographs (pre-1950)

Common Record Issues

Date problems in the 008 persist.
Missing data (e.g. records w/o authors)
Minor variations in publisher (e.g. Wiley vs. John Wiley and Sons.)

	True Positives	False Positives
Gold Rush	76.05%	10.62%
SCSB	93.58%	6.13%
MARC-AI	87.21%	2.67%
OCLC Primary	21.85%	3.52%
OCLC Reconciled	91.19%	1.72%

1 of 13

2 of 13

3 of 13

4 of 13

5 of 13

6 of 13

7 of 13

8 of 13

9 of 13

10 of 13

11 of 13

12 of 13

13 of 13