1 of 13

The Partnership’s Matching Algorithms Task Force Final Report

Ian Bogus & George Machovec

2 of 13

Matching Algorithms Task Force Charge

Began work in Q2 2022

Reported to the Infrastructure Working Group of the Partnership for Shared Book Collections (now the Shared Print Partnership)

  • Investigate and document current algorithms used by vendors, service providers, and open source tools
  • Gather input from experts
  • Examine different levels of matching needs for different use cases (e.g. shared print, ILL, library analytics)
  • Report back to the community

Early in the process commercial services were interested in the work but were mostly unwilling to share their approach to bibliographic matching due to proprietary reasons

3 of 13

Task Force Members

Current Members

  • Sara Amato - EAST
  • Ian Bogus (Chair) - ReCap
  • Barbara Cormack - California Digital Library (CDL)
  • Andy Hart- University of Pennsylvania
  • George Machovec - Colo Alliance of Research Libraries
  • Steve Smith - UMass Boston
  • Karla Strieb - BTAA/Ohio State University
  • Raiden van Bronkhorst - CDL

Former Members

David Almovodar - Pace University

Claudia Conrad (Chair) - CDL

Judy Dobry - CDL

Dana Jemison - CDL

Visitors

A variety of other people from various organizations visited including from several commercial library vendors

4 of 13

Why Do We Care?

Shared Print Programs

  • Ensuring that retention commitments are not over or under represented

Resource Sharing

  • Is the patron getting the item they requested. Bad clustering is a disaster

Collection Analytics

  • Determining overlap for cooperative collection development, deselection, space analysis, etc.

Cooperative Programs

  • Digitization, building communities, etc.

5 of 13

Rationale for What Algorithms to Study

Algorithmic Diversity: Each chosen algorithm represents a fundamentally different approach

*Organizational Transparency: The research prioritized algorithms from organizations willing to provide detailed insights into their approaches

*Vendor algorithms were not directly included for this reason

Algorithm types compared in study

Three types of algorithms were examined:

  • “Match keys” based on bibliographic data
  • Machine Learning/AI based matching
  • Matching on standard control numbers

6 of 13

MARC21 Data Sets Used for Analysis

English Monographs (2013-2017)

  • Current cataloging selected from two Ivy Plus libraries (with their approval)

Recent non-Roman Monographs (2013-2017)

  • Current cataloging selected from two Ivy Plus libraries (with their approval)

Older English Language Monographs (pre-1950)

  • Selected from two EAST libraries (with their approval)

7 of 13

Algorithms Overview

Gold Rush - Bibliographic Match Key Algorithm

Shared Collection Service Bus (SCSB) - Control Number Dependent Matching

MARC-AI - Machine Learning Matching Algorithms

OCLC Primary (benchmark #1) – First instance of the 035$a

OCLC Reconciled (benchmark #2) – Leveraged the WorldCat API to find merged OCLC numbers

8 of 13

Methodology

Library 1

Bib Records

Library 2

Bib Records

Gold Rush

Matches

MARC AI

Matches

SCSB

Matches

OCLC Primary

Matches

OCLC Reconciled

Matches

None of the Five Matched

Matched All

Five

Island of Uncertainty

9 of 13

Summary Results

English (2013-2017)

Non-Roman (2013-2017)

English (pre-1950)

# Records

# Match Groups

# Records

# Match Groups

# Records

# Match Groups

Library 1

62,276

228,403

50,655

Library 2

54,402

108,423

18,706

All Five Matched

36,120

18,051

59,410

29,676

3,411

1,705

Island of Uncertainty

6,051

2,967

90,796

44,886

8,129

4,021

10 of 13

Scenario 1

Recent English-language monographs (2013-2017)

Common Record Issues

  • “First Edition” vs. [blank]
  • Minor variations in title
  • Inconsistency in the 008 dates
  • Minor variations in publisher "Oxford University Press" vs "Published for the British Academy by Oxford University Press"

True Positives

False Positives

Gold Rush

91.32%

0.61%

SCSB

99.27%

0.97%

MARC-AI

96.46%

0.45%

OCLC Primary

95.05%

0.10%

OCLC Reconciled

97.76%

0.18%

11 of 13

Scenario 2

Recent non-Roman-language monographs (2013-2017)

Common Record Issues

  • Differences in record preference handling for diacritics and markings can cause problems
  • Differences of where transliteration and vernacular are within records.

True Positives

False Positives

Gold Rush

46.75%

0.33%

SCSB

94.99%

2.57%

MARC-AI

91.70%

2.82%

OCLC Primary

87.16%

1.59%

OCLC Reconciled

89.04%

1.56%

12 of 13

Scenario 3

English-language monographs (pre-1950)

Common Record Issues

  • Date problems in the 008 persist.
  • Missing data (e.g. records w/o authors)
  • Minor variations in publisher (e.g. Wiley vs. John Wiley and Sons.)

True Positives

False Positives

Gold Rush

76.05%

10.62%

SCSB

93.58%

6.13%

MARC-AI

87.21%

2.67%

OCLC Primary

21.85%

3.52%

OCLC Reconciled

91.19%

1.72%

13 of 13

Where to find the report

https://sharedprint.org/2025/06/13/the-matching-algorithms-task-force-releases-its-final-report/