1 of 52

ARKs in the Open: 3.2 Billion Persistent Identifiers

  1. John Kunze, California Digital Library
  2. Bess Missell, Smithsonian Libraries
  3. Karen Hanson, Portico, ITHAKA
  4. Tom Creighton, Family Search International

April 2020

2 of 52

Why care about ARK identifiers?

Because persistent, reliable web links are lacking.

  • The average URL lifetime is 44 days
  • Maybe ok for the rest of the world, but not for archives and libraries
  • URLs in vendor content management tools (a) can break between major system releases and (b) aren’t generally portable to another vendor’s system

Wanted: a flexible, low cost, vendor- and software-independent persistent identifier

3 of 52

ARK (Archival Resource Key)

  • ARK: a persistent link for any kind of thing
  • 3.2 billion ARKs created by 600 institutions – libraries, archives, museums, publishers, educators, etc. A sample ...

University of California Berkeley�Smithsonian National Museum�National Library of France�University of Chicago�Musée du Louvre�Family Search�British Library�Google

Internet Archive�Bodleian Libraries�Berkeley Law Library�Bibliothèque Mazarine�New York Public Library�French National Archives�National Library of Austria�Library and Archives Canada

4 of 52

ARK anatomy

A labelled URL with a globally unique identity inside it

https://n2t.net/ark:/12345/fk1234

makes ARK actionable (the resolver)

core globally unique identity (independent of web and hostname)

5 of 52

What are ARKs used for?

  • genealogical records (3 billion FamilySearch)
  • publisher content (100 million Portico)
  • scientific datasets and records (22 million INIST)
  • scanned books and texts (23 million Internet Archive)
  • bibliographic records (15 million BnF main catalog)
  • museum specimens (15 million Smithsonian Institution)
  • public health documents (14 million UCSF IDL)
  • historical documents (21 million CDL, 5 million BnF Gallica)
  • historical authors and scholars (4 million SNAC)
  • vocabulary terms (9,000 Periodo, YAMZ)

6 of 52

Why ARKs and not DOIs

(or Handles or PURLs or URNs)?

  • G**gle for “ten persistent myths about persistent identifiers”
  • Flexible resolution: centralized (n2t.net) or via your own server
  • University of California’s history of open
    • 1968 Free speech movement
    • 1982 Open source Berkeley UNIX – FreeBSD – Mac OS X
    • 2013 Groundbreaking open access research policy
    • 2019 Termination of Elsevier journal subscription
    • 2001-2020 Non-paywalled, decentralized persistent identifiers – ARKs

7 of 52

ARKs and DOIs

DOIs (Digital Object Identifiers) – publishing industry solution

  • requires membership, per-identifier fees, rigid metadata requirements

ARKs - cultural heritage solution

  • no fees or membership, and highly flexible creation and metadata policies
  • like DOIs, ARKs are also stable, linked to metadata, and found in the Data Citation Index (linked to the Web of Science), HathiTrust, Wikipedia articles, Wikidata records, Internet Archive collections, ORCID researcher profiles, etc.

8 of 52

The Covenant of the ARK

The ARK scheme

will not charge fees to create or use ARKs

will not limit the number of ARKs you assign

will not limit the kind of content you identify

will not require metadata, nor even persistence

will not mandate use of any particular resolver

9 of 52

Getting involved in ARKs

  • Learn more: “ARK Identifiers FAQ”
  • Start assigning: n2t.net/e/naan_request

Support open infrastructure

  • Join us at ARKsInTheOpen.org
  • Community owned infrastructure
  • Collaboration between CDL and LYRASIS

10 of 52

ARKs �at the

Smithsonian Institution

Bess Missell

Metadata Librarian

Smithsonian Libraries

missellb@si.edu

11 of 52

Overview

  • Who we are:

The Smithsonian Libraries & The Smithsonian Institution.

  • What we assign ARKs to:

Collection metadata & multimedia objects.

  • When did we start and how many are assigned so far:
    • 2015: The Natural Museum of Natural History began assigning ARKs to their collections (over 10 million metadata & 3 million multimedia records).
    • January, 2020: new datasetIDs for collection systems were registered in EZID & assigning ARKs in our collection systems began.
    • February 26, 2020: Smithsonian Open Access launch with 11,486,102 ARKs on CC0 metadata records and 2,794,786 ARKs on CC0 multimedia records.
    • We have assigned over 15 million ARKs and counting ….
  • Why we chose ARKs
    • Project size.
    • Cost.
    • Ease of implementation.
    • Permanence.
  • Why we are involved in ARKsInTheOpen
    • ARKs are a perfect fit for the Smithsonian collections.

12 of 52

Smithsonian Libraries is a network of 21 specialized research libraries, as well as central support services which include Smithsonian Research Online, a bibliography of Smithsonian publication citations and the Institution’s repository. library.si.edu

The Smithsonian Institution is the world’s largest museum, education, and research complex, with 19 museums and the National Zoo. www.si.edu

19 Museums + 1 Zoo

23.2M Visits by Public

155.5M Museum Objects & Specimens

2.2M Library Volumes

2,633 Scholarly Publications

154M Website Visitors

16.6M Social Media Followers

21 Libraries

2.2M Library Volumes

80K Smithsonian Research Online

772K Website visitors

239K Social Media Followers

13 of 52

The Smithsonian is Assigning ARKs to our Collection Systems

Examples include records and images for:

Scientific specimens from the National Museum of Natural History

http://n2t.net/ark:/65665/381440f27-3f74-4eb9-ac11-b4d633a7da3d

Cultural artifacts from the National Museum of American History http://n2t.net/ark:/65665/ng49ca746b2-42dc-704b-e053-15f76fa0b4fa

Sculpture from the Freer Gallery of Art & Arthur M. Sackler Gallery http://n2t.net/ark:/65665/ye3080ce305-a705-49cc-a70d-99aff8cb65da

Photographs from the National Museum of African American History and Culture

http://n2t.net/ark:/65665/fd5ad97cb86-caaf-4209-8fde-98d70f52f072

Paintings from the Smithsonian American Art Museum http://n2t.net/ark:/65665/vk7a466371d-0413-451f-bd76-ca0becc46f94

14 of 52

National Museum of Natural History

2015

  • The Natural Museum of Natural History began assigning ARKs to their metadata collections.

  • The Natural Museum of Natural History later began assigning ARKs to their multimedia collections.

  • Over 10 million ARKs have been assigned to NMNH metadata records.

  • Over 3 million ARKs have been assigned to NMNH multimedia records.

Smithsonian Open Access Project

February 26th, 2020

  • The Smithsonian released 11.5 million metadata records and 2.8 million multimedia records into the Public Domain.
  • The Smithsonian chose ARKs to be the global unique identifier (GUID) for these open access images and records.
  • https://www.si.edu/OpenAccess
  • #SmithsonianOpenAccess

Over 15 million ARKs and counting ...

15 of 52

ARKs were chosen because…

  • A large number of ARKs will be needed: over 15 million ARKs and growing;
  • cost;
  • ease of implementation;
  • the growth of the ARKs in the Open project encouraged the Smithsonian to choose ARKs as a viable, sustainable identifier.

Courtesy of the Smithsonian Libraries

Alexandre, Arsène. Noé dans son arche. Combet et Cie, 1902, https://doi.org/10.5479/sil.720005.39088010288199

https://library.si.edu/digital-library/book/noeydanssonarch00alex

16 of 52

  • issues, registers, maintains ARKs for Smithsonian collection systems;
  • registers DOIs for Smithsonian publications and research;
  • maintains a Smithsonian GUID webpage for SI staff and researchers https://library.si.edu/research/guids-help-make-your-data-findable

Smithsonian Libraries…

17 of 52

http://n2t.net/ is the resolver that takes the web call to the EZID service, who then uses the Name Assigning Authority Number (NAAN) to identify who is the registered naming authority. The Smithsonian also has registered datasetIDs (or shoulders) so that EZID passes the web traffic to a specific Smithsonian collection system.

vk7 in the ARK above is registered to metadata records in the Smithsonian American Art Museum (SAAM) collection management system. If vk7 were replaced with bj9 the call would change and go to the image delivery server for SAAM.

Each Smithsonian collection system is configured to receive the web call from EZID, read the datasetID, and direct the call to the correct server for metadata records or multimedia.

vc9 resolves to the Cooper Hewitt image server https://collection.cooperhewitt.org/ark/vc9

ye3 resolves to the Freer Sackler metadata server https://collections.si.edu/search/record/ark:/65665/ye3

jy5 resolves to the Freer Sackler image server https://ids.si.edu/ids/deliveryService?id=ark:/65665/jy5

18 of 52

Using EZID, I register each Smithsonian collection system with our NAAN AND a datasetID with a URL to where the datasetID should resolve.

The Smithsonian wrote a datasetID schema which I follow when I create and register new collection systems:

Two randomly selected lowercase letters (no lowercase L, rm, nm, or fu)

+

One randomly selected number (2-9)

Image from the website: https://ezid.cdlib.org/

19 of 52

Each Smithsonian collection system is now configured to automatically generate an ARK when a metadata or multimedia record is saved. The ARK includes the SI NAAN and the datasetID assigned to the collection system.

20 of 52

Challenges for the ARK implementation included…

  • tight schedule to meet the February 2020 SI Open Access launch;

  • multiple collection management systems, administrators, and IT support;

  • encountering IT problems such as:

    • identifying the correct syntax for the URL which needs to be registered with EZID:

The datasetID needs to be included in the URL: https://collections.si.edu/search/record/ark:/65665/ye3

    • how to configure each system to receive the URL with ARK datasetID.

21 of 52

  • Implementing ARKs with archival management systems
  • What if collections are split between two collection systems? 

  • What if objects get moved from one collecting unit to another?

  • Implementing ARK inflections

Phase II of Open Access

Image from the website: https://n2t.net/e/ark_ids.html

22 of 52

SI media server

National Postal Museum SI TMS

AHM media server

CH media server

NH media server

National Museum of African Art SI TMS

American History Museum Mimsy XG

Cooper Hewitt Smithsonian Design Museum TMS

Natural History Museum

EMU

Plus 12 more systems …

Commercial resolver: n2t.net

23 of 52

Thank you!

ARKs at the Smithsonian Institution

Bess Missell

Metadata Librarian

Smithsonian Libraries

missellb@si.edu

24 of 52

ARKs in the Portico Archive

April 23rd 2020

Karen Hanson, Senior Research Developer

25 of 52

Overview

  • Who we are
    • Portico - a community supported preservation archive. Work with libraries and publishers to preserve electronic scholarly publications
  • What we assign ARKs to
    • Every package going into the archive, and a lot of other things (more on that later)
  • When did we start and how many are assigned so far
    • Started ~2006; Assigned >2 billion ARK IDs
  • Why we chose ARKs
    • Flexible, opaque, unique, easy to generate, recognized by the community
  • Why we are involved in ARKsInTheOpen
    • Use ARKs extensively; may adopt some of the new specifications

26 of 52

Portico workflow

Files checked, normalized, and packaged to prepare for preservation

Batch of files received e.g. PDF and XML version of articles in a journal issue

Resulting “archival units” deposited into archive. Each unit = e.g. 1 article

27 of 52

Portico workflow

Files checked, normalized, and packaged to prepare for preservation

Batch of files received e.g. PDF and XML version of articles in a journal issue

Resulting “archival units” deposited into archive. Each unit = e.g. 1 article

28 of 52

Archival unit content structure

Archival Unit

Content Units

Functional Units

Storage Units

Article A

Article A: Version 1

Article A: Version 2

Marked up full text

Page images rendition

Figure graphic component

Publisher supplied XML

Normalized XML (JATS)

PDF

JPEG

(high resolution)

PNG

(low resolution)

29 of 52

Structure described in metadata

Archival unit:

phc5qbrw2a.zip

Open BagIt “Bag”

Storage Units

Publisher supplied XML

Normalized XML (JATS)

PDF

JPEG

(high resolution)

PNG

(low resolution)

30 of 52

Structure described in metadata

Preservation

Metadata

Archival unit:

phc5qbrw2a.zip

Open BagIt “Bag”

Storage Units

Publisher supplied XML

Normalized XML (JATS)

PDF

JPEG

(high resolution)

PNG

(low resolution)

31 of 52

Archival unit content structure

Archival Unit

Content Units

Functional Units

Storage Units

Article A

Article A: Version 1

Article A: Version 2

Marked up full text

Page images rendition

Figure graphic component

Publisher supplied XML

Normalized XML (JATS)

PDF

JPEG

(high resolution)

PNG

(low resolution)

32 of 52

Use of ARKs supports a self describing archive

  • The files are the archive
  • The system manages the archive, but the files can exist independently
  • ARKs are assigned to abstract concepts and sections of metadata, as well as digital objects

33 of 52

Full text XML with image references

ark:/12345/rmkd92kd

ark:/12345/rmkp3zr8

ark:/12345/rmk7fzqk

ark:/12345/rmk2kdjq

<fig id="fig1" position="float">

<label>Fig. 1</label>

<caption>

<p>Example figure!</p>

</caption>

<graphic

position="anchor"

xlink:href="ark:/12345/rmkp3zr8"

alt-version="no"

xlink:type="simple"/>

</fig>

references

references

references

34 of 52

What did we assign billions of ARKs to?

  • Archival units (~110 million) – the “interesting” ones
  • Versions of the content (~121 million)
  • Archived files… including metadata files (~1.8 billion)
  • Sections of metadata (technical metadata, event metadata)

over 2 billion ARKs

35 of 52

ARK resolver use case: Enhanced Monographs

  • “Enhancing Services to Preserve New Forms of Scholarship” – Mellon funded project, a collaboration with NYU Libraries, CLOCKSS and university presses
  • Looks at monographs that go beyond text and images (embedded multimedia, interactive features etc.)
  • Identify what can be preserved at scale

36 of 52

EPUB Challenge: Remote Resources

remote resource

visually embedded or linked

37 of 52

Problem of external content embedded in EPUBs

38 of 52

Problem of external content embedded in EPUBs

39 of 52

What if we could resolve an ARK to the video?

https://ids.portico.org/ark:/12345/rmkrq29x8

40 of 52

Thank you!

karen.hanson@ithaka.org

Thanks also to my colleague Amy Kirchhoff for helping me put together this presentation.

41 of 52

ARK Identifiers In Genealogy

FamilySearch International

Presented at CNI, Spring 2020

N. Thomas Creighton

tc@familysearch.org

42 of 52

  • Who we are
    • FamilySearch International
  • What we assign ARKs to
    • Digital images of genealogically significant documents (eg. census records)
    • Transcriptions of the data from the digital images
    • ‘Persona’ data from those transcriptions
    • Genealogies collected from patrons and interviews
  • When did we start and how many are assigned so far
    • We started minting ARKs in 2012, but it took several months to switch over in full.
    • We have minted several billion so far.
  • Why we chose ARKs
    • We chose ARK because we wanted something recognizable as an industry effort to standardize long-lived URIs.
    • Minting so many identifiers made other options cost prohibitive.
    • We needed to control URI, resolution, redirects, etc.
  • Why we are involved in ARKsInTheOpen
    • We are involved in ARKsInTheOpen to contribute our experience.

43 of 52

FamilySearch International - A Brief Introduction

  • Originally The Genealogical Society of Utah
  • We help people connect with their families through:
    • Providing engaging discovery experiences
    • Publishing guidance on how to do family history research
    • Acquiring and publishing billions of source records from around the world
    • Creating software systems to aid in researching and collaborating on family history
    • Providing all of this at no cost to our patrons - It’s free!
  • We also maintain significant long-term digital preservation systems
    • Two independently implemented and maintained systems each holding two copies of all artifacts
    • One system in the public cloud; one system literally in a cave in the mountains
    • Tens of petabytes and billions of artifacts
  • Open to anyone at no cost; Fully supported by the Church of Jesus Christ of Latter-day Saints

www.familysearch.org

44 of 52

Artifact Processing Abstraction

45 of 52

Searching For Ancestors

46 of 52

https://www.familysearch.org/ark:/61903/1:1:K98H-2G2

Maudie M. Creighton --

.../ark:/61903/1:1:K98H-2GL

David M. Creighton --

.../ark:/61903/1:1:K98H-2GG

Robert T. Creighton --

.../ark:/61903/1:1:K98H-2GP

Thomas Percy Creighton Details --

.../ark:/61903/4:1:25V8-3J5

Census page with context --

.../ark:/61903/3:1:3QSQ-G9MT-N9ZF?personaUrl=%2Fark%3A%2F61903%2F1%3A1%3AK98H-2G2

47 of 52

https://www.familysearch.org/

ark:/61903/3:1:3QSQ-G9MT-N9ZF

?i=35&personaUrl=%2Fark%3A%2F61903%2F1%3A1%3AK98H-2G2

48 of 52

https://www.familysearch.org/ark:/61903/3:1:3QSQ-G9MT-N9ZF

49 of 52

A small snippet of 3438 lines:

50 of 52

Organization and Volume of Minting

Namespace or

Name Assigning Authority

Description

Approximate Count In Millions

Annual Increase In Millions

1:1

Historical record persona

8800

1,511

1:2

Historical record

5300

452.27

2 (1-3)

Pedigree data

1500

73.05

3 (1-4)

Digital images of documents

4300

344.64

4

FamilyTree person records

1400

43.2

Total:

21300

2,424

51 of 52

Managing Access and Routing

  • Routing of ARKs is basically the same as all other resources managed at familysearch.org.
  • https://www.familysearch.org/ark:/61903/1:1:K98H-2G2 is seen by DTM. If Accept header specifies html, forward to appropriate application (typically in Heroku); if json, forward to the 1:1 resolver.
  • https://www.familysearch.org/ark:/61903/3:1:3QSQ-G9MT-N9ZF is seen by DTM. Based on Accept header it forwards to the image viewer app (Heroku) or the 3:1 resolver. The 3:1 resolver will authorize and redirect to a temporary signed S3 URL.

52 of 52

FamilySearch International

www.familysearch.org

N. Thomas Creighton

tc@familysearch.org