1 of 27

David Rosenthal, LOCKSS | Stanford (http://blog.dshr.org/)

Jefferson Bailey, Internet Archive (@jefferson_bail)

Nicholas Taylor, Stanford University (@nullhandle)

Building API-Based Web Archiving Systems and Services

2 of 27

Why do we need APIs?

David S. H. Rosenthal

LOCKSS Program

Stanford University Libraries

http://www.lockss.org/

http://blog.dshr.org/

© 2016 David S. H. Rosenthal

IIPC GA 2016

3 of 27

IIPC GA 2016

4 of 27

IIPC GA 2016

5 of 27

Internet APIs + Uses

Credit: Alexis Rossi

IIPC GA 2016

6 of 27

IIPC GA 2016

7 of 27

The Big Picture

  • Ingest:
    • Sharing Crawlers, Capturing renderings, Deduplication
    • Divide/Conquer crawling, Soft Errors, Metadata extraction
    • Crawl management
  • Preservation:
    • Detect/repair damage, advertise holdings
  • Dissemination:
    • Memento, federated browsing, text & metadata search
    • Bulk access, format migration, data mining
    • Emulation

IIPC GA 2016

8 of 27

Jefferson Bailey, Internet Archive (@jefferson_bail)

Landscapes & WASAPI

IIPC GA 2016

9 of 27

Growth in Web Archiving (NDSA & Archive-It)

IIPC GA 2016

10 of 27

Local Preservation of Web Archives

Recent Surveys of local preservation of web data

  • NDSA: 18%-20% (2011, 2013, 2016)
  • AIT: 20% of respondents (2016)
  • Reasons include
    • No local preservation plan
    • Trust in service
    • Doesn’t integrate with existing workflows
    • Too much data

IIPC GA 2016

11 of 27

Community Involvement in WA Development

  • Few coordinated efforts on shared tools
  • Historical reliance on few providers
  • Variance of coordination on emergent efforts & foresight on interoperability
  • Few on-ramps for not-dev participation
  • Yet some collaborative digital library efforts have proven successful
  • Emergence of broader web archiving community of practice

IIPC GA 2016

12 of 27

Other Challenges

  • Web Archiving often still a niche collecting activity
  • Use largely TBD or not measured
  • Convenience of end-to-end services diminishes tech needs
  • Little familiarity with formats, software, or processes
  • Nascent community impetus to join or advise on broad technical development activities

IIPC GA 2016

13 of 27

  • Wayback APIs
  • Archive-It Partner Metadata APIs
  • Data Analytics APIs (crawl logs and reports)
  • Index (CDX) APIs
  • Upload APIs (non-web)
  • Internal APIs

IIPC GA 2016

14 of 27

WASAPI: Web Archiving Systems APIs

  • “Systems Interoperability and Collaborative Development for Web Archives”
  • National Leadership Grant, National Digital Platform, R&D
  • IA/AIT (PI), Stanford, UNT, Rutgers
  • 2-year project started January 2016
  • National Symposium Early 2017

IIPC GA 2016

15 of 27

WASAPI: Web Archiving Systems APIs

Three Key Areas of R&D:

  1. What are the attributes of a community model that can support sustainable and broad-based collaborative web archiving technology development?
  2. What are the community needs and downstream uses for the planned Export APIs (by AIT & LOCKSS) to facilitate transfer of web archive data between distributed systems and what other prospective APIs does it point to?
  3. How can better interoperability of web archiving systems support new forms of access and research use?

IIPC GA 2016

16 of 27

WASAPI: Web Archiving Systems APIs

Outcomes:

  1. Seed & launch a community modeled on the characteristics of successful development and participation communities ID’ed by project
  2. Build WARC & derivative dataset APIs (AIT & LOCKSS) and test via transfer to partners (SUL, UNT, Rutgers) to enable better distributed preservation and access
  3. Sketch a blueprint and technical model for future web archiving APIs informed by R&D
  4. Seed a technical infrastructure that will facilitate more computational and distributed research use of web archive collections

IIPC GA 2016

17 of 27

WASAPI Technical Working Group

and Current Progress

Nicholas Taylor (@nullhandle)

Web Archiving Service Manager

Stanford University Libraries

IIPC GA 2016

18 of 27

Technical Working Group

Stephen Abrams

California Digital Library

Andy Jackson

British Library

David S.H. Rosenthal

Stanford University

Tom Cramer

Stanford University

Nicholas Taylor

Stanford University

Courtney Mumma

Internet Archive

Vinay Goel

Internet Archive

Jefferson Bailey

Internet Archive

Mark Phillips

University of North Texas

Matt Weber

Rutgers University

19 of 27

related API work

  • CDX Server API (IA, IIPC)
  • derivative formats (Archive-It, BL)
  • crawl logs/partner data (Archive-It)
  • Wayback Machine APIs (IA)
  • proliferating capture tools (GWU, IA, Rhizome)
  • Cobweb (CDL, Harvard, UCLA)

20 of 27

use cases

  • Archive-It →
    • partner IR/local use
    • DPN
    • LOCKSS (PLN)
  • CDL → Archive-It (migration)
  • DLSS → IA (WebBase)
  • [EoT partners] ← → [EoT partners]
  • IA global Wayback→
    • LOCKSS (OA content)
    • national libraries
  • LOCKSS (.gov) → IA
  • [any web archive] →
    • researcher
    • original publisher

21 of 27

data exchange b/t repositories

service provider

preservation network

local repository

22 of 27

standardizing researcher data access

service provider

preservation network

local repository

researcher workspace

23 of 27

data exchange within repositories

capture tools

ingest workflows

24 of 27

candidate features discussed

  • content negotiation for W/ARC or derivatives
  • protocol negotiation for transfer handoff
  • ability to specify parameters for custom export
  • metadata for provenance, crawler configuration, crawl logs, description
  • request custom data extraction
  • authentication + privileges management

25 of 27

export API example

  • authentication
    • (system tracks permissions)
  • submit institution ID
    • return associated collection IDs
  • submit collection ID(s)
    • return associated job IDs
  • submit job ID(s)
    • return associated W/ARC files
  • submit candidate W/ARC files
    • return supported protocols
  • initiate transfer
    • (transfer files)
    • (acknowledge transfer completion status)

26 of 27

THANKS! (discussion is next)

Nicholas Taylor, Stanford University (@nullhandle)

Jefferson Bailey, Internet Archive (jefferson@archive.org)

David Rosenthal, LOCKSS | Stanford University

27 of 27

Discussion Questions

  • What APIs have attendees built, or are currently using, in their web archiving activities?
  • Are these APIs RESTful? If not, why not?
  • What frameworks/languages were they built with? What are other notable characteristics of their development and maintenance?
  • What part of the web archiving lifecycle would most benefit from next-stage API development, post-WASAPI?