1 of 27

David Rosenthal, LOCKSS | Stanford (http://blog.dshr.org/)

Jefferson Bailey, Internet Archive (@jefferson_bail)

Nicholas Taylor, Stanford University (@nullhandle)

Building API-Based Web Archiving Systems and Services

2 of 27

Why do we need APIs?

David S. H. Rosenthal

LOCKSS Program

Stanford University Libraries

http://www.lockss.org/

http://blog.dshr.org/

IIPC GA 2016

3 of 27

IIPC GA 2016

4 of 27

IIPC GA 2016

5 of 27

Internet APIs + Uses

Credit: Alexis Rossi

IIPC GA 2016

6 of 27

IIPC GA 2016

7 of 27

The Big Picture

Ingest:

Sharing Crawlers, Capturing renderings, Deduplication
Divide/Conquer crawling, Soft Errors, Metadata extraction
Crawl management

Preservation:

Detect/repair damage, advertise holdings

Dissemination:

Memento, federated browsing, text & metadata search
Bulk access, format migration, data mining
Emulation

IIPC GA 2016

8 of 27

Jefferson Bailey, Internet Archive (@jefferson_bail)

Landscapes & WASAPI

IIPC GA 2016

9 of 27

Growth in Web Archiving (NDSA & Archive-It)

IIPC GA 2016

10 of 27

Local Preservation of Web Archives

Recent Surveys of local preservation of web data

NDSA: 18%-20% (2011, 2013, 2016)
AIT: 20% of respondents (2016)
Reasons include

No local preservation plan
Trust in service
Doesn’t integrate with existing workflows
Too much data

IIPC GA 2016

11 of 27

Community Involvement in WA Development

Few coordinated efforts on shared tools
Historical reliance on few providers
Variance of coordination on emergent efforts & foresight on interoperability
Few on-ramps for not-dev participation
Yet some collaborative digital library efforts have proven successful
Emergence of broader web archiving community of practice

IIPC GA 2016

12 of 27

Other Challenges

Web Archiving often still a niche collecting activity
Use largely TBD or not measured
Convenience of end-to-end services diminishes tech needs
Little familiarity with formats, software, or processes
Nascent community impetus to join or advise on broad technical development activities

IIPC GA 2016

13 of 27

Wayback APIs
Archive-It Partner Metadata APIs
Data Analytics APIs (crawl logs and reports)
Index (CDX) APIs
Upload APIs (non-web)
Internal APIs

https://github.com/ArchiveLabs/api.archive.org

IIPC GA 2016

14 of 27

WASAPI: Web Archiving Systems APIs

“Systems Interoperability and Collaborative Development for Web Archives”
National Leadership Grant, National Digital Platform, R&D
IA/AIT (PI), Stanford, UNT, Rutgers
2-year project started January 2016
National Symposium Early 2017

IIPC GA 2016

15 of 27

WASAPI: Web Archiving Systems APIs

Three Key Areas of R&D:

What are the attributes of a community model that can support sustainable and broad-based collaborative web archiving technology development?
What are the community needs and downstream uses for the planned Export APIs (by AIT & LOCKSS) to facilitate transfer of web archive data between distributed systems and what other prospective APIs does it point to?
How can better interoperability of web archiving systems support new forms of access and research use?

IIPC GA 2016

16 of 27

WASAPI: Web Archiving Systems APIs

Outcomes:

Seed & launch a community modeled on the characteristics of successful development and participation communities ID’ed by project
Build WARC & derivative dataset APIs (AIT & LOCKSS) and test via transfer to partners (SUL, UNT, Rutgers) to enable better distributed preservation and access
Sketch a blueprint and technical model for future web archiving APIs informed by R&D
Seed a technical infrastructure that will facilitate more computational and distributed research use of web archive collections

IIPC GA 2016

17 of 27

WASAPI Technical Working Group

and Current Progress

Nicholas Taylor (@nullhandle)

Web Archiving Service Manager

Stanford University Libraries

IIPC GA 2016

18 of 27

Technical Working Group

Stephen Abrams

California Digital Library

Andy Jackson

British Library

David S.H. Rosenthal

Stanford University

Tom Cramer

Stanford University

Nicholas Taylor

Stanford University

Courtney Mumma

Internet Archive

Vinay Goel

Internet Archive

Jefferson Bailey

Internet Archive

Mark Phillips

University of North Texas

Matt Weber

Rutgers University

19 of 27

related API work

CDX Server API (IA, IIPC)
derivative formats (Archive-It, BL)
crawl logs/partner data (Archive-It)
Wayback Machine APIs (IA)
proliferating capture tools (GWU, IA, Rhizome)
Cobweb (CDL, Harvard, UCLA)

20 of 27

use cases

Archive-It →

partner IR/local use
DPN
LOCKSS (PLN)

CDL → Archive-It (migration)
DLSS → IA (WebBase)

[EoT partners] ← → [EoT partners]
IA global Wayback→

LOCKSS (OA content)
national libraries

LOCKSS (.gov) → IA
[any web archive] →

researcher
original publisher

21 of 27

data exchange b/t repositories

service provider

preservation network

local repository

22 of 27

standardizing researcher data access

service provider

preservation network

local repository

researcher workspace

23 of 27

data exchange within repositories

→

capture tools

ingest workflows

24 of 27

candidate features discussed

content negotiation for W/ARC or derivatives
protocol negotiation for transfer handoff
ability to specify parameters for custom export
metadata for provenance, crawler configuration, crawl logs, description
request custom data extraction
authentication + privileges management

25 of 27

export API example

authentication

(system tracks permissions)

submit institution ID

return associated collection IDs

submit collection ID(s)

return associated job IDs

submit job ID(s)

return associated W/ARC files

submit candidate W/ARC files

return supported protocols

initiate transfer

(transfer files)
(acknowledge transfer completion status)

26 of 27

THANKS! (discussion is next)

Nicholas Taylor, Stanford University (@nullhandle)

Jefferson Bailey, Internet Archive (jefferson@archive.org)

David Rosenthal, LOCKSS | Stanford University

WASAPI

https://groups.google.com/forum/#!forum/wasapi-community

https://github.com/WASAPI-Community

https://www.imls.gov/sites/default/files/proposal_narritive_lg-71-15-0174_internet_archive.pdf

27 of 27

Discussion Questions

What APIs have attendees built, or are currently using, in their web archiving activities?
Are these APIs RESTful? If not, why not?
What frameworks/languages were they built with? What are other notable characteristics of their development and maintenance?
What part of the web archiving lifecycle would most benefit from next-stage API development, post-WASAPI?