1 of 34

Jefferson Bailey, Internet Archive (jefferson@archive.org | @jefferson_bail)

Naomi Dushay, Stanford University ( ndushay@stanford.edu )

IIPC WAC 2017| https://github.com/WASAPI-Community/data-transfer-apis

WASAPI: (Web Archiving Systems APIs)

Project Updates & Data Transfer APIs Specifications and Demonstrations

2 of 34

WASAPI: Web Archiving Systems APIs

Outline:

Jefferson

WASAPI Project Background
Work to Date: Research & Community Building
Work to Date: Data Transfer APIs
Archive-It API: Specification & Demonstration

Naomi

Stanford Use of Archive-It API
Work outline
Engineering Approach
Demo (Video)

Questions & Discussion (please ask them!)

IIPC WAC 2017

3 of 34

WASAPI: Web Archiving Systems APIs

“Systems Interoperability and Collaborative Development for Web Archives”
National Leadership Grant, National Digital Platform, R&D
IA/AIT (PI), Stanford, LOCKSS, UNT, Rutgers
2-year project (Jan 2016 - Dec 2017)
R&D + National Symposium + APIs

IIPC WAC 2017

4 of 34

WASAPI: Web Archiving Systems APIs

Goals & Outcomes:

Build WARC & derivative dataset APIs (AIT & LOCKSS) and test via transfer to partners (SUL, UNT, Rutgers) to enable better distributed preservation and access
Seed & launch a community modeled on the characteristics of successful development and participation from communities ID’ed by project
Sketch a blueprint and technical model for future web archiving APIs informed by project R&D
Seed a technical infrastructure to facilitate more computational research use of web archives

IIPC WAC 2017

5 of 34

WASAPI: Web Archiving Systems APIs

Work to date -- Research, Education & Community�

WARC & Digital Preservation Surveys
Online Webinars & Demos
National Symposium
Presentations & Working Groups

IIPC WAC 2017

6 of 34

WASAPI: Web Archiving Systems APIs

WASAPI & AIT “State of the WARC” Preservation Surveys (plus NDSA Web Archiving Survey)

15%-20% of respondents are downloading their WARCs for local preservation (33% plan/hope to)
Institution, collection, crawl, seed, date-range all identified as main data points for transfer/access
Broad interest in streamlining process, but systems for local preservation remain disparate
Transfer of WARCs/datasets for researcher access small but growing

IIPC WAC 2017

7 of 34

WASAPI: Web Archiving Systems APIs

Work to date -- Research, Education & Community�

Video trainings/demos on WA APIs

SAA WA Section + SUL demos

Presentations & Working Groups

Mucho preso + TWG notes

IIPC WAC 2017

8 of 34

National Symposium on Web Archiving Interoperability

@ Internet Archive, Feb 22-23, 2017

40+ Institutions from United States and Canada
Orgs included custodial, research, and engineering reps
Presentations focused on local uses of existing APIs (search, CDX, etc) and emerging tools
Affiliated Archives Unleashed event
Agenda, docs, presos in GitHub

Breakout Groups
Community fractured among conferences, travel challenges
Desire for more interaction between practitioners & developers
WA still needs broader institutional understanding & buy-in
Need a marquee event and unified public & member comms channels

IIPC WAC 2017

9 of 34

National Symposium on Web Archiving Interoperability

@ Internet Archive, Feb 22-23, 2017

IIPC WAC 2017

10 of 34

WASAPI: Web Archiving Systems APIs

Work to date -- Development�

General Specification (on Github)
LOCKSS Implementation (on Github)
Archive-It Implementation (on Github)
Archive-It API documentation (on Github)
Testing and utilities (in progress)

IIPC WAC 2017

11 of 34

WASAPI: Archive-It Transfer API

Written in python, meets all gen-spec criteria, swagger yaml in the repos

Auth: Uses AIT Django framework (same as web app) -- Auth is not defined in the gen spec

Browser cookies OR http basic auth (login or pass creds via CLI)

Basic endpoint: https://partner.archive-it.org/wasapi/v1/webdata (in production!)

Base path returns all WARCs for that account; base/all results are paginated

Query parameters:

filename -- limited use but knowable via AIT CDX/C API
filetype -- currently just WARCs, but others (derivatives) in dev
collection -- ID designating a specific AIT collection [repeatable param]
crawl -- ID designating a specific AIT crawl job
crawl-time -- uses WARC creation date; crawl-time-before / crawl-time-after
crawl-start -- uses crawl job start date; crawl-job-before / crawl-job-after

Some caveats!

IIPC WAC 2017

12 of 34

WASAPI: Archive-It Transfer API

Response:

JSON object has: pagination, count, request-url, includes-extra

Files fields:

account: the numeric Archive-It account identifier
checksums: an object with md5 and sha1 keys and hexadecimal checksum values
collection: the numeric Archive-It collection identifier
crawl: the numeric Archive-It crawl job identifier
crawl-time: an RFC3339 date stamp of the time the webdata file was created
crawl-start: an optional RFC3339 date stamp of the time the crawl job started
filename: the name of the webdata file (without any path of directories)
filetype: the format of the webdata file, eg warc, wat, wane, cdx
locations: a list of sources from which to retrieve the webdata file
size: the size in bytes of the webdata file

IIPC WAC 2017

13 of 34

WASAPI: Archive-It Transfer API

Sample queries!

Gimme all my WARCs for collection #blacklivesmatter collection (2950)�https://partner.archive-it.org/wasapi/v1/webdata?collection=2950&format=json

Gimme all my WARCs for a specific crawl (300208)

https://partner.archive-it.org/wasapi/v1/webdata?crawl=300208&format=json

Gimme all my WARCs from Q1 of 2017 and collection 1068

https://partner.archive-it.org/wasapi/v1/webdata?collection=1068&crawl-time-after=2016-12-31&crawl-time-before=2017-04-01

WARRRRRRCs:

curl --user username:password 'https://partner.archive-it.org/wasapi/v1/webdata?collection=2950&format=json' | jq -r '.files | .[] | .["filename"] | .[]' > WARRRRRRCs.txt

IIPC WAC 2017

14 of 34

WASAPI: Archive-It Transfer API

GET A JOB!

Supports submitting jobs for generation of derivative datasets re WASAPI goal of expanding researcher / analytic access and use

Submit by HTTP POST to https://partner.archive-it.org/wasapi/v1/jobs
curl --user usernam:password -H 'Content-Type: application/json' -d '{"function": "build-wat","query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01"}' https://partner.archive-it.org/wasapi/v1/jobs
Functions;
build-wat: build WAT (Web Archive Transformation) files
build-wane: build WANE (Web Archive Name Entities) files
build-cdx: Build a CDX (Capture Index) files
Use existing API query syntax to specify content targeted for job
Receive token for checking job status and use API to poll for status, a la https://partner.archive-it.org/wasapi/v1/jobs/136

IIPC WAC 2017

15 of 34

WASAPI: Archive-It Transfer API

GET A JOB! (Done)

{

"account": 1177,

"function": "build-wat",

"jobtoken": "136",

"query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01",

"state": "complete",

"submit-time": "2017-06-03T22:49:13Z",

"termination-time": "2017-06-06T01:37:54Z"

}

GET A JOB! (Results)

same as file fields array, with relevant changes to hash, location, size, filetype/name, etc
query by filetype or job, a la https://partner.archive-it.org/wasapi/v1/jobs/{jobtoken}/result

IIPC WAC 2017

16 of 34

WASAPI: Web Archiving Systems APIs

Work remaining

Minor AIT API features
Recipes and utilities (testers welcome!)
Community building research & report
A few papers on WA APIs
Ongoing surveys and research
Other APIs in WASAPI (past & future)

IIPC WAC 2017

17 of 34

IIPC WAC 2017

18 of 34

: Poll for Latest Captures ...

1. Login to Archive-It

2. Select collection

3. Copy WARC urls from list

IIPC WAC 2017

19 of 34

: … Download and Validate

4. Format url list into BagIt holey bag

fetch manifest

5. Retrieve WARCs via BagIt library

to “fill” bag

6. Validate checksums; rerun until ok

IIPC WAC 2017

20 of 34

: … Accessioning into IR

7. Prepare list of crawls

8. Register for ingest

9. Accession

10. Validate crawls in Stanford Digital Repository

11. Validate crawls in Stanford Wayback

12. Validate seed thumbnail creation

13. Clean up after accessioning

IIPC WAC 2017

21 of 34

: Poll, Download, Accession

1. Login to Archive-It

2. Select collection

3. Copy desired WARC urls from list

4. Format url list into BagIt holey bag fetch manifest

5. Retrieve WARCs via BagIt library to “fill” bag

6. Validate checksums; rerun until ok

7. Prepare list of crawls

8. Register for ingest

9. Accession

10. Validate crawls in Stanford Digital Repository

11. Validate crawls in Stanford Wayback

12. Validate seed thumbnail creation

13. Clean up after accessioning

IIPC WAC 2017

22 of 34

: Grant Deliverable

1. Login to Archive-It

2. Select collection

3. Copy desired WARC urls from list

4. Format url list into BagIt holey bag fetch manifest

5. Retrieve WARCs via BagIt library to “fill” bag

6. Validate checksums; rerun until ok

7. Prepare list of crawls

8. Register for ingest

9. Accession

10. Validate crawls in Stanford Digital Repository

11. Validate crawls in Stanford Wayback

12. Validate seed thumbnail creation

13. Clean up after accessioning

IIPC WAC 2017

23 of 34

: Grant Deliverable

1. Login to Archive-It

2. Select collection

3. Copy desired WARC urls from list

4. Format url list into BagIt holey bag fetch manifest

5. Retrieve WARCs via BagIt library to “fill” bag

6. Validate checksums; rerun until ok

IIPC WAC 2017

Given collection ID and [dates|lower limit crawl id]:

Determine new crawl id(s) from WASAPI
Get WARC filenames from WASAPI
Download WARCs from WASAPI
Validate checksums; re-download until ok

24 of 34

: Grant Deliverable

1. Login to Archive-It

2. Select collection

3. Copy desired WARC urls from list

4. Format url list into BagIt holey bag fetch manifest

5. Retrieve WARCs via BagIt library to “fill” bag

6. Validate checksums; rerun until ok

IIPC WAC 2017

Given collection ID and [dates|lower limit crawl id]:

Determine new crawl id(s) from WASAPI
Get WARC filenames from WASAPI
Download WARCs from WASAPI
Validate checksums; re-download until ok

WASAPI Download Utility:

(non-institution specific)

Login
Which Data?
Get and Validate Data

25 of 34

: Grant Deliverable

Production Quality
Open Source
Stanford Libraries (DLSS) best software practices:

Maintainable

Continuous Integration
Excellent Test Coverage
Deployment: Automate-able

Documented
Versioned
“Agile”

Team based
1 week sprints (total: 5 weeks)

IIPC WAC 2017

John

Martin

Tommy

Ingulfsen

26 of 34

: Grant Deliverable

Choices Made:

Leverage LOCKSS engineering knowledge

/ (testing)
(test coverage)
(build tool)
(static analysis and style checker)
(deployment)

Download directly; do not use BagIt

IIPC WAC 2017

27 of 34

: Grant Deliverable

https://github.com/sul-dlss/wasapi-downloader

28 of 34

: Grant Deliverable

https://github.com/sul-dlss/wasapi-downloader

29 of 34

30 of 34

31 of 34

: Grant Deliverable

https://github.com/sul-dlss/wasapi-downloader

https://www.youtube.com/watch?v=hrI1U6VDB7c

IIPC WAC 2017

32 of 34

wasapi-downloader

Work remaining

Merge Nick Ruest’s pull request
Archive-It API change communication
Github “Release” v1.0.0
Use in Production at Stanford
Broader Testing
Updates as Archive-It implements API changes
https://github.com/sul-dlss/wasapi-downloader/issues
Update Stanford Internal Docs

IIPC WAC 2017

33 of 34

THANKS!

Jefferson Bailey, Internet Archive (jefferson@archive.org)

Naomi Dushay, Stanford University (ndushay@stanford.edu)

WASAPI on the webs

https://github.com/WASAPI-Community

https://archive.org/details/wasapi

https://github.com/sul-dlss/wasapi-downloader

https://www.youtube.com/watch?v=hrI1U6VDB7c

https://wasapi.slack.com/ (We can add you)

https://groups.google.com/forum/#!forum/wasapi-community

34 of 34

Discussion Questions

What APIs have attendees built, or are currently using, in their web archiving activities?
Are these APIs RESTful? If not, why not?
What frameworks/languages were they built with? What are other notable characteristics of their development and maintenance?
What part of the web archiving lifecycle would most benefit from next-stage API development, post-WASAPI?