1 of 34

Jefferson Bailey, Internet Archive (jefferson@archive.org | @jefferson_bail)

Naomi Dushay, Stanford University ( ndushay@stanford.edu )

IIPC WAC 2017| https://github.com/WASAPI-Community/data-transfer-apis

WASAPI: (Web Archiving Systems APIs)

Project Updates & Data Transfer APIs Specifications and Demonstrations

2 of 34

WASAPI: Web Archiving Systems APIs

Outline:

  • Jefferson
    • WASAPI Project Background
    • Work to Date: Research & Community Building
    • Work to Date: Data Transfer APIs
    • Archive-It API: Specification & Demonstration
  • Naomi
    • Stanford Use of Archive-It API
    • Work outline
    • Engineering Approach
    • Demo (Video)
  • Questions & Discussion (please ask them!)

IIPC WAC 2017

3 of 34

WASAPI: Web Archiving Systems APIs

  • “Systems Interoperability and Collaborative Development for Web Archives”
  • National Leadership Grant, National Digital Platform, R&D
  • IA/AIT (PI), Stanford, LOCKSS, UNT, Rutgers
  • 2-year project (Jan 2016 - Dec 2017)
  • R&D + National Symposium + APIs

IIPC WAC 2017

4 of 34

WASAPI: Web Archiving Systems APIs

Goals & Outcomes:

  1. Build WARC & derivative dataset APIs (AIT & LOCKSS) and test via transfer to partners (SUL, UNT, Rutgers) to enable better distributed preservation and access
  2. Seed & launch a community modeled on the characteristics of successful development and participation from communities ID’ed by project
  3. Sketch a blueprint and technical model for future web archiving APIs informed by project R&D
  4. Seed a technical infrastructure to facilitate more computational research use of web archives

IIPC WAC 2017

5 of 34

WASAPI: Web Archiving Systems APIs

Work to date -- Research, Education & Community�

  1. WARC & Digital Preservation Surveys
  2. Online Webinars & Demos
  3. National Symposium
  4. Presentations & Working Groups

IIPC WAC 2017

6 of 34

WASAPI: Web Archiving Systems APIs

  • WASAPI & AIT “State of the WARC” Preservation Surveys (plus NDSA Web Archiving Survey)
    1. 15%-20% of respondents are downloading their WARCs for local preservation (33% plan/hope to)
    2. Institution, collection, crawl, seed, date-range all identified as main data points for transfer/access
    3. Broad interest in streamlining process, but systems for local preservation remain disparate
    4. Transfer of WARCs/datasets for researcher access small but growing

IIPC WAC 2017

7 of 34

WASAPI: Web Archiving Systems APIs

Work to date -- Research, Education & Community�

  • Video trainings/demos on WA APIs
    • SAA WA Section + SUL demos
  • Presentations & Working Groups
    • Mucho preso + TWG notes

IIPC WAC 2017

8 of 34

National Symposium on Web Archiving Interoperability

@ Internet Archive, Feb 22-23, 2017

  • 40+ Institutions from United States and Canada
  • Orgs included custodial, research, and engineering reps
  • Presentations focused on local uses of existing APIs (search, CDX, etc) and emerging tools
  • Affiliated Archives Unleashed event
  • Agenda, docs, presos in GitHub

  • Breakout Groups
  • Community fractured among conferences, travel challenges
  • Desire for more interaction between practitioners & developers
  • WA still needs broader institutional understanding & buy-in
  • Need a marquee event and unified public & member comms channels

IIPC WAC 2017

9 of 34

National Symposium on Web Archiving Interoperability

@ Internet Archive, Feb 22-23, 2017

IIPC WAC 2017

10 of 34

WASAPI: Web Archiving Systems APIs

Work to date -- Development�

  • General Specification (on Github)
  • LOCKSS Implementation (on Github)
  • Archive-It Implementation (on Github)
  • Archive-It API documentation (on Github)
  • Testing and utilities (in progress)

IIPC WAC 2017

11 of 34

WASAPI: Archive-It Transfer API

Written in python, meets all gen-spec criteria, swagger yaml in the repos

Auth: Uses AIT Django framework (same as web app) -- Auth is not defined in the gen spec

  • Browser cookies OR http basic auth (login or pass creds via CLI)

Basic endpoint: https://partner.archive-it.org/wasapi/v1/webdata (in production!)

  • Base path returns all WARCs for that account; base/all results are paginated

Query parameters:

  • filename -- limited use but knowable via AIT CDX/C API
  • filetype -- currently just WARCs, but others (derivatives) in dev
  • collection -- ID designating a specific AIT collection [repeatable param]
  • crawl -- ID designating a specific AIT crawl job
  • crawl-time -- uses WARC creation date; crawl-time-before / crawl-time-after
  • crawl-start -- uses crawl job start date; crawl-job-before / crawl-job-after

Some caveats!

IIPC WAC 2017

12 of 34

WASAPI: Archive-It Transfer API

Response:

  • JSON object has: pagination, count, request-url, includes-extra

Files fields:

  • account: the numeric Archive-It account identifier
  • checksums: an object with md5 and sha1 keys and hexadecimal checksum values
  • collection: the numeric Archive-It collection identifier
  • crawl: the numeric Archive-It crawl job identifier
  • crawl-time: an RFC3339 date stamp of the time the webdata file was created
  • crawl-start: an optional RFC3339 date stamp of the time the crawl job started
  • filename: the name of the webdata file (without any path of directories)
  • filetype: the format of the webdata file, eg warc, wat, wane, cdx
  • locations: a list of sources from which to retrieve the webdata file
  • size: the size in bytes of the webdata file

IIPC WAC 2017

13 of 34

WASAPI: Archive-It Transfer API

Sample queries!

Gimme all my WARCs for collection #blacklivesmatter collection (2950)�https://partner.archive-it.org/wasapi/v1/webdata?collection=2950&format=json

Gimme all my WARCs for a specific crawl (300208)

https://partner.archive-it.org/wasapi/v1/webdata?crawl=300208&format=json

Gimme all my WARCs from Q1 of 2017 and collection 1068

https://partner.archive-it.org/wasapi/v1/webdata?collection=1068&crawl-time-after=2016-12-31&crawl-time-before=2017-04-01

WARRRRRRCs:

curl --user username:password 'https://partner.archive-it.org/wasapi/v1/webdata?collection=2950&format=json' | jq -r '.files | .[] | .["filename"] | .[]' > WARRRRRRCs.txt

IIPC WAC 2017

14 of 34

WASAPI: Archive-It Transfer API

GET A JOB!

Supports submitting jobs for generation of derivative datasets re WASAPI goal of expanding researcher / analytic access and use

  • Submit by HTTP POST to https://partner.archive-it.org/wasapi/v1/jobs
  • curl --user usernam:password -H 'Content-Type: application/json' -d '{"function": "build-wat","query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01"}' https://partner.archive-it.org/wasapi/v1/jobs
  • Functions;
  • build-wat: build WAT (Web Archive Transformation) files
  • build-wane: build WANE (Web Archive Name Entities) files
  • build-cdx: Build a CDX (Capture Index) files
  • Use existing API query syntax to specify content targeted for job
  • Receive token for checking job status and use API to poll for status, a la https://partner.archive-it.org/wasapi/v1/jobs/136

IIPC WAC 2017

15 of 34

WASAPI: Archive-It Transfer API

GET A JOB! (Done)

{

"account": 1177,

"function": "build-wat",

"jobtoken": "136",

"query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01",

"state": "complete",

"submit-time": "2017-06-03T22:49:13Z",

"termination-time": "2017-06-06T01:37:54Z"

}

GET A JOB! (Results)

IIPC WAC 2017

16 of 34

WASAPI: Web Archiving Systems APIs

Work remaining

  • Minor AIT API features
  • Recipes and utilities (testers welcome!)
  • Community building research & report
  • A few papers on WA APIs
  • Ongoing surveys and research
  • Other APIs in WASAPI (past & future)

IIPC WAC 2017

17 of 34

IIPC WAC 2017

18 of 34

: Poll for Latest Captures ...

1. Login to Archive-It

2. Select collection

3. Copy WARC urls from list

IIPC WAC 2017

19 of 34

: … Download and Validate

4. Format url list into BagIt holey bag

fetch manifest

5. Retrieve WARCs via BagIt library

to “fill” bag

6. Validate checksums; rerun until ok

IIPC WAC 2017

20 of 34

: … Accessioning into IR

7. Prepare list of crawls

8. Register for ingest

9. Accession

10. Validate crawls in Stanford Digital Repository

11. Validate crawls in Stanford Wayback

12. Validate seed thumbnail creation

13. Clean up after accessioning

IIPC WAC 2017

21 of 34

: Poll, Download, Accession

1. Login to Archive-It

2. Select collection

3. Copy desired WARC urls from list

4. Format url list into BagIt holey bag fetch manifest

5. Retrieve WARCs via BagIt library to “fill” bag

6. Validate checksums; rerun until ok

7. Prepare list of crawls

8. Register for ingest

9. Accession

10. Validate crawls in Stanford Digital Repository

11. Validate crawls in Stanford Wayback

12. Validate seed thumbnail creation

13. Clean up after accessioning

IIPC WAC 2017

22 of 34

: Grant Deliverable

1. Login to Archive-It

2. Select collection

3. Copy desired WARC urls from list

4. Format url list into BagIt holey bag fetch manifest

5. Retrieve WARCs via BagIt library to “fill” bag

6. Validate checksums; rerun until ok

7. Prepare list of crawls

8. Register for ingest

9. Accession

10. Validate crawls in Stanford Digital Repository

11. Validate crawls in Stanford Wayback

12. Validate seed thumbnail creation

13. Clean up after accessioning

IIPC WAC 2017

23 of 34

: Grant Deliverable

1. Login to Archive-It

2. Select collection

3. Copy desired WARC urls from list

4. Format url list into BagIt holey bag fetch manifest

5. Retrieve WARCs via BagIt library to “fill” bag

6. Validate checksums; rerun until ok

IIPC WAC 2017

Given collection ID and [dates|lower limit crawl id]:

  1. Determine new crawl id(s) from WASAPI
  2. Get WARC filenames from WASAPI
  3. Download WARCs from WASAPI
  4. Validate checksums; re-download until ok

24 of 34

: Grant Deliverable

1. Login to Archive-It

2. Select collection

3. Copy desired WARC urls from list

4. Format url list into BagIt holey bag fetch manifest

5. Retrieve WARCs via BagIt library to “fill” bag

6. Validate checksums; rerun until ok

IIPC WAC 2017

Given collection ID and [dates|lower limit crawl id]:

  • Determine new crawl id(s) from WASAPI
  • Get WARC filenames from WASAPI
  • Download WARCs from WASAPI
  • Validate checksums; re-download until ok

WASAPI Download Utility:

(non-institution specific)

  • Login
  • Which Data?
  • Get and Validate Data

25 of 34

: Grant Deliverable

  • Production Quality
  • Open Source
  • Stanford Libraries (DLSS) best software practices:
    • Maintainable
      • Continuous Integration
      • Excellent Test Coverage
      • Deployment: Automate-able
    • Documented
    • Versioned
    • “Agile”
      • Team based
      • 1 week sprints (total: 5 weeks)

IIPC WAC 2017

John

Martin

Tommy

Ingulfsen

26 of 34

: Grant Deliverable

Choices Made:

  • Leverage LOCKSS engineering knowledge

  • / (testing)
  • (test coverage)
  • (build tool)
  • (static analysis and style checker)
  • (deployment)

  • Download directly; do not use BagIt

IIPC WAC 2017

27 of 34

: Grant Deliverable

28 of 34

: Grant Deliverable

29 of 34

30 of 34

31 of 34

: Grant Deliverable

IIPC WAC 2017

32 of 34

wasapi-downloader

Work remaining

  • Merge Nick Ruest’s pull request
  • Archive-It API change communication
  • Github “Release” v1.0.0
  • Use in Production at Stanford
  • Broader Testing
  • Updates as Archive-It implements API changes
  • https://github.com/sul-dlss/wasapi-downloader/issues
  • Update Stanford Internal Docs

IIPC WAC 2017

33 of 34

THANKS!

Jefferson Bailey, Internet Archive (jefferson@archive.org)

Naomi Dushay, Stanford University (ndushay@stanford.edu)

34 of 34

Discussion Questions

  • What APIs have attendees built, or are currently using, in their web archiving activities?
  • Are these APIs RESTful? If not, why not?
  • What frameworks/languages were they built with? What are other notable characteristics of their development and maintenance?
  • What part of the web archiving lifecycle would most benefit from next-stage API development, post-WASAPI?