Jefferson Bailey, Internet Archive (jefferson@archive.org | @jefferson_bail)
Naomi Dushay, Stanford University ( ndushay@stanford.edu )
IIPC WAC 2017| https://github.com/WASAPI-Community/data-transfer-apis
WASAPI: (Web Archiving Systems APIs)
Project Updates & Data Transfer APIs Specifications and Demonstrations
WASAPI: Web Archiving Systems APIs
Outline:
IIPC WAC 2017
WASAPI: Web Archiving Systems APIs
IIPC WAC 2017
WASAPI: Web Archiving Systems APIs
Goals & Outcomes:
IIPC WAC 2017
WASAPI: Web Archiving Systems APIs
Work to date -- Research, Education & Community�
IIPC WAC 2017
WASAPI: Web Archiving Systems APIs
IIPC WAC 2017
WASAPI: Web Archiving Systems APIs
Work to date -- Research, Education & Community�
IIPC WAC 2017
National Symposium on Web Archiving Interoperability
@ Internet Archive, Feb 22-23, 2017
IIPC WAC 2017
National Symposium on Web Archiving Interoperability
@ Internet Archive, Feb 22-23, 2017
IIPC WAC 2017
WASAPI: Web Archiving Systems APIs
Work to date -- Development�
IIPC WAC 2017
WASAPI: Archive-It Transfer API
Written in python, meets all gen-spec criteria, swagger yaml in the repos
Auth: Uses AIT Django framework (same as web app) -- Auth is not defined in the gen spec
Basic endpoint: https://partner.archive-it.org/wasapi/v1/webdata (in production!)
Query parameters:
Some caveats!
IIPC WAC 2017
WASAPI: Archive-It Transfer API
Response:
Files fields:
IIPC WAC 2017
WASAPI: Archive-It Transfer API
Sample queries!
Gimme all my WARCs for collection #blacklivesmatter collection (2950)�https://partner.archive-it.org/wasapi/v1/webdata?collection=2950&format=json
Gimme all my WARCs for a specific crawl (300208)
https://partner.archive-it.org/wasapi/v1/webdata?crawl=300208&format=json
Gimme all my WARCs from Q1 of 2017 and collection 1068
WARRRRRRCs:
curl --user username:password 'https://partner.archive-it.org/wasapi/v1/webdata?collection=2950&format=json' | jq -r '.files | .[] | .["filename"] | .[]' > WARRRRRRCs.txt
IIPC WAC 2017
WASAPI: Archive-It Transfer API
GET A JOB!
Supports submitting jobs for generation of derivative datasets re WASAPI goal of expanding researcher / analytic access and use
IIPC WAC 2017
WASAPI: Archive-It Transfer API
GET A JOB! (Done)
{
"account": 1177,
"function": "build-wat",
"jobtoken": "136",
"query": "collection=4783&crawl-time-after=2016-01-01&crawl-time-before=2017-01-01",
"state": "complete",
"submit-time": "2017-06-03T22:49:13Z",
"termination-time": "2017-06-06T01:37:54Z"
}
GET A JOB! (Results)
IIPC WAC 2017
WASAPI: Web Archiving Systems APIs
Work remaining
IIPC WAC 2017
IIPC WAC 2017
: Poll for Latest Captures ...
1. Login to Archive-It
2. Select collection
3. Copy WARC urls from list
IIPC WAC 2017
: … Download and Validate
4. Format url list into BagIt holey bag
fetch manifest
5. Retrieve WARCs via BagIt library
to “fill” bag
6. Validate checksums; rerun until ok
IIPC WAC 2017
: … Accessioning into IR
7. Prepare list of crawls
8. Register for ingest
9. Accession
10. Validate crawls in Stanford Digital Repository
11. Validate crawls in Stanford Wayback
12. Validate seed thumbnail creation
13. Clean up after accessioning
IIPC WAC 2017
: Poll, Download, Accession
1. Login to Archive-It
2. Select collection
3. Copy desired WARC urls from list
4. Format url list into BagIt holey bag fetch manifest
5. Retrieve WARCs via BagIt library to “fill” bag
6. Validate checksums; rerun until ok
7. Prepare list of crawls
8. Register for ingest
9. Accession
10. Validate crawls in Stanford Digital Repository
11. Validate crawls in Stanford Wayback
12. Validate seed thumbnail creation
13. Clean up after accessioning
IIPC WAC 2017
: Grant Deliverable
1. Login to Archive-It
2. Select collection
3. Copy desired WARC urls from list
4. Format url list into BagIt holey bag fetch manifest
5. Retrieve WARCs via BagIt library to “fill” bag
6. Validate checksums; rerun until ok
7. Prepare list of crawls
8. Register for ingest
9. Accession
10. Validate crawls in Stanford Digital Repository
11. Validate crawls in Stanford Wayback
12. Validate seed thumbnail creation
13. Clean up after accessioning
IIPC WAC 2017
: Grant Deliverable
1. Login to Archive-It
2. Select collection
3. Copy desired WARC urls from list
4. Format url list into BagIt holey bag fetch manifest
5. Retrieve WARCs via BagIt library to “fill” bag
6. Validate checksums; rerun until ok
IIPC WAC 2017
Given collection ID and [dates|lower limit crawl id]:
: Grant Deliverable
1. Login to Archive-It
2. Select collection
3. Copy desired WARC urls from list
4. Format url list into BagIt holey bag fetch manifest
5. Retrieve WARCs via BagIt library to “fill” bag
6. Validate checksums; rerun until ok
IIPC WAC 2017
Given collection ID and [dates|lower limit crawl id]:
WASAPI Download Utility:
(non-institution specific)
: Grant Deliverable
IIPC WAC 2017
John
Martin
Tommy
Ingulfsen
: Grant Deliverable
Choices Made:
IIPC WAC 2017
: Grant Deliverable
: Grant Deliverable
: Grant Deliverable
IIPC WAC 2017
wasapi-downloader
Work remaining
IIPC WAC 2017
THANKS!
Jefferson Bailey, Internet Archive (jefferson@archive.org)
Naomi Dushay, Stanford University (ndushay@stanford.edu)
Discussion Questions