Tools For Personal Web Archiving
WARCreate, WAIL, and Webrecorder
John Berlin (@johnaberlin, N0taN3rd)
Who Am I?
These blog posts helped shape the topic of my master's thesis (hint hint)
I love JavaScript, browser based preservation, crawlers and used to be a Java guy
What do I listen to besides Powerviolence:
Crust Punk, Hardcore Punk, Punk, Doom, Sludge, Tech Death, Post Hardcore, Noise/Drone, Grindcore, Slam, Progressive Metal, Deep|Tech|Progressive Trance, GOA, Techno, Bass, EuroTrash, Drum N Bass / Dubstep, Experimental/Ambient, Vapor|Ritual|Synth Wave, Witch|Acid|Tech|Deep|Future|Bass House, ......
John Berlin, a MS student (thesis option)
Who is also member of @WebSciDL
I have written some WSDL blog posts
CNN.com has been unarchivable since November 1st, 2016
A State Of Replay or Location, Location, Location
Replacing Heritrix with Chrome in WAIL...
It Begins With Hoarding Files
WARC Files (Web ARChive) ISO 28500:2017
Flat UTF-8 files that
Stores both the HTTP payload content and control information
May contain metadata about the data in the WARC or another WARC
Support compression and record integrity
Whose general format is
warc-file = 1*warc-record
warc-record = header CRLF
block CRLF CRLF
header = version warc-fields
version = "WARC/1.0" CRLF
warc-fields = *named-field CRLF block = *OCTET
WARC Record Structure
Two main content blocks
Content Block
WARC Header
Delimiters
WARC Info Record
This record type describes the record and the records that follow it, up through ends of file or until next WARCInfo
Provides a parent ID that all other records are “concurrent” to That is provides a parent id for its child records
Considered a descriptive record
WARC Metadata Record
Contains additional information pertaining to the archived resource
Such as all the outlinks (links to another page)
contained in HTML of the target URI
WARC Request Record
Contains you guessed it HTTP Request Information
WARC Response Record
Contains a HTTP Response with response body
Tools That Archive
Archiving With WGET
Viewing Archived
example.com
Viewing Archived
React Router Documentation
A Thing Called JavaScript
WARCreate: WARC Creating Chrome Extension
Step 1: Add Extension To Chrome
Step 3: Profit
Step 2: Archive
What To Do With My WARCS
Reading WARCs: Node.js With Node-WARC
Reading WARCS: Python With WARCio
WAIL (Web Archiving Interface Layer)
Allows for anyone regardless of technical prowess to build collections of web archives and to preserve and replay web pages from their personal computers. (also a Mat Kelly Original)�
Tools included and made accessible through WAIL are Heritrix, Pywb and WAIL Archiver (Not Yet Released Electron Based High-Fidelity Archival Crawler)�
In other words Archive-It++
Open source and welcomes contributors
Distributed as an stand alone desktop application for
Linux, MacOS, Windows
Maintaining Collections
Of “Archived” Web
Pages Before WAIL
Collections Before WAIL
Collections In WAIL
Names of each of
your collection
How many seeds (pages)
per collection
Size of collection on disk
When was last time you
added to or archived
a seed in the collection
Viewing A Collection
Per seed information
Time added to collection
Last time archived
How you archived the seed
View Seed In Wayback
Viewing Your Archived Seed (Pages)
1. Click
Opens up your browser to the collection view for the page
Click�
Enjoy
Adding WARC From Local Machine�
Drag and drop (W)ARC
Attempts to determine seed
You choose correct seed
Adds (W)ARC to collection and ensures Wayback ingestion
Archiving Page On The Web
Enter URL
Check liveness and quick static analysis of seed
One click Heritrix crawl configuration and launch
Or start of a browser based crawl.
Depends on configuration
Twitter Monitoring And Archiving
Step 1
Required Fields
Handle (Screen Name)
Which collection
How Long
Twitter Monitoring And Archiving
Step 2
Optional Tweet Filtering
Words in tweet
Hashtags
Webrecorder
An web archiving platform that allows you to create high-fidelity, interactive web archives of any web site you browse (A Rhizome and Ilya Kreymer Production. Ilya Is The Author Of Pywb)
Free Service (must sign up) and comes with 5GB of storage.
Open source as well
Handles Youtube and video streaming services like a champ
Cool fact. Webrecorder is using my JavaScript overriding scheme in production. Pywb beta featuring this soon to come (among other things also thanks to me)
Live Demo
Fin
Wiliwonka https://imgflip.com/i/1x01ip
Batman https://imgflip.com/i/1x02xb
Arnold Thumbs Up https://imgflip.com/i/1x05p0
404 https://imgflip.com/i/1x068k
Files Everywhere https://imgflip.com/i/1x1ge6
Use WARCS https://imgflip.com/i/1x1gsk
Supplementary Material
https://gist.github.com/N0taN3rd/1fc65512ef2572ea8c94f68c4583284a