1 of 32

Tools For Personal Web Archiving

WARCreate, WAIL, and Webrecorder

John Berlin (@johnaberlin, N0taN3rd)

2 of 32

Who Am I?

These blog posts helped shape the topic of my master's thesis (hint hint)

I love JavaScript, browser based preservation, crawlers and used to be a Java guy

What do I listen to besides Powerviolence:

Crust Punk, Hardcore Punk, Punk, Doom, Sludge, Tech Death, Post Hardcore, Noise/Drone, Grindcore, Slam, Progressive Metal, Deep|Tech|Progressive Trance, GOA, Techno, Bass, EuroTrash, Drum N Bass / Dubstep, Experimental/Ambient, Vapor|Ritual|Synth Wave, Witch|Acid|Tech|Deep|Future|Bass House, ......

3 of 32

It Begins With Hoarding Files

4 of 32

WARC Files (Web ARChive) ISO 28500:2017

Flat UTF-8 files that

Stores both the HTTP payload content and control information

May contain metadata about the data in the WARC or another WARC

Support compression and record integrity

Whose general format is

warc-file = 1*warc-record

warc-record = header CRLF

block CRLF CRLF

header = version warc-fields

version = "WARC/1.0" CRLF

warc-fields = *named-field CRLF block = *OCTET

5 of 32

WARC Record Structure

  • WARC fields: info about the WARC itself and or current record
  • Content-Type: “type” of the content block
  • Content-Length: how long is the content block

Two main content blocks

Content Block

WARC Header

  • The actual contents of the WARC Record

Delimiters

  • Single control line feed means next part of record incoming
  • Three control line feeds means end of record

6 of 32

WARC Info Record

This record type describes the record and the records that follow it, up through ends of file or until next WARCInfo

Provides a parent ID that all other records are “concurrent” to That is provides a parent id for its child records

Considered a descriptive record

7 of 32

WARC Metadata Record

Contains additional information pertaining to the archived resource

Such as all the outlinks (links to another page)

contained in HTML of the target URI

8 of 32

WARC Request Record

Contains you guessed it HTTP Request Information

9 of 32

WARC Response Record

Contains a HTTP Response with response body

10 of 32

Tools That Archive

11 of 32

Archiving With WGET

12 of 32

Viewing Archived

example.com

13 of 32

Viewing Archived

React Router Documentation

14 of 32

A Thing Called JavaScript

15 of 32

WARCreate: WARC Creating Chrome Extension

Step 1: Add Extension To Chrome

Step 3: Profit

A Mat Kelly (@machawk1) original

http://warcreate.com/

Step 2: Archive

16 of 32

What To Do With My WARCS

17 of 32

Reading WARCs: Node.js With Node-WARC

18 of 32

Reading WARCS: Python With WARCio

19 of 32

WAIL (Web Archiving Interface Layer)

Allows for anyone regardless of technical prowess to build collections of web archives and to preserve and replay web pages from their personal computers. (also a Mat Kelly Original)�

Tools included and made accessible through WAIL are Heritrix, Pywb and WAIL Archiver (Not Yet Released Electron Based High-Fidelity Archival Crawler)�

In other words Archive-It++

Open source and welcomes contributors

Distributed as an stand alone desktop application for

Linux, MacOS, Windows

20 of 32

Maintaining Collections

Of “Archived” Web

Pages Before WAIL

21 of 32

Collections Before WAIL

22 of 32

Collections In WAIL

Names of each of

your collection

How many seeds (pages)

per collection

Size of collection on disk

When was last time you

added to or archived

a seed in the collection

23 of 32

Viewing A Collection

Per seed information

Time added to collection

Last time archived

How you archived the seed

View Seed In Wayback

24 of 32

Viewing Your Archived Seed (Pages)

1. Click

Opens up your browser to the collection view for the page

Click�

25 of 32

Enjoy

26 of 32

Adding WARC From Local Machine�

Drag and drop (W)ARC

Attempts to determine seed

You choose correct seed

Adds (W)ARC to collection and ensures Wayback ingestion

27 of 32

Archiving Page On The Web

Enter URL

Check liveness and quick static analysis of seed

One click Heritrix crawl configuration and launch

Or start of a browser based crawl.

Depends on configuration

28 of 32

Twitter Monitoring And Archiving

Step 1

Required Fields

Handle (Screen Name)

Which collection

How Long

29 of 32

Twitter Monitoring And Archiving

Step 2

Optional Tweet Filtering

Words in tweet

Hashtags

30 of 32

Webrecorder

An web archiving platform that allows you to create high-fidelity, interactive web archives of any web site you browse (A Rhizome and Ilya Kreymer Production. Ilya Is The Author Of Pywb)

Free Service (must sign up) and comes with 5GB of storage.

Open source as well

Handles Youtube and video streaming services like a champ

Cool fact. Webrecorder is using my JavaScript overriding scheme in production. Pywb beta featuring this soon to come (among other things also thanks to me)

31 of 32

Live Demo

32 of 32

Fin