1 of 16

Client-side Reconstruction of Composite Mementos Using ServiceWorker

Sawood Alam, Mat Kelly, Michele C. Weigle, and Michael L. Nelson

Web Science and Digital Libraries Research Group

Old Dominion University, Norfolk, VA, 23529

@ibnesayeed

@WebSciDL

Supported in part by NSF III 1526700

JCDL 2017, June 19-23, 2017, Toronto, Ontario, Canada

1

2 of 16

2008 Memento Seen in 2017

2

?

Sawood Alam <@ibnesayeed>

3 of 16

2008 Memento Seen in 2012

3

Sawood Alam <@ibnesayeed>

4 of 16

XenLand @ Alpha Centauri

4

Sawood Alam <@ibnesayeed>

5 of 16

Zombies in Archive

5

?

Sawood Alam <@ibnesayeed>

6 of 16

Zombies in Archive

6

<img src="http://xenland.alpha/images/map.png">

// Is rewritten on replay to become:

<img src="http://archive.example.org/1998/http://xenland.alpha/images/map.png">

// URLs constructed by JavaScript are harder to rewrite on replay, e.g.:

var base = 'http://xenland.alpha';

var imgdir = '/images/';

var img = document.createElement('img');

img.src = base + imgdir + 'ruler.png';

document.getElementById('ruler').appendChild(img);

//=>> http://xenland.alpha/images/ruler.png

Sawood Alam <@ibnesayeed>

7 of 16

Replay URL Resolution & Rewriting

7

Reference type

Example

Resolution after relocation

Relative path

images/logo.png

Potentially correct

Absolute path

/public/images/logo.png

Potentially incorrect

Absolute URL

http://example.com/public/images/logo.png

Potentially live leakage

http://example.com/public/index.html

...

<img src="/public/images/logo.png">

...

http://archive.example.org/<datetime>/http://example.com/public/index.html

...

<img src="/<datetime>/http://example.com/public/images/logo.png">

...

Sawood Alam <@ibnesayeed>

8 of 16

Avoiding Zombies

  • Ahead-of-time rendering and JS execution
    • http://archive.is/
  • Archival replay proxy
  • Browser extension
    • MementoFox (deprecated)
  • JS override
    • wombat.js in PyWB
  • ServiceWorker

8

Sawood Alam <@ibnesayeed>

9 of 16

ServiceWorker

  • New web API (still a working draft)
  • A standalone JavaScript file
  • Persists in the browser independent of the window
  • Acts as a proxy
  • Installed by a web page under its domain at a specific path (called scope)
  • Intercepts all requests in scope
    • Resources under the scope path (at any depth)
    • Secondary resource requests originated from any resource under scope
  • Allows modification in request and response
  • Primarily used in web applications for offline access and notification support
  • Requires HTTPS
  • Growing browser support (73.61% as of June 8, 2017)

9

Sawood Alam <@ibnesayeed>

10 of 16

reconstructive.js

  • A ServiceWorker script written for archival replay
  • Plug-in for web archives or Memento aggregators
  • Intercepts all network requests originated from a memento
  • Reroutes requests to an archive (prevents live leakage & incorrect references)
  • Optionally rewrites the content to add banner & to fix hyperlinks

10

Sawood Alam <@ibnesayeed>

11 of 16

Zombies, No More!

11

Sawood Alam <@ibnesayeed>

12 of 16

Rewriting Mementos is Expensive

In our experiment over 500 home pages we observed:

  • One-fifth mean data overhead
  • One-third mean time overhead

12

Original capture (without any rewriting)

15% more data in twice the time

Sawood Alam <@ibnesayeed>

13 of 16

Archival Capture Replay Test Suite (ACRTS)

13

reconstructive.js

Sawood Alam <@ibnesayeed>

14 of 16

Reconstruction Winners: PyWB & reconstructive.js

  1. OpenWayback
  2. PyWB
  3. Memento Reconstruct
  4. Memento for Chrome
  5. reconstructive.js

14

Sawood Alam <@ibnesayeed>

15 of 16

Future Work

  • Use “Prefer” header for original content (when archives support it)
  • Add a customizable archival banner
  • Add click handler for lazy rewriting of hyperlinks
  • Handle archived ServiceWorkers
  • Write a 404-combat ServiceWorker script for webmasters

15

Sawood Alam <@ibnesayeed>

16 of 16

Conclusions

  • reconstructive.js => no zombies!
  • Rerouting instead of rewriting (lazy rewriting)
  • Mean overhead reduction
    • one-fifth data
    • one-third time
  • 73.61% (and growing) browser support for ServiceWorker
  • reconstructive.js
  • Archival Capture Replay Test Suite

16

  • In-depth recap: WADL 2017 Thursday, June 22, 3:45pm (https://fox.cs.vt.edu/wadl2017.html)

Sawood Alam <@ibnesayeed>