1 of 30

Web Archiving in the Year

eaee1902f186819154789ee22ca30035

Michael L. Nelson

@phonedude_mln

with: Scott Ainsworth, Sawood Alam, Mohamed Aturban, John Berlin, Justin Brunelle, Kritika Garg, Hussam Hallak, Himarsha Jayanetti, Mat Kelly, Michele C. Weigle

@WebSciDL

Trust in Web Archives Panel, 2021 Web Archiving Conference

2021-06-16

$ echo "2025" | md5

eaee1902f186819154789ee22ca30035

$ # I read somewhere that hashes

$ # were better than datetime

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

2 of 30

My Vision for Trustworthy

Web Archiving in 2025

2

Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.

#Disclaimer: “…both the live Web and the Wayback Machine [...] are reasonably reliable for everyday use”

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

3 of 30

My Vision for Trustworthy

Web Archiving in 2025

3

Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.

This is doable by 2025. But let’s look further at the challenges that could stop us from achieving this goal.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

4 of 30

4

Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

5 of 30

IA: the Walter Cronkite of web archives?

5

Cable news now offers greater diversity, representation, and POV. However, few anchors offer the gravitas of “Uncle Walter”, “the most trusted man in America”, and some intentionally deceive.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

6 of 30

Are we close to 100s of archives?

IIPC has 60+ members!

6

Members are not 1:1 with archives.

OTOH, there are many archives who are not IIPC members.

We certainly have “dozens” of archives.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

7 of 30

Will the number of archives continue to grow?

Maybe not -- innumerable examples point toward

centralization / consolidation

7

IA has admirably supported the Decentralized Web movement.

https://blog.archive.org/tag/decentralized-web/

But centralization is about economics, not technologies:

DSHR: “Unless decentralized technologies specifically address the issue of how to avoid increasing returns to scale they will not, of themselves, fix this economic problem. Their increasing returns to scale will drive layering centralized businesses on top of decentralized infrastructure, replicating the problem we face now, just on different infrastructure.”

https://blog.dshr.org/2017/08/why-is-web-centralized.html

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

8 of 30

8

Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

9 of 30

We estimated that ~2/3 of web traffic

is not publicly archivable

9

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

10 of 30

Tools for archiving the private web exist,

but the practice, at least as we might think of it,

is not yet widespread

10

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

11 of 30

Commercial private (web) archives largely uninformed by IIPC, Wayback, Heritrix, pywb, Brozzler et al.

11

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

12 of 30

Dark web archives :-(

12

$ curl -I https://www.webarchive.org.uk/wayback/archive/20150930064233mp_/http://sigbi.org/

HTTP/1.1 451 Unavailable For Legal Reasons

Server: nginx/1.20.1

Date: Tue, 08 Jun 2021 16:46:14 GMT

Content-Type: text/html

Content-Length: 3947

Connection: keep-alive

$

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

13 of 30

13

Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

14 of 30

Three copies archived at exactly the same time -- What are the chances?!

Actually, there are three copies of the same observation, not three independent observations.

14

$ curl -iLs memgator.cs.odu.edu/timemap/link/https://blog.reidreport.com | grep 20051213063757

<https://webarchive.loc.gov/all/20051213063757/http://blog.reidreport.com/>; rel="memento"; datetime="Tue, 13 Dec 2005 06:37:57 GMT",

<http://archive.md/20051213063757/http://blog.reidreport.com/>; rel="memento"; datetime="Tue, 13 Dec 2005 06:37:57 GMT",

<https://web.archive.org/web/20051213063757/http://blog.reidreport.com/>; rel="memento"; datetime="Tue, 13 Dec 2005 06:37:57 GMT",

It will never be 2005 again, so hosting IA’s WARC files from 2005 is the best we can do.

Going forward, it would be nice to have 3+ independent observations, which could all be different because of GeoIP, personalization, CDN status, etc.

Then it’s up to the reader to determine if the differences

are semantically meaningful.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

15 of 30

15

Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

16 of 30

Homogeneity is not true interoperability

16

I don’t fault the staff who converge on popular, high-quality tech stacks & services,

but I do lament the loss of heterogeneity.

True interoperability comes through the hard work of protocols and standards.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

17 of 30

17

Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

18 of 30

2017: First published works about

robustness vs. malicious .html/.js?

18

Prior to these works, our group (@WebSciDL) had observed: Zombies (live web leakage into the archive), Temporal Violations (replaying web pages that never existed), Cookie Violations, Twitter replay problems, etc., but we never considered ingesting malicious .html/.js until these groundbreaking pubs.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

19 of 30

2018: Web IDL & Client-side rewriting

2020: Analysis of attacks on rehosting sites

19

I signed off on John’s thesis 3 years ago, but I’m only now really understanding it.

Key contribution: web archives

as subclass of rehosting sites.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

20 of 30

20

Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

21 of 30

“No man ever steps in the same river twice, for it's not the same river and he's not the same man”

21

For third party playback, we are far from being able to do meaningful audits: replaying the same archived page over and over produces different results.

Left: Reload 1566 archived pages 39 times over 1 year.

Green=resource loaded,

Gray = resource not loaded,

Black line = baseline download.

https://github.com/oduwsdl/mementos-fixity

Conventional fixity-based approaches will not work.

https://www.slideshare.net/phonedude/blockchain-can-not-be-used-to-verify-replayed-archived-web-pages-125618706

We can’t depend on the archive for fixity; archives change and/or die.

Cf. “Where did the archive go?”

(parts 1, 2, 3, 4) &

“Archive Assisted Archival Fixity Verification Framework”

https://arxiv.org/abs/1905.12565

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

22 of 30

22

Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

23 of 30

That archives don’t ingest the output of other archives

is a lack of interoperability.

That we’re not more concerned about this is a lack of cooperation.

23

1

2

3

4

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

24 of 30

Kudos to archive.today for preserving machine-readable

source metadata and including it in the UI

24

n.b. tracking source is built-in to NNTP, SMTP, Atom, etc.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

25 of 30

APIs are necessary but not sufficient.

We must be able to preserve/audit the data (e.g., WARC, HAR) as rendered through software (e.g., pywb), not just the data.

25

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

26 of 30

26

Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

27 of 30

These apps probably* use HTTP, json, etc.,

but what’s their URL? Are they even still web?

27

* I really don’t know (WebRTC?). And if they don’t, that further proves my point.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

28 of 30

28

Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

29 of 30

More than just Wayback Machines:

we must accommodate any system that supports rehosting and/or revisions

29

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL

30 of 30

Web Archiving in the Year 312351bff07989769097660a56395065

30

$ echo -n "2025" | md5

312351bff07989769097660a56395065

$ # oh no - the hash changed from slide 1

$ # is this content drift?!

Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.

Can we achieve this by 2025? Yes.

Will we achieve this by 2025? Maybe.

Will we “solve” trust? No.

Technical definitions (e.g., ISO 16363) notwithstanding,

“trust” in web archives might be better understood as analogous to

relevance” in info retrieval: defined by a user’s information need.

Web Archiving in the Year eaee1902f186819154789ee22ca30035

Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL