with: Scott Ainsworth, Sawood Alam, Mohamed Aturban, John Berlin, Justin Brunelle, Kritika Garg, Hussam Hallak, Himarsha Jayanetti, Mat Kelly, Michele C. Weigle
Trust in Web Archives Panel, 2021 Web Archiving Conference
$ echo "2025" | md5
eaee1902f186819154789ee22ca30035
$ # I read somewhere that hashes
$ # were better than datetime
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
My Vision for Trustworthy
Web Archiving in 2025
2
Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.
#Disclaimer: “…both the live Web and the Wayback Machine [...] are reasonably reliable for everyday use”
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
My Vision for Trustworthy
Web Archiving in 2025
3
Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.
This is doable by 2025. But let’s look further at the challenges that could stop us from achieving this goal.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
4
Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
IA: the Walter Cronkite of web archives?
5
https://www.britannica.com/biography/Walter-Cronkite
https://medium.com/tvnewsanalyzer/visualizing-the-who-and-what-of-cable-tv-news-f51d314b4c2d
Cable news now offers greater diversity, representation, and POV. However, few anchors offer the gravitas of “Uncle Walter”, “the most trusted man in America”, and some intentionally deceive.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
Are we close to 100s of archives?
IIPC has 60+ members!
6
Members are not 1:1 with archives.
OTOH, there are many archives who are not IIPC members.
We certainly have “dozens” of archives.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
Will the number of archives continue to grow?
Maybe not -- innumerable examples point toward
centralization / consolidation
7
https://www.currentware.com/the-state-of-the-web-browser-in-2020/
IA has admirably supported the Decentralized Web movement.
https://blog.archive.org/tag/decentralized-web/
But centralization is about economics, not technologies:
DSHR: “Unless decentralized technologies specifically address the issue of how to avoid increasing returns to scale they will not, of themselves, fix this economic problem. Their increasing returns to scale will drive layering centralized businesses on top of decentralized infrastructure, replicating the problem we face now, just on different infrastructure.”
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
8
Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
We estimated that ~2/3 of web traffic
is not publicly archivable
9
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
Tools for archiving the private web exist,
but the practice, at least as we might think of it,
is not yet widespread
10
https://oduwsdl.github.io/nehdhig2017/
https://ws-dl.blogspot.com/2019/09/2019-09-02-so-long-and-thanks-for-all.html
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
Commercial private (web) archives largely uninformed by IIPC, Wayback, Heritrix, pywb, Brozzler et al.
11
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
Dark web archives :-(
12
$ curl -I https://www.webarchive.org.uk/wayback/archive/20150930064233mp_/http://sigbi.org/
HTTP/1.1 451 Unavailable For Legal Reasons
Server: nginx/1.20.1
Date: Tue, 08 Jun 2021 16:46:14 GMT
Content-Type: text/html
Content-Length: 3947
Connection: keep-alive
$
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
13
Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
Three copies archived at exactly the same time -- What are the chances?!
Actually, there are three copies of the same observation, not three independent observations.
14
$ curl -iLs memgator.cs.odu.edu/timemap/link/https://blog.reidreport.com | grep 20051213063757
<https://webarchive.loc.gov/all/20051213063757/http://blog.reidreport.com/>; rel="memento"; datetime="Tue, 13 Dec 2005 06:37:57 GMT",
<http://archive.md/20051213063757/http://blog.reidreport.com/>; rel="memento"; datetime="Tue, 13 Dec 2005 06:37:57 GMT",
<https://web.archive.org/web/20051213063757/http://blog.reidreport.com/>; rel="memento"; datetime="Tue, 13 Dec 2005 06:37:57 GMT",
It will never be 2005 again, so hosting IA’s WARC files from 2005 is the best we can do.
Going forward, it would be nice to have 3+ independent observations, which could all be different because of GeoIP, personalization, CDN status, etc.
Then it’s up to the reader to determine if the differences
are semantically meaningful.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
15
Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
Homogeneity is not true interoperability
16
https://netpreserveblog.wordpress.com/2020/12/16/openwayback-to-pywb-transition-guide/
https://ws-dl.blogspot.com/2019/09/2019-09-10-where-did-archive-go-part-2.html
I don’t fault the staff who converge on popular, high-quality tech stacks & services,
but I do lament the loss of heterogeneity.
True interoperability comes through the hard work of protocols and standards.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
17
Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
2017: First published works about
robustness vs. malicious .html/.js?
18
http://labs.rhizome.org/presentations/security.html#/
https://blog.dshr.org/2017/06/wac2017-security-issues-for-web-archives.html
https://acmccs.github.io/papers/p1741-lernerAT3.pdf
https://blog.dshr.org/2017/09/attacking-users-of-wayback-machine.html
Prior to these works, our group (@WebSciDL) had observed: Zombies (live web leakage into the archive), Temporal Violations (replaying web pages that never existed), Cookie Violations, Twitter replay problems, etc., but we never considered ingesting malicious .html/.js until these groundbreaking pubs.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
2018: Web IDL & Client-side rewriting
2020: Analysis of attacks on rehosting sites
19
I signed off on John’s thesis 3 years ago, but I’m only now really understanding it.
Key contribution: web archives
as subclass of rehosting sites.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
20
Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
“No man ever steps in the same river twice, for it's not the same river and he's not the same man”
21
For third party playback, we are far from being able to do meaningful audits: replaying the same archived page over and over produces different results.
Left: Reload 1566 archived pages 39 times over 1 year.
Green=resource loaded,
Gray = resource not loaded,
Black line = baseline download.
https://github.com/oduwsdl/mementos-fixity
Conventional fixity-based approaches will not work.
We can’t depend on the archive for fixity; archives change and/or die.
Cf. “Where did the archive go?”
“Archive Assisted Archival Fixity Verification Framework”
https://arxiv.org/abs/1905.12565
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
22
Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
That archives don’t ingest the output of other archives
is a lack of interoperability.
That we’re not more concerned about this is a lack of cooperation.
23
https://www.slideshare.net/phonedude/web-archives-at-the-nexus-of-good-fakes-and-flawed-originals/87
1
2
3
4
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
Kudos to archive.today for preserving machine-readable
source metadata and including it in the UI
24
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
APIs are necessary but not sufficient.
We must be able to preserve/audit the data (e.g., WARC, HAR) as rendered through software (e.g., pywb), not just the data.
25
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
26
Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
These apps probably* use HTTP, json, etc.,
but what’s their URL? Are they even still web?
27
* I really don’t know (WebRTC?). And if they don’t, that further proves my point.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
28
Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
More than just Wayback Machines:
we must accommodate any system that supports rehosting and/or revisions
29
see also: https://www.slideshare.net/ibnesayeed/readying-web-archives-to-consume-and-leverage-web-bundles
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL
Web Archiving in the Year 312351bff07989769097660a56395065
30
$ echo -n "2025" | md5
312351bff07989769097660a56395065
$ # oh no - the hash changed from slide 1
$ # is this content drift?!
Hundreds of publicly available, independent, interoperable, robust, auditable, cooperating web archives.
Can we achieve this by 2025? Yes.
Will we achieve this by 2025? Maybe.
Will we “solve” trust? No.
Technical definitions (e.g., ISO 16363) notwithstanding,
“trust” in web archives might be better understood as analogous to
“relevance” in info retrieval: defined by a user’s information need.
Web Archiving in the Year eaee1902f186819154789ee22ca30035
Web Archiving Conference 2021-06-16 @phonedude_mln, @WebSciDL