Comparison of Access Patterns of Robots and Humans in Web Archives
Himarsha R. Jayanetti1, Kritika Garg1, Sawood Alam2, Michael L. Nelson1, and Michele C. Weigle1
1 Web Science & Digital Libraries Research Group
Old Dominion University, Norfolk VA, USA
@WebSciDL
2 Wayback Machine, Internet Archive
San Francisco, California, USA
@internetarchive
Web Archiving and Digital Libraries (WADL), 24 June 2022
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Motivation
2
Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, “Access Patterns for Robots and Humans in Web Archives,” In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). Indianapolis, IN, July 2013, pp. 339-–348.
https://www.cs.odu.edu/~mln/pubs/jcdl-2013/fp105-AlNoamany.pdf
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Datasets
Full day sample of access logs:
Internet Archive - 2012 (February 2, 2012)
Internet Archive - 2019 (February 7, 2019)
Arquivo.pt - 2019
(February 7, 2019)
3
Feature | IA 2012 | IA 2019 | PT 2019 |
No. of Requests | 99,173,542 (100%) | 308,194,916 (100%) | 1,046,855 (100%) |
GET | 97,987,295 (98.80%) | 304,125,661 (98.68%) | 1025132 (97.92%) |
HEAD | 1,109,810 (1.12%) | 2,578,735 (0.84%) | 14330 (1.37%) |
Status Code 2xx | 32,460,590 (32.73%) | 148,742,768 (48.26%) | 272467 (26.03%) |
Status Code 3xx | 52,131,835 (52.57%) | 131,729,104 (42.74%) | 211709 (20.22%) |
Status Code 4xx | 11,614,387 (11.71%) | 27,099,599 (8.79%) | 560913 (53.58%) |
Status Code 5xx | 2,964,146 (2.99%) | 614,502 (0.20%) | 1764 (0.17%) |
Embedded Resources | 43,260,926 (43.62%) | 195,287,060 (63.36%) | 205976 (19.68%) |
SI Bot | 8,867 (0.01%) | 476367(0.15%) | 3602 (0.34%) |
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Datasets
Full day sample of access logs:
Internet Archive - 2012 (February 2, 2012)
Internet Archive - 2019 (February 7, 2019)
Arquivo.pt - 2019
(February 7, 2019)
4
Feature | IA 2012 | IA 2019 | PT 2019 |
No. of Requests | 99,173,542 (100%) | 308,194,916 (100%) | 1,046,855 (100%) |
GET | 97,987,295 (98.80%) | 304,125,661 (98.68%) | 1025132 (97.92%) |
HEAD | 1,109,810 (1.12%) | 2,578,735 (0.84%) | 14330 (1.37%) |
Status Code 2xx | 32,460,590 (32.73%) | 148,742,768 (48.26%) | 272467 (26.03%) |
Status Code 3xx | 52,131,835 (52.57%) | 131,729,104 (42.74%) | 211709 (20.22%) |
Status Code 4xx | 11,614,387 (11.71%) | 27,099,599 (8.79%) | 560913 (53.58%) |
Status Code 5xx | 2,964,146 (2.99%) | 614,502 (0.20%) | 1764 (0.17%) |
Embedded Resources | 43,260,926 (43.62%) | 195,287,060 (63.36%) | 205976 (19.68%) |
SI Bot | 8,867 (0.01%) | 476367(0.15%) | 3602 (0.34%) |
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Datasets
Full day sample of access logs:
Internet Archive - 2012 (February 2, 2012)
Internet Archive - 2019 (February 7, 2019)
Arquivo.pt - 2019
(February 7, 2019)
5
Feature | IA 2012 | IA 2019 | PT 2019 |
No. of Requests | 99,173,542 (100%) | 308,194,916 (100%) | 1,046,855 (100%) |
GET | 97,987,295 (98.80%) | 304,125,661 (98.68%) | 1025132 (97.92%) |
HEAD | 1,109,810 (1.12%) | 2,578,735 (0.84%) | 14330 (1.37%) |
Status Code 2xx | 32,460,590 (32.73%) | 148,742,768 (48.26%) | 272467 (26.03%) |
Status Code 3xx | 52,131,835 (52.57%) | 131,729,104 (42.74%) | 211709 (20.22%) |
Status Code 4xx | 11,614,387 (11.71%) | 27,099,599 (8.79%) | 560913 (53.58%) |
Status Code 5xx | 2,964,146 (2.99%) | 614,502 (0.20%) | 1764 (0.17%) |
Embedded Resources | 43,260,926 (43.62%) | 195,287,060 (63.36%) | 205976 (19.68%) |
SI Bot | 8,867 (0.01%) | 476367(0.15%) | 3602 (0.34%) |
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Datasets
Full day sample of access logs:
Internet Archive - 2012 (February 2, 2012)
Internet Archive - 2019 (February 7, 2019)
Arquivo.pt - 2019
(February 7, 2019)
6
Feature | IA 2012 | IA 2019 | PT 2019 |
No. of Requests | 99,173,542 (100%) | 308,194,916 (100%) | 1,046,855 (100%) |
GET | 97,987,295 (98.80%) | 304,125,661 (98.68%) | 1025132 (97.92%) |
HEAD | 1,109,810 (1.12%) | 2,578,735 (0.84%) | 14330 (1.37%) |
Status Code 2xx | 32,460,590 (32.73%) | 148,742,768 (48.26%) | 272467 (26.03%) |
Status Code 3xx | 52,131,835 (52.57%) | 131,729,104 (42.74%) | 211709 (20.22%) |
Status Code 4xx | 11,614,387 (11.71%) | 27,099,599 (8.79%) | 560913 (53.58%) |
Status Code 5xx | 2,964,146 (2.99%) | 614,502 (0.20%) | 1764 (0.17%) |
Embedded Resources | 43,260,926 (43.62%) | 195,287,060 (63.36%) | 205976 (19.68%) |
SI Bot | 8,867 (0.01%) | 476367(0.15%) | 3602 (0.34%) |
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Data Cleaning
Stage 1
7
Dataset | Before (No. of Requests) | After Cleaning (No. of Requests) | |
Stage 1 | Stage 2 | ||
IA 2012 | 99,173,542 | 84,512,394 (85.22%) | 18,432,398 (18.58%) |
IA 2019 | 308,194,916 | 237,901,926 (77.19%) | 35,015,776 (11.36%) |
PT 2019 | 1,046,855 | 904,515 (86.40%) | 604,762 (57.77%) |
Stage 2
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Session Identification
1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
8
The IP addresses are anonymized.
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Session Identification
1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"
9
The duration between the two requests
> 10 Minutes
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Type of request: HEAD Request
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
10
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Known Bots
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100
11
https://pypi.org/project/device-detector/ (DeviceDetector, User-Agent Parser)
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Number of User-Agents per IP
x0.77.87.100 - - [02/Feb/2012:03:46:54 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 0 "http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)" 00101000
x0.77.87.100 - - [02/Feb/2012:04:06:29 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 - "http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" 00101000
x0.77.87.100 - - [02/Feb/2012:05:09:30 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 - "http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 00101000
x0.77.87.100 - - [02/Feb/2012:07:59:43 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 302 0 "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.50" 00101000
. . .
. . .
. . .
x0.77.87.100 - - [02/Feb/2012:22:04:57 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Creative)" 00101000
x0.77.87.100 - - [02/Feb/2012:22:08:02 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)" 00101000
x0.77.87.100 - - [02/Feb/2012:23:40:31 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.0" 00101000
x0.77.87.100 - - [02/Feb/2012:23:40:32 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; MRA 4.6 (build 01425))" 00101000
x0.77.87.100 - - [02/Feb/2012:23:59:34 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)" 00101000
12
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Robots.txt
0.139.100.213_2_2 - - [02/Feb/2012:17:03:22 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:06:30 +0000] "GET http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:06:32 +0000] "GET http://wayback.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside HTTP/1.1" 404 2409 "http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside" "RSS Scout 0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:07:38 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:10:44 +0000] "GET http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:10:45 +0000] "GET http://wayback.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 404 2385 "http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI" "RSS Scout 0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:14:50 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:19:54 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000
0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET http://wayback.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 404 2416 "http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation" "RSS Scout 0.9.2" 00001000
0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000
13
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Image to HTML ratio
14
Downloaded using cURL
Accessed in the Web Browser
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Browsing Speed
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190205174131/https://connect.facebook.net/signals/config/225699104785488?v=2.8.40&r=stable HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004025/https://connect.facebook.net/en_US/fbevents.js HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004025/https://connect.facebook.net/signals/config/225699104785488?v=2.8.40&r=stable HTTP/1.1" 302
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://embed.tawk.to/59cc85aec28eca75e4622ccd/default HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://embed.tawk.to/59cc85aec28eca75e4622ccd/default HTTP/1.1" 200
0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 302
. . .
. . .
15
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Results: Bot Detection
16
Heuristics | IA 2012 | IA 2019 | PT 2012 | |||
Sessions: 1.53M | Requests: 22.3M | Sessions: 2.7M | Requests: 42.9M | Sessions: 3.7k | Requests: 614k | |
Known Bots | 21k (1%) | 398k (1%) | 322k (12%) | 4.96M (12%) | 1k (24%) | 67k (11%) |
#UA per IP | 5k (0.3%) | 757k (3%) | 5k (0.2%) | 1.4M (3.4%) | 3 (0.1%) | 3k (0.4%) |
Robots.txt | 2k (0.1%) | 11k (0.1%) | 9k (0.4%) | 31k (0.1%) | 404 (11%) | 4k (0.7%) |
Image to HTML ratio | 1.33M (87%) | 19.89M (89%) | 1.75M (66%) | 24M (56%) | 3k (79%) | 589k (96%) |
Browsing Speed | 237k (16%) | 4.56M (20%) | 515k (19%) | 21M (49%) | 2k (46%) | 162k (26%) |
Total Robots | 1.34M (88%) | 20.28M (91%) | 1.85M (70%) | 30M (70%) | 4k (97%) | 604k (98%) |
The number of requests/sessions which had been labeled as robots from each heuristic separately
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Results: Bot Detection
17
Heuristics | IA 2012 | IA 2019 | PT 2012 | |||
Sessions: 1.53M | Requests: 22.3M | Sessions: 2.7M | Requests: 42.9M | Sessions: 3.7k | Requests: 614k | |
Known Bots | 21k (1%) | 398k (1%) | 322k (12%) | 4.96M (12%) | 1k (24%) | 67k (11%) |
#UA per IP | 5k (0.3%) | 757k (3%) | 5k (0.2%) | 1.4M (3.4%) | 3 (0.1%) | 3k (0.4%) |
Robots.txt | 2k (0.1%) | 11k (0.1%) | 9k (0.4%) | 31k (0.1%) | 404 (11%) | 4k (0.7%) |
Image to HTML ratio | 1.33M (87%) | 19.89M (89%) | 1.75M (66%) | 24M (56%) | 3k (79%) | 589k (96%) |
Browsing Speed | 237k (16%) | 4.56M (20%) | 515k (19%) | 21M (49%) | 2k (46%) | 162k (26%) |
Total Robots | 1.34M (88%) | 20.28M (91%) | 1.85M (70%) | 30M (70%) | 4k (97%) | 604k (98%) |
Image-to-HTML ratio had the largest effect on detecting robots
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Results: Bot Detection
18
Heuristics | IA 2012 | IA 2019 | PT 2012 | |||
Sessions: 1.53M | Requests: 22.3M | Sessions: 2.7M | Requests: 42.9M | Sessions: 3.7k | Requests: 614k | |
Known Bots | 21k (1%) | 398k (1%) | 322k (12%) | 4.96M (12%) | 1k (24%) | 67k (11%) |
#UA per IP | 5k (0.3%) | 757k (3%) | 5k (0.2%) | 1.4M (3.4%) | 3 (0.1%) | 3k (0.4%) |
Robots.txt | 2k (0.1%) | 11k (0.1%) | 9k (0.4%) | 31k (0.1%) | 404 (11%) | 4k (0.7%) |
Image to HTML ratio | 1.33M (87%) | 19.89M (89%) | 1.75M (66%) | 24M (56%) | 3k (79%) | 589k (96%) |
Browsing Speed | 237k (16%) | 4.56M (20%) | 515k (19%) | 21M (49%) | 2k (46%) | 162k (26%) |
Total Robots | 1.34M (88%) | 20.28M (91%) | 1.85M (70%) | 30M (70%) | 4k (97%) | 604k (98%) |
The number of requests/sessions detected after applying all the heuristics together.
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Key Takeaways
19
Out of requests,
IA 2012: 91%
IA 2019: 70%
PT 2019: 98%
Out of sessions,
IA 2012: 88%
IA 2019: 70%
PT 2019: 97%
The percentage of web archive accesses that were detected as robots.
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Backup Slides …
20
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Methodology
21
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Access Patterns
Dip: The user accesses only one URI (URI-M or URI-T).
Slide: The user accesses the same URI-R at different
Memento-Datetimes.
Dive: The user accesses different URI-Rs at nearly the same
MementoDatetime (i.e., dives deeply into a memento
by browsing hyperlinks of URIMs).
Skim: The user accesses different TimeMaps (URI-T).
22
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Temporal Preference
Majority of the requests are for mementos that are close to the datetime of each access log sample
23
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Image to HTML ratio
0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:55:22 +0000] "GET /web/*/http://maestro.haarp.alaska.edu/ HTTP/2.0" 200 9002 "https://archive.org/search.php?query=http%3A%2F%2Fmaestro.haarp.alaska.edu%2F" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 0.192 MISS 0.192 "text/html; charset=utf-8" - "-" "-" "wwwb-app31" "-" 00001000
0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:56:15 +0000] "GET /web/20130304102141/http://maestro.haarp.alaska.edu/ HTTP/2.0" 404 0 "https://web.archive.org/web/20130715000000*/http://maestro.haarp.alaska.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 10.859 MISS 10.856 "text/html; charset=utf-8" - "-" "-" "wwwb-app104" "-" 00001000
0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:56:15 +0000] "GET /web/20130304102141/http://maestro.haarp.alaska.edu/ HTTP/2.0" 404 0 "https://web.archive.org/web/20130715000000*/http://maestro.haarp.alaska.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 10.926 MISS 10.928 "text/html; charset=utf-8" - "-" "-" "wwwb-app58" "-" 00001000
0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:56:15 +0000] "GET /web/20130304102141/http://maestro.haarp.alaska.edu/ HTTP/2.0" 404 0 "https://web.archive.org/web/20130715000000*/http://maestro.haarp.alaska.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 11.453 MISS 11.456 "text/html; charset=utf-8" - "-" "-" "wwwb-app57" "-" 00001000
. . .
. . .
. . .
0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:56:23 +0000] "GET /web/20130304102141/http://maestro.haarp.alaska.edu/ HTTP/2.0" 404 8274 "https://web.archive.org/web/20130715000000*/http://maestro.haarp.alaska.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 0.000 HIT - "text/html; charset=utf-8" - "-" "-" "wwwb-app43" "-" 00001000
0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:56:23 +0000] "GET /web/20130304102141/http://maestro.haarp.alaska.edu/ HTTP/2.0" 404 8274 "https://web.archive.org/web/20130715000000*/http://maestro.haarp.alaska.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 0.227 MISS 0.224 "text/html; charset=utf-8" - "-" "-" "wwwb-app43" "-" 00001000
0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:56:29 +0000] "GET /web/*/http://maestro.haarp.alaska.edu/* HTTP/2.0" 200 8341 "https://web.archive.org/web/20130304102141/http://maestro.haarp.alaska.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 0.087 MISS 0.088 "text/html; charset=utf-8" - "-" "-" "wwwb-app57" "-" 00001000
24
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Results - Internet Archive 2012 Dataset
25
Heuristics | Sessions: 1,527,340 | Requests: 22,302,090 |
Known Bots | 21,423 (1.40%) | 398,053 (1.78%) |
#UA per IP | 5,050 (0.33%) | 756,801(3.39%) |
Robots.txt | 1,958(0.13%) | 11,074 (0.05%) |
Image to HTML ratio | 1,327,896 (86.94%) | 19,893,394 (89.20%) |
Browsing Speed | 237,271 (15.53%) | 4,563,851 (20.46%) |
Total Robots | 1340318 (87.76%) | 20,281,301 (90.94%) |
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Results - Internet Archive 2019 Dataset
26
Heuristics | Sessions: 2,658,637 | Requests: 42,868,048 |
Known Bots | 322,379 (12.13%) | 4,969,187(11.59%) |
#UA per IP | 5,475 (0.21%) | 1,442,574(3.37%) |
Robots.txt | 9,296(0.35%) | 31,452(0.07%) |
Image to HTML ratio | 1,746,989 (65.71%) | 24,056,112 (56.12%) |
Browsing Speed | 514,878 (19.37%) | 21,176,163 (49.40%) |
Total Robots | 1,854,282 (69.75%) | 29,968,059 (69.91%) |
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Results - Arquivo.pt 2019 Dataset
27
Heuristics | Sessions: 3,680 | Requests: 613,672 |
Known Bots | 884 (24.02%) | 67,453 (10.99%) |
#UA per IP | 3 (0.08%) | 2,636 (0.43%) |
Robots.txt | 404 (10.98%) | 4,236 (0.69%) |
Image to HTML ratio | 2,916 (79.24%) | 589,363 (96.04%) |
Browsing Speed | 1,694 (46.03%) | 162,068 (26.41%) |
Total Robots | 3584 (97.39%) | 603,654 (98.37%) |
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022
Percentage of Robots Detected in Each Dataset
Number of Sessions
Number of Requests
28
@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022