1 of 28

Comparison of Access Patterns of Robots and Humans in Web Archives

Himarsha R. Jayanetti1, Kritika Garg1, Sawood Alam2, Michael L. Nelson1, and Michele C. Weigle1

1 Web Science & Digital Libraries Research Group

Old Dominion University, Norfolk VA, USA

@WebSciDL

2 Wayback Machine, Internet Archive

San Francisco, California, USA

@internetarchive

Web Archiving and Digital Libraries (WADL), 24 June 2022

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

2 of 28

Motivation

  • User access patterns on the live web are well recognized, there have been only few studies of how users access web archives, both humans and robots.

  • We present an examination of user accesses to web archives by using three different datasets from anonymized server access logs.

  • Our goal is to determine which accesses are likely to be from humans (web browsers) and which are from bots.

  • This is an extension of a previous study which studied access patterns for robots and humans in web archives based on samples from the Wayback Machine from 2012.

2

Yasmin AlNoamany, Michele C. Weigle, and Michael L. Nelson, “Access Patterns for Robots and Humans in Web Archives,” In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). Indianapolis, IN, July 2013, pp. 339-–348.

https://www.cs.odu.edu/~mln/pubs/jcdl-2013/fp105-AlNoamany.pdf

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

3 of 28

Datasets

Full day sample of access logs:

  • IA 2012

Internet Archive - 2012 (February 2, 2012)

  • IA 2019

Internet Archive - 2019 (February 7, 2019)

  • PT 2019

Arquivo.pt - 2019

(February 7, 2019)

3

Feature

IA 2012

IA 2019

PT 2019

No. of Requests

99,173,542 (100%)

308,194,916 (100%)

1,046,855 (100%)

GET

97,987,295 (98.80%)

304,125,661 (98.68%)

1025132 (97.92%)

HEAD

1,109,810 (1.12%)

2,578,735 (0.84%)

14330 (1.37%)

Status Code 2xx

32,460,590 (32.73%)

148,742,768 (48.26%)

272467 (26.03%)

Status Code 3xx

52,131,835 (52.57%)

131,729,104 (42.74%)

211709 (20.22%)

Status Code 4xx

11,614,387 (11.71%)

27,099,599 (8.79%)

560913 (53.58%)

Status Code 5xx

2,964,146 (2.99%)

614,502 (0.20%)

1764 (0.17%)

Embedded Resources

43,260,926 (43.62%)

195,287,060 (63.36%)

205976 (19.68%)

SI Bot

8,867 (0.01%)

476367(0.15%)

3602 (0.34%)

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

4 of 28

Datasets

Full day sample of access logs:

  • IA 2012

Internet Archive - 2012 (February 2, 2012)

  • IA 2019

Internet Archive - 2019 (February 7, 2019)

  • PT 2019

Arquivo.pt - 2019

(February 7, 2019)

4

Feature

IA 2012

IA 2019

PT 2019

No. of Requests

99,173,542 (100%)

308,194,916 (100%)

1,046,855 (100%)

GET

97,987,295 (98.80%)

304,125,661 (98.68%)

1025132 (97.92%)

HEAD

1,109,810 (1.12%)

2,578,735 (0.84%)

14330 (1.37%)

Status Code 2xx

32,460,590 (32.73%)

148,742,768 (48.26%)

272467 (26.03%)

Status Code 3xx

52,131,835 (52.57%)

131,729,104 (42.74%)

211709 (20.22%)

Status Code 4xx

11,614,387 (11.71%)

27,099,599 (8.79%)

560913 (53.58%)

Status Code 5xx

2,964,146 (2.99%)

614,502 (0.20%)

1764 (0.17%)

Embedded Resources

43,260,926 (43.62%)

195,287,060 (63.36%)

205976 (19.68%)

SI Bot

8,867 (0.01%)

476367(0.15%)

3602 (0.34%)

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

5 of 28

Datasets

Full day sample of access logs:

  • IA 2012

Internet Archive - 2012 (February 2, 2012)

  • IA 2019

Internet Archive - 2019 (February 7, 2019)

  • PT 2019

Arquivo.pt - 2019

(February 7, 2019)

5

Feature

IA 2012

IA 2019

PT 2019

No. of Requests

99,173,542 (100%)

308,194,916 (100%)

1,046,855 (100%)

GET

97,987,295 (98.80%)

304,125,661 (98.68%)

1025132 (97.92%)

HEAD

1,109,810 (1.12%)

2,578,735 (0.84%)

14330 (1.37%)

Status Code 2xx

32,460,590 (32.73%)

148,742,768 (48.26%)

272467 (26.03%)

Status Code 3xx

52,131,835 (52.57%)

131,729,104 (42.74%)

211709 (20.22%)

Status Code 4xx

11,614,387 (11.71%)

27,099,599 (8.79%)

560913 (53.58%)

Status Code 5xx

2,964,146 (2.99%)

614,502 (0.20%)

1764 (0.17%)

Embedded Resources

43,260,926 (43.62%)

195,287,060 (63.36%)

205976 (19.68%)

SI Bot

8,867 (0.01%)

476367(0.15%)

3602 (0.34%)

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

6 of 28

Datasets

Full day sample of access logs:

  • IA 2012

Internet Archive - 2012 (February 2, 2012)

  • IA 2019

Internet Archive - 2019 (February 7, 2019)

  • PT 2019

Arquivo.pt - 2019

(February 7, 2019)

6

Feature

IA 2012

IA 2019

PT 2019

No. of Requests

99,173,542 (100%)

308,194,916 (100%)

1,046,855 (100%)

GET

97,987,295 (98.80%)

304,125,661 (98.68%)

1025132 (97.92%)

HEAD

1,109,810 (1.12%)

2,578,735 (0.84%)

14330 (1.37%)

Status Code 2xx

32,460,590 (32.73%)

148,742,768 (48.26%)

272467 (26.03%)

Status Code 3xx

52,131,835 (52.57%)

131,729,104 (42.74%)

211709 (20.22%)

Status Code 4xx

11,614,387 (11.71%)

27,099,599 (8.79%)

560913 (53.58%)

Status Code 5xx

2,964,146 (2.99%)

614,502 (0.20%)

1764 (0.17%)

Embedded Resources

43,260,926 (43.62%)

195,287,060 (63.36%)

205976 (19.68%)

SI Bot

8,867 (0.01%)

476367(0.15%)

3602 (0.34%)

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

7 of 28

Data Cleaning

Stage 1

  • Remove log entries that were either invalid or irrelevant to the analysis.

    • Everything except requests to Mementos
    • Everything except requests to TimeMaps
    • Kept the requests to the robots.txt of the web archive.

7

Dataset

Before

(No. of Requests)

After Cleaning (No. of Requests)

Stage 1

Stage 2

IA 2012

99,173,542

84,512,394 (85.22%)

18,432,398 (18.58%)

IA 2019

308,194,916

237,901,926 (77.19%)

35,015,776 (11.36%)

PT 2019

1,046,855

904,515 (86.40%)

604,762 (57.77%)

Stage 2

  • Remove log entries that were irrelevant in terms of user behavior.

    • Everything except GET requests
    • Everything except 200, 404, and 503 response codes
    • Embedded resources

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

8 of 28

Session Identification

  • Dividing the access logs into different sessions.

    • Grouped the requests based on the IP and User-Agent.

    • Divided the requests of each user into individual sessions (timeout threshold: 10 minute)

1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"

0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"

0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"

0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"

0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"

0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"

8

The IP addresses are anonymized.

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

9 of 28

Session Identification

  • Dividing the access logs into different sessions.

    • Grouped the requests based on the IP and User-Agent.

    • Divided the requests of each user into individual sessions (timeout threshold: 10 minute)

1.1.0.100_0_1 - - [02/Feb/2012:04:36:43 +0000] "GET http://web.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"

0.0.0.100_0_1 - - [02/Feb/2012:04:36:44 +0000] "GET http://wayback.archive.org/web/*/DEFI-METRAX.RU HTTP/1.0" 404 2164 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"

0.0.0.100_0_2 - - [02/Feb/2012:04:58:34 +0000] "GET http://web.archive.org/web/*/LETSMILK.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"

0.0.0.100_0_2 - - [02/Feb/2012:04:58:37 +0000] "GET http://wayback.archive.org/web/*/LETSMILK.RU HTTP/1.0" 404 2162 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"

0.0.0.100_0_2 - - [02/Feb/2012:05:00:21 +0000] "GET http://web.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 302 0 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"

0.0.0.100_0_2 - - [02/Feb/2012:05:00:49 +0000] "GET http://wayback.archive.org/web/*/CHISTKALICA.RU HTTP/1.0" 503 2197 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; ru; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"

9

The duration between the two requests

> 10 Minutes

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

10 of 28

Type of request: HEAD Request

199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100

10

  • Web browsers issue GET requests for web pages.

  • We flagged the requests making HEAD requests as bots.

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

11 of 28

Known Bots

  • A manually compiled User-Agent list of known bots.

  • User-Agents with keywords such as bot, crawler, spider, etc.

  • Python module "DeviceDetector", which is a User-Agent parser which will help us determine whether or not the User-Agent is a bot.

199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:01 +0100] "GET /robots.txt HTTP/1.1" 200 1414 "-" "Twitterbot/1.0" 01011100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:02 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:05 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100

199.16.157.100_0_0 - - [07/Jul/2019:14:00:07 +0100] "HEAD /wayback/20170625001353/http://www.fabricadochocolate.com/ HTTP/1.1" 200 - "-" "Twitterbot/1.0" 11001100

11

https://pypi.org/project/device-detector/ (DeviceDetector, User-Agent Parser)

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

12 of 28

Number of User-Agents per IP

x0.77.87.100 - - [02/Feb/2012:03:46:54 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 0 "http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)" 00101000

x0.77.87.100 - - [02/Feb/2012:04:06:29 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 - "http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)" 00101000

x0.77.87.100 - - [02/Feb/2012:05:09:30 +0000] "POST http://web.archive.org/web/20070211155651/http://212.227.83.57/cproc.aspx HTTP/1.0" 302 - "http://www.vbleisure.co.uk/guest_book.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" 00101000

x0.77.87.100 - - [02/Feb/2012:07:59:43 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 302 0 "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; ru) Opera 8.50" 00101000

. . .

. . .

. . .

x0.77.87.100 - - [02/Feb/2012:22:04:57 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90; Creative)" 00101000

x0.77.87.100 - - [02/Feb/2012:22:08:02 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)" 00101000

x0.77.87.100 - - [02/Feb/2012:23:40:31 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.0" 00101000

x0.77.87.100 - - [02/Feb/2012:23:40:32 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; MRA 4.6 (build 01425))" 00101000

x0.77.87.100 - - [02/Feb/2012:23:59:34 +0000] "POST http://web.archive.org/web/20070501120942/http://www.ibcmemorial.org.way_back_stub/formmailer.php HTTP/1.0" 503 - "http://ibcmemorial.org/sign-guestbook.html" "Opera/7.60 (Windows NT 5.2; U) [en] (IBM EVV/3.0/EAK01AG9/LE)" 00101000

12

  • Some of the bots keep changing their User-Agent between requests to avoid being detected as a bot.
  • We have flagged requests from IPs that update their User-Agent field more than 20 times as bots.

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

13 of 28

Robots.txt

  • Legitimate bots will typically request robots.txt to determine what they are allowed to crawl.
  • We considered a request for the robots.txt file as an indication for a bot request.

0.139.100.213_2_2 - - [02/Feb/2012:17:03:22 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000

0.139.100.213_2_2 - - [02/Feb/2012:17:06:30 +0000] "GET http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000

0.139.100.213_2_2 - - [02/Feb/2012:17:06:32 +0000] "GET http://wayback.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside HTTP/1.1" 404 2409 "http://web.archive.org/web/*/http://c00lbookmarks.com/story.php?title=best-door-blinds-inside" "RSS Scout 0.9.2" 00001000

0.139.100.213_2_2 - - [02/Feb/2012:17:07:38 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000

0.139.100.213_2_2 - - [02/Feb/2012:17:10:44 +0000] "GET http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000

0.139.100.213_2_2 - - [02/Feb/2012:17:10:45 +0000] "GET http://wayback.archive.org/web/*/http://www.goloco.org/users/D5EWwXI HTTP/1.1" 404 2385 "http://web.archive.org/web/*/http://www.goloco.org/users/D5EWwXI" "RSS Scout 0.9.2" 00001000

0.139.100.213_2_2 - - [02/Feb/2012:17:14:50 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000

0.139.100.213_2_2 - - [02/Feb/2012:17:19:54 +0000] "GET http://web.archive.org/robots.txt HTTP/1.1" 200 125 "-" "RSS Scout 0.9.2" 00011000

0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET http://wayback.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 404 2416 "http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation" "RSS Scout 0.9.2" 00001000

0.139.100.213_2_2 - - [02/Feb/2012:17:27:26 +0000] "GET http://web.archive.org/web/*/http://epicbookmarks.com/story.php?title=door-blinds-inside-installation HTTP/1.1" 302 0 "-" "RSS Scout 0.9.2" 00001000

13

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

14 of 28

Image to HTML ratio

  • Image-to-HTML, is the ratio between the number of image files and the number of HTML files per session.
  • Robots tend to retrieve only HTML pages (ignoring images and other embedded resources). Therefore human sessions should have more images than robot sessions.
  • We flagged a session with less than one image file for every 10 HTML files as a robot session.

14

Downloaded using cURL

Accessed in the Web Browser

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

15 of 28

Browsing Speed

  • We considered a browsing speed >= 0.5 (requests per second) as a threshold to detect robot sessions.

0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190205174131/https://connect.facebook.net/signals/config/225699104785488?v=2.8.40&r=stable HTTP/1.1" 200

0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200

0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200

0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200

0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207001831/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 200

0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004025/https://connect.facebook.net/en_US/fbevents.js HTTP/1.1" 200

0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004025/https://connect.facebook.net/signals/config/225699104785488?v=2.8.40&r=stable HTTP/1.1" 302

0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://embed.tawk.to/59cc85aec28eca75e4622ccd/default HTTP/1.1" 200

0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://embed.tawk.to/59cc85aec28eca75e4622ccd/default HTTP/1.1" 200

0.0.100.100_0_0 web.archive.org - [07/Feb/2019:00:46:30 +0000] "GET /web/20190207004026/https://fonts.googleapis.com/css?family=Lato:100,100i,300,300i,400,400i,700,700i,900,900i&subset=latin-ext HTTP/1.1" 302

. . .

. . .

15

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

16 of 28

Results: Bot Detection

16

Heuristics

IA 2012

IA 2019

PT 2012

Sessions:

1.53M

Requests:

22.3M

Sessions:

2.7M

Requests:

42.9M

Sessions:

3.7k

Requests:

614k

Known Bots

21k (1%)

398k (1%)

322k (12%)

4.96M (12%)

1k (24%)

67k (11%)

#UA per IP

5k (0.3%)

757k (3%)

5k (0.2%)

1.4M (3.4%)

3 (0.1%)

3k (0.4%)

Robots.txt

2k (0.1%)

11k (0.1%)

9k (0.4%)

31k (0.1%)

404 (11%)

4k (0.7%)

Image to HTML ratio

1.33M (87%)

19.89M (89%)

1.75M (66%)

24M (56%)

3k (79%)

589k (96%)

Browsing Speed

237k (16%)

4.56M (20%)

515k (19%)

21M (49%)

2k (46%)

162k (26%)

Total Robots

1.34M (88%)

20.28M (91%)

1.85M (70%)

30M (70%)

4k (97%)

604k (98%)

The number of requests/sessions which had been labeled as robots from each heuristic separately

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

17 of 28

Results: Bot Detection

17

Heuristics

IA 2012

IA 2019

PT 2012

Sessions:

1.53M

Requests:

22.3M

Sessions:

2.7M

Requests:

42.9M

Sessions:

3.7k

Requests:

614k

Known Bots

21k (1%)

398k (1%)

322k (12%)

4.96M (12%)

1k (24%)

67k (11%)

#UA per IP

5k (0.3%)

757k (3%)

5k (0.2%)

1.4M (3.4%)

3 (0.1%)

3k (0.4%)

Robots.txt

2k (0.1%)

11k (0.1%)

9k (0.4%)

31k (0.1%)

404 (11%)

4k (0.7%)

Image to HTML ratio

1.33M (87%)

19.89M (89%)

1.75M (66%)

24M (56%)

3k (79%)

589k (96%)

Browsing Speed

237k (16%)

4.56M (20%)

515k (19%)

21M (49%)

2k (46%)

162k (26%)

Total Robots

1.34M (88%)

20.28M (91%)

1.85M (70%)

30M (70%)

4k (97%)

604k (98%)

Image-to-HTML ratio had the largest effect on detecting robots

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

18 of 28

Results: Bot Detection

18

Heuristics

IA 2012

IA 2019

PT 2012

Sessions:

1.53M

Requests:

22.3M

Sessions:

2.7M

Requests:

42.9M

Sessions:

3.7k

Requests:

614k

Known Bots

21k (1%)

398k (1%)

322k (12%)

4.96M (12%)

1k (24%)

67k (11%)

#UA per IP

5k (0.3%)

757k (3%)

5k (0.2%)

1.4M (3.4%)

3 (0.1%)

3k (0.4%)

Robots.txt

2k (0.1%)

11k (0.1%)

9k (0.4%)

31k (0.1%)

404 (11%)

4k (0.7%)

Image to HTML ratio

1.33M (87%)

19.89M (89%)

1.75M (66%)

24M (56%)

3k (79%)

589k (96%)

Browsing Speed

237k (16%)

4.56M (20%)

515k (19%)

21M (49%)

2k (46%)

162k (26%)

Total Robots

1.34M (88%)

20.28M (91%)

1.85M (70%)

30M (70%)

4k (97%)

604k (98%)

The number of requests/sessions detected after applying all the heuristics together.

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

19 of 28

Key Takeaways

  • In 2013, AlNoamany et al. used a 30 minute sample of Wayback Machine's anonymized server access logs from 2012 to investigate the access patterns of humans and robots in the Internet Archive.

  • We have extended this study to use three samples of server access logs for a full day.
    • IA's Wayback Machine - 2012 and 2019
    • Arquivo.pt - 2019

  • We used a variety of heuristics to classify sessions as a robot or human, including type of requests made (HEAD bot), known bots, User-Agent per IP, requests to robots.txt, image-to-HTML ratio, and browsing speed. The Image-to-HTML ratio had the largest effect on detecting robots.

19

Out of requests,

IA 2012: 91%

IA 2019: 70%

PT 2019: 98%

Out of sessions,

IA 2012: 88%

IA 2019: 70%

PT 2019: 97%

The percentage of web archive accesses that were detected as robots.

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

20 of 28

Backup Slides …

20

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

21 of 28

Methodology

21

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

22 of 28

Access Patterns

Dip: The user accesses only one URI (URI-M or URI-T).

Slide: The user accesses the same URI-R at different

Memento-Datetimes.

Dive: The user accesses different URI-Rs at nearly the same

MementoDatetime (i.e., dives deeply into a memento

by browsing hyperlinks of URIMs).

Skim: The user accesses different TimeMaps (URI-T).

22

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

23 of 28

Temporal Preference

Majority of the requests are for mementos that are close to the datetime of each access log sample

23

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

24 of 28

Image to HTML ratio

0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:55:22 +0000] "GET /web/*/http://maestro.haarp.alaska.edu/ HTTP/2.0" 200 9002 "https://archive.org/search.php?query=http%3A%2F%2Fmaestro.haarp.alaska.edu%2F" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 0.192 MISS 0.192 "text/html; charset=utf-8" - "-" "-" "wwwb-app31" "-" 00001000

0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:56:15 +0000] "GET /web/20130304102141/http://maestro.haarp.alaska.edu/ HTTP/2.0" 404 0 "https://web.archive.org/web/20130715000000*/http://maestro.haarp.alaska.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 10.859 MISS 10.856 "text/html; charset=utf-8" - "-" "-" "wwwb-app104" "-" 00001000

0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:56:15 +0000] "GET /web/20130304102141/http://maestro.haarp.alaska.edu/ HTTP/2.0" 404 0 "https://web.archive.org/web/20130715000000*/http://maestro.haarp.alaska.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 10.926 MISS 10.928 "text/html; charset=utf-8" - "-" "-" "wwwb-app58" "-" 00001000

0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:56:15 +0000] "GET /web/20130304102141/http://maestro.haarp.alaska.edu/ HTTP/2.0" 404 0 "https://web.archive.org/web/20130715000000*/http://maestro.haarp.alaska.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 11.453 MISS 11.456 "text/html; charset=utf-8" - "-" "-" "wwwb-app57" "-" 00001000

. . .

. . .

. . .

0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:56:23 +0000] "GET /web/20130304102141/http://maestro.haarp.alaska.edu/ HTTP/2.0" 404 8274 "https://web.archive.org/web/20130715000000*/http://maestro.haarp.alaska.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 0.000 HIT - "text/html; charset=utf-8" - "-" "-" "wwwb-app43" "-" 00001000

0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:56:23 +0000] "GET /web/20130304102141/http://maestro.haarp.alaska.edu/ HTTP/2.0" 404 8274 "https://web.archive.org/web/20130715000000*/http://maestro.haarp.alaska.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 0.227 MISS 0.224 "text/html; charset=utf-8" - "-" "-" "wwwb-app43" "-" 00001000

0.0.122.100_1_0 web.archive.org - [07/Feb/2019:16:56:29 +0000] "GET /web/*/http://maestro.haarp.alaska.edu/* HTTP/2.0" 200 8341 "https://web.archive.org/web/20130304102141/http://maestro.haarp.alaska.edu/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/57.0.3098.116" 0.087 MISS 0.088 "text/html; charset=utf-8" - "-" "-" "wwwb-app57" "-" 00001000

24

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

25 of 28

Results - Internet Archive 2012 Dataset

25

Heuristics

Sessions: 1,527,340

Requests: 22,302,090

Known Bots

21,423 (1.40%)

398,053 (1.78%)

#UA per IP

5,050 (0.33%)

756,801(3.39%)

Robots.txt

1,958(0.13%)

11,074 (0.05%)

Image to HTML ratio

1,327,896 (86.94%)

19,893,394 (89.20%)

Browsing Speed

237,271 (15.53%)

4,563,851 (20.46%)

Total Robots

1340318 (87.76%)

20,281,301 (90.94%)

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

26 of 28

Results - Internet Archive 2019 Dataset

26

Heuristics

Sessions: 2,658,637

Requests: 42,868,048

Known Bots

322,379 (12.13%)

4,969,187(11.59%)

#UA per IP

5,475 (0.21%)

1,442,574(3.37%)

Robots.txt

9,296(0.35%)

31,452(0.07%)

Image to HTML ratio

1,746,989 (65.71%)

24,056,112 (56.12%)

Browsing Speed

514,878 (19.37%)

21,176,163 (49.40%)

Total Robots

1,854,282 (69.75%)

29,968,059 (69.91%)

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

27 of 28

Results - Arquivo.pt 2019 Dataset

27

Heuristics

Sessions: 3,680

Requests: 613,672

Known Bots

884 (24.02%)

67,453 (10.99%)

#UA per IP

3 (0.08%)

2,636 (0.43%)

Robots.txt

404 (10.98%)

4,236 (0.69%)

Image to HTML ratio

2,916 (79.24%)

589,363 (96.04%)

Browsing Speed

1,694 (46.03%)

162,068 (26.41%)

Total Robots

3584 (97.39%)

603,654 (98.37%)

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022

28 of 28

Percentage of Robots Detected in Each Dataset

Number of Sessions

Number of Requests

28

@HimarshaJ @kritika_garg @ibnesayeed @phonedude_mln @weiglemc WADL, 2022