Internet Archive Level 1 Diagnosis Guide
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

 
$
%
123
 
 
 
 
 
 
 
 
 
ABCDEFGHIJKLMNOPQRSTUVWXYZAA
1
Archive.org Level 1 Service Diagnosis Guide
When escalating: First post a decription of the problem and steps checked into the Slack #ops channel.
Second: send SMS.
Third call (phone numbers in column H)
Phones:
2
Core Team (in call order):John, Sam, Andrew, Jonah, Management (below)
John G.: (415) 250-1308, 415-482-7971
3
Draft 2: 8/9/2018
Server Admin Team (in call order)
John, Andrew, Sam, Jonah, Management (below)
Brewster: (415) 533-5593
4
Search Services Team (in order)John, Aaron X., Sam, Gio, Management
Andrew: (612) 817-6918
5
Archive-It TeamJamesJonah: (415) 763-8676
6
OpenLibrary TeamMEKSam: (415) 425-7739
7
Wayback TeamMark, David van Duzer (dvd), Kenji
Aaron: (415) 637-5243
Gio: (415) 568-1457
8
Management (in call order):John, BrewsterHank: (912) 508-3579
9
See "after escalation checklist" tab (below)....
Mark: (917) 697-0110
Kenji: (650) 863-7586
10
Situation:
"Site is Down" notice comes in (via Slack or phone call) to someone on the Level-1 Response Team
MEK: (415) 690-8033
11
ARCHIVE.ORG SITE OUTAGE
James: (215) 292-2830
12
Step Number:
Question to AnswerDo thisLook at these things....What Indicates a problem?Confirmation StepProblem Resolution or Next ActionService Documentation
13
14
Step 1:Is it down for EVERYONE, or just the notifier?Check "down for everyone"When you click on this link, it will indicate whether the site is down for everyone or just you.Report indicates that Achive.Org is down for everyone.If the report indicates that only you are unable to connect to the site, please try to find someone to corroborate your experience...Go to Step 2
15
1.1HOW is it "Down"?Go to Archive.orgWhat comes up in 2 desktop browsers? Does it come up on phone browser?"Scheduled Maintenance" or
503 Error or "Server Unavailable"
ping archive.orgGo to Step 2
https://wiki.archive.org/twiki/bin/view/PetaBox/WebHome
16
1.2Certificate ErrorSend e-mail, not an emergency.
17
1.3Site loads, but is very slow or home page does not load any collection or image tilesGo to Step 5
18
1.4Site loads, but home page does not load any collection or image tilesGo to Step 6.3
19
20
Step 2Is ONLY Archive.org down, or are all Internet Archive Services down?Go to OpenLibrary.orgWhat comes up in your browser?503 Error or Server Unavailableping openlibrary.orgGo to Step 3
21
2.1Go to Archive-it.orgWhat comes up in your browser?503 Error or Server Unavailableping archive-it.orgGo to Step 3
22
23
Step 3Is an entire datacenter down?Go to the "Weathermap"Look at the color (traffic levels) for arrows in and out of "300 Funston" and "2512 Florida" (center of right custer of arrows)If all arrows are gray for more than 5 minutes.ESCALATE TO CORE TEAM !!!
AND ESCALATE TO MANAGEMENT!!!
https://wiki.archive.org/twiki/bin/view/PetaBox/WebHome
24
25
Step 4Is there a major network issue?Go to the "Weathermap"Look at the color (traffic levels) for arrows between "200 Paul" and each of it's connection pointsIf many arrows are gray for more than 5 minutes.ESCALATE TO CORE TEAM !!!
AND ESCALATE TO MANAGEMENT!!!
https://wiki.archive.org/twiki/bin/view/PetaBox/WebHome
26
27
Step 5Has the site service been "wobbly" recently?Go to pageview historyLook at the 3-Hour graphsDo any of the 4 graphs show gaps (drop to zero), a sudden drop, or wild fluctuations (swings of more than 30%) within the last 30 minutes?Go to Step 6
28
5.1Do any of the 4 charts show a recent significant increase in the amount of purple or black at the top of the "flame"?Go to Step 6
29
5.2Go to Sorry Server StatsLook at 3-hour graphsAre there recent spikes of activity?Go to Step 6
30
31
Step 6Are the databases healthy?Link to Grafana graphLook at 3-hour graphsIs there any recent drop in transactions per second or spike in the replication delay?
32
6.1Is the DB variance normal?
Click on the suspicious graph
Look at Daily, Weekly, and Monthly detailIs recent activity inconsistent with past patterns?Escalate to Server Admin Team
33
34
Step 7Are the web-heads healthy?Web Head Nagios AlertsLook at the small table at to middle center of the web page that comes up.How many web-head servers are up?, Down? Un-reachable? If more than 5 are servers are down or unreachable, that is bad.Escalate to Server Admin Team
35
You will need login and password (standard)
36
37
38
39
40
Step 8Is the Seach Engine healthy?Look at SE historyLook at 3-hour graphsDoes the link not work at all or does left-hand graph show gaps (drop to zero), a sudden drop, or wild fluctuations (swings of more tha 30%) within the last 30 minutes?Escalate to Search Services Team
41
8.1Does left-hand graph show a recent significant increase in the amount of purple or black at the top of the "flame"?Escalate to Search Services Team
42
8.2Does right-hand graph show a recent significant increase?Escalate to Search Services Team
43
8.3(Advanced): Are the actual ES VMs healthy?Pick and and view the ES-related VMs hereIF IT IS MEANINGFUL TO YOU, Use the VM select drop-down in upper left to examine suspected VMs
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Loading...
Main menu