Ops Checklist
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

Comment only
 
 
ABCDEFGHIJKLMNOPQRSTUV
1
2
Ops Checklist
3
CheckCustomer APIPartner APIAdminDetail description
4
Last ReviewedJul-15Jul-15Jul-15
5
Certificates
6
Monitor Third party issued certificates and notify 30 days before expirationYesNoYese.g. Verisign informes us 30 days before expiration
7
Renewal of external certificates take less than an hourNoNoNoe.g. Verisign certificate renewal is fully automated
8
Monitor all self-signed certificates and notify 30 days before expiratione.g. Some job or monitoring tool checks all self-signed certificates, on each server
9
Self-signed certificates are recorded in an inventory and expiration date regularly checkede.g. All certificate names, domain, expiration date maintained in support wiki
10
Renewed certificates can be deployed within 10 mins across all servers, fully automaticallye.g. AWS certificate validation can take up to 72 hours. That's too slow for emergency
11
12
13
Traps/Alerts
14
Trap for too many errors in webserver logN/ANoYese.g. If more than 10 http 500 recorded in access log per minute, raise trap.
15
Trap on too many slow response in webserver logN/Ae.g. if more than 10 http request took more than 10 seconds in a minute, raise trap.
16
Trap for ERROR, WARNING, CRITICAL, ALERT, EMERGENCY weblogic logN/AYesYesTrap for each of these keywords. If there are already too many ERROR and WARNING, then suppress these two for a while, until we fix them. But the rest of the keyword must be in hunter trap.
In IIS, there's HTTPERR log. \windows\system32\inetsrv\LogFiles
17
Trap for hogging threadsN/AYesYesTrap on HoggingThreads >= 5. Windows: Check the perf counter for ASP.NET Requests in Application Queue
18
Trap for Request Queue lengthN/AYesNoTrap on QueueLength > 0. Windows: Check the perf counter for ASP.NET Requests in Application Queue
19
Trap on health state of website/api processN/AYesYesTrap on Health_State != HEALTH_OK. Windows: Check App Pool enabled, Website started satus, W3SVC service running status
20
Trap on thread issuesN/AYesNoTrap on Stuck threads, IdleThreads =0, Throughput >= 30, ActiveConnectionsHighCount > 30
21
Trap on files not housekeepedN/ANoYesIf log files are older than 3 months, alert
22
No stdout file continuously growing and not getting rotatedN/ANoYesFor ex, weblogic stdout getting written to a file, which cannot be rotated, unless weblogic is restarted
23
Trap When webserver is downN/AYesYesMonitor web server process availability
24
Trap when important pages/service are not responding within timeN/ANoYesHit important URLs and expects response within 5 sec. If not responded or errored, raise trap
25
26
27
Database
28
Resilience
29
RAC/MirroringYesYesYesOracle RAC or SQL Server Mirroring for all critical databases
30
All nodes evenly loadedNo??YesCheck Oracle AWR report to see if all nodes have more or less same Avg sessions, CPU, IO load
31
DR failover testedNoYesNoWas the DR database ever brought online and DR app talked to it for a while?
32
Have up-to-date checklist for various failover scenarioNoYesNoYou have a detail checklist, which says who needs to do what, when we go through a failover scenario. For ex, when a single DB node down, who does what. When all DB nodes are down, who does what?
33
Use TNS Service name not SID and Scan IP not Node IP everywhereN/ANoyesNo cron, java properties file, tns config must use Oracle SID and individual node IP. Must be Service Name and RAC cluster IP.
34
35
Storage
36
Proper storage tierYesYesYesEnsure all data files, redo, archive are on Tier 2 storage. Ask Storage team.
37
High Storage latency Trap (>30ms on any tablespace)YesYesYesSchedule a cron to check latency recorded by Oracle and raise trap if not within limit, SQL Server Audit records storage latency
38
Temp tablespace alertNoYesYesSchedule cron to check temp tablespace free, if less than 40%, raise trap. SQL Server: Check tempdb max size. Current utilization.
39
Tablespace over 80% trapYesYesYesSchedule cron to check tablespace being over 80%, raise trap. SQL Server: Any MDF, LDF file having <20% free space.
40
Autogrowth of tablespaces are 100MB or moreYesYesYesThere should be no tablespace having auto growth smaller than 100MB. SQL Server: MDF, LDF must be >100MB, not a %
41
Monitor UNDO space has enough free spaceYesYesYesEnsure UNDO space never fills up more than 80%, if it does, fix SQL or increase UNDO. SQL Server is LDF
42
43
Memory
44
SGA and PGA set as per Advisor report in last one monthYesYesYesCheck recent AWR report and see if SGA, PGA advisory section says we need to increase SGA, PGA
45
Linux Hugepages configured properly for current SGA sizeYesYesYesCheck with Prod DBA if Oracle is using HugePages and the correct value of HugePages is configured based on SGA size.
46
ASMM configured (not ASM)YesNoNoIs sga_target_size > 0
47
Trap to detect linux memory fragmentationN/ANoNoIs there any monitoring script which is calculating fragmented pages in Linux?
48
Trap for too low free shared pool memoryN/ANoYesCron to check free shared pool and if less than 500MB free, or too many fragmentation, raise trap.
49
50
Jobs
51
Job queries are optimized and running within expected timeNoNoYesAre all jobs running within hour?
52
No Job running during online hours, peak trafficNoYesYesNo expensive job is scheduled between 8AM to 6PM?
53
Jobs do not start if previous instance is already runningNo??YesJob has check to ensure previous instance isn't already running.
54
Trap if job failsNo??NoIf a job fails, email alert goes to ASG
55
Overruning job trap, exceeding max allowed timeNo??NoIf a job has exceeded 3 hours, email alert to ASG
56
57
Query
58
Application queries are optimizedNoNoYese.g. No online query taking more than 5 seconds. SQL Server has Activity Monitor
59
Limit max SELECT recordNoNoNono one can select say 10,000 records via some API call and blow up app/db
60
61
Housekeeping
62
Regular Index online rebuild for fragmented indexNoNoNoScheduled online rebuild of important index to keep index performance good
63
Regular Table fragmentation tuningNoNoNoDefrag highly fragmented tables regularly.
64
Large tables are partitionedNoYesYesAll multi million record table is partitioned.
65
Unused partitions are droppedNoNoYesPartitions containing old records are dropped regularly via scheduled job on large tables.
66
Weekly stale statistics reportYesNoYesReport to show stale statistics last analyzed 10 days ago.
67
Weekly fragmentation reportNoNoNoReport to show heavily fragmented tables and index.
68
Weekly job run time reportNoNoNoReport to show cron and job runtime and show trend over several weeks.
69
Weekly/Month large table housekeepingNoNoYesScheduled purging of old data from large tables.
70
Purge ceased customer dataNoNoNoEU GDPR requires BT cannot hold a customer data who left 2 years ago
71
Specific stats gather job, not automaticNoNoYesDo not use auto gather stats on entire database. Use specific schema gather at a time.
72
Automatic Purge logging tablesNoNoYesAny audit, logging type table is purged regularly via automated job.
73
Cleaned all backup/temp tablesNoNoNoNo useless tables left in DB.
74
Any tablespace over 80% allocatedYesYesNo tablespace over 80% allocated on DB.
75
Fixed object stats gathered monthlyNoYesYesDue to Oracle bug, fixed object stats gather is scheduled to run on a weekend, once a month.
76
Stats are locked on temp, backup tablesNoNoYesUseless tables have their stats locked, so that they aren't gathered automatically, wasting DB resource.
77
Are the backups running successfully without any failureYesYesNo error reported in netbackup, comvault etc. check with backup team.
78
79
Security
80
DB Links have their own user account, not using schemaYesYesDB links aren't using unlimited schema user.
81
Each client has their own user account, not using schemaYesNoApplications have their own user accounts and not use the privilege account, eg sa
82
DBAs have resource limit on their user accountNoNoNoDBA cannot run query that can blow up database.
83
DBA use their own user ID and not share user accountYesNoYesAll Employee ID based login, not using sa/sys or any privilege account to login
84
DBA use own ID for regular work, priviledged user only for deploymentYesNoNoNon privileged account for day to day work.
85
Latest database version and patchYesNoNoOracle 12c or SQL Server 2018
86
Alert on priviledged user loginNoNoNoEmail alert whenever someone logs into DB using schema user.
87
88
Misc
89
Trap to detect individual node issue, not the clusterNoNoYesEvery 10 mins Cron to hit each and every node individually and run a SELECT query on an important table. If the select query takes more than 1 sec, raise alert. If fails, rate trap.
90
Sequence reaching max limit trapYesYesYesEvery 10 mins Cron to check if sequence is over 80% of max value.
91
Trap for invalid objectYesNoYesEvery 10 mins Cron to check if any object has become invalid.
92
All Nodes have same SGA, PGA etc DB parametersYesYesCompare all DB parameters across all nodes.
93
Backup runs during out of hoursYesNo
94
Trap when more than X sessions are blocked for more than Y minutesNoYesCron/Job to monitor for blocking sessions every 10 mins and send alert
95
Trap when one or more TNS Listener/Database is downYesYes
96
97
Crons / Windows Task
98
Cron failure monitoringNoYesYesIf a business critical cron script fails to run, raise trap.
99
Crons have exception loggingNoYesYesAll business critical crons record errors in log file.
100
Detects when cron does not run on time?NoYesNoA cron to check if other business critical crons have run on time.
Loading...