Ops Checklist
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

 
Comment only
 
 
ABCDEFGHIJKLMNOPQRSTUV
1
2
Ops Checklist
3
CheckSystem ASystem BSystem CDetail description
4
Last ReviewedJul-15Jul-15Jul-15
5
Traps/Alerts
6
Trap for too many errors in webserver logN/ANoYese.g. If more than 10 http 500 recorded in access log per minute, raise trap.
7
Trap on too many slow response in webserver logN/Ae.g. if more than 10 http request took more than 10 seconds in a minute, raise trap.
8
Trap for ERROR, WARNING, CRITICAL, ALERT, EMERGENCY weblogic logN/AYesYesHunter trap for each of these keywords. If there are already too many ERROR and WARNING, then suppress these two for a while, until we fix them. But the rest of the keyword must be in hunter trap.
In IIS, there's HTTPERR log. \windows\system32\inetsrv\LogFiles
9
Trap for hogging threadsN/AYesYesTrap on HoggingThreads >= 5. Windows: Check the perf counter for ASP.NET Requests in Application Queue
10
Trap for Queue lengthN/AYesNoTrap on QueueLength > 0. Windows: Check the perf counter for ASP.NET Requests in Application Queue
11
Trap on health stateN/AYesYesTrap on Health_State != HEALTH_OK. Windows: Check App Pool enabled, Website started satus, W3SVC service running status
12
Trap on thread issuesN/AYesNoTrap on Stuck threads, IdleThreads =0, Throughput >= 30, ActiveConnectionsHighCount > 30
13
Trap on files not housekeepedN/ANoYesIf log files are older than 3 months, alert
14
No stdout file continuously growing and not getting rotatedN/ANoYesFor ex, weblogic stdout getting written to a file, which cannot be rotated, unless weblogic is restarted
15
Trap When webserver is downN/AYesYesMonitor web server process availability
16
Trap when important pages/service are not responding within timeN/ANoYesHit important URLs and expects response within 5 sec. If not responded or errored, raise trap
17
18
19
Database
20
Resilience
21
RAC/MirroringYesYesYesOracle RAC or SQL Server Mirroring for all critical databases
22
All nodes evenly loadedNo??YesCheck AWR to see if all nodes have more or less same Avg sessions, CPU, IO load
23
DR failover testedNoYesNoWas the DR database ever brought online and DR app talked to it for a while?
24
Have up-to-date checklist for various failover scenarioNoYesNoYou have a detail checklist, which says who needs to do what, when we go through a failover scenario. For ex, when a single DB node down, who does what. When all DB nodes are down, who does what?
25
Use TNS Service name not SID and Scan IP not Node IP everywhereN/ANoyesNo cron, java properties file, tns config must use Oracle SID and individual node IP. Must be Service Name and RAC cluster IP.
26
27
Storage
28
Proper storage tierYesYesYesEnsure all data files, redo, archive are on Tier 2 storage. Ask Storage team.
29
High Storage latency Trap (>30ms on any tablespace)YesYesYesSchedule a cron to check latency recorded by Oracle and raise trap if not within limit, SQL Server Audit records storage latency
30
Temp tablespace alertNoYesYesSchedule cron to check temp tablespace free, if less than 40%, raise trap. SQL Server: Check tempdb max size. Current utilization.
31
Tablespace over 80% trapYesYesYesSchedule cron to check tablespace being over 80%, raise trap. SQL Server: Any MDF, LDF file having <20% free space.
32
Autogrowth of tablespaces are 100MB or moreYesYesYesThere should be no tablespace having auto growth smaller than 100MB. SQL Server: MDF, LDF must be >100MB, not a %
33
Monitor UNDO space has enough free spaceYesYesYesEnsure UNDO space never fills up more than 80%, if it does, fix SQL or increase UNDO. SQL Server is LDF
34
35
Memory
36
SGA and PGA set as per Advisor report in last one monthYesYesYesCheck recent AWR report and see if SGA, PGA advisory section says we need to increase SGA, PGA
37
Linux Hugepages configured properly for current SGA sizeYesYesYesCheck with Prod DBA if Oracle is using HugePages and the correct value of HugePages is configured based on SGA size.
38
ASMM configured (not ASM)YesNoNoIs sga_target_size > 0
39
Trap to detect linux memory fragmentationN/ANoNoIs there any monitoring script which is calculating fragmented pages in Linux?
40
Trap for too low free shared pool memoryN/ANoYesCron to check free shared pool and if less than 500MB free, or too many fragmentation, raise trap.
41
42
DMBS Jobs
43
Job queries are optimized and running within expected timeNoNoYesAre all jobs running within hour?
44
No Job running during online hours, peak trafficNoYesYesNo expensive job is scheduled between 8AM to 6PM?
45
Jobs do not start if previous instance is already runningNo??YesJob has check to ensure previous instance isn't already running.
46
Trap if job failsNo??NoIf a job fails, email alert goes to ASG
47
Overruning job trap, exceeding max allowed timeNo??NoIf a job has exceeded 3 hours, email alert to ASG
48
49
Query
50
Application queries are optimizedNoNoYese.g. No online query taking more than 5 seconds. SQL Server has Activity Monitor
51
Limit max SELECT recordNoNoNono one can select say 10,000 records via some API call and blow up app/db
52
53
Housekeeping
54
Regular Index online rebuild for fragmented indexNoNoNoScheduled online rebuild of important index to keep index performance good
55
Regular Table fragmentation tuningNoNoNoDefrag highly fragmented tables regularly.
56
Large tables are partitionedNoYesYesAll multi million record table is partitioned.
57
Unused partitions are droppedNoNoYesPartitions containing old records are dropped regularly via scheduled job on large tables.
58
Weekly stale statistics reportYesNoYesReport to show stale statistics last analyzed 10 days ago.
59
Weekly fragmentation reportNoNoNoReport to show heavily fragmented tables and index.
60
Weekly job run time reportNoNoNoReport to show cron and job runtime and show trend over several weeks.
61
Weekly/Month large table housekeepingNoNoYesScheduled purging of old data from large tables.
62
Purge ceased customer dataNoNoNoEU GDPR requires BT cannot hold a customer data who left 2 years ago
63
Specific stats gather job, not automaticNoNoYesDo not use auto gather stats on entire database. Use specific schema gather at a time.
64
Automatic Purge logging tablesNoNoYesAny audit, logging type table is purged regularly via automated job.
65
Cleaned all backup/temp tablesNoNoNoNo useless tables left in DB.
66
Any tablespace over 80% allocatedYesYesNo tablespace over 80% allocated on DB.
67
Fixed object stats gathered monthlyNoYesYesDue to Oracle bug, fixed object stats gather is scheduled to run on a weekend, once a month.
68
Stats are locked on temp, backup tablesNoNoYesUseless tables have their stats locked, so that they aren't gathered automatically, wasting DB resource.
69
Are the backups running successfully without any failureYesYesNo error reported in netbackup, comvault etc. check with backup team.
70
71
Security
72
DB Links have their own user account, not using schemaYesYesDB links aren't using unlimited schema user.
73
Each client has their own user account, not using schemaYesNoApplications have their own user accounts and not use the privilege account, eg sa
74
ASGs have resource limit on their user accountNoNoNoASG cannot run query that can blow up database.
75
ASGs use their own user ID and not share user accountYesNoYesAll EIN/BOATID based login, not using sa/sys or any privilege account to login
76
ASGs use own ID for regular work, schema user only for deploymentYesNoNoNon privileged account for day to day work.
77
Latest Oracle/SQLserver version and patchYesNoNoOracle 12c or SQL Server 2018
78
Alert on schema user loginNoNoNoEmail alert whenever someone logs into DB using schema user.
79
80
Misc
81
Trap to detect individual node issue, not the clusterNoNoYesEvery 10 mins Cron to hit each and every node individually and run a SELECT query on an important table. If the select query takes more than 1 sec, raise alert. If fails, rate trap.
82
Sequence reaching max limit trapYesYesYesEvery 10 mins Cron to check if sequence is over 80% of max value.
83
Trap for invalid objectYesNoYesEvery 10 mins Cron to check if any object has become invalid.
84
All Nodes have same SGA, PGA etc DB parametersYesYesCompare all DB parameters across all nodes.
85
Backup runs during out of hoursYesNo
86
Trap when more than X sessions are blocked for more than Y minutesNoYesCron/Job to monitor for blocking sessions every 10 mins and send alert
87
Trap when one or more TNS Listener/Database is downYesYes
88
89
Crons / Windows Task
90
Cron failure monitoringNoYesYesIf a business critical cron script fails to run, raise trap.
91
Crons have exception loggingNoYesYesAll business critical crons record errors in log file.
92
Detects when cron does not run on time?NoYesNoA cron to check if other business critical crons have run on time.
93
Trap on overruning cronNoYesNoIf a cron is running longer than it should, raise trap. This requires cron to record start and stop timestamp in a log file.
94
Prevent duplicate runNoYesNoCron checks if there's another instance of same cron already running. If yes, exit.
95
Trap on 0 records processedNoNoIf cron is producing file, ensure the file has valid content. If not, raise trap.
96
97
All servers
98
Trap for high CPU, Memory using sarNoYesRun cron to check if avg CPU, RAM over 80%.
99
Trap for high Disk Busy using iostatNoYesRun iostat to check if any disk I/O use is over 80%, (this is not DISK SPACE)
100
Trap for high swapNoYesRun cron to check if avg swap usage over 80%
Loading...
Main menu