A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | ||||||||||||||||||||||
2 | Ops Checklist | |||||||||||||||||||||
3 | Check | Customer API | Partner API | Admin | Detail description | |||||||||||||||||
4 | Last Reviewed | Jul-15 | Jul-15 | Jul-15 | ||||||||||||||||||
5 | Certificates | |||||||||||||||||||||
6 | Monitor Third party issued certificates and notify 30 days before expiration | Yes | No | Yes | e.g. Verisign informes us 30 days before expiration | |||||||||||||||||
7 | Renewal of external certificates take less than an hour | No | No | No | e.g. Verisign certificate renewal is fully automated | |||||||||||||||||
8 | Monitor all self-signed certificates and notify 30 days before expiration | e.g. Some job or monitoring tool checks all self-signed certificates, on each server | ||||||||||||||||||||
9 | Self-signed certificates are recorded in an inventory and expiration date regularly checked | e.g. All certificate names, domain, expiration date maintained in support wiki | ||||||||||||||||||||
10 | Renewed certificates can be deployed within 10 mins across all servers, fully automatically | e.g. AWS certificate validation can take up to 72 hours. That's too slow for emergency | ||||||||||||||||||||
11 | ||||||||||||||||||||||
12 | ||||||||||||||||||||||
13 | Traps/Alerts | |||||||||||||||||||||
14 | Trap for too many errors in webserver log | N/A | No | Yes | e.g. If more than 10 http 500 recorded in access log per minute, raise trap. | |||||||||||||||||
15 | Trap on too many slow response in webserver log | N/A | e.g. if more than 10 http request took more than 10 seconds in a minute, raise trap. | |||||||||||||||||||
16 | Trap for ERROR, WARNING, CRITICAL, ALERT, EMERGENCY weblogic log | N/A | Yes | Yes | Trap for each of these keywords. If there are already too many ERROR and WARNING, then suppress these two for a while, until we fix them. But the rest of the keyword must be in hunter trap. In IIS, there's HTTPERR log. \windows\system32\inetsrv\LogFiles | |||||||||||||||||
17 | Trap for hogging threads | N/A | Yes | Yes | Trap on HoggingThreads >= 5. Windows: Check the perf counter for ASP.NET Requests in Application Queue | |||||||||||||||||
18 | Trap for Request Queue length | N/A | Yes | No | Trap on QueueLength > 0. Windows: Check the perf counter for ASP.NET Requests in Application Queue | |||||||||||||||||
19 | Trap on health state of website/api process | N/A | Yes | Yes | Trap on Health_State != HEALTH_OK. Windows: Check App Pool enabled, Website started satus, W3SVC service running status | |||||||||||||||||
20 | Trap on thread issues | N/A | Yes | No | Trap on Stuck threads, IdleThreads =0, Throughput >= 30, ActiveConnectionsHighCount > 30 | |||||||||||||||||
21 | Trap on files not housekeeped | N/A | No | Yes | If log files are older than 3 months, alert | |||||||||||||||||
22 | No stdout file continuously growing and not getting rotated | N/A | No | Yes | For ex, weblogic stdout getting written to a file, which cannot be rotated, unless weblogic is restarted | |||||||||||||||||
23 | Trap When webserver is down | N/A | Yes | Yes | Monitor web server process availability | |||||||||||||||||
24 | Trap when important pages/service are not responding within time | N/A | No | Yes | Hit important URLs and expects response within 5 sec. If not responded or errored, raise trap | |||||||||||||||||
25 | ||||||||||||||||||||||
26 | ||||||||||||||||||||||
27 | Database | |||||||||||||||||||||
28 | Resilience | |||||||||||||||||||||
29 | RAC/Mirroring | Yes | Yes | Yes | Oracle RAC or SQL Server Mirroring for all critical databases | |||||||||||||||||
30 | All nodes evenly loaded | No | ?? | Yes | Check Oracle AWR report to see if all nodes have more or less same Avg sessions, CPU, IO load | |||||||||||||||||
31 | DR failover tested | No | Yes | No | Was the DR database ever brought online and DR app talked to it for a while? | |||||||||||||||||
32 | Have up-to-date checklist for various failover scenario | No | Yes | No | You have a detail checklist, which says who needs to do what, when we go through a failover scenario. For ex, when a single DB node down, who does what. When all DB nodes are down, who does what? | |||||||||||||||||
33 | Use TNS Service name not SID and Scan IP not Node IP everywhere | N/A | No | yes | No cron, java properties file, tns config must use Oracle SID and individual node IP. Must be Service Name and RAC cluster IP. | |||||||||||||||||
34 | ||||||||||||||||||||||
35 | Storage | |||||||||||||||||||||
36 | Proper storage tier | Yes | Yes | Yes | Ensure all data files, redo, archive are on Tier 2 storage. Ask Storage team. | |||||||||||||||||
37 | High Storage latency Trap (>30ms on any tablespace) | Yes | Yes | Yes | Schedule a cron to check latency recorded by Oracle and raise trap if not within limit, SQL Server Audit records storage latency | |||||||||||||||||
38 | Temp tablespace alert | No | Yes | Yes | Schedule cron to check temp tablespace free, if less than 40%, raise trap. SQL Server: Check tempdb max size. Current utilization. | |||||||||||||||||
39 | Tablespace over 80% trap | Yes | Yes | Yes | Schedule cron to check tablespace being over 80%, raise trap. SQL Server: Any MDF, LDF file having <20% free space. | |||||||||||||||||
40 | Autogrowth of tablespaces are 100MB or more | Yes | Yes | Yes | There should be no tablespace having auto growth smaller than 100MB. SQL Server: MDF, LDF must be >100MB, not a % | |||||||||||||||||
41 | Monitor UNDO space has enough free space | Yes | Yes | Yes | Ensure UNDO space never fills up more than 80%, if it does, fix SQL or increase UNDO. SQL Server is LDF | |||||||||||||||||
42 | ||||||||||||||||||||||
43 | Memory | |||||||||||||||||||||
44 | SGA and PGA set as per Advisor report in last one month | Yes | Yes | Yes | Check recent AWR report and see if SGA, PGA advisory section says we need to increase SGA, PGA | |||||||||||||||||
45 | Linux Hugepages configured properly for current SGA size | Yes | Yes | Yes | Check with Prod DBA if Oracle is using HugePages and the correct value of HugePages is configured based on SGA size. | |||||||||||||||||
46 | ASMM configured (not ASM) | Yes | No | No | Is sga_target_size > 0 | |||||||||||||||||
47 | Trap to detect linux memory fragmentation | N/A | No | No | Is there any monitoring script which is calculating fragmented pages in Linux? | |||||||||||||||||
48 | Trap for too low free shared pool memory | N/A | No | Yes | Cron to check free shared pool and if less than 500MB free, or too many fragmentation, raise trap. | |||||||||||||||||
49 | ||||||||||||||||||||||
50 | Jobs | |||||||||||||||||||||
51 | Job queries are optimized and running within expected time | No | No | Yes | Are all jobs running within hour? | |||||||||||||||||
52 | No Job running during online hours, peak traffic | No | Yes | Yes | No expensive job is scheduled between 8AM to 6PM? | |||||||||||||||||
53 | Jobs do not start if previous instance is already running | No | ?? | Yes | Job has check to ensure previous instance isn't already running. | |||||||||||||||||
54 | Trap if job fails | No | ?? | No | If a job fails, email alert goes to ASG | |||||||||||||||||
55 | Overruning job trap, exceeding max allowed time | No | ?? | No | If a job has exceeded 3 hours, email alert to ASG | |||||||||||||||||
56 | ||||||||||||||||||||||
57 | Query | |||||||||||||||||||||
58 | Application queries are optimized | No | No | Yes | e.g. No online query taking more than 5 seconds. SQL Server has Activity Monitor | |||||||||||||||||
59 | Limit max SELECT record | No | No | No | no one can select say 10,000 records via some API call and blow up app/db | |||||||||||||||||
60 | ||||||||||||||||||||||
61 | Housekeeping | |||||||||||||||||||||
62 | Regular Index online rebuild for fragmented index | No | No | No | Scheduled online rebuild of important index to keep index performance good | |||||||||||||||||
63 | Regular Table fragmentation tuning | No | No | No | Defrag highly fragmented tables regularly. | |||||||||||||||||
64 | Large tables are partitioned | No | Yes | Yes | All multi million record table is partitioned. | |||||||||||||||||
65 | Unused partitions are dropped | No | No | Yes | Partitions containing old records are dropped regularly via scheduled job on large tables. | |||||||||||||||||
66 | Weekly stale statistics report | Yes | No | Yes | Report to show stale statistics last analyzed 10 days ago. | |||||||||||||||||
67 | Weekly fragmentation report | No | No | No | Report to show heavily fragmented tables and index. | |||||||||||||||||
68 | Weekly job run time report | No | No | No | Report to show cron and job runtime and show trend over several weeks. | |||||||||||||||||
69 | Weekly/Month large table housekeeping | No | No | Yes | Scheduled purging of old data from large tables. | |||||||||||||||||
70 | Purge ceased customer data | No | No | No | EU GDPR requires BT cannot hold a customer data who left 2 years ago | |||||||||||||||||
71 | Specific stats gather job, not automatic | No | No | Yes | Do not use auto gather stats on entire database. Use specific schema gather at a time. | |||||||||||||||||
72 | Automatic Purge logging tables | No | No | Yes | Any audit, logging type table is purged regularly via automated job. | |||||||||||||||||
73 | Cleaned all backup/temp tables | No | No | No | No useless tables left in DB. | |||||||||||||||||
74 | Any tablespace over 80% allocated | Yes | Yes | No tablespace over 80% allocated on DB. | ||||||||||||||||||
75 | Fixed object stats gathered monthly | No | Yes | Yes | Due to Oracle bug, fixed object stats gather is scheduled to run on a weekend, once a month. | |||||||||||||||||
76 | Stats are locked on temp, backup tables | No | No | Yes | Useless tables have their stats locked, so that they aren't gathered automatically, wasting DB resource. | |||||||||||||||||
77 | Are the backups running successfully without any failure | Yes | Yes | No error reported in netbackup, comvault etc. check with backup team. | ||||||||||||||||||
78 | ||||||||||||||||||||||
79 | Security | |||||||||||||||||||||
80 | DB Links have their own user account, not using schema | Yes | Yes | DB links aren't using unlimited schema user. | ||||||||||||||||||
81 | Each client has their own user account, not using schema | Yes | No | Applications have their own user accounts and not use the privilege account, eg sa | ||||||||||||||||||
82 | DBAs have resource limit on their user account | No | No | No | DBA cannot run query that can blow up database. | |||||||||||||||||
83 | DBA use their own user ID and not share user account | Yes | No | Yes | All Employee ID based login, not using sa/sys or any privilege account to login | |||||||||||||||||
84 | DBA use own ID for regular work, priviledged user only for deployment | Yes | No | No | Non privileged account for day to day work. | |||||||||||||||||
85 | Latest database version and patch | Yes | No | No | Oracle 12c or SQL Server 2018 | |||||||||||||||||
86 | Alert on priviledged user login | No | No | No | Email alert whenever someone logs into DB using schema user. | |||||||||||||||||
87 | ||||||||||||||||||||||
88 | Misc | |||||||||||||||||||||
89 | Trap to detect individual node issue, not the cluster | No | No | Yes | Every 10 mins Cron to hit each and every node individually and run a SELECT query on an important table. If the select query takes more than 1 sec, raise alert. If fails, rate trap. | |||||||||||||||||
90 | Sequence reaching max limit trap | Yes | Yes | Yes | Every 10 mins Cron to check if sequence is over 80% of max value. | |||||||||||||||||
91 | Trap for invalid object | Yes | No | Yes | Every 10 mins Cron to check if any object has become invalid. | |||||||||||||||||
92 | All Nodes have same SGA, PGA etc DB parameters | Yes | Yes | Compare all DB parameters across all nodes. | ||||||||||||||||||
93 | Backup runs during out of hours | Yes | No | |||||||||||||||||||
94 | Trap when more than X sessions are blocked for more than Y minutes | No | Yes | Cron/Job to monitor for blocking sessions every 10 mins and send alert | ||||||||||||||||||
95 | Trap when one or more TNS Listener/Database is down | Yes | Yes | |||||||||||||||||||
96 | ||||||||||||||||||||||
97 | Crons / Windows Task | |||||||||||||||||||||
98 | Cron failure monitoring | No | Yes | Yes | If a business critical cron script fails to run, raise trap. | |||||||||||||||||
99 | Crons have exception logging | No | Yes | Yes | All business critical crons record errors in log file. | |||||||||||||||||
100 | Detects when cron does not run on time? | No | Yes | No | A cron to check if other business critical crons have run on time. |