A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Description | Ubuntu | CentOS/RHEL | ||||||||||||||||||||||
2 | Avoid dangerous configuration | Use: fsync=on, full_page_writes=on | |||||||||||||||||||||||
3 | Enable data checksums | As of 2020, must be enabled at initdb time. There are methods to enable for existing cluster (since Postgres 12, it's "official"), but it requires cluster to be shut down. For existing cluster without downtime, two options: - use logical replication, initdb with checksums enabled, then perform switchover; - on a replica: shut Postgres down, apply pg_checksums (12) or https://github.com/credativ/pg_checksums (pre-12), start it, wait until the replication lag is gone, perform switchover (recipe from https://paquier.xyz/postgresql-2/postgres-12-pg-checksums/ and A.Kukushkin) Decide on what value of ignore_checksum_failure is most applicable in your case – if downtime costs a lot, you may prefer keeping it "off" (but keep in mind that this behavior may cause crashes, propagate or hide corruption, or other serious problems). If data corruption is absolutely unacceptable, turn it on. https://postgresqlco.nf/en/doc/param/ignore_checksum_failure/ | |||||||||||||||||||||||
4 | Monitor for errors | 1) Check Postgres log for: PANIC, "memory allocation invalid size", "invalid offset", ... 2) Check syslog for "Segmentation fault", kernel erors associated with Postgres activity 3) use https://github.com/munakoiso/logerrors in monitoring. Watch for ERRCODE_INTERNAL_ERROR, ERRCODE_DATA_CORRUPTED, ERRCODE_INDEX_CORRUPTED | |||||||||||||||||||||||
5 | Prepare to get details from coredump | 1) Be prepared to enable coredumps when needed (do not keep in enable not to waste disk space; ensure that when needed, there will be enough disk space) 2) Install debug symbols to see more details when needed 3) Learn how to use gdb to see full backtrace from a coredump (gdb). Practice it on a non-prod machine (kill -11) See also https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD | E.g., on old 16.04: 1) install&enable "apport" (https://askubuntu.com/questions/966407/where-do-i-find-the-core-dump-in-ubuntu-16-04lts), 2) install gdb: sudo apt-get install gdb 3) install debug symbols (e.g. for 11: sudo apt-get install postgresql-11-dbg) Then on a non-production(!) machine, test it: - kill -11 to a Postgres process - see a crash file in /var/crash. Unpack it (apport-unpack .. ..; watch the free disk space) - analyze with gdb (gdb -q /path/to/postgres /path/to/file; then in gdb type "bt") | 1) See https://www.thegeekdiary.com/how-to-enable-core-dump-for-applications-on-centos-rhel/ 2) install gdb: sudo yum install gdb 3) install debug symbols: e.g., for 9.6., PGDG repo: sudo yum install postgresql96-debuginfo.x86_64 (alternative: abrt) Then test obtaining a full backtrace on a non-production(!) machine (kill -11 to a postgres process; then gdb -q /path/to/postgres /path/to/coredump-ungzipped; and then type "bt") | |||||||||||||||||||||
6 | Check indexes using amcheck | 1) Periodically, use amcheck on *all* nodes (both primary and secondary nodes) 2) Use more often on PGDATA that was restored from a physical backup, right after restoration Important notes: - if issues found on one node, it doesn't mean that they will or will not be found on another - errors found by am check needs to be carefully investigated, - reindex could be not enough in some cases. pageinspect can be useful in such cases | |||||||||||||||||||||||
7 | Check heap using heapcheck | https://www.postgresql.org/message-id/flat/12ED3DA8-25F0-4B68-937D-D907CFBF08E7%40enterprisedb.com | |||||||||||||||||||||||
8 | Dump/restore periodically | This is a basic and time-consuming check that is helpful to ensure that there are no logical data inconsistencies such as uniquness constraint violation, FK violation, etc. Use periodically on a machine where backup verification is happening, after restore | |||||||||||||||||||||||
9 | Dump to /dev/null | This is useful when data checksums are *enabled*. After restoration, perform full dump to /dev/null -- this leads to reading all data pages, verifying checksums. Observe logs for errors. | |||||||||||||||||||||||
10 | Learn how to report bugs | tbd | |||||||||||||||||||||||
11 | Develop a DR plan for the cases of data corruption | tbd | |||||||||||||||||||||||
12 | Learn how to use pg_resetwal / pg_resetxlog | This should be a "last resort", when you don't have other options and need to reach get data (or at least some of it). As the first step (always!) copy PGDATA to non-usual place before you go this route. Learn params. | |||||||||||||||||||||||
13 | |||||||||||||||||||||||||
14 | |||||||||||||||||||||||||
15 | |||||||||||||||||||||||||
16 | |||||||||||||||||||||||||
17 | |||||||||||||||||||||||||
18 | Useful materials | ||||||||||||||||||||||||
19 | (RU) how to keep stats during major upgrades -- a talk about data corruption during major upgrades, mentions pg_dirty_hands | https://pgconf.ru/2018/110829; https://github.com/dsarafan/pg_dirty_hands | |||||||||||||||||||||||
20 | (RU) slidedeck on corruption from Andrey Borodin, Yandex.Cloud (July 2020) | https://docviewer.yandex.ru/view/1134989814/?page=24&*=UMLRHmpUD3HSzseFTaVfgJgmDht7InVybCI6InlhLWRpc2stcHVibGljOi8vSUFSb0hSV3VUTjltQWVvZSs0akQ2M0FmTHFEM0c4ajFRaEdtS1pERjRkYVpxRnVoNmk3Y2JvcVlPckprMnlHWXEvSjZicG1SeU9Kb25UM1ZvWG5EYWc9PSIsInRpdGxlIjoiQ29ycnVwdGlvbjIwMjAucHB0eCIsIm5vaWZyYW1lIjpmYWxzZSwidWlkIjoiMTEzNDk4OTgxNCIsInRzIjoxNTk2MTcxMjY5MDE1LCJ5dSI6Ijc2MTg3NzE3ODE1NjE3Mjk3NzAifQ%3D%3D | |||||||||||||||||||||||
21 | Postgres Wiki – Corruption (a lot of useful materials) | https://wiki.postgresql.org/wiki/Corruption | |||||||||||||||||||||||
22 | Troubleshooting Database Corruption, Robert Haas (2014) | http://rhaas.blogspot.com/2014/05/troubleshooting-database-corruption.html | |||||||||||||||||||||||
23 | Avoiding, Detecting, and Recovering From Corruption, Robert Haas (slide deck, 19 pages) | https://pgconf.in/files/presentations/2020/Robert_Haas_2020_Avoiding_Detecting_and_Recovery_From_Corruption.pdf | |||||||||||||||||||||||
24 | PGConf India 2020 - Avoiding, Detecting, and Recovering From Data Corruption - Robert Haas - EDB (video, 46 min) | https://www.youtube.com/watch?v=ITxMmqEfzT8 | |||||||||||||||||||||||
25 | How to enable core dumps on CentOS/RHEL | https://www.thegeekdiary.com/how-to-enable-core-dump-for-applications-on-centos-rhel/ | |||||||||||||||||||||||
26 | amcheck (Peter Geoghegan) | https://github.com/petergeoghegan/amcheck/ | |||||||||||||||||||||||
27 | amcheck (part of core distribution since version 10) | https://www.postgresql.org/docs/10/amcheck.html | |||||||||||||||||||||||
28 | pg_check (Tomas Vondra) | https://github.com/tvondra/pg_check | |||||||||||||||||||||||
29 | pg_catcheck (EnterpriseDB) | https://github.com/EnterpriseDB/pg_catcheck | |||||||||||||||||||||||
30 | pg_hexedit (Peter Geoghegan) | https://github.com/petergeoghegan/pg_hexedit | |||||||||||||||||||||||
31 | pageinspect | https://www.postgresql.org/docs/current/pageinspect.html | |||||||||||||||||||||||
32 | pg_visibility (see pg_check_frozen, pg_check_visible – thanks Adrien Nayrat for the tip) | https://www.postgresql.org/docs/current/pgvisibility.html | |||||||||||||||||||||||
33 | -hackers discusion “new heapcheck contrib module” | https://www.postgresql.org/message-id/flat/12ED3DA8-25F0-4B68-937D-D907CFBF08E7%40enterprisedb.com | |||||||||||||||||||||||
34 | tool for restoration corrupted WALs (patch) | https://github.com/x4m/postgres_g/commit/af4b17fb214990dc14772610e75394dcb1bcf838 | |||||||||||||||||||||||
35 | patch for amcheck to detect "lost updates" on replicas | https://commitfest.postgresql.org/24/2254/ | |||||||||||||||||||||||
36 | Postgres 12 highlight - pg_checksums (Michael Paquier; describes the recipe to enable checksums for existing large cluster with minimal downtime; the same is possible in pre-12 with pg_checksums from Credativ) | https://paquier.xyz/postgresql-2/postgres-12-pg-checksums/ | |||||||||||||||||||||||
37 | pg_toolkit (by Bertrand Drouvot) – a set of scripts for Postgres to inspect data, compare data in memory and on disk, btree stats, etc | https://github.com/bdrouvot/pg_toolkit | |||||||||||||||||||||||
38 | PostgreSQL Invalid Page and Checksum Verification Failed (2019; by Jeremy Schneider) | https://ardentperf.com/2019/11/08/postgresql-invalid-page-and-checksum-verification-failed/ | |||||||||||||||||||||||
39 | Getting a stack trace of a running PostgreSQL backend on Linux/BSD | https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD | |||||||||||||||||||||||
40 | |||||||||||||||||||||||||
41 | |||||||||||||||||||||||||
42 | |||||||||||||||||||||||||
43 | |||||||||||||||||||||||||
44 | |||||||||||||||||||||||||
45 | TODO/unprocessed: | ||||||||||||||||||||||||
46 | corruption due to XID wraparound | https://www.cybertec-postgresql.com/en/transaction-id-wraparound-a-walk-on-the-wild-side/ | |||||||||||||||||||||||
47 | |||||||||||||||||||||||||
48 | Data Corruption – Sebastian Webber (talk; video) | https://www.youtube.com/watch?v=OYSFWzdSQrs | |||||||||||||||||||||||
49 | Tomáš Vondra: PostgreSQL vs. data corruption (talk; video) | https://www.youtube.com/watch?v=SgPQZWCj48k | |||||||||||||||||||||||
50 | Fixing up a corrupted toast table (demo, video) | https://www.youtube.com/watch?v=4jcC-lYGM0k | |||||||||||||||||||||||
51 | Thomas Munro: Don’t let collation versions corrupt your PostgreSQL indexes (article) | https://www.citusdata.com/blog/2020/12/12/dont-let-collation-versions-corrupt-your-postgresql-indexes/ | https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/don-t-let-collation-versions-corrupt-your-postgresql-indexes/ba-p/1978394 | ||||||||||||||||||||||
52 | Laurenz Albe: How to corrupt your PostgreSQL database | https://www.cybertec-postgresql.com/en/how-to-corrupt-your-postgresql-database/ | |||||||||||||||||||||||
53 | Daniel Westermann: Dealing with corrupted system indexes in PostgreSQL | https://www.dbi-services.com/blog/dealing-with-corrupted-system-indexes-in-postgresql/ | |||||||||||||||||||||||
54 | A bug example that feels like corruption | https://www.postgresql.org/message-id/flat/fe9b3722df94f7bdb08768f50ee8fe59%40postgrespro.ru | |||||||||||||||||||||||
55 | BUG #15285: Query used index over field with ICU collation in some cases wrongly return 0 rows | https://www.postgresql.org/message-id/flat/153201618542.1404.3611626898935613264%40wrigleys.postgresql.org | |||||||||||||||||||||||
56 | How to save statistics during major update, and what can be the consequences | https://pgconf.ru/en/2018/110829 | |||||||||||||||||||||||
57 | commitfest: Add Amcheck option for checking unique constraints in btree indexes | https://commitfest.postgresql.org/41/3464/ | |||||||||||||||||||||||
58 | commitfest: Amcheck verification of GiST and GIN | https://commitfest.postgresql.org/41/3733/ | |||||||||||||||||||||||
59 | Andrey Borodin: corruption 2020 (slides) | https://disk.yandex.com/i/HFKSB62_q3ppoQ | |||||||||||||||||||||||
60 | |||||||||||||||||||||||||
61 | |||||||||||||||||||||||||
62 | |||||||||||||||||||||||||
63 | |||||||||||||||||||||||||
64 | |||||||||||||||||||||||||
65 | |||||||||||||||||||||||||
66 | |||||||||||||||||||||||||
67 | |||||||||||||||||||||||||
68 | |||||||||||||||||||||||||
69 | |||||||||||||||||||||||||
70 | |||||||||||||||||||||||||
71 | |||||||||||||||||||||||||
72 | |||||||||||||||||||||||||
73 | |||||||||||||||||||||||||
74 | |||||||||||||||||||||||||
75 | |||||||||||||||||||||||||
76 | |||||||||||||||||||||||||
77 | |||||||||||||||||||||||||
78 | |||||||||||||||||||||||||
79 | |||||||||||||||||||||||||
80 | |||||||||||||||||||||||||
81 | |||||||||||||||||||||||||
82 | |||||||||||||||||||||||||
83 | |||||||||||||||||||||||||
84 | |||||||||||||||||||||||||
85 | |||||||||||||||||||||||||
86 | |||||||||||||||||||||||||
87 | |||||||||||||||||||||||||
88 | |||||||||||||||||||||||||
89 | |||||||||||||||||||||||||
90 | |||||||||||||||||||||||||
91 | |||||||||||||||||||||||||
92 | |||||||||||||||||||||||||
93 | |||||||||||||||||||||||||
94 | |||||||||||||||||||||||||
95 | |||||||||||||||||||||||||
96 | |||||||||||||||||||||||||
97 | |||||||||||||||||||||||||
98 | |||||||||||||||||||||||||
99 | |||||||||||||||||||||||||
100 |