How it is diagnosed:

Source Analysis


bug report:



Postgres crash under load.

Root cause:
This is a concurrent bug. Invalid memory access.
While postgres is building a new relation entry (the RelationBuildDesc call in RelationClearRelation) for a locally-created relation, it received an sinval reset event caused by sinval queue overflow.  (That could only happen with a lot of concurrent catalog update activity, which is why there's a significant number of concurrent "job1" clients needed to provoke the problem.) The sinval reset will be serviced by RelationCacheInvalidate, which will blow away any relcache entries with refcount zero, including the one that the outer instance of RelationClearRelation is trying to rebuild. So when control returns the next thing that happens is we try to do the equalTupleDescs() comparison against a trashed pointer, as seen in the
stack trace.

Stack trace:

#0  equalTupleDescs (tupdesc1=0x7f7f7f7f, tupdesc2=0x966508c4) at
#1  0x0832451b in RelationClearRelation (relation=0x96654178, rebuild=<value
optimized out>) at relcache.c:1901
#2  0x08324b9f in RelationFlushRelation () at relcache.c:1991
#3  RelationCacheInvalidateEntry (relationId=322615) at relcache.c:2042
#4  0x0831dd89 in LocalExecuteInvalidationMessage (msg=0xa218b54) at
#5  0x0831d341 in ProcessInvalidationMessages (hdr=0xa2010c4, func=0x831dc50
<LocalExecuteInvalidationMessage>) at inval.c:397
#6  0x0831d3dc in CommandEndInvalidationMessages () at inval.c:1006
#7  0x080c9035 in AtCommit_LocalCache () at xact.c:1009
#8  CommandCounterIncrement () at xact.c:634
#9  0x0826dcc9 in finish_xact_command () at postgres.c:2363
#10 0x0826ed4d in exec_simple_query (
   query_string=0xa1e88d4 "insert into product_refer_urls (item_id,
destination_url) select product_info.oid,  product_info_static.affiliate_url
from real_products, product_parts, product_info left join
product_info_static on ("...) at postgres.c:1022
#11 0x0827042c in PostgresMain (argc=4, argv=0xa17c5d0, username=0xa17c590
"gb") at postgres.c:3614
#12 0x0823a4d3 in BackendRun () at postmaster.c:3449
#13 BackendStartup () at postmaster.c:3063
#14 ServerLoop () at postmaster.c:1387
#15 0x0823b503 in PostmasterMain (argc=4, argv=0xa179678) at
#16 0x081dc0a6 in main (argc=4, argv=0xa179678) at main.c:188

Per the developer, the stack-trace is enough for diagnosing the failure:

I have not been able to reproduce the crash using Rusty's script on my
own machine, but after contemplating his stack trace for awhile I have a theory about what is happening.


Is there any log message?:


Can we automatically anticipate?

Yes. Signal handler + stack trace.