[redis]github-754 Report

1. Symptom

Redis server crash caused by jemalloc library bug.

1.1 Severity

Severe. But only crashes one server. As long as there are HA solutions shouldn’t be a big problem.

1.2 Was there exception thrown?

Yes. Crash

1.2.1 Were there multiple exceptions?

No

1.3 Scope of the failure

Single node.

2. How to reproduce this failure

2.0 Version

2.6.2

2.1 Configuration

Standard (but run on a 32-bit OS)

2.2 Reproduction procedure

1. Create a large db that will be larger than 4GB (the addr space)

2. Observe the crash

2.2.1 Timing order

Irrelevant --- jemalloc will randomly crash memory once the usage is larger than the addr space.

2.2.2 Events order externally controllable? (Deterministic?)

Yes.

2.3 Can the logs tell how to reproduce the failure?

Yes. But noisy.

2.4 How many machines needed?

1

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Crash

3.2 Backward inference

This is very hard to diagnose. The developer ended up asking all the user logs.

4. Root cause

jemalloc (freebsd malloc library) randomly corrupt memory once the usage is greater than addr space.

4.1 Category:

Incorrect error handling (in jemalloc library)

5. Fix

Fix it by configuration.