[redis]github-322 Report

https://github.com/antirez/redis/issues/322

1. Symptom

Redis server crash caused by jemalloc library bug under slowlog configuration.

1.1 Severity

Severe. But only crashes one server. As long as there are HA solutions shouldn’t be a big problem.

1.2 Was there exception thrown?

Yes. Crash

1.2.1 Were there multiple exceptions?

No

1.3 Scope of the failure

Single node.

2. How to reproduce this failure

2.0 Version

2.4.4

2.1 Configuration

Standard (but run on a 32-bit OS)

2.2 Reproduction procedure

1. enable slowlog

2. Create a large db that will close to 4GB usage (the addr space)

2.2.1 Timing order

Irrelevant --- jemalloc will randomly crash memory once the usage is larger than the addr space.

2.2.2 Events order externally controllable? (Deterministic?)

Yes.

2.3 Can the logs tell how to reproduce the failure?

Yes. But noisy.

2.4 How many machines needed?

1

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Crash

3.2 Backward inference

This is very hard to diagnose. The developer ended up asking all the user logs.

4. Root cause

jemalloc (freebsd malloc library) randomly corrupt memory once the usage close to the size of addr space

4.1 Category:

Incorrect error handling (not anticipated)

5. Fix

They change the code to warn about the potential problem of large mem usage.