[hadoop]MAPREDUCE-4467 Report

1. Symptom

IndexCache failures due to missing synchronization

1.1 Severity


1.2 Was there exception thrown?


1.2.1 Were there multiple exceptions?


1.3 Scope of the failure

Single computation node

2. How to reproduce this failure

2.0 Version


2.1 Configuration

1 NM

2.2 Reproduction procedure

1. MR client submit a job (feature start)

2.2.1 Timing order


2.2.2 Events order externally controllable?

No (multi-thread, lock contention)

2.3 Can the logs tell how to reproduce the failure?


2.4 How many machines needed?

1 (1 NM)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

TestMRJobs.testSleepJob throws an IllegalMonitorStateException that the wait method does not hold the process lock.

3.2 Backward inference

As the code is not wrapped into a “synchronized” block, when the thread visit the code with wait() method in it without the process lock, it will throw the exception.

4. Root cause

The threading calling “wait” method may not holding the lock to the object due to the loss of “synchronized” keyword.

4.1 Category:


5. Fix

5.1 How?

Warpping the code into a synchronized block.

5.2 Exception behavior?

No more exceptions