[hadoop]MAPREDUCE-4833 Report

1. Symptom

Task can get stuck in FAIL_CONTAINER_CLEANUP

1.1 Severity

critical

1.2 Was there exception thrown?

No.

1.2.1 Were there multiple exceptions?

No.

1.3 Scope of the failure

single request

2. How to reproduce this failure

2.0 Version

0.23.5

2.1 Configuration

1 AM, 1RM and 1 NM

2.2 Reproduction procedure

2.2.1 Timing order

must be in that order

2.2.2 Events order externally controllable?

no (multinode)

2.3 Can the logs tell how to reproduce the failure?

2.4 How many machines needed?

3 (1 NM, 1AM and 1 RM)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The task get stuck when there’s a NM goes down and AM tries to launch a container on it.

3.2 Backward inference

The TA_FAILMSG arrives before the TA_CONTAINER_LAUNCH_FAILED message. Then the task attempt tries to kill the container. But the ContainerLauncherImpl will not send back a TA_CONTAINER_CLEANED event, causing the attempt to be stuck.

4. Root cause

Not sending the TA_CONTAINER_CLEANED event when killing the container.

4.1 Category:

Wrong exception handling

5. Fix

5.1 How?

Always send a TA_CONTAINER_CLEANED event in all cases, even when the container is failed and going to be killed.