[hadoop]MAPREDUCE-4425 Report

1. Symptom

hang

Speculation + Fetch failures can lead to a hung job

1.1 Severity

critical

1.2 Was there exception thrown?

No

1.2.1 Were there multiple exceptions?

No

1.3 Scope of the failure

Single Job

2. How to reproduce this failure

2.0 Version

0.23.1

2.1 Configuration

2 MR Tasks (1 attempt and 1 speculative attempt)

2.2 Reproduction procedure

1. attemp1 starts(feature start)

2. speculative attempt starts(feature start)

2.2.1 Timing order

The attempt 1 completes - Task moves to SUCCEEDED state, then the speculative attempt is KILLED but the event T_ATTEMPT_KILLED is ignored.

Then attempt 1 fails with TOO_MANY_FETCH_FAILURES, the job will hang.

2.2.2 Events order externally controllable?

No (multinode)

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2 (2 tasks)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The job hangs when there’s a TOO_MANY_FETCH_FAILURES.

3.2 Backward inference

Since the attempt is not removed from the InProgress list, when there’s a failure the task still thinks there’s unfinished attempts, hence does not start a new attempt.

4. Root cause

Incorrect bookkeeping in TaskImpl ignores speculative attempts when one attempt is successful. The incomplete attempt fails to fetch and doesn’t launch a new attempt.

4.1 Category:

Incorrect handling

5. Fix

5.1 How?

Add T_ATTEMPT_KILLED and T_ATTEMPT_SUCCEEDED event handling at the SUCCEEDED status. The handling process is to remove the attempt from Inprogress list and move them into the finishedAttempts Tasks list.