[hadoop]MAPREDUCE-4292 Report

Dup. of 3927

1. Symptom

The job hangs forever when some maps are failing.

1.1 Severity

critical

1.2 Was there exception thrown?

No

1.2.1 Were there multiple exceptions?

No

1.3 Scope of the failure

Single Task

2. How to reproduce this failure

2.0 Version

2.0.0-alpha

2.1 Configuration

1 MR

2.2 Reproduction procedure

1. Submit a MR job (feature start)

2. Some of the mappers failed (job fail)

2.2.1 Timing order

Yes

2.2.2 Events order externally controllable?

Yes (We can control that some of the mappers always fail)

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

3 (2 for mapper tasks and 1 for reducer task)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The map-reduce job hangs when there’re some mappers fail.

3.2 Backward inference

MR will not correctly calculate the progress when there’re mappers fail in the job. So the user will never see the job completed in the progress report.

4. Root cause

When a map task fails, it will not be counted in the progress calculating.

4.1 Category:

Semantic

5. Fix

5.1 How?

Let the progress be calculated as 1f when the task ends with status FAILED or KILLED.