[hadoop]MAPREDUCE-4832 Report

1. Symptom

The RM may think an AM has gone down when the previous AM is still up and running.

1.1 Severity

critical

1.2 Was there exception thrown?

No

1.2.1 Were there multiple exceptions?

No

1.3 Scope of the failure

Single AM

2. How to reproduce this failure

2.0 Version

2.0.2-alpha, 0.23.5

2.1 Configuration

1 AM

1 RM

2.2 Reproduction procedure

1. Due to network problems, RM thinks that an AM goes down though AM is still alive(disconnect)

2. Previous AM tries to commit a job without requiring resources from RM (feature start)

2.2.1 Timing order

no

2.2.2 Events order externally controllable?

no

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Duplicate jobs committing

3.2 Backward inference

Because there’s no heartbeat checking at the AM side when committing job, the old AM is still committing the job though RM thinks it is dead.

4. Root cause

The RM thinks the AM has gone down while the AM is actually working.

4.1 Category:

Semantic

5. Fix

5.1 How?

Add a commit window into the AppMaster. Any tasks that are committed outside the commit window will be regarded as invalid.