[hadoop]MAPREDUCE-5466 Report

1. Symptom

Wrong computation

After the RM restarts, History Server does not get restarted job results.

1.1 Severity

Blocker

1.2 Was there exception thrown?

1.2.1 Were there multiple exceptions?

1.3 Scope of the failure

Users who restarts their jobs or who are running jobs when there’s a RM restart.

2. How to reproduce this failure

2.0 Version

None?

target-version: 2.1.1-beta

2.1 Configuration

1 HS

2.2 Reproduction procedure

1. Restart a job. (restart)

2. Try to get job status from HS when the job finishes. (file read)

2.2.1 Timing order

Yes

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

1 node (1 HS)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Wrong Computation

The HS does not get right status of the job.

3.2 Backward inference

When AM starts, the HS flushes the jobhistory file inside staging directory to done_intermediate dir. So when job History server is queries after a job restart, it will gives back the old AM’s info.

4. Root cause

When the AM starts it will copy all logs from staging directory to done_intermediate dir.

After the job completes, the JobHistoryEventHandler will scan the log and output the old AM’s log.

4.1 Category:

Incorrect exception handling

5. Fix

5.1 How?

Add JOB_AM_REBOOT event to the event pool between RM and AM.

Skip writing out the AM’s history log (except for the last AM attempt) and let just the last one be written.