[hadoop]MAPREDUCE-4819 Report

1. Symptom

data loss

AM can rerun job after reporting final job status to the client.

1.1 Severity

blocker

1.2 Was there exception thrown?

1.2.1 Were there multiple exceptions?

1.3 Scope of the failure

Single Job

2. How to reproduce this failure

2.0 Version

0.23.3, 2.0.1-alpha

2.1 Configuration

1 RM, 1 AM

2.2 Reproduction procedure

1. user start a job inside a AM(feature start)

2. AM crashes after reporting the status to RM, without unregistering to RM (disconnect)

2.2.1 Timing order

AM must crash after reporting the status to RM, before reporting the unregistering itself to the RM

2.2.2 Events order externally controllable?

No(multinode)

2.3 Can the logs tell how to reproduce the failure?

2.4 How many machines needed?

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

AM reruns the job when it recovers from a crash though the job has been finished before it crashes.

3.2 Backward inference

As the AM has not flushed the log from intermediate directory to the done directory, when the next AM starts it will believe that the previous attempt is still not finished. Thus it will rerun the job and may cause data loss.

4. Root cause

The new AM assumes that previous attempt is not done and will make the job to rerun

4.1 Category:

Incorrect Handling

5. Fix

5.1 How?

Before the AM completes the job, only flush the last action of the AM log from the intermidiate dir to done dir to make the next AM instance do the right recovery action