[hadoop]MAPREDUCE-5468 Report

1. Symptom

Map jobs restarts from the beginning after RM restarts -- they should have saved the progress.

https://issues.apache.org/jira/browse/mapreduce-5468

1.1 Severity

Blocker

1.2 Was there exception thrown?

Yes: RM failed (restart). Job restart

1.2.1 Were there multiple exceptions?

Yes

1.3 Scope of the failure

Catastrophic --- all map jobs

2. How to reproduce this failure

2.0 Version

0.23.8

2.1 Configuration

standard

2.2 Reproduction procedure

0. submit map job (feature start)

1. restart RM (restart)

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2: RM + AM

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Notice the restart

4. Root cause

In the AM recovery code, did not recover the previously running jobs’ states.

processRecovery() {

   ← This piece of code is to handle RM restart error
    // attempt will generate one.  However that disables recovery if there
    // are reducers as the shuffle secret would be app attempt specific.
    int numReduceTasks = getConfig().getInt(MRJobConfig.NUM_REDUCES, 0);
    .. ..

 
    if (recoveryEnabled && recoverySupportedByCommitter
-          && shuffleKeyValidForRecovery) {
+        && (numReduceTasks <= 0 || shuffleKeyValidForRecovery)) {
      LOG.info("Recovery is enabled. "
          + "Will try to recover from previous life on best effort basis.");

4.1 Category:

Incorrect error handling (missed error subtype): It did not consider the case where numReduceTasks <= 0.