[hadoop]MAPREDUCE-4444 Report

1. Symptom

The node manager fails to start when one of the local-dirs is bad-in-place.

1.1 Severity

blocker

1.2 Was there exception thrown?

Yes

[main]org.apache.hadoop.yarn.YarnException: Failed to initialize LocalizationService

Caused by: EROFS: Read-only file system

1.2.1 Were there multiple exceptions?

No

1.3 Scope of the failure

Nodemanager cannot be initialized.

2. How to reproduce this failure

2.0 Version

0.23.3, 2.0.0-alpha, 3.0.0

2.1 Configuration

1 Namenode

2.2 Reproduction procedure

1. Configure the localDir and logDir(config change)

2. corrupt local dir and log dir(data corruption)

3. start a service within node manager(feature start)

2.2.1 Timing order

Yes. Event timing order is important

2.2.2 Events order externally controllable?

Yes.

2.3 Can the logs tell how to reproduce the failure?

Yes.

2.4 How many machines needed?

1 node (NM)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The LocalizationService fails to initialize when starting the node manager.

3.2 Backward inference

The code did not eliminate the corrupted dirs in the initialization function, which makes the service failed to start when it tried to visit the bad dirs.

4. Root cause

During init, the LocalDirsHandlerService does not check for bad dirs so the subsequent checking process will fail when it tries to visit those bad dirs.

4.1 Category:

Incorrect exception handling (Not even checked)

Fix:

@@ -135,6 +119,10 @@ public class LocalDirsHandlerService extends AbstractService {
        YarnConfiguration.DEFAULT_NM_MIN_HEALTHY_DISKS_FRACTION);
    lastDisksCheckTime = System.currentTimeMillis();
    super.init(conf);
+
+    // Check the disk health immediately to weed out bad directories
+    // before other init code attempts to use them.
+    checkDirs();
  }