[hadoop]MAPREDUCE-3272 Report

1. Symptom

If there is a temporary network error btw. NM and RM, the NM will abort after they rejoin the network.

1.1 Severity

Critical

1.2 Was there exception thrown?

Yes.

1.2.1 Were there multiple exceptions?

Yes.

1. RM detects NM cannot be connected;

2. NM shuts down

1.3 Scope of the failure

If there is a temporary network error affecting lots of NMs, all of them will die.

2. How to reproduce this failure

2.0 Version

0.23.0

2.1 Configuration

Standard

2.2 Reproduction procedure

1. stop NM (disconnect)

2. start NM (add node)

2.2.1 Timing order

in this order

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2 (RM + NM)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

From the NM’s error stack, we know exactly that it received a reboot exception (but not handled correctly).

3.2 Backward inference

  @Override

  public void stateChanged(Service service) {

    // Shutdown the Nodemanager when the NodeStatusUpdater is stopped.

    if (NodeStatusUpdaterImpl.class.getName().equals(service.getName())

        && STATE.STOPPED.equals(service.getServiceState())) {

      stop();

      ←- Basically it did not handle the case where RM sent NM a reboot command here (the fix is to handle this case).

    }

  }

  public static void main(String[] args) {

    StringUtils.startupShutdownMessage(NodeManager.class, args, LOG);

    try {

      NodeManager nodeManager = new NodeManager();

      Runtime.getRuntime().addShutdownHook(

          new CompositeServiceShutdownHook(nodeManager));

      YarnConfiguration conf = new YarnConfiguration();

      nodeManager.init(conf);

      nodeManager.start();

    } catch (Throwable t) {

      LOG.fatal("Error starting NodeManager", t);

      ←- Basically it did not handle the case where RM sent NM a reboot command here (the fix is to handle this case).

       Eventually it will fall into here -- forgot that RM might send “REBOOT” command.

      System.exit(-1);

    }

  }

3.3 Are the printed log sufficient for diagnosis?

Yes

4. Root cause

NM handled the generic statechange signals from RM, but it forgot to handle the “Reboot” signal case..

In the fix, in the error handling (stateChanged here is an error handler), it further check if the new state is Reboot and handle it.

4.1 Category:

Incorrect error handling.