[hadoop]MAPREDUCE-3963 Report

1. Symptom

When the NMs are started before the RM starts, NM will all die.

1.1 Severity

Critical

1.2 Was there exception thrown?

Yes. Error starting service. Failure.

1.3 Scope of the failure

Potentially all NM --- entire cluster.

2. How to reproduce this failure

2.0 Version

0.23.1

2.1 Configuration

Standard

2.2 Reproduction procedure

1. Start NM (node start)

2. Start RM (node start)

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2 (RM + NM)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

NM Shutdown with the stacktrace.

ERROR org.apache.hadoop.yarn.service.CompositeService: Error starting services org.apache.hadoop.yarn.server.nodemanager.NodeManager
org.apache.avro.AvroRuntimeException: java.lang.reflect.UndeclaredThrowableException
        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:149)
        at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:68)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.start(NodeManager.java:167)
        at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:242)
Caused by: java.lang.reflect.UndeclaredThrowableException
        at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:66)
        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:182)
        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.start(NodeStatusUpdaterImpl.java:145)
        ... 3 more
Caused by: com.google.protobuf.ServiceException: java.net.ConnectException: Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http:
//wiki.apache.org/hadoop/ConnectionRefused
        at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:131)
        at $Proxy23.registerNodeManager(Unknown Source)
        at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:59)
        ... 5 more
Caused by: java.net.ConnectException: Call From HOST-10-18-52-230/10.18.52.230 to HOST-10-18-52-250:8025 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http:
//wiki.apache.org/hadoop/ConnectionRefused
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:857)
        at org.apache.hadoop.ipc.Client.call(Client.java:1141)
        at org.apache.hadoop.ipc.Client.call(Client.java:1100)
        at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:128)
        ... 7 more
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:659)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:469)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:563)
        at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:211)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1247)
        at org.apache.hadoop.ipc.Client.call(Client.java:1117)
        ... 9 more
2012-01-16 15:04:13,336 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher thread interrupted
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:76)
        at java.lang.
Thread.run(Thread.java:619)
2012-01-16 15:04:13,337 INFO org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is stopped.
2012-01-16 15:04:13,392 INFO org.mortbay.log: Stopped SelectChannelConnector@0.0.0.0:9999
2012-01-16 15:04:13,493 INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is stopped.
2012-01-16 15:04:13,493 INFO org.apache.hadoop.ipc.Server: Stopping server on 24290
2012-01-16 15:04:13,494 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 24290
2012-01-16 15:04:13,495 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2012-01-16 15:04:13,496 INFO org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler is stopped.
2012-01-16 15:04:13,496 WARN org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher thread interrupted
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:76)
        at java.lang.
Thread.run(Thread.java:619)

3.2 Backward inference

It is not hard to infer from the stack trace where the error was handled:

public void start() {

    // NodeManager is the last service to start, so NodeId is available.

    this.nodeId = this.context.getNodeId();

    String httpBindAddressStr =

      getConfig().get(YarnConfiguration.NM_WEBAPP_ADDRESS,

          YarnConfiguration.DEFAULT_NM_WEBAPP_ADDRESS);

    InetSocketAddress httpBindAddress =

      NetUtils.createSocketAddr(httpBindAddressStr,

        YarnConfiguration.DEFAULT_NM_WEBAPP_PORT,

        YarnConfiguration.NM_WEBAPP_ADDRESS);

    try {

      //      this.hostName = InetAddress.getLocalHost().getCanonicalHostName();

      this.httpPort = httpBindAddress.getPort();

      // Registration has to be in start so that ContainerManager can get the

      // perNM tokens needed to authenticate ContainerTokens.

      registerWithRM();

      super.start();

      startStatusUpdater();

    } catch (Exception e) {

      throw new AvroRuntimeException(e);

    }

  }

Inside registerWithRM:

 private void registerWithRM() throws YarnRemoteException {

    this.resourceTracker = getRMClient();

    LOG.info("Connected to ResourceManager at " + this.rmAddress);

   

    RegisterNodeManagerRequest request = recordFactory.newRecordInstance(RegisterNodeManagerRequest.class);

    request.setHttpPort(this.httpPort);

    request.setResource(this.totalResource);

    request.setNodeId(this.nodeId);

-    RegistrationResponse regResponse =
-        this.resourceTracker.registerNodeManager(request).getRegistrationResponse();

 -- basically registerNodeManager throws an exception, and was just thrown from the caller, which clearly will take-down the NM.

4. Root cause

When NM cannot connect to RM, instead of directly abort, it should have tried multiple times (standard network retrying). So this is really a trivial mistake.

In the fix, they simply added the retrying mechanism for network error between NM to RM.

4.1 Category:

Incorrect error handling (handled) (stmt coverage)