HDFS-1540 Report

https://issues.apache.org/jira/browse/HDFS-1540

1. Symptom

Datanodes quit when the RPC call (from DN->NN) namenode.register(...) do not succeed. Because of this, a temporary network problem of NN can bring down a large number of DNs. The DN’s log registered a ConnectException as follows:

2013-07-08 13:37:48,952 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting DataNode

STARTUP_MSG:   host = slave2/192.168.103.151

STARTUP_MSG:   args = []

STARTUP_MSG:   version = 0.20.3-dev

STARTUP_MSG:   build =  -r ; compiled by 'renna' on Thu Jul  4 16:07:40 EDT 2013

************************************************************/

2013-07-08 13:37:49,168 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Registered FSDatasetStatusMBean

2013-07-08 13:37:49,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 50010

2013-07-08 13:37:49,170 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is 1048576 bytes/s

2013-07-08 13:37:54,232 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog

2013-07-08 13:37:54,286 INFO org.apache.hadoop.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 50075

2013-07-08 13:37:54,286 INFO org.apache.hadoop.http.HttpServer: listener.getLocalPort() returned 50075 webServer.getConnectors()[0].getLocalPort() returned 50075

2013-07-08 13:37:54,286 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50075

2013-07-08 13:37:54,286 INFO org.mortbay.log: jetty-6.1.14

2013-07-08 13:37:54,715 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:50075

2013-07-08 13:37:54,725 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null

2013-07-08 13:37:59,750 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020

2013-07-08 13:37:59,757 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting

2013-07-08 13:37:59,758 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting

2013-07-08 13:37:59,758 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50020: starting

2013-07-08 13:37:59,758 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020: starting

2013-07-08 13:37:59,758 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(slave2:50010, storageID=DS-1742617519-192.168.103.151-50010-1373303967568, infoPort=50075, ipcPort=50020)

2013-07-08 13:37:59,761 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020: starting

2013-07-08 13:38:00,761 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.103.146:54310. Already tried 0 time(s).

2013-07-08 13:38:01,762 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.103.146:54310. Already tried 1 time(s).

2013-07-08 13:38:02,764 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.103.146:54310. Already tried 2 time(s).

2013-07-08 13:38:03,765 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.103.146:54310. Already tried 3 time(s).

2013-07-08 13:38:04,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.103.146:54310. Already tried 4 time(s).

2013-07-08 13:38:05,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.103.146:54310. Already tried 5 time(s).

2013-07-08 13:38:06,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.103.146:54310. Already tried 6 time(s).

2013-07-08 13:38:07,771 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.103.146:54310. Already tried 7 time(s).

2013-07-08 13:38:08,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.103.146:54310. Already tried 8 time(s).

2013-07-08 13:38:09,774 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.103.146:54310. Already tried 9 time(s).

2013-07-08 13:38:09,776 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.net.ConnectException: Call to master/192.168.103.146:54310 failed on connection exception: java.net.ConnectException: Connection refused

    at org.apache.hadoop.ipc.Client.wrapException(Client.java:767)

    at org.apache.hadoop.ipc.Client.call(Client.java:743)

    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)

    at com.sun.proxy.$Proxy4.register(Unknown Source)

    at org.apache.hadoop.hdfs.server.datanode.DataNode.register(DataNode.java:531)

    at org.apache.hadoop.hdfs.server.datanode.DataNode.runDatanodeDaemon(DataNode.java:1208)

    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1247)

    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

Caused by: java.net.ConnectException: Connection refused

    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)

    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)

    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)

    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)

    at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)

    at org.apache.hadoop.ipc.Client.getConnection(Client.java:860)

    at org.apache.hadoop.ipc.Client.call(Client.java:720)

    ... 6 more

2013-07-08 13:38:09,778 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down DataNode at slave2/192.168.103.151

************************************************************/

1.1 Severity

Major.

Datanodes would quit and have to be started manually, but no harm was done to the data or to the rest of the cluster.

1.2 Was there exception thrown?

Yes.

java.net.ConnectException, as shown above.

1.4 Were there multiple exceptions?

Yes.

NN first shutdown, and then DN has connection exception.

2. How to reproduce this failure

Start up a cluster with one of the affected versions and once the server is running properly, shut down the datanode. Then, start it up again and shut down the namenode after the handshake() is finished, but before the function namenode.register(...) is called. Timing here is a very important thing. To ease the procedure, inserting new log messages into the source code could help a lot.

2.0 Version

Three versions were reported to be affected by the bug: 0.20.1, 0.20.2 and 0.21.0. The report seems to use 0.21.0, but I used 0.20.2. For that, I had to to some slight changes in the procedure reported in the page. One of them was the removal of one of the three expected exceptions - the UnregisteredNodeException - creating a vulnerability on the system, but it was irrelevant for the case, since we just wanted to reproduce the bug.

2.1 Environment settings and configuration

Theoretically, only one machine is needed. One namenode and one datanode, though two datanodes were used to reproduce this failure. Set up a normal hadoop cluster using any of the 3 appointed versions with a namenode and a datanode.

2.2 Reproduction procedure

- Start DN.

- Shutdown NN is right away  (after the handshake function and before the register function called on DN).

NOTE: timing is important here!

In my case, I used the following command on the datanode machine:

hadoop-daemon.sh start datanode

As can be seen in the log message.

************************************************************/

2013-07-10 18:15:45,191 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting DataNode

STARTUP_MSG:   host = research-virtual-machine/127.0.1.1

STARTUP_MSG:   args = []

STARTUP_MSG:   version = 0.20.3-dev

STARTUP_MSG:   build =  -r ; compiled by 'renna' on Thu Jul  4 16:07:40 EDT 2013

************************************************************/

and after half a second, I used the following command on the namenode machine:

hadoop-daemon.sh stop namenode

as we can see in the following log message.

2013-07-10 18:15:45,361 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at master/192.168.103.146

************************************************************/

2.3 Can the logs tell how to reproduce this failure?

Yes. A good job was done in the event log. The datanode itself logs the exceptions thrown with a timestamp, which are already very good to indicate the root cause.

If the logs for the namenode are checked and their shutdown is detected, it becomes almost obvious that the problem was the connection between NN and DN.

2.4 How many machines were needed?

One datanode and one namenode were needed.

One machine would be enough to reproduce this failure.

3 Diagnosis procedure

3.1 Detailed Symptom

During the startup, the datanode dies (automatically shuts itself down) due to an unsuccessful connection to the server.

2013-07-10 18:15:45,191 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting DataNode

STARTUP_MSG:   host = research-virtual-machine/127.0.1.1

STARTUP_MSG:   args = []

STARTUP_MSG:   version = 0.20.3-dev

STARTUP_MSG:   build =  -r ; compiled by 'renna' on Thu Jul  4 16:07:40 EDT 2013

************************************************************/

(...)

2013-07-10 18:16:05,939 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.103.146:54310. Already tried 9 time(s).

2013-07-10 18:16:05,942 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.net.ConnectException: Call to master/192.168.103.146:54310 failed on connection exception: java.net.ConnectException: Connection refused

    at org.apache.hadoop.ipc.Client.wrapException(Client.java:767)

    at org.apache.hadoop.ipc.Client.call(Client.java:743)

    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)

    at com.sun.proxy.$Proxy4.register(Unknown Source)

    at org.apache.hadoop.hdfs.server.datanode.DataNode.register(DataNode.java:531)

    at org.apache.hadoop.hdfs.server.datanode.DataNode.runDatanodeDaemon(DataNode.java:1208)

    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1247)

    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

Caused by: java.net.ConnectException: Connection refused

    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)

    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)

    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)

    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)

    at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)

    at org.apache.hadoop.ipc.Client.getConnection(Client.java:860)

    at org.apache.hadoop.ipc.Client.call(Client.java:720)

    ... 6 more

2013-07-10 18:16:05,943 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down DataNode at research-virtual-machine/127.0.1.1

************************************************************/

The namenode was shut down between the startup message and the shutdown message.

2013-07-10 18:15:45,361 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at master/192.168.103.146

************************************************************/

3.2 Backward inference

From the log messages, we can surely tell that there is a problem with the connection between the namenode and datanode. (java.net.ConnectException).

Also from the stack of the exception, we can note that it occurs when the DN is in “register” function. In the startup process, almost all the methods perform local operations. Only the register and the handshake functions make use of something remote during the Datanode startup. So, since the exception is a ConnectionException, these two are the top two suspects and will be the first ones to have their codes checked. The source code for the handshake method follows.

private NamespaceInfo handshake() throws IOException {

        NamespaceInfo nsInfo = new NamespaceInfo();

        while (shouldRun) {

          try {

            nsInfo = namenode.versionRequest();

            break;

          } catch(SocketTimeoutException e) {  // NN is busy

            LOG.info("Problem connecting to server: " + getNameNodeAddr());

            try {

                      Thread.sleep(1000);

                    } catch (InterruptedException ie) {}

          }

}

And the code for register method too.

private void register() throws IOException {

  if (dnRegistration.getStorageID().equals("")) {

    setNewStorageID(dnRegistration);

  }

  while(shouldRun) {

    try {

      // reset name to machineName. For web interface.

      dnRegistration.name = machineName + ":" + dnRegistration.getPort();

      dnRegistration = namenode.register(dnRegistration);

      break;

    } catch(SocketTimeoutException e) {  // NN is busy

      /* Note: this log msg is NOT printed, and it only caught “SocketTimeoutException”... */

      LOG.info("Problem connecting to server: " + getNameNodeAddr());

      try {

         Thread.sleep(1000);

      } catch (InterruptedException ie) {}

    }

  }

(...)

By looking at the “register()” function, which threw an ConnectionException, we can infer that the exception was NOT caught because:

A). the log msg:

      LOG.info("Problem connecting to server: " + getNameNodeAddr());

    was not printed, and

B). it only tried to catch “SocketTimeoutException” (not connectionException).

By analyzing the source-code, the methods that are called by those two methods and the methods that lead to the call of those two methods, we can draw a diagram as follows.

Basically, the developers did not handle this ConnectException!

4 Root Cause

Finally, we take a look at the patch the developers made available on the report page and confirm these ideas. Applying the patch, we can check that the patch actually works and, under the same conditions as before, the Datanodes on patched system will not quit because of a connection problem anymore.

@@ -683,12 +683,21 @@
        dnRegistration.name = machineName + ":" + dnRegistration.getPort();
        dnRegistration = namenode.registerDatanode(dnRegistration);
        break;
-      } catch(SocketTimeoutException e) {  // namenode is busy
-        LOG.info("Problem connecting to server: " + getNameNodeAddr());
-        try {
-          Thread.sleep(1000);
-        } catch (InterruptedException ie) {}
+      } catch(
RemoteException re) {
+        IOException ue = re.unwrapRemoteException(
+                           UnregisteredNodeException.class,
+                           DisallowedDatanodeException.class,
+                           IncorrectVersionException.class);
+        if (ue != re) {
+          LOG.warn("DataNode is shutting down: ", re);
+          throw ue;
+        }
+      } catch(IOException e) {  // namenode cannot be contacted
+        LOG.info("Problem connecting to server: " + getNameNodeAddr(), e);
      }
+      try {
+        Thread.sleep(1000);
+      } catch (InterruptedException ie) {}
    }

.. .. ..

2013-07-10 18:16:04,998 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.103.146:54310. Already tried 9 time(s).

2013-07-10 18:16:05,000 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: master/192.168.103.146:54310java.net.ConnectException: Call to master/192.168.103.146:54310 failed on connection exception: java.net.ConnectException: Connection refused

    at org.apache.hadoop.ipc.Client.wrapException(Client.java:767)

    at org.apache.hadoop.ipc.Client.call(Client.java:743)

    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)

    at com.sun.proxy.$Proxy4.register(Unknown Source)

    at org.apache.hadoop.hdfs.server.datanode.DataNode.register(DataNode.java:534)

    at org.apache.hadoop.hdfs.server.datanode.DataNode.runDatanodeDaemon(DataNode.java:1219)

    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1258)

    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1379)

Caused by: java.net.ConnectException: Connection refused

    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)

    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)

    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)

    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)

    at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)

    at org.apache.hadoop.ipc.Client.getConnection(Client.java:860)

    at org.apache.hadoop.ipc.Client.call(Client.java:720)

    ... 6 more


So, now the datanode did not quit anymore. Instead, it kept trying to reconnect to the namenode until it finally succeeded.

4.1 Category

Incorrect exception handling. Not handled. They handled a wrong exception, did not handle “ConnectException”.

5. Fix:

5.1 How?

Fixed by handling “remoteException” -- a more general exception. Also a new exception is thrown.