HBase-5883 Report

https://issues.apache.org/jira/browse/HBASE-5883?attachmentSortBy=dateTime#attachmentmodule

1. Symptom

When the active master node was down, the backup node got notification and started to become active. Immediately backup node got aborted.

1.1 Severity

Major

1.2 Was there exception thrown?

Yes

1.2.1 Were there multiple exceptions?

Yes (Active master cannot be connected + backup node abort)

1.3 Scope of the failure

Entire cluster will be down

2. How to reproduce this failure

2.0 Version

0.94.0

2.1 Configuration

Standard

2.2 Reproduction procedure

1. Shutdown master (disconnect)

2. Shutdown ZK (disconnect)

3. Shutdown NN (disconnect)

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

3 (HMaster, ZK, NM)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The exception msg:

2012-04-09 10:42:24,270 INFO org.apache.hadoop.hbase.master.SplitLogManager: finished splitting (more than or equal to) 861248320 bytes in 4 log files in [hdfs://192.168.47.205:9000/hbase/.logs/HOST-192-168-47-202,60020,1333715537172-splitting] in 26374ms
2012-04-09 10:42:24,316 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: []
2012-04-09 10:42:24,333 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
java.io.IOException: java.net.ConnectException: Connection refused
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:375)
        at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1045)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:897)
        at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
        at $Proxy13.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:183)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:303)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:280)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:332)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:236)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1276)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1233)
        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1220)
        at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:569)
        at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:369)
        at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRootServerConnection(CatalogTracker.java:353)
        at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRootRegionLocation(CatalogTracker.java:660)
        at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:616)
        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:540)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:363)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:488)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:328)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:362)
        ... 20 more
2012-04-09 10:42:24,336 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
2012-04-09 10:42:24,336 DEBUG org.apache.hadoop.hbase.master.HMaster: Stopping service threads

3.2 Backward inference

We can locate this piece of code in HMaster.java:

 public void run() {

    try {

      /*

       * Block on becoming the active master.

       *

       * We race with other masters to write our address into ZooKeeper.  If we

       * succeed, we are the primary/active master and finish initialization.

       *

       * If we do not succeed, there is another active master and we should

       * now wait until it dies to try and become the next active master.  If we

       * do not succeed on our first attempt, this is no longer a cluster startup.

       */

      this.activeMasterManager = new ActiveMasterManager(zooKeeper, address, this);

      this.zooKeeper.registerListener(activeMasterManager);

      stallIfBackupMaster(this.conf, this.activeMasterManager);

      this.activeMasterManager.blockUntilBecomingActiveMaster();

      // We are either the active master or we were asked to shutdown

      if (!this.stopped) {

        finishInitialization();

        loop();

      }

    } catch (Throwable t) {

      abort("Unhandled exception. Starting shutdown.", t);

   }

We further notice the following code in HBaseRPC.java:

   while (true) {

      try {

        return getProxy(protocol, clientVersion, addr, conf, rpcTimeout);

      } catch(ConnectException se) {  // namenode has not been started

        ioe = se;

        if (maxAttempts >= 0 && ++reconnectAttempts >= maxAttempts) {

          LOG.info("Server at " + addr + " could not be reached after " +

            reconnectAttempts + " tries, giving up.");

          throw new RetriesExhaustedException("Failed setting up proxy " +

            protocol + " to " + addr.toString() + " after attempts=" +

            reconnectAttempts, se);

         }

      }

So it appears the exception is handled, however, it was actually not. The exception was thrown as follow:

new IOException(new ConnecException("Connection refused"));

This won’t fall into the “catch (ConnectException se)” block.

4. Root cause

Handled a wrong exception

4.1 Category:

Incorrect error handling (not handled, handled wrong exception)