HBase-5883 Report
https://issues.apache.org/jira/browse/HBASE-5883?attachmentSortBy=dateTime#attachmentmodule
When the active master node was down, the backup node got notification and started to become active. Immediately backup node got aborted.
Major
Yes
Yes (Active master cannot be connected + backup node abort)
Entire cluster will be down
0.94.0
Standard
1. Shutdown master (disconnect)
2. Shutdown ZK (disconnect)
3. Shutdown NN (disconnect)
In this order
Yes
Yes
3 (HMaster, ZK, NM)
The exception msg:
2012-04-09 10:42:24,270 INFO org.apache.hadoop.hbase.master.SplitLogManager: finished splitting (more than or equal to) 861248320 bytes in 4 log files in [hdfs://192.168.47.205:9000/hbase/.logs/HOST-192-168-47-202,60020,1333715537172-splitting] in 26374ms
2012-04-09 10:42:24,316 FATAL org.apache.hadoop.hbase.master.HMaster: Master server abort: loaded coprocessors are: []
2012-04-09 10:42:24,333 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
java.io.IOException: java.net.ConnectException: Connection refused
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:375)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1045)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:897)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
at $Proxy13.getProtocolVersion(Unknown Source)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:183)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:303)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:280)
at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:332)
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:236)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1276)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1233)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1220)
at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:569)
at org.apache.hadoop.hbase.catalog.CatalogTracker.getRootServerConnection(CatalogTracker.java:369)
at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRootServerConnection(CatalogTracker.java:353)
at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRootRegionLocation(CatalogTracker.java:660)
at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:616)
at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:540)
at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:363)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:488)
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:328)
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:362)
... 20 more
2012-04-09 10:42:24,336 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
2012-04-09 10:42:24,336 DEBUG org.apache.hadoop.hbase.master.HMaster: Stopping service threads
We can locate this piece of code in HMaster.java:
public void run() {
try {
/*
* Block on becoming the active master.
*
* We race with other masters to write our address into ZooKeeper. If we
* succeed, we are the primary/active master and finish initialization.
*
* If we do not succeed, there is another active master and we should
* now wait until it dies to try and become the next active master. If we
* do not succeed on our first attempt, this is no longer a cluster startup.
*/
this.activeMasterManager = new ActiveMasterManager(zooKeeper, address, this);
this.zooKeeper.registerListener(activeMasterManager);
stallIfBackupMaster(this.conf, this.activeMasterManager);
this.activeMasterManager.blockUntilBecomingActiveMaster();
// We are either the active master or we were asked to shutdown
if (!this.stopped) {
finishInitialization();
loop();
}
} catch (Throwable t) {
abort("Unhandled exception. Starting shutdown.", t);
}
We further notice the following code in HBaseRPC.java:
while (true) {
try {
return getProxy(protocol, clientVersion, addr, conf, rpcTimeout);
} catch(ConnectException se) { // namenode has not been started
ioe = se;
if (maxAttempts >= 0 && ++reconnectAttempts >= maxAttempts) {
LOG.info("Server at " + addr + " could not be reached after " +
reconnectAttempts + " tries, giving up.");
throw new RetriesExhaustedException("Failed setting up proxy " +
protocol + " to " + addr.toString() + " after attempts=" +
reconnectAttempts, se);
}
}
So it appears the exception is handled, however, it was actually not. The exception was thrown as follow:
new IOException(new ConnecException("Connection refused"));
This won’t fall into the “catch (ConnectException se)” block.
Handled a wrong exception
Incorrect error handling (not handled, handled wrong exception)