HBase-5063 Report

1. Symptom

RegionServer fail to report to backup HMaster after primary goes down

Category (in the spreadsheet):

Hang

1.1 Severity

Critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

Yes.

WARN regionserver.HRegionServer: Unable to connect to master. Retrying. Error was:
java.net.
ConnectException: Connection refused
       at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
       at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
       at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
       at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
       at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:328)
       at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:362)
       at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1024)
       at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:876)
       at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150)
       at $Proxy8.getProtocolVersion(Unknown Source)
       at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:183)
       at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:303)
       at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:280)
       at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:332)
       at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:236)
       at org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1616)
       at org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:787)
       at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:674)
       at java.lang.
Thread.run(Thread.java:619)

1.2.1 Were there multiple exceptions?

Yes. Exception should also been through when primary Hmaster goes down.

1.3 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

All clients (specifically new connections made after HMaster fail-over to backup)

Catastrophic? (spreadsheet column)

Yes. This failure is catastrophic because after the failure, any connection made to Hbase will not work because regionservers are not reachable. Downtime can be shortened because the HBase can recover from the failure manually. To recover, we replace a new HMaster in the same place, and kill the backup Hmaster. The cluster will function normally again. This will get rid of the problem that regionservers are stuck trying to talk to the original primary Hmaster.

2. How to reproduce this failure

2.0 Version

0.92.0

2.1 Configuration

HA needs to be enabled. When primary HMaster goes down, backup HMaster must be able to take over.

# of Nodes?

1 is the minimum node. A regionServer is needed to reproduce this failure.

2.2 Reproduction procedure

1. Make sure primary HMaster, backup HMaster and RegionServer is started, wait/trigger HMaster a disconnect (disconnect)

2. Wait for ZooKeeper to time out (disconnect)

3. HA Backup HMaster fail-over (feature start)

2.2.1 Timing order (Order important column)

Must be the exact order

2.2.2 Events order externally controllable? (Order externally controllable? column)

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes, log will tell us all of the input events.

2.4 How many machines needed?

1

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Tables may show up but RegionServers never report on web page. Existing connections are fine. New connections cannot find regionservers

3.2 Backward inference

Noticed primary HMaster had failed and backup HMaster is used.

4. Root cause

4.1 Category:

Incorrect error handling (stuck inside retry look instead of giving up and open new connection to backup HMaster) See the fix below.

5. Fix

-    while ((masterServerName = this.masterAddressManager.getMasterAddress()) == null) {

           ←- No master found, error handling code!
-      if (!keepLooping()) return null;
-      LOG.debug("No master found; retry");
-      sleeper.sleep();
-    }
-    InetSocketAddress isa =
-      new InetSocketAddress(masterServerName.getHostname(), masterServerName.getPort());
    HMasterRegionInterface master = null;
    while (keepLooping() && master == null) {
+      masterServerName = this.masterAddressManager.getMasterAddress();
+      if (masterServerName == null) {
+        if (!keepLooping()) {
+          // give up with no connection.
+          LOG.debug("No master found and cluster is stopped; bailing out");
+          return null;
+        }
+        LOG.debug("No master found; retry");
+        sleeper.sleep();
+        continue;
+      }
+
+      InetSocketAddress isa =
+        new InetSocketAddress(masterServerName.getHostname(), masterServerName.getPort());
+

5.1 How?

In the RegionServer error handling code, instead of trying to connect back to the old/primary HMaster, the fix tells RegionServer to give up, and open a new connection to connect to backup HMaster.