HBase-3627 Report

Version:

0.90.1

1. Symptom

Null Pointer Exception in RS’s open region thread

1.1 Severity

Critical

1.2 Was there exception thrown?

Yes.

NPE

1.2.1 Were there multiple exceptions?

Yes.

1. The openregion was slow (no log though)

2. Znode is down (cannot be connected):

DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x22dc571dde04ca7 Unable to get data of znode /hbase/unassigned/0519dc3b62a569347526875048c37faa because node does not exist (not necessarily an error)

3. NPE:

ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_RS_OPEN_REGION

java.lang.NullPointerException

at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)

at org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198)

at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672)

at org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:585)

at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:322)

at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:97)

at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

1.3 Scope of the failure

Affect one RS’s open for one region.

2. How to reproduce this failure

2.0 Version

0.90.1

2.1 Configuration

STandard

2.2 Reproduction procedure

1. open a large region

2. remove the znode

I reversed the fix & run the test:

mvn test -Dtest=TestOpenRegionHandler

2.2.1 Timing order

2 occurs before 1 finishes

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

1. RS + Znode

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

NPE:

DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x22dc571dde04ca7 Unable to get data of znode /hbase/unassigned/0519dc3b62a569347526875048c37faa because node does not exist (not necessarily an error)

ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_RS_OPEN_REGION

java.lang.NullPointerException

at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)

at org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198)

at org.apache.hadoop.hbase.zookeeper.ZKAssign.transitionNode(ZKAssign.java:672)

at org.apache.hadoop.hbase.zookeeper.ZKAssign.retransitionNodeOpening(ZKAssign.java:585)

at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.tickleOpening(OpenRegionHandler.java:322)

at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:97)

at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

3.2 Backward inference

We can first locate the msg:

DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020-0x22dc571dde04ca7 Unable to get data of znode /hbase/unassigned/0519dc3b62a569347526875048c37faa because node does not exist (not necessarily an error)

to:

  public static byte [] getDataNoWatch(ZooKeeperWatcher zkw, String znode,

      Stat stat)

  throws KeeperException {

    try {

      byte [] data = zkw.getZooKeeper().getData(znode, zkw, stat);

      logRetrievedMsg(zkw, znode, data, false);

      return data;

    } catch (KeeperException.NoNodeException e) {

      LOG.debug(zkw.prefix("Unable to get data of znode " + znode + " " +

          "because node does not exist (not necessarily an error)"));

      return null;

    } catch (KeeperException e) {

      LOG.warn(zkw.prefix("Unable to get data of znode " + znode), e);

      zkw.keeperException(e);

      return null;

    } catch (InterruptedException e) {

      LOG.warn(zkw.prefix("Unable to get data of znode " + znode), e);

      zkw.interruptedException(e);

      return null;

    }

  }

Notice it returned NULL.

In its caller:

 public static int transitionNode(..)

  throws KeeperException {

    String encoded = region.getEncodedName();

    .. ..

    // Read existing data of the node
    Stat stat = new Stat();
   byte [] existingBytes =
     ZKUtil.
getDataNoWatch(zkw, node, stat);
    RegionTransitionData existingData =
      RegionTransitionData.
fromBytes(existingBytes);

        ---> NullPointerException

4. Root cause

transitionNode didn’t handle the case where getDataNoWatch returned null (caused by znode down).

4.1 Category:

Incorrect error handling. Stmt coverage on the above highlighted exception handling code will expose the bug.

In fact, this is a typical NullPointerDereference bug, that should have been detected by tools like coverity.

5. Fix

5.1 How?

return -1 when getDataNoWatch returned null:

-    byte [] existingBytes =
-      ZKUtil.getDataNoWatch(zkw, node, stat);
+    byte [] existingBytes = ZKUtil.getDataNoWatch(zkw, node, stat);
+    if (existingBytes == null) {
+      // Node no longer exists.  Return -1. It means unsuccessful transition.
+      return -1;
+    }
    RegionTransitionData existingData =
      RegionTransitionData.fromBytes(existingBytes);