HBase-7259 Report

https://issues.apache.org/jira/browse/HBASE-7259

1. Symptom

Deadlock on Client. Client stalls.

HBaseClient was running after a period of time, all of get operation became too slow.

1.1 Severity

Critical.

1.2 Was there exception thrown?

Yes.

1.2.1 Were there multiple exceptions?

No. 1 exception is enough.

1.3 Scope of the failure

1 node + 1 server, single client

2. How to reproduce this failure

2.0 Version

0.94.2

2.1 Configuration

standard

2.2 Reproduction procedure

1. Start master

2.2.1 Timing order

1. an KeeperException btw client and ZooKeeper;

2. the particular dependencies btw threads

2.2.2 Events order externally controllable?

No. Concurrency bug. At least, event thread could not be controlled.

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

If hang, we should use jstack, and it is not hard to notice a deadlock in the jstack:

Catalina.out found one Java-level deadlock:

=============================

"catalina-exec-800":
 waiting to lock monitor 0x000000005f1f6530 (object 0x0000000731902200, a java.lang.Object),
 which is held by "catalina-exec-710"
"catalina-exec-710":
 waiting to lock monitor 0x00002aaab9a05bd0 (object 0x00000007321f8708, a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation),
 which is held by "catalina-exec-29-EventThread"
"catalina-exec-29-EventThread":
 waiting to lock monitor 0x000000005f9f0af0 (object 0x0000000732a9c7e0, a org.apache.hadoop.hbase.zookeeper.RootRegionTracker),
 which is held by "catalina-exec-710"
Java stack information for the threads listed above:

===================================================

"catalina-exec-710":
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:599)
       - waiting to lock <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
       at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:158)
       - locked <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
       at org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
       at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
       at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174)
       at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:150)
       at org.apache.hadoop.hbase.client.MetaScanner.access$000(MetaScanner.java:48)
       at org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:126)
       at org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:123)
       at org.apache.hadoop.hbase.client.HConnectionManager.execute(HConnectionManager.java:359)
       at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123)
       at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948)
       - locked <0x0000000731902200> (a java.lang.Object)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
       at org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
       at org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
       at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
       at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
"catalina-exec-29-EventThread":
       at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.stop(ZooKeeperNodeTracker.java:98)
       - waiting to lock <0x0000000732a9c7e0> (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:604)
       - locked <0x00000007321f8708> (a org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
       at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
       at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:374)
       at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:271)
       at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521)
       at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
Found 1 deadlock.

3.2 Backward inference

We can notice the deadlock from jstack (see the above graph).

4. Root cause

Lock order is wrong.

4.1 Category:

Concurrency

5. Fix

5.1 How?

Avoid duplicated abort().