HBase-5606 Report

1. Symptom

HMaster’s log splitting thread will hang in some corner cases.

1.1 Severity

Critical.

1.2 Was there exception thrown?

Yes. Many intermediate exceptions.

E1, RS down (which triggered a log split).

E2, Zookeeper connection loss:

19:32:24,657 WARN org.apache.hadoop.hbase.master.SplitLogManager$GetDataAsyncCallback: getdata rc = CONNECTIONLOSS /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020.1331752316170 retry=0
--- This will occur multiple times.

Finally, hang...

1.2.1 Were there multiple exceptions?

Yes. E1 & E2 repeats multiple times above.

1.3 Scope of the failure

The regions from the failed RS never got recovered...

2. How to reproduce this failure

2.0 Version

0.92.0

2.1 Configuration

Standard.

2.2 Reproduction procedure

1. Shutdown an RS.

2. Wait until the log split starts, shutdown ZK for a while;

3. Bring ZK back

2.2.1 Timing order

The timing of 2 & 3 is super important: 2 has to be btw. HMaster created the znode, but before the znode respond to the data query.

2.2.2 Events order externally controllable?

Yes.

2.3 Can the logs tell how to reproduce the failure?

Yes.

2.4 How many machines needed?

1. HMaster + RS + Zookeeper

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

When hang, users will use jstack:

"MASTER_META_SERVER_OPERATIONS-HOST-192-168-47-204,60000,1331719909985-1" prio=10 tid=0x0000000040d7c000 nid=0x624b in Object.wait() [0x00007ff090482000]
  java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.hbase.master.SplitLogManager.waitTasks(SplitLogManager.java:316)
        - locked <0x000000078e6c4258> (a org.apache.hadoop.hbase.master.SplitLogManager$TaskBatch)
        at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:262)

3.2 Backward inference

See the graph above.

4. Root cause

See above.

4.1 Category:

Semantic

5. Fix

5.1 How?

set proper install/done/error values.