HBase-3332 Report

https://issues.apache.org/jira/browse/Hbase-3332

1. Symptom

After the RS holding ROOT and META goes down, and “balancer” is run, some regions will stuck in transition mode forever --- causing data loss for those regions.

1.1 Severity

Blocker

1.2 Was there exception thrown?

Yes:

1. RS goes down

2. Final data loss

1.2.1 Were there multiple exceptions?

yes

1.3 Scope of the failure

Multiple regions: significant data loss

2. How to reproduce this failure

2.0 Version

0.90.0

2.1 Configuration

Standard

2.2 Reproduction procedure

1. RS holding ROOT and META down (disconnect)

2. balancer started (feature start)

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

no. 2 must occur before HMaster detects 1.

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2 (HMaster + RS)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

From the log you can notice the above events. Also from the final symptom you can notice the regions were in pending states. Also the following messages:

Regions in transition timed out:  usertable,user876353700,1291843173343.182d193c6317cab6486eab93bf95b6a7. state=PENDING_CLOSE, ts=1292005206699

3.2 Backward inference

You know the regions are in PENDING_CLOSE states.

In ServerShutdownHandler.java:

-    // Remove regions that were in transition
-  
 for (HRegionInfo rit : regionsInTransition) hris.remove(rit);
-    LOG.info("Reassigning the " + hris.size() + " region(s) that " + serverName
+    // Skip regions that were in transition unless CLOSING or PENDING_CLOSE
+    for (RegionState rit : regionsInTransition) {
+      if (!rit.isClosing() && !rit.isPendingClose()) {
+        hris.remove(rit.getRegion());
+      }
+    }
+

In fact, the above red part handles server shutdown & transition state, is completely wrong!

4. Root cause

Regions that are in “PENDING_CLOSE” is blindly deleted!

4.1 Category:

Incorrect error handling (handled, statement coverage) --- once you trigger the read part above, it will remove the regions and you will see the failure.