HBASE-4695 Report

1. Symptom

Massive data loss caused by WAL deleted before region server can fully flush

1.1 Severity

Blocker

1.2 Was there exception thrown?

Yes: region server cannot be reached for flush + final error

1.2.1 Were there multiple exceptions?

Yes

1.3 Scope of the failure

Massive data loss

2. How to reproduce this failure

2.0 Version

0.90.4

2.1 Configuration

Standard

2.2 Reproduction procedure

1. shutdown region server (shutdown)

2.2.1 Timing order

2.2.2 Events order externally controllable?

No. Has to occur at a particular state

2.3 Can the logs tell how to reproduce the failure?

Yes

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

1. log shows you the WAL deleted (or you notice the data loss):

09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749

2. YOu notice the RS was down:

09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close

09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close

3.2 Backward inference

Easy to locate the following handling code:

--- src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java (revision 1195765)
+++ src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java (working copy)
@@ -654,12 +654,10 @@
} else if (abortRequested) {
if (this.fsOk) {
closeAllRegions(abortRequested); // Don't leave any open file handles
- closeWAL(false);
}
LOG.info("aborting server at: " + this.serverInfo.getServerName());
} else {
closeAllRegions(abortRequested);
- closeWAL(true);
closeAllScanners();
LOG.info("stopping server at: " + this.serverInfo.getServerName());
}
@@ -668,7 +666,11 @@
if (this.catalogTracker != null) this.catalogTracker.stop();
if (this.fsOk)
waitOnAllRegionsToClose(abortRequested);
-
+
+ //fsOk flag may be changed when closing region throws exception.
+ if (!this.killed && this.fsOk) {
+ closeWAL(abortRequested ? false : true);
+ }
// Make sure the proxy is down.
if (this.hbaseMaster != null) {
HBaseRPC.stopProxy(this.hbaseMaster);

4. Root cause

The RS shutdown failure was not properly handled.

4.1 Category:

Incorrect error handling (stmt coverage)

5. Fix

4.1 How?

Wait until regionServer finished shutting down and fileSystem is ok before closing WAL.

if (!this.killed && this.fsOk) {

closeWAL(abortRequested ? false : true);

}