Massive data loss caused by WAL deleted before region server can fully flush
Blocker
Yes: region server cannot be reached for flush + final error
Yes
Massive data loss
0.90.4
Standard
1. shutdown region server (shutdown)
NA
No. Has to occur at a particular state
Yes
1. log shows you the WAL deleted (or you notice the data loss):
09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
2. YOu notice the RS was down:
09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close
09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close
Easy to locate the following handling code:
--- src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java (revision 1195765)
+++ src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java (working copy)
@@ -654,12 +654,10 @@
} else if (abortRequested) {
if (this.fsOk) {
closeAllRegions(abortRequested); // Don't leave any open file handles
- closeWAL(false);
}
LOG.info("aborting server at: " + this.serverInfo.getServerName());
} else {
closeAllRegions(abortRequested);
- closeWAL(true);
closeAllScanners();
LOG.info("stopping server at: " + this.serverInfo.getServerName());
}
@@ -668,7 +666,11 @@
if (this.catalogTracker != null) this.catalogTracker.stop();
if (this.fsOk)
waitOnAllRegionsToClose(abortRequested);
-
+
+ //fsOk flag may be changed when closing region throws exception.
+ if (!this.killed && this.fsOk) {
+ closeWAL(abortRequested ? false : true);
+ }
// Make sure the proxy is down.
if (this.hbaseMaster != null) {
HBaseRPC.stopProxy(this.hbaseMaster);
The RS shutdown failure was not properly handled.
Incorrect error handling (stmt coverage)
5. Fix
4.1 How?
Wait until regionServer finished shutting down and fileSystem is ok before closing WAL.
if (!this.killed && this.fsOk) {
closeWAL(abortRequested ? false : true);
}