HBase-3446 Report
https://issues.apache.org/jira/browse/HBASE-3446?attachmentSortBy=dateTime#attachmentmodule
When shutting down region servers, if at the same time the RS serving META region cannot be connected, there will be massive data loss (all the regions served by those shut-down RS will get lost).
Blocker
Yes
Yes:
1. RS shutdown,
2. Cannot find connect to the META region server
3. Data loss
Massive dataloss
0.90.0
Standard
1. Shutdown data RS (shutdown)
2. Shutdown META-region RS (disconnect)
In this order
No. Multi-node: META-region RS must occur after “ServerShutdownHandler.process” checked its liveness.
Yes. The two events are well documented.
3. 2 RS (1 holding META region and the other just data region) + 1 HMaster
2011-01-16 18:03:26,164 DEBUG org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Offlined and split region usertable,user136857679,1295149082811.9f2822a04028c86813fe71264da5c167.; checking daughter presence
2011-01-16 18:03:26,169 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN
org.apache.hadoop.ipc.RemoteException: java.io.IOException: Server not running
at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2360)
at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1754)
...
at $Proxy6.openScanner(Unknown Source)
at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:260)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.isDaughterMissing(ServerShutdownHandler.java:256)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.fixupDaughter(ServerShutdownHandler.java:214)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.fixupDaughters(ServerShutdownHandler.java:196)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.processDeadRegion(ServerShutdownHandler.java:181)
at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:151)
It’s easy to locate the code:
ServerShutdownHandler.process():
// Wait on meta to come online; we need it to progress.
// TODO: Best way to hold strictly here? We should build this retry logic
// into the MetaReader operations themselves.
NavigableMap<HRegionInfo, Result> hris = null;
while (!this.server.isStopped()) {
try {
this.server.getCatalogTracker().waitForMeta();
hris = MetaReader.getServerUserRegions(this.server.getCatalogTracker(),
this.hsi);
break;
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new IOException("Interrupted", e);
} catch (IOException ioe) {
LOG.info("Received exception accessing META during server shutdown of " +
serverName + ", retrying META read");
}
}
<- This logic is to wait until we can find META RS!
<- Introduce a META RS disconnect here!!!
.. .. ..
LOG.info("Reassigning " + hris.size() + " region(s) that " + serverName +
" was carrying (skipping " + regionsInTransition.size() +
" regions(s) that are already in transition)");
// Iterate regions that were on this server and assign them
for (Map.Entry<HRegionInfo, Result> e: hris.entrySet()) {
if (processDeadRegion(e.getKey(), e.getValue(),
this.services.getAssignmentManager(),
this.server.getCatalogTracker())) {
this.services.getAssignmentManager().assign(e.getKey(), true);
}
}
This function further eventually calls MetaReader.fullScan to find the meta information about the regions on this dead RS. But it found out the META RS cannot be located, therefore an exception would be thrown, and eventually observed in:
hbase.executor.EventHandler:
public void run() {
try {
if (getListener() != null) getListener().beforeProcess(this);
process();
if (getListener() != null) getListener().afterProcess(this);
} catch(Throwable t) {
LOG.error("Caught throwable while processing event " + eventType, t);
}
}
And it is not handled further…
So basically, in processDeadRegion and on, this META RS unreachable event is not handled at all….
Developers simply did not anticipate the error that META server unreachable in processDeadRegion. They thought it should be up since before calling processDeadRegion, it already checked the META server’s liveness.
Incorrect error handling (not handled)