HBase-3446 Report

https://issues.apache.org/jira/browse/HBASE-3446?attachmentSortBy=dateTime#attachmentmodule

1. Symptom

When shutting down region servers, if at the same time the RS serving META region cannot be connected, there will be massive data loss (all the regions served by those shut-down RS will get lost).

1.1 Severity

Blocker

1.2 Was there exception thrown?

Yes

1.2.1 Were there multiple exceptions?

Yes:

1. RS shutdown,

2. Cannot find connect to the META region server

3. Data loss

1.3 Scope of the failure

Massive dataloss

2. How to reproduce this failure

2.0 Version

0.90.0

2.1 Configuration

Standard

2.2 Reproduction procedure

1. Shutdown data RS (shutdown)

2. Shutdown META-region RS (disconnect)

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

No. Multi-node: META-region RS must occur after “ServerShutdownHandler.process” checked its liveness.

2.3 Can the logs tell how to reproduce the failure?

Yes. The two events are well documented.

2.4 How many machines needed?

3. 2 RS (1 holding META region and the other just data region) + 1 HMaster

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

2011-01-16 18:03:26,164 DEBUG org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Offlined and split region usertable,user136857679,1295149082811.9f2822a04028c86813fe71264da5c167.; checking daughter presence

2011-01-16 18:03:26,169 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN

org.apache.hadoop.ipc.RemoteException: java.io.IOException: Server not running

at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2360)

at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1754)

...

at $Proxy6.openScanner(Unknown Source)

at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:260)

at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.isDaughterMissing(ServerShutdownHandler.java:256)

at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.fixupDaughter(ServerShutdownHandler.java:214)

at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.fixupDaughters(ServerShutdownHandler.java:196)

at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.processDeadRegion(ServerShutdownHandler.java:181)

at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:151)

3.2 Backward inference

It’s easy to locate the code:

ServerShutdownHandler.process():

    // Wait on meta to come online; we need it to progress.

    // TODO: Best way to hold strictly here?  We should build this retry logic

    //       into the MetaReader operations themselves.

    NavigableMap<HRegionInfo, Result> hris = null;

    while (!this.server.isStopped()) {

      try {

        this.server.getCatalogTracker().waitForMeta();

        hris = MetaReader.getServerUserRegions(this.server.getCatalogTracker(),

            this.hsi);

        break;

      } catch (InterruptedException e) {

        Thread.currentThread().interrupt();

        throw new IOException("Interrupted", e);

      } catch (IOException ioe) {

        LOG.info("Received exception accessing META during server shutdown of " +

            serverName + ", retrying META read");

      }

    }

   <- This logic is to wait until we can find META RS!

   <- Introduce a META RS disconnect here!!!

   .. .. ..

    LOG.info("Reassigning " + hris.size() + " region(s) that " + serverName +

      " was carrying (skipping " + regionsInTransition.size() +

      " regions(s) that are already in transition)");

    // Iterate regions that were on this server and assign them

    for (Map.Entry<HRegionInfo, Result> e: hris.entrySet()) {

      if (processDeadRegion(e.getKey(), e.getValue(),

          this.services.getAssignmentManager(),

          this.server.getCatalogTracker())) {

        this.services.getAssignmentManager().assign(e.getKey(), true);

      }

    }

This function further eventually calls MetaReader.fullScan to find the meta information about the regions on this dead RS. But it found out the META RS cannot be located, therefore an exception would be thrown, and eventually observed in:

hbase.executor.EventHandler:

  public void run() {

    try {

      if (getListener() != null) getListener().beforeProcess(this);

      process();

      if (getListener() != null) getListener().afterProcess(this);

    } catch(Throwable t) {

      LOG.error("Caught throwable while processing event " + eventType, t);

    }

  }

And it is not handled further…

So basically, in processDeadRegion and on, this META RS unreachable event is not handled at all….

4. Root cause

Developers simply did not anticipate the error that META server unreachable in processDeadRegion. They thought it should be up since before calling processDeadRegion, it already checked the META server’s liveness.

4.1 Category:

Incorrect error handling (not handled)