HBase-3403 Report

https://issues.apache.org/jira/browse/Hbase-3403

1. Symptom

Entire region orphaned (cannot be accessed, to the client is data loss) after split failure.

1.1 Severity

Blocker

1.2 Was there exception thrown?

Yes:

1. RS crashed

2. Final data loss

1.2.1 Were there multiple exceptions?

Yes

2. How to reproduce this failure

2.0 Version

0.90.0

2.1 Configuration

Standard

2.2 Reproduction procedure

1. Split region (feature start, long running)

2. Shutdown the RS (disconnect)

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

No (multi-node): 2 must occur right after the parent region’s info is removed from META and daughters’ infor added to META.

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2. HMASTER + RS

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

1. data loss

2. notice the split occurred

3. noticed the RS failed.

3.2 Backward inference

From the log, we noticed that the splitting process will call:

 /**

   * Offline parent in meta.

   * Used when splitting.

   * @param catalogTracker

   * @param parent

   * @param a Split daughter region A

   * @param b Split daughter region B

   * @throws NotAllMetaRegionsOnlineException

   * @throws IOException

   */

  public static void offlineParentInMeta(CatalogTracker catalogTracker,

      HRegionInfo parent, final HRegionInfo a, final HRegionInfo b)

  throws NotAllMetaRegionsOnlineException, IOException {

    HRegionInfo copyOfParent = new HRegionInfo(parent);

    copyOfParent.setOffline(true);

    copyOfParent.setSplit(true);

    Put put = new Put(copyOfParent.getRegionName());

    addRegionInfo(put, copyOfParent);

-    put.add(HConstants.CATALOG_FAMILY, HConstants.SERVER_QUALIFIER,
-        HConstants.EMPTY_BYTE_ARRAY);
-    put.add(HConstants.CATALOG_FAMILY, HConstants.
STARTCODE_QUALIFIER,
-        HConstants.EMPTY_BYTE_ARRAY);

    put.add(HConstants.CATALOG_FAMILY, HConstants.SPLITA_QUALIFIER,

      Writables.getBytes(a));

    put.add(HConstants.CATALOG_FAMILY, HConstants.SPLITB_QUALIFIER,

      Writables.getBytes(b));

    catalogTracker.waitForMetaServerConnectionDefault().put(CatalogTracker.META_REGION, put);

    LOG.info("Offlined parent region " + parent.getRegionNameAsString() +

      " in META");

  }

The two statements (removed later) are to remove the parent region’s info from the META.

4. Root cause

1. Split occurs

2. In “offlineParentInMeta” --- when it is setting the parent region in Meta as offline, it clear out SERVER_QUALIFIER and SERVER_STARTCODE

3. RS crash --- At this moment, we haven’t got a chance to put the daughter info into the Meta region yet, but the parent info is already cleared

4. When we recover the failure in HMaster (ServerShutdownHandler), we cannot find the region info in Meta for both parent and daughters.

4.1 Category:

incorrect error handling (not handled at all)

 --- didn’t make the operation atomic to prepare a failure in between.

But this is hard one to test --- you will need to