HBase-5611 Report

https://issues.apache.org/jira/browse/HBASE-5611

1. Symptom

A region server’s memstore flusher thread keeps emitting exception, and cannot flush the in-memory edit log. Could cause data-loss if the log is not flushed.

2012-03-21 00:33:39,303 ERROR org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flusher failed for entry null
java.lang.IllegalStateException
       at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
       at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:199)
       at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:223)
       at java.lang.Thread.run(Thread.java:662)

1.1 Severity

Critical

1.2 Was there exception thrown?

Yes.

1.2.1 Were there multiple exceptions?

Yes.

1) RS1 died (triggered log split)

5). recovery time out

8). Final exception.

1.3 Scope of the failure

Data loss potentially.

2. How to reproduce this failure

2.0 Version

0.90.6

2.1 Configuration

Timeout monitor;

Low/high watermark;

these better be set at small numbers to have better chance to see.

2.2 Reproduction procedure

1. split region in RS1

2. set small timeout, therefore the recovery of RS2 will quickly timeout

3. keep adding data, and RS2’s memory usage reaches watermark

4. when it flushes, see the exception

2.2.1 Timing order

This order

2.2.2 Events order externally controllable?

Yes.

2.3 Can the logs tell how to reproduce the failure?

Yes. Key events are all logged.

2.4 How many machines needed?

1. RS1 + RS2 + HMaster

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The exception above.

3.2 Backward inference

It is relatively easy to locate the code that thrown this exception, but the question is why the HRegion is null...

 /**

   * Return true if we're above the high watermark

   */

  private boolean isAboveLowWaterMark() {

    return server.getGlobalMemStoreSize() >= globalMemStoreLimitLowMark;

  }

  public void run() {

    while (!this.server.isStopped()) {

      FlushQueueEntry fqe = null;

      try {

        wakeupPending.set(false); // allow someone to wake us up again

        fqe = flushQueue.poll(threadWakeFrequency, TimeUnit.MILLISECONDS);

        if (fqe == null || fqe instanceof WakeupFlushThread) {

          if (isAboveLowWaterMark()) {

            LOG.info("Flush thread woke up with memory above low water.");

            if (!flushOneForGlobalPressure()) {

             --- Throws exception!

             .. ..

            }

          }

        }

        .. ..

      } catch (InterruptedException ex) {

        continue;

      } catch (ConcurrentModificationException ex) {

        continue;

      } catch (Exception ex) {

        --- control-flow comes here...

        LOG.error("Cache flusher failed for entry " + fqe, ex);

        if (!server.checkFileSystem()) {

          break;

        }

      }

    }

HRegionServer.java:

  /**

   * Return the total size of all memstores in every region.

   *

   * @return memstore size in bytes

   */

  public long getGlobalMemStoreSize() {

    long total = 0;

    for (HRegion region : onlineRegions.values()) {

      total += region.memstoreSize.get();

    }

    return total;

  }

From the above, we can see the exception is thrown in “flushOneForGlobalPressure”. Within it, the “regionToFlush” is null... But why is this NULL? from here, there is little connection to carry this diagnosis forward...

4. Root cause

Didn’t handle the time-out exception (hence the HRegion drop-off) correctly.

4.1 Category:

Incorrect exception handling.

5. Fix

5.1 How?

When dropping the HRegion, simply also decrease the account for memstore:

https://issues.apache.org/jira/secure/attachment/12524810/HBASE-5611-trunk-v2-minorchange.patch

   } catch (Throwable t) {

       // We failed open. Our caller will see the 'null' return value
      // and transition the node back to FAILED_OPEN. If that fails,
      // we rely on the Timeout Monitor in the master to reassign.
-      LOG.error("Failed open of region=" +
-        this.regionInfo.getRegionNameAsString(), t);
+      LOG.error(
+          "Failed open of region=" + this.regionInfo.getRegionNameAsString()
+              + ", starting to roll back the global memstore size.", t);
+      // Decrease the global memstore size.
+      if (this.rsServices != null) {
+        RegionServerAccounting rsAccounting =
+          this.rsServices.getRegionServerAccounting();
+        if (rsAccounting != null) {
+          rsAccounting.rollbackRegionReplayEditsSize(this.regionInfo.getRegionName());
+        }
+      }
    }