HBase-2461 Report

https://issues.apache.org/jira/browse/HBASE-2461

1. Symptom

After an HDFS exception thrown during the region split, the entire region is gone (massive dataloss)...

1.1 Severity

Blocker

1.2 Was there exception thrown?

Yes

1.2.1 Were there multiple exceptions?

Yes (HDFS exception + HBase exception + later write exception)

1.3 Scope of the failure

Entire region, all user, catastrophic

2. How to reproduce this failure

2.0 Version

0.20.5

2.1 Configuration

Standard

2.2 Reproduction procedure

1. Trigger a region split (feature start)

2. Trigger a NPE from HDFS, disconnect HDFS nodes (disconnect)

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

Yes.

2.3 Can the logs tell how to reproduce the failure?

Yes. In the exception log:

2010-04-16 19:18:20,727 ERROR org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction failed for region TestTable,-1945465867<1271449232310>,1271453785648

java.lang.NullPointerException

at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.enqueueCurrentPacket(DFSClient.java:3124)

at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal(DFSClient.java:3220)

at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3306)

at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3255)

at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)

at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)

at org.apache.hadoop.fs.FileSystem.createNewFile(FileSystem.java:560)

at org.apache.hadoop.hbase.util.FSUtils.create(FSUtils.java:95)

at org.apache.hadoop.hbase.io.Reference.write(Reference.java:129)

at org.apache.hadoop.hbase.regionserver.StoreFile.split(StoreFile.java:498)

at org.apache.hadoop.hbase.regionserver.HRegion.splitRegion(HRegion.java:682)

at org.apache.hadoop.hbase.regionserver.CompactSplitThread.split(CompactSplitThread.java:162)

at org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:95)

It’s clear that the exception occurred during the region split, and the exception is from hdfs.

2.4 How many machines needed?

2. Region server + HDFS client

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The exception message.

3.2 Backward inference

You realize that this is not handled correctly.

4. Root cause

The exception will cause the CompactSplitThread.split function to return in the middle. At this point, the parent region is already closed and gone, therefore caused the dataloss.

4.1 Category:

Incorrect error handling (handled) (stmt coverage)

CompactSplitThread.java:

 public void run() {

    while (!this.server.isStopped()) {

      HRegion r = null;

      try {

         .. ..

                split(r, midKey);      

      } catch (InterruptedException ex) {

        continue;

      } catch (IOException ex) {

        LOG.error("Compaction/Split failed for region " +

            r.getRegionNameAsString(),

          RemoteExceptionHandler.checkIOException(ex));

        if (!server.checkFileSystem()) {

          break;

        }

      } catch (Exception ex) {

        LOG.error("Compaction failed" +

            (r != null ? (" for region " + r.getRegionNameAsString()) : ""),

            ex);

      }

    }

    compactionQueue.clear();

    LOG.info(getName() + " exiting");

  }

If “split” triggers an exception, then the symptom will be seen!

5. Fix

5.1 How?

Add transaction semantic in region split: the split function won’t return directly when encountered exception.