HBase-3367 Report

https://issues.apache.org/jira/browse/Hbase-3367

1. Symptom

Failure of an RS in the middle of HLog (WAL) split will cause the HLog completely lost --- all the updates would be lost.

1.1 Severity

blocker

1.2 Was there exception thrown?

Yes. ERROR [MASTER_SERVER_OPERATIONS-h17.sfo.stumble.net:58644-0] master.MasterFileSystem(197):
Failed splitting hdfs://localhost:58631/user/jdcryans/.logs/h17.sfo.stumble.net,58647,1292464631034
java.io.IOException: Discovered orphan hlog after split. Maybe HRegionServer was not dead when we started

+

later data loss

1.2.1 Were there multiple exceptions?

Yes

1.3 Scope of the failure

Large scale data loss

2. How to reproduce this failure

2.0 Version

0.90.0

2.1 Configuration

standard

2.2 Reproduction procedure

1. split HLOG (feature start, long running)

2. shutdown the RS hosting the HLog (disconnect)

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2 (HMaster + RS)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The exception:

2010-12-15 17:58:33,642 ERROR [MASTER_SERVER_OPERATIONS-h17.sfo.stumble.net:58644-0] master.MasterFileSystem(197):
Failed splitting hdfs://localhost:58631/user/jdcryans/.logs/h17.sfo.stumble.net,58647,1292464631034
java.io.IOException: Discovered orphan hlog after split. Maybe HRegionServer was not dead when we started
       at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:290)
       at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:151)
       at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:193)
       at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:96)
       at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
       at java.lang.Thread.run(Thread.java:680)

3.2 Backward inference

 public void splitLog(final String serverName) {

    this.splitLogLock.lock();

    long splitTime = 0, splitLogSize = 0;

    Path logDir = new Path(this.rootdir, HLog.getHLogDirectoryName(serverName));

    try {

      HLogSplitter splitter = HLogSplitter.createLogSplitter(

        conf, rootdir, logDir, oldLogDir, this.fs);

     

        splitter.splitLog();

   

    } catch (IOException e) {

      LOG.error("Failed splitting " + logDir.toString(), e);

    }

Clearly, this error is only logged but not properly handled.

4. Root cause

Split log error (causing HLog to be orphaned) is not correctly handled. The fix is when splitLog throws exception, it would retry splitLog again..

4.1 Category:

Incorrect error handling (handled, statement coverage)