HBase-8099 Report

1. Symptom

if one/some Region Sever(s) die(s) after the cluster makes use of ZooKeeper's multi-update functionality (enable hbase.zookeeper.useMulti) and is setup as a replication source, all regionservers aborted within a few seconds.

1.1 Severity

Blocker

1.2 Was there exception thrown?

Yes

1.2.1 Were there multiple exceptions?

No

1.3 Was there a long propagation of the failure?

yes

1.4 Scope of the failure

All the Region Servers are aborted

2. How to reproduce this failure

2.0 Version

0.94.5

2.1 Configuration

set hbase.replication and hbase.zookeeper.useMulti to true

2.2 Reproduction procedure

1 enable zookeeper.useMulti and replication features (config change)

2 restart replication source cluster (restartfff)

3 abort one Region Server (shutdown)

2.2.1 Timing order

In the above order

2.2.2 Events order externally controllable?

yes

2.3 Can the logs tell how to reproduce the failure?

no

2.4 How many machines needed?

2 (1RS + 1ZK)

2.5 How hard is the reproduction?

Easy

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

2013-03-14 00:11:37,993 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Atomically moving dnds1-4,60020,1363219866063's hlogs to my queue

2013-03-14 00:11:37,997 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper: Creating dnds1-4%2C60020%2C1363219866063.1363219868868 with data

2013-03-14 00:11:37,997 DEBUG org.apache.hadoop.hbase.replication.ReplicationZookeeper:  The multi list size is: 5

2013-03-14 00:11:38,068 WARN org.apache.hadoop.hbase.replication.ReplicationZookeeper: Got exception in copyQueuesFromRSUsingMulti:

org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode

        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)

        at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)

        at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)

        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:531)

        at org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1436)

        at org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:705)

        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:590)

        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

        at java.lang.Thread.run(Thread.java:662)

2013-03-14 00:11:38,070 INFO org.apache.hadoop.hbase.metrics: new MBeanInfo

2013-03-14 00:11:38,077 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting 2 rs from peer cluster # 1

2013-03-14 00:11:38,077 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Choosing peer ist6-dnds1-6,60020,1359508078741

2013-03-14 00:11:38,077 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Choosing peer ist6-dnds1-3,60020,1359508079058

2013-03-14 00:11:39,080 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Replicating 4f3d5435-898c-47c2-8821-aeb01f9e87cc -> 74c750a5-4254-4a3b-ab12-063869759edd

2013-03-14 00:11:39,081 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening log for replication dnds1-4%2C60020%2C1363219866063.136321986886

8 at 0

2013-03-14 00:11:39,090 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: currentNbOperations:0 and seenEntries:3 and size: 0

2013-03-14 00:11:39,090 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: Going to report log #dnds1-4%2C60020%2C1363219866063.1363219868868

 for position 1004 in hdfs://cluster/hbase/.oldlogs/dnds1-4%2C60020%2C1363219866063.1363219868868

2013-03-14 00:11:39,139 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server dnds1-13,60020,1363219887385: Writing replication status

org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/replication/rs/dnds1-13,60020,1363219887385/1-dnds1-4,60

020,1363219866063/dnds1-4%2C60020%2C1363219866063.1363219868868

        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)

        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

        at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1266)

        at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.setData(RecoverableZooKeeper.java:349)

        at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:848)

        at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:900)

        at org.apache.hadoop.hbase.zookeeper.ZKUtil.setData(ZKUtil.java:894)

        at org.apache.hadoop.hbase.replication.ReplicationZookeeper.writeReplicationStatus(ReplicationZookeeper.java:558)

        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:155)

        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:379)

2013-03-14 00:11:39,140 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []

-------------------

3.2 Backward inference

3.3 Are the printed log sufficient for diagnosis?

yes

3.4 Are logs misleading?

no

3.5 Do we need to examine different component’s log for diagnosis?

no

3.6 Is it a multi-components failure?

no

3.7 How hard is the diagnosis?

hard

4. Root cause

If a RegionServer accidentally shuts down, only one of the NodeFailoverWorkers is supposed to take the dead RegionServer znode and starts moving the log znodes under it. When the rest of NodeFailoverWorkers want to copy the log, they will get an exception and not move the logs. However, this exception is not handled correctly. It returns to other NodeFailoverWorkers a non-empty queue so that they will try to move the znode as well and try to clean up the log later. Since the first NodeFailoverWorker has already done these operations. The following job will crash the cluster.

4.1 Category:

incorrect error handling (statement coverage), see below for the patch.

4.2 Are there multiple fault?

NO

4.2 Can we automatically test it?

Yes

5. Fix

diff --git src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java

index f23c31c..290741e 100644

--- src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java

+++ src/main/java/org/apache/hadoop/hbase/replication/ReplicationZookeeper.java

@@ -672,7 +672,7 @@ public class ReplicationZookeeper {

         } catch (KeeperException e) {

           // Multi call failed; it looks like some other regionserver took away the logs.

           LOG.warn("Got exception in copyQueuesFromRSUsingMulti: ", e);

+          queues.clear();

         }

         return queues;

   }