HBASE-8325 (duplicates 7122 and 6446)

1. Symptom

After we start a master cluster without any further operation (a HLog will be created and its size will be zero), if we start replication, an EOFException will be thrown and appear every 10 sec until an entry is inserted into the HLog.

1.1 Severity

Major

1.2 Was there exception thrown?

Yes

1.2.1 Were there multiple exceptions?

No

1.3 Scope of the failure

Only affect the performance

2 How to reproduce the failure

2.0 Version

0.94.2

2.1 Configuration

<property>

<name>hbase.replication</name>

<value>true</value>

</property>

Refer to http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/replication/package-

summary.html#requirements

2.2 Reproduction procedure

0 start a new master cluster

(make sure no Hlog left before start it)

1 start replication (feature start)

2.2.1 Timing order

Single event

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2 (2 Masters)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Log messages:

2012-11-07 15:47:40,926 WARN regionserver.ReplicationSource (ReplicationSource.java:openReader(543)) - 1 Got:

java.io.EOFException

at java.io.DataInputStream.readFully(DataInputStream.java:180)

at java.io.DataInputStream.readFully(DataInputStream.java:152)

at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)

at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)

at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)

at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)

at

org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)

at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175)

at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:716)

at org.apache.hadoop.hbase.replication.regionserver.

ReplicationSource.openReader(ReplicationSource.java:491)

at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:290)

2012-11-07 15:47:40,927 WARN regionserver.ReplicationSource (ReplicationSource.java:openReader(547)) - Waited too long for this file, considering dumping

4. Root cause

When we start replication, HBase will try to open the HLog and read it. Since a new cluster is just started, there's no data in the HLog file. So it throws the EOFException. And if it is unable to open a reader, it will try to open the HLog after a few seconds. If there is still no entry inserted into the HLog, the EOFException will be thrown again.

4.1 Category:

Semantic

5. Fix

src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSource.java

@@ -611,6 +611,7 @@ public class ReplicationSource extends Thread

             }

           }

         } catch (IOException ioe) {

+          if (ioe instanceof EOFException && isCurrentLogEmpty()) return true;

           LOG.warn(peerClusterZnode + " Got: ", ioe);

           this.reader = null;

           if (ioe.getCause() instanceof NullPointerException) {

@@ -628,6 +629,16 @@ public class ReplicationSource extends Thread

         return true;

   }

 

+  /*

+   * Checks whether the current log file is empty, and it is not a recovered queue. This is to

+   * handle scenario when in an idle cluster, there is no entry in the current log and we keep on

+   * trying to read the log file and get EOFEception. In case of a recovered queue the last log file

+   * may be empty, and we don't want to retry that.

+   */

+  private boolean isCurrentLogEmpty() {

+        return (this.repLogReader.getPosition() == 0 && !queueRecovered && queue.size() == 0);

+  }

+

5.1 How?

Add a condition to check If the HLog is empty. If so, the function will return instead throwing an EOFexception.