HDFS-4617 (duplicates HDFS-4298) Report

1. Symptom

When HDFS is configured with High Availability using the Quorum Journal Manager, after the number of active namenode’s extra edit log segments reaches the value of dfs.namenode.max.extra.edits.segments.retained, a warning message and exceptions appear every time a purging logs is made or uploaded. And the JournalNodes will not record these checkpoint operations.

1.1 Severity

Major

1.2 Was there exception thrown?

Yes.

1.2.1 Were there multiple exceptions?

Yes. quorumException and IllegalStateException

1.3 Scope of the failure

Affect JournalNodes to keep the record of namenode’s modifications.

2. How to reproduce this failure

2.0 Version

2.0.3-alpha

2.1 Configuration

1 enable HA using QJM

2 put a cap on the number of completed edits files retained by the Namenode

3 set checkpoint period to a small period

core-site.xml:

<configuration>

<property>

  <name>fs.defaultFS</name>

  <value>hdfs://mycluster</value>

</property>

<property>

  <name>dfs.journalnode.edits.dir</name>

  <value>/home/research/tmp/hadoop-2.0.3/journal</value>

</property>

<property>

  <name>hadoop.tmp.dir</name>

  <value>/home/research/tmp/hadoop-2.0.3/</value>

  <description>A base for other temporary directories.</description>

</property>

</configuration>

hdfs-site.xml:

<configuration>

  <property>

        <name>dfs.nameservices</name>

        <value>mycluster</value>

  </property>

  <property>

        <name>dfs.ha.namenodes.mycluster</name>

        <value>nn1,nn2</value>

  </property>

  <property>

        <name>dfs.namenode.rpc-address.mycluster.nn1</name>

        <value>master:8020</value>

  </property>

  <property>

        <name>dfs.namenode.http-address.mycluster.nn1</name>

        <value>master:52000</value>

  </property>

  <property>

        <name>dfs.namenode.rpc-address.mycluster.nn2</name>

        <value>slave1:8020</value>

  </property>

  <property>

        <name>dfs.namenode.http-address.mycluster.nn2</name>

        <value>slave1:52000</value>

  </property>

  <property>

        <name>dfs.namenode.shared.edits.dir</name>

        <value>qjournal://master:8485;slave1:8485;slave2:8485/mycluster</value>

  </property>

  <property>

        <name>dfs.client.failover.proxy.provider.mycluster</name>

        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>

  </property>

<property>

  <name>dfs.ha.fencing.methods</name>

  <value>sshfence</value>

</property>

 

<property>

  <name>dfs.ha.fencing.ssh.private-key-files</name>

  <value>/home/research/.ssh/id_rsa</value>

</property>

  <property>

        <name>dfs.datanode.data.dir</name>

        <value>file:/home/research/tmp/hadoop-2.0.3/data</value>

        <final>true</final>

  </property>

  <property>

        <name>dfs.namenode.name.dir</name>

        <value>file:/home/research/tmp/hadoop-2.0.3/name</value>

        <final>true</final>

  </property>

<property>

        <name>dfs.namenode.num.extra.edits.retained</name>

        <value>15</value>

  </property>

<property>

        <name>dfs.namenode.checkpoint.period</name>

        <value>150</value>

  </property>

</configuration>

2.2 Reproduction procedure

1 Start journalnodes

run hadoop-daemon.sh start journalnode in each journal node machine (in my case, i have 3 journalnodes : master, slave1 and slave2)

2 Start namenodes

2.1 Format one namenode by running hadoop namenode -format and then start it by running hadoop-daemon.sh start namenode.

2.2 Copy over the contents of your formatted namenode metadata directories to the other namenode by running hadoop namenode -bootstrapStandby In the unformatted namenode machine. Then, start this namenode by running hadoop-daemon.sh start namenode.

2.3 Because I did not configure automatic failover feature, the two namenode initially are in the Standby state. Transition the state of the given namenode to active by running hdfs haadmin -transitionToActive <serviceId> (nn1 in my case).

3 Start datanodes

2.2.1 Timing order

No

2.2.2 Events order externally controllable?

No timing order

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

3 machines. Because using QJM needs 3 journalnodes.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

… …

--- the previous checkpoints are succeeded and no error messages printed

2013-08-19 18:00:20,725 INFO org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer: Triggering checkpoint because it has been 180 seconds since the last checkpoint, which exceeds the configured interval 150

2013-08-19 18:00:20,726 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image file /home/research/tmp/hadoop-2.0.3/name/current/fsimage.ckpt_0000000000000000020 using no compression

2013-08-19 18:00:20,728 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file of size 123 saved in 0 seconds.

2013-08-19 18:00:20,731 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 16

2013-08-19 18:00:20,731 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/home/research/tmp/hadoop-2.0.3/name/current/fsimage_0000000000000000014, cpktTxId=0000000000000000014)

2013-08-19 18:00:21,066 WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Unable to determine input streams from QJM to [192.168.59.179:8485, 192.168.59.172:8485, 192.168.59.170:8485]. Skipping.

org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown:

192.168.59.170:8485: Asked for firstTxId 2 which is in the middle of file /home/research/tmp/hadoop-2.0.3/journal/mycluster/current/edits_0000000000000000001-0000000000000000002

    at org.apache.hadoop.hdfs.server.namenode.FileJournalManager.getRemoteEditLogs(FileJournalManager.java:189)

    at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:630)

    at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:182)

    at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:203)

    at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:14028)

    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)

    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1735)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1731)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441)

    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1729)

192.168.59.179:8485: Asked for firstTxId 2 which is in the middle of file /home/research/tmp/hadoop-2.0.3/journal/mycluster/current/edits_0000000000000000001-0000000000000000002

    at org.apache.hadoop.hdfs.server.namenode.FileJournalManager.getRemoteEditLogs(FileJournalManager.java:189)

    at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:630)

    at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:182)

    at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:203)

    at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:14028)

    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)

    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1735)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1731)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441)

    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1729)

192.168.59.172:8485: Asked for firstTxId 2 which is in the middle of file /home/research/tmp/hadoop-2.0.3/journal/mycluster/current/edits_0000000000000000001-0000000000000000002

    at org.apache.hadoop.hdfs.server.namenode.FileJournalManager.getRemoteEditLogs(FileJournalManager.java:189)

    at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogManifest(Journal.java:630)

    at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:182)

    at org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:203)

    at org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:14028)

    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)

    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1735)

    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1731)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1441)

    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1729)

    at org.apache.hadoop.hdfs.qjournal.client.QuorumException.create(QuorumException.java:81)

    at org.apache.hadoop.hdfs.qjournal.client.QuorumCall.rethrowException(QuorumCall.java:213)

    at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:142)

    at org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager.selectInputStreams(QuorumJournalManager.java:457)

    at org.apache.hadoop.hdfs.server.namenode.JournalSet.selectInputStreams(JournalSet.java:252)

    at org.apache.hadoop.hdfs.server.namenode.FSEditLog.selectInputStreams(FSEditLog.java:1133)

    at org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager.purgeOldStorage(NNStorageRetentionManager.java:114)

    at org.apache.hadoop.hdfs.server.namenode.FSImage.purgeOldStorage(FSImage.java:940)

    at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:925)

    at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:862)

    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.doCheckpoint(StandbyCheckpointer.java:165)

    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer.access$1100(StandbyCheckpointer.java:53)

    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.doWork(StandbyCheckpointer.java:297)

    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.access$300(StandbyCheckpointer.java:210)

    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread$1.run(StandbyCheckpointer.java:230)

    at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:455)

    at org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$CheckpointerThread.run(StandbyCheckpointer.java:226)

2013-08-19 18:00:21,083 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Opening connection to http://master:52000/getimage?putimage=1&txid=20&port=52000&storageInfo=-40:499919378:0:CID-ae149561-ec63-4e80-bdff-b4d858c58bf0

2013-08-19 18:00:21,171 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Transfer took 0.09s at 0.00 KB/s

2013-08-19 18:00:21,171 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Uploaded image with txid 20 to namenode at master:52000

3.2 Backward inference

The log messages show the call stacks of the exceptions. We just need follow these call stacks and check each caller functions.

4. Root cause

When the namenode want to purge old edits log files, ourgeOldStorage will call selectInputStreams function with a transaction ID. After the number of active namenode’s extra edit log segments reaches the value of dfs.namenode.max.extra.edits.segments.retained, this transaction ID will fall in the middle of a segment and JournalNodes will reject such a request and throw a exception.

Namenode

  public void purgeOldStorage() throws IOException {

        FSImageTransactionalStorageInspector inspector =

          new FSImageTransactionalStorageInspector();

        storage.inspectStorageDirs(inspector);

        long minImageTxId = getImageTxIdToRetain(inspector);

        purgeCheckpointsOlderThan(inspector, minImageTxId);

        long minimumRequiredTxId = minImageTxId + 1;

        long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain);

---- minimumRequiredTxId initially is 1 and increase by 2 after each checkpoint

---- numExtraEditsToRetain = the value of dfs.namenode.max.extra.edits.segments.retained    

---- purgeLogsFrom is 0 when the number of active namenode’s extra edit log segments

is smaller than the the value of dfs.namenode.max.extra.edits.segments.retained

        ArrayList<EditLogInputStream> editLogs = new ArrayList<EditLogInputStream>();

        purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false);

        Collections.sort(editLogs, new Comparator<EditLogInputStream>() {

  public List<RemoteEditLog> getRemoteEditLogs(long firstTxId) throws IOException {

---- firstTxId equals purgeLogsFrom during a checkpoint

        … …

          } else if ((firstTxId > elf.getFirstTxId()) &&

                     (firstTxId <= elf.getLastTxId())) {

            // Note that this behavior is different from getLogFiles below.

            throw new IllegalStateException("Asked for firstTxId " + firstTxId

                + " which is in the middle of file " + elf.file);

          }

        … …

4.1 Category:

Semantic

5. Fix

diff --git hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java

index 1625625..06376e5 100644

--- hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java

+++ hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/GenericTestUtils.java

@@ -95,8 +95,8 @@ public static void assertGlobEquals(File dir, String pattern,

         Set<String> expectedSet = Sets.newTreeSet(

             Arrays.asList(expectedMatches));

         Assert.assertEquals("Bad files matching " + pattern + " in " + dir,

-            Joiner.on(",").join(found),

-            Joiner.on(",").join(expectedSet));

+            Joiner.on(",").join(expectedSet),

+            Joiner.on(",").join(found));

   }

   

   public static void assertExceptionContains(String string, Throwable t) {

diff --git hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java

index 2baf4dc..ed57eea 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/contrib/bkjournal/src/main/java/org/apache/hadoop/contrib/bkjournal/BookKeeperJournalManager.java

@@ -500,9 +500,15 @@ public void finalizeLogSegment(long firstTxId, long lastTxId)

         }

   }

 

-  @Override

   public void selectInputStreams(Collection<EditLogInputStream> streams,

           long fromTxId, boolean inProgressOk) throws IOException {

+        selectInputStreams(streams, fromTxId, inProgressOk, true);

+  }

+

+  @Override

+  public void selectInputStreams(Collection<EditLogInputStream> streams,

+          long fromTxId, boolean inProgressOk, boolean forReading)

+          throws IOException {

         List<EditLogLedgerMetadata> currentLedgerList = getLedgerList(fromTxId,

             inProgressOk);

         try {

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java

index 99b9d6b..dda1de1 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLogger.java

@@ -109,7 +109,7 @@ AsyncLogger createLogger(Configuration conf, NamespaceInfo nsInfo,

        * Fetch the list of edit logs available on the remote node.

        */

   public ListenableFuture<RemoteEditLogManifest> getEditLogManifest(

-          long fromTxnId);

+          long fromTxnId, boolean forReading);

 

   /**

        * Prepare recovery. See the HDFS-3077 design document for details.

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java

index 16cd548..3beff86 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/AsyncLoggerSet.java

@@ -263,13 +263,13 @@ void appendHtmlReport(StringBuilder sb) {

   }

 

   public QuorumCall<AsyncLogger, RemoteEditLogManifest>

-          getEditLogManifest(long fromTxnId) {

+          getEditLogManifest(long fromTxnId, boolean forReading) {

         Map<AsyncLogger,

             ListenableFuture<RemoteEditLogManifest>> calls

             = Maps.newHashMap();

         for (AsyncLogger logger : loggers) {

           ListenableFuture<RemoteEditLogManifest> future =

-              logger.getEditLogManifest(fromTxnId);

+              logger.getEditLogManifest(fromTxnId, forReading);

           calls.put(logger, future);

         }

         return QuorumCall.create(calls);

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java

index c64acd8..9115804 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/IPCLoggerChannel.java

@@ -519,12 +519,12 @@ public Void call() throws Exception {

 

   @Override

   public ListenableFuture<RemoteEditLogManifest> getEditLogManifest(

-          final long fromTxnId) {

+          final long fromTxnId, final boolean forReading) {

         return executor.submit(new Callable<RemoteEditLogManifest>() {

           @Override

           public RemoteEditLogManifest call() throws IOException {

             GetEditLogManifestResponseProto ret = getProxy().getEditLogManifest(

-                journalId, fromTxnId);

+                journalId, fromTxnId, forReading);

             // Update the http port, since we need this to build URLs to any of the

             // returned logs.

             httpPort = ret.getHttpPort();

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java

index 6b3503d..3852001 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumJournalManager.java

@@ -445,13 +445,18 @@ public void recoverUnfinalizedSegments() throws IOException {

   public void close() throws IOException {

         loggers.close();

   }

+  

+  public void selectInputStreams(Collection<EditLogInputStream> streams,

+          long fromTxnId, boolean inProgressOk) throws IOException {

+        selectInputStreams(streams, fromTxnId, inProgressOk, true);

+  }

 

   @Override

   public void selectInputStreams(Collection<EditLogInputStream> streams,

-          long fromTxnId, boolean inProgressOk) throws IOException {

+          long fromTxnId, boolean inProgressOk, boolean forReading) throws IOException {

 

         QuorumCall<AsyncLogger, RemoteEditLogManifest> q =

-            loggers.getEditLogManifest(fromTxnId);

+            loggers.getEditLogManifest(fromTxnId, forReading);

         Map<AsyncLogger, RemoteEditLogManifest> resps =

             loggers.waitForWriteQuorum(q, selectInputStreamsTimeoutMs,

                 "selectInputStreams");

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java

index 769d084..63d7a75 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocol/QJournalProtocol.java

@@ -123,10 +123,12 @@ public void purgeLogsOlderThan(RequestInfo requestInfo, long minTxIdToKeep)

   /**

        * @param jid the journal from which to enumerate edits

        * @param sinceTxId the first transaction which the client cares about

+   * @param forReading whether or not the caller intends to read from the edit

+   *            logs

        * @return a list of edit log segments since the given transaction ID.

        */

   public GetEditLogManifestResponseProto getEditLogManifest(

-          String jid, long sinceTxId) throws IOException;

+          String jid, long sinceTxId, boolean forReading) throws IOException;

   

   /**

        * Begin the recovery process for a given segment. See the HDFS-3077

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java

index 653c069..bdebb38 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolServerSideTranslatorPB.java

@@ -202,7 +202,8 @@ public GetEditLogManifestResponseProto getEditLogManifest(

         try {

           return impl.getEditLogManifest(

               request.getJid().getIdentifier(),

-              request.getSinceTxId());

+              request.getSinceTxId(),

+              request.getForReading());

         } catch (IOException e) {

           throw new ServiceException(e);

         }

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java

index 290a62a..7b36ff5 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/protocolPB/QJournalProtocolTranslatorPB.java

@@ -228,12 +228,13 @@ public void purgeLogsOlderThan(RequestInfo reqInfo, long minTxIdToKeep)

 

   @Override

   public GetEditLogManifestResponseProto getEditLogManifest(String jid,

-          long sinceTxId) throws IOException {

+          long sinceTxId, boolean forReading) throws IOException {

         try {

           return rpcProxy.getEditLogManifest(NULL_CONTROLLER,

               GetEditLogManifestRequestProto.newBuilder()

                 .setJid(convertJournalId(jid))

                 .setSinceTxId(sinceTxId)

+                .setForReading(forReading)

                 .build());

         } catch (ServiceException e) {

           throw ProtobufHelper.getRemoteException(e);

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java

index 58965ce..16b5694 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java

@@ -627,14 +627,14 @@ private void purgePaxosDecision(long segmentTxId) throws IOException {

   /**

        * @see QJournalProtocol#getEditLogManifest(String, long)

        */

-  public RemoteEditLogManifest getEditLogManifest(long sinceTxId)

-          throws IOException {

+  public RemoteEditLogManifest getEditLogManifest(long sinceTxId,

+          boolean forReading) throws IOException {

         // No need to checkRequest() here - anyone may ask for the list

         // of segments.

         checkFormatted();

         

         RemoteEditLogManifest manifest = new RemoteEditLogManifest(

-            fjm.getRemoteEditLogs(sinceTxId));

+            fjm.getRemoteEditLogs(sinceTxId, forReading));

         return manifest;

   }

 

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java

index 05a4956..d00ba2d 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/JournalNodeRpcServer.java

@@ -175,10 +175,10 @@ public void purgeLogsOlderThan(RequestInfo reqInfo, long minTxIdToKeep)

 

   @Override

   public GetEditLogManifestResponseProto getEditLogManifest(String jid,

-          long sinceTxId) throws IOException {

+          long sinceTxId, boolean forReading) throws IOException {

         

         RemoteEditLogManifest manifest = jn.getOrCreateJournal(jid)

-            .getEditLogManifest(sinceTxId);

+            .getEditLogManifest(sinceTxId, forReading);

         

         return GetEditLogManifestResponseProto.newBuilder()

             .setManifest(PBHelper.convert(manifest))

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/BackupJournalManager.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/BackupJournalManager.java

index 5420b12..46f4309 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/BackupJournalManager.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/BackupJournalManager.java

@@ -77,7 +77,7 @@ public void purgeLogsOlderThan(long minTxIdToKeep)

 

   @Override

   public void selectInputStreams(Collection<EditLogInputStream> streams,

-          long fromTxnId, boolean inProgressOk) {

+          long fromTxnId, boolean inProgressOk, boolean forReading) {

         // This JournalManager is never used for input. Therefore it cannot

         // return any transactions

   }

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java

index 7b08b77..c848614 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLog.java

@@ -277,7 +277,7 @@ synchronized void openForWrite() throws IOException {

         // Safety check: we should never start a segment if there are

         // newer txids readable.

         List<EditLogInputStream> streams = new ArrayList<EditLogInputStream>();

-        journalSet.selectInputStreams(streams, segmentTxId, true);

+        journalSet.selectInputStreams(streams, segmentTxId, true, true);

         if (!streams.isEmpty()) {

           String error = String.format("Cannot start writing at txid %s " +

             "when there is a stream available for read: %s",

@@ -939,7 +939,7 @@ void setMetricsForTests(NameNodeMetrics metrics) {

        */

   public synchronized RemoteEditLogManifest getEditLogManifest(long fromTxId)

           throws IOException {

-        return journalSet.getEditLogManifest(fromTxId);

+        return journalSet.getEditLogManifest(fromTxId, true);

   }

 

   /**

@@ -1233,8 +1233,8 @@ synchronized void recoverUnclosedStreams() {

   }

   

   public void selectInputStreams(Collection<EditLogInputStream> streams,

-          long fromTxId, boolean inProgressOk) {

-        journalSet.selectInputStreams(streams, fromTxId, inProgressOk);

+          long fromTxId, boolean inProgressOk, boolean forReading) {

+        journalSet.selectInputStreams(streams, fromTxId, inProgressOk, forReading);

   }

 

   public Collection<EditLogInputStream> selectInputStreams(

@@ -1253,7 +1253,7 @@ public void selectInputStreams(Collection<EditLogInputStream> streams,

           long fromTxId, long toAtLeastTxId, MetaRecoveryContext recovery,

           boolean inProgressOk) throws IOException {

         List<EditLogInputStream> streams = new ArrayList<EditLogInputStream>();

-        selectInputStreams(streams, fromTxId, inProgressOk);

+        selectInputStreams(streams, fromTxId, inProgressOk, true);

 

         try {

           checkForGaps(streams, fromTxId, toAtLeastTxId, inProgressOk);

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java

index 0f63192..435216c 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FileJournalManager.java

@@ -164,10 +164,13 @@ public void purgeLogsOlderThan(long minTxIdToKeep)

   /**

        * Find all editlog segments starting at or above the given txid.

        * @param fromTxId the txnid which to start looking

+   * @param forReading whether or not the caller intends to read from the edit

+   *            logs

        * @return a list of remote edit logs

        * @throws IOException if edit logs cannot be listed.

        */

-  public List<RemoteEditLog> getRemoteEditLogs(long firstTxId) throws IOException {

+  public List<RemoteEditLog> getRemoteEditLogs(long firstTxId,

+          boolean forReading) throws IOException {

         File currentDir = sd.getCurrentDir();

         List<EditLogFile> allLogFiles = matchEditLogs(currentDir);

         List<RemoteEditLog> ret = Lists.newArrayListWithCapacity(

@@ -177,11 +180,15 @@ public void purgeLogsOlderThan(long minTxIdToKeep)

           if (elf.hasCorruptHeader() || elf.isInProgress()) continue;

           if (elf.getFirstTxId() >= firstTxId) {

             ret.add(new RemoteEditLog(elf.firstTxId, elf.lastTxId));

-          } else if ((firstTxId > elf.getFirstTxId()) &&

-                     (firstTxId <= elf.getLastTxId())) {

-            // Note that this behavior is different from getLogFiles below.

-            throw new IllegalStateException("Asked for firstTxId " + firstTxId

-                + " which is in the middle of file " + elf.file);

+          } else if (elf.getFirstTxId() < firstTxId && firstTxId <= elf.getLastTxId()) {

+            // If the firstTxId is in the middle of an edit log segment

+            if (forReading) {

+              // Note that this behavior is different from getLogFiles below.

+              throw new IllegalStateException("Asked for firstTxId " + firstTxId

+                  + " which is in the middle of file " + elf.file);

+            } else {

+              ret.add(new RemoteEditLog(elf.firstTxId, elf.lastTxId));

+            }

           }

         }

         

@@ -242,7 +249,7 @@ public void purgeLogsOlderThan(long minTxIdToKeep)

   @Override

   synchronized public void selectInputStreams(

           Collection<EditLogInputStream> streams, long fromTxId,

-          boolean inProgressOk) throws IOException {

+          boolean inProgressOk, boolean forReading) throws IOException {

         List<EditLogFile> elfs = matchEditLogs(sd.getCurrentDir());

         LOG.debug(this + ": selecting input streams starting at " + fromTxId +

             (inProgressOk ? " (inProgress ok) " : " (excluding inProgress) ") +

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java

index 8ed073d..396524d 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/JournalSet.java

@@ -235,10 +235,12 @@ public void apply(JournalAndStream jas) throws IOException {

        *                             may not be sorted-- this is up to the caller.

        * @param fromTxId             The transaction ID to start looking for streams at

        * @param inProgressOk         Should we consider unfinalized streams?

+   * @param forReading           Whether or not the caller intends to read from

+   *                             the returned streams.

        */

   @Override

   public void selectInputStreams(Collection<EditLogInputStream> streams,

-          long fromTxId, boolean inProgressOk) {

+          long fromTxId, boolean inProgressOk, boolean forReading) {

         final PriorityQueue<EditLogInputStream> allStreams =

             new PriorityQueue<EditLogInputStream>(64,

                 EDIT_LOG_INPUT_STREAM_COMPARATOR);

@@ -248,7 +250,8 @@ public void selectInputStreams(Collection<EditLogInputStream> streams,

             continue;

           }

           try {

-            jas.getManager().selectInputStreams(allStreams, fromTxId, inProgressOk);

+            jas.getManager().selectInputStreams(allStreams, fromTxId, inProgressOk,

+                forReading);

           } catch (IOException ioe) {

             LOG.warn("Unable to determine input streams from " + jas.getManager() +

                 ". Skipping.", ioe);

@@ -587,14 +590,15 @@ public void apply(JournalAndStream jas) throws IOException {

        * @param fromTxId Starting transaction id to read the logs.

        * @return RemoteEditLogManifest object.

        */

-  public synchronized RemoteEditLogManifest getEditLogManifest(long fromTxId) {

+  public synchronized RemoteEditLogManifest getEditLogManifest(long fromTxId,

+          boolean forReading) {

         // Collect RemoteEditLogs available from each FileJournalManager

         List<RemoteEditLog> allLogs = Lists.newArrayList();

         for (JournalAndStream j : journals) {

           if (j.getManager() instanceof FileJournalManager) {

             FileJournalManager fjm = (FileJournalManager)j.getManager();

             try {

-              allLogs.addAll(fjm.getRemoteEditLogs(fromTxId));

+              allLogs.addAll(fjm.getRemoteEditLogs(fromTxId, forReading));

             } catch (Throwable t) {

               LOG.warn("Cannot list edit logs in " + fjm, t);

             }

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LogsPurgeable.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LogsPurgeable.java

index d644ed5..4261825 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LogsPurgeable.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/LogsPurgeable.java

@@ -42,12 +42,13 @@

        *

        * @param fromTxId the first transaction id we want to read

        * @param inProgressOk whether or not in-progress streams should be returned

+   * @param forReading whether or not the caller intends to read from the edit logs

        *

        * @return a list of streams

        * @throws IOException if the underlying storage has an error or is otherwise

        * inaccessible

        */

   void selectInputStreams(Collection<EditLogInputStream> streams,

-          long fromTxId, boolean inProgressOk) throws IOException;

+          long fromTxId, boolean inProgressOk, boolean forReading) throws IOException;

   

 }

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorageRetentionManager.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorageRetentionManager.java

index 75d1bd8..6b545a6 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorageRetentionManager.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorageRetentionManager.java

@@ -108,7 +108,7 @@ public void purgeOldStorage() throws IOException {

         long purgeLogsFrom = Math.max(0, minimumRequiredTxId - numExtraEditsToRetain);

         

         ArrayList<EditLogInputStream> editLogs = new ArrayList<EditLogInputStream>();

-        purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false);

+        purgeableLogs.selectInputStreams(editLogs, purgeLogsFrom, false, false);

         Collections.sort(editLogs, new Comparator<EditLogInputStream>() {

           @Override

           public int compare(EditLogInputStream a, EditLogInputStream b) {

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java

index e602b76..4384126 100644

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java

+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java

@@ -820,7 +820,7 @@ public void purgeLogsOlderThan(long minTxIdToKeep) throws IOException {

 

           @Override

           public void selectInputStreams(Collection<EditLogInputStream> streams,

-              long fromTxId, boolean inProgressOk) {

+              long fromTxId, boolean inProgressOk, boolean forReading) {

             Iterator<StorageDirectory> iter = storage.dirIterator();

             while (iter.hasNext()) {

               StorageDirectory dir = iter.next();        

5.1 How?

The patch addresses the issue by plumbing through a "forReading" parameter to selectInputStreams/getEditLogManifest so that the namenode edit log purger can indicate to the journalnodes that it won't be reading from the transaction ID it's asking for, and thus the journalnodes shouldn't error out if the txid isn't on an edit log segment boundary.