HDFS-3415 Report

https://issues.apache.org/jira/browse/HDFS-3415

1. Symptom

Namenode fails to start. In the NN’s log, we have the exception:

2013-07-10 16:38:02,304 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join

java.lang.NullPointerException

        at org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageFile(NNStorage.java:686)

        at org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditsInStorageDir(FSImagePreTransactionalStorageInspector.java:244)

        at org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getLatestEditsFiles(FSImagePreTransactionalStorageInspector.java:262)

        at org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditLogStreams(FSImagePreTransactionalStorageInspector.java:279)

        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:596)

        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:247)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:498)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:390)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:354)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:369)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:403)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:570)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:551)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1114)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1170)

1.1 Severity

Major.

1.2 Was there exception thrown?

Yes.

2. How to reproduce this failure

2.0 Version

2.0.0-alpha

2.1 Configuration

One node, only NN is needed!

The only non-traditional setting: hdfs-site.xml

  <property>

    <name>dfs.namenode.name.dir</name>

<value>file:/home/ding/research/hadoop/hadoop-2.0.0-alpha-src/hadoop-dist/target/hadoop-2.0.0-alpha/dfs/name,file:/home/ding/research/hadoop/hadoop-2.0.0-alpha-src/hadoop-dist/target/hadoop-2.0.0-alpha/dfs/name1</value>

    <final>true</final>

  </property>

2.2 Reproduction procedure

$ hdfs namenode -format

$ hadoop-daemon.sh start namenode

$ hadoop-daemon.sh stop namenode

Goto dfs/name1/current/VERSION

 → change “-40” -> “123”

$ hadoop-daemon.sh start namenode

  namenode fails to start.

2.3 Can the logs tell how to reproduce the failure?

Hard. This step:

Goto dfs/name1/current/VERSION

 → change “-40” -> “123”

is hard to be reflected.

But there are clues:

2013-07-10 16:38:02,295 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Ignoring one or more storage directories with old layouts

  --- This log msg shouldn’t be printed in non-error executions.

NNStorage.java:

    if (LayoutVersion.supports(Feature.TXID_BASED_LAYOUT, minLayoutVersion)) {

      FSImage.LOG.warn("Ding: Creating FSImageTransactionalStorageInspector");

      inspector = new FSImageTransactionalStorageInspector();

      if (!LayoutVersion.supports(Feature.TXID_BASED_LAYOUT, maxLayoutVersion)) {

        --- maxLayoutVersion in the buggy case is “123”. If VERSION is not modified, it should be “-40”, and this msg shouldn’t be printed!

        FSImage.LOG.warn("Ignoring one or more storage directories with old layouts");

      }

    } else {

      FSImage.LOG.warn("Ding: Creating FSImagePreTransactionalStorageInspector");

      inspector = new FSImagePreTransactionalStorageInspector();

    }

2.4 How many machines needed?

One is enough.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

2013-07-10 16:38:02,304 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join

java.lang.NullPointerException

        at org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageFile(NNStorage.java:686)

        at org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditsInStorageDir(FSImagePreTransactionalStorageInspector.java:244)

        at org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getLatestEditsFiles(FSImagePreTransactionalStorageInspector.java:262)

        at org.apache.hadoop.hdfs.server.namenode.FSImagePreTransactionalStorageInspector.getEditLogStreams(FSImagePreTransactionalStorageInspector.java:279)

        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:596)

        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:247)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:498)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:390)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:354)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:369)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:403)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:570)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:551)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1114)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1170)

3.2 Backward inference

We can locate the code:

/**

   * Get a storage file for one of the files that doesn't need a txid associated

   * (e.g version, seen_txid)

   */

  static File getStorageFile(StorageDirectory sd, NameNodeFile type) {

    return new File(sd.getCurrentDir(), type.getName());

  }

--- Challenge: we don’t know which is null: StorageDirectory, or NameNodeFile type?

If we go up along the call-stack, we will arrive at this function:

 /**

   * Choose latest image from one of the directories,

   * load it and merge with the edits from that directory.

   *

   * Saving and loading fsimage should never trigger symlink resolution.

   * The paths that are persisted do not have *intermediate* symlinks

   * because intermediate symlinks are resolved at the time files,

   * directories, and symlinks are created. All paths accessed while

   * loading or saving fsimage should therefore only see symlinks as

   * the final path component, and the functions called below do not

   * resolve symlinks that are the final path component.

   *

   * @return whether the image should be saved

   * @throws IOException

   */

  boolean loadFSImage(FSNamesystem target, MetaRecoveryContext recovery)

      throws IOException {

    .. ..

    if (LayoutVersion.supports(Feature.TXID_BASED_LAYOUT,

                               getLayoutVersion())) {

      // If we're open for write, we're either non-HA or we're the active NN, so

      // we better be able to load all the edits. If we're the standby NN, it's

      // OK to not be able to read all of edits right now.

      long toAtLeastTxId = editLog.isOpenForWrite() ? inspector.getMaxSeenTxId() : 0;

      editStreams = editLog.selectInputStreams(imageFile.getCheckpointTxId() + 1,

          toAtLeastTxId, false);

    } else {

      editStreams = FSImagePreTransactionalStorageInspector

        .getEditLogStreams(storage);

    }

 

--- So the layout version (123) is not supported, that’s why a FSImagePreTransactionalStorageInspector is used (which is wrong).

In fact, this layout version (123) should be rejected... The fix is simply reject it when the first time it is detected below:

 /**

   * Iterate over all of the storage dirs, reading their contents to determine

   * their layout versions. Returns an FSImageStorageInspector which has

   * inspected each directory.

   *

   * <b>Note:</b> this can mutate the storage info fields (ctime, version, etc).

   * @throws IOException if no valid storage dirs are found

   */

  FSImageStorageInspector readAndInspectDirs()

      throws IOException {

    int minLayoutVersion = Integer.MAX_VALUE; // the newest

    int maxLayoutVersion = Integer.MIN_VALUE; // the oldest

   

    // First determine what range of layout versions we're going to inspect

    for (Iterator<StorageDirectory> it = dirIterator();

         it.hasNext();) {

      StorageDirectory sd = it.next();

      if (!sd.getVersionFile().exists()) {

        FSImage.LOG.warn("Storage directory " + sd + " contains no VERSION file. Skipping...");

        continue;

      }

      readProperties(sd); // sets layoutVersion

      minLayoutVersion = Math.min(minLayoutVersion, getLayoutVersion());

      maxLayoutVersion = Math.max(maxLayoutVersion, getLayoutVersion());

    }

   

    if (minLayoutVersion > maxLayoutVersion) {

      throw new IOException("No storage directories contained VERSION information");

    }

    assert minLayoutVersion <= maxLayoutVersion;

   

    // If we have any storage directories with the new layout version

    // (ie edits_<txnid>) then use the new inspector, which will ignore

    // the old format dirs.

    FSImageStorageInspector inspector;

    if (LayoutVersion.supports(Feature.TXID_BASED_LAYOUT, minLayoutVersion)) {

      inspector = new FSImageTransactionalStorageInspector();

      if (!LayoutVersion.supports(Feature.TXID_BASED_LAYOUT, maxLayoutVersion)) {

        FSImage.LOG.warn("Ignoring one or more storage directories with old layouts");

   --- invalid case. An exception should have been thrown and rejected early!!

      }

    } else {

      FSImage.LOG.warn("Ding: Creating FSImagePreTransactionalStorageInspector");

      inspector = new FSImagePreTransactionalStorageInspector();

    }

   

    inspectStorageDirs(inspector);

    return inspector;

  }

Patch explained:

diff --git hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java
index d5061b7..01c951a 100644
--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java
+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NNStorage.java
@@ -31,6 +31,7 @@ import java.util.Collection;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
+import java.util.Map;
import java.util.Properties;
import java.util.UUID;
import java.util.concurrent.CopyOnWriteArrayList;
@@ -58,6 +59,7 @@ import org.apache.hadoop.net.DNS;
import com.google.common.base.Preconditions;
import com.google.common.annotations.VisibleForTesting;
import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;

/**
 * NNStorage is responsible for management of the StorageDirectories used by
@@ -1076,13 +1078,14 @@ public class NNStorage extends Storage implements Closeable {
   * inspected each directory.
   *
   * <b>Note:</b> this can mutate the storage info fields (ctime, version, etc).
-   * @throws IOException if no valid storage dirs are found
+   * @throws IOException if no valid storage dirs are found or no valid layout version
   */
  FSImageStorageInspector readAndInspectDirs()
      throws IOException {
-    int minLayoutVersion = Integer.MAX_VALUE; // the newest
-    int maxLayoutVersion = Integer.MIN_VALUE; // the oldest
-    
+    Integer layoutVersion = null;
+    boolean multipleLV = false;
+    StringBuilder layoutVersions = new StringBuilder();
+
    // First determine what range of layout versions we're going to inspect
    for (Iterator<StorageDirectory> it = dirIterator();
         it.hasNext();) {
@@ -1092,24 +1095,29 @@ public class NNStorage extends Storage implements Closeable {
        continue;
      }
      readProperties(sd); // sets layoutVersion
-      minLayoutVersion = Math.min(minLayoutVersion, getLayoutVersion());
-      maxLayoutVersion = Math.max(maxLayoutVersion, getLayoutVersion());
+      int lv = getLayoutVersion();
+      if (layoutVersion == null) {
+        layoutVersion = Integer.valueOf(lv);
+      } else if (!layoutVersion.equals(lv)) {
+        multipleLV = true;
+      }
+      layoutVersions.append("(").append(sd.getRoot()).append(", ").append(lv).append(") ");
    }
   
-    if (minLayoutVersion > maxLayoutVersion) {
+    if (layoutVersion == null) {
      throw new IOException("No storage directories contained VERSION information");
    }
-    assert minLayoutVersion <= maxLayoutVersion;
-    
-    // If we have any storage directories with the new layout version
+    if (multipleLV) {            
+      throw new IOException(
+          "Storage directories containe multiple layout versions: "
+              + layoutVersions);
+    }
+    // If the storage directories are with the new layout version
    // (ie edits_<txnid>) then use the new inspector, which will ignore
    // the old format dirs.
    FSImageStorageInspector inspector;
-    if (LayoutVersion.supports(Feature.TXID_BASED_LAYOUT, minLayoutVersion)) {
+    if (LayoutVersion.supports(Feature.TXID_BASED_LAYOUT, getLayoutVersion())) {
      inspector = new FSImageTransactionalStorageInspector();
-      if (!LayoutVersion.supports(Feature.TXID_BASED_LAYOUT, maxLayoutVersion)) {
-        FSImage.LOG.warn("Ignoring one or more storage directories with old layouts");
-      }
    } else {
      inspector = new FSImagePreTransactionalStorageInspector();
    }

Basically to reject a multiple layout at the very beginning. Not allowing it to propagate. With this patch, the behavior would be:

2013-07-10 17:52:26,391 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system...

2013-07-10 17:52:26,392 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.

2013-07-10 17:52:26,392 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.

2013-07-10 17:52:26,393 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join

java.io.IOException: Storage directories contain multiple layout versions: (/home/ding/research/hadoop/hadoop-2.0.2-alpha-src/hadoop-dist/target/hadoop-2.0.2-alpha/dfs/name, -40) (/home/ding/research/hadoop/hadoop-2.0.2-alpha-src/hadoop-dist/target/hadoop-2.0.2-alpha/dfs/name1, 123)

        at org.apache.hadoop.hdfs.server.namenode.NNStorage.readAndInspectDirs(NNStorage.java:988)

        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:561)

        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:241)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:544)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:423)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:385)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:402)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:436)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:612)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:593)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1176)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1240)

2013-07-10 17:52:26,395 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1

2013-07-10 17:52:26,396 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

4. Root cause

4.1 Category:

Mishandling exception --- for the mismatching layout version, should just bail instead of let it go.