HDFS-4182 Report

https://issues.apache.org/jira/browse/HDFS-4182

1. Symptom

Memory leak. The secondary namenode leaks NameCache entries.

1.1 Severity

Critical

1.2 Was there exception thrown?

No (at least not until the out of memory).

2. How to reproduce this failure

2.0 Version

2.0.2-alpha

2.1 Configuration

No special configurations required at all. Just start 1 DN together with NN & 2NN

2.2 Reproduction procedure

Just start the hdfs and the 2nd NN is leaking namecache entries... To speed up the process, you can write a lot of small files into the FS.

2.3 Can the logs tell how to reproduce the failure?

Reproducing this is really straightforward.

2.4 How many machines needed?

Just one.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Really hard. There is no logs. They rely on a heap dump, and see entries went to NameCache. You can use java virtualvm. Here is the output on baker221:

3.2 Backward inference

In fact, from the java visual VM we can see that most of the memory went to: secondary namenode, namecache.

Now beyond this, there is not much evidences, and we will need to look into the source to debug.. Fortunately, this NameCache.java is a short file.

/**

 * Caches frequently used names to facilitate reuse.

 * (example: byte[] representation of the file name in {@link INode}).

 *

 * This class is used by initially adding all the file names. Cache

 * tracks the number of times a name is used in a transient map. It promotes

 * a name used more than {@code useThreshold} to the cache.

 *

 * One all the names are added, {@link #initialized()} should be called to

 * finish initialization. The transient map where use count is tracked is

 * discarded and cache is ready for use.

 *

 * <p>

 * This class must be synchronized externally.

 *

 * @param <K> name to be added to the cache

 */

class NameCache<K> {

 /** indicates initialization is in progress */

  private boolean initialized = false;

  /** Cached names */

  final HashMap<K, K> cache = new HashMap<K, K>();

  /** Names and with number of occurrences tracked during initialization */

  Map<K, UseCount> transientMap = new HashMap<K, UseCount>();

  /**

   * Add a given name to the cache or track use count.

   * exist. If the name already exists, then the internal value is returned.

   *

   * @param name name to be looked up

   * @return internal value for the name if found; otherwise null

   */

  K put(final K name) {

    LOG.info("Ding: namecache.put" + name);

    K internal = cache.get(name);

    if (internal != null) {

      lookups++;

      return internal;

    }

    // Track the usage count only during initialization

    if (!initialized) {

      UseCount useCount = transientMap.get(name);

      if (useCount != null) {

        useCount.increment();

        if (useCount.get() >= useThreshold) {

          promote(name);

        }

        return useCount.value;

      }

      useCount = new UseCount(name);

      transientMap.put(name, useCount);

    }

    return null;

  }

 

  /**

   * Mark the name cache as initialized. The use count is no longer tracked

   * and the transient map used for initializing the cache is discarded to

   * save heap space.

   */

  void initialized() {

    LOG.info("initialized with " + size() + " entries " + lookups + " lookups");

    this.initialized = true;

    transientMap.clear();

    transientMap = null;

  }

 

  /** Promote a frequently used name to the cache */

  private void promote(final K name) {

        LOG.info("Ding: namecache.promate" + name);

    transientMap.remove(name);

    cache.put(name, name);

    lookups += useThreshold;

  }

}

Basically, it keeps 2 data-structures:

cache and transientMap. If the ref count is greater than threshold, the entry will be moved from “transientMap” to “cache” in promote. However, this will only happen when “initialized == false” -- in other words, during the initialization process.

Now in this case: the secondary NN never bothered to call “initialized” to initialize the NameCache. As a result, whenever “put” is called, it will put this new entry into either “transientMap” or “cache”, and never deletes them!

The correct logic is to make sure “namecache.initialized” is called everytime 2NN is started:

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/SecondaryNameNode.java
@@ -886,6 +886,7 @@ static void doMerge(
            "just been downloaded");
      }
      dstImage.reloadFromImageFile(file, dstNamesystem);
+      dstNamesystem.dir.imageLoadComplete();
    }
   

This “doMerge” is called from “doCheckpoint” --- whenever secondary NN is to do a checkpoint (read the edit log and do a checkpoint). This “imageLoadComplete” will further call:

SecondaryNameNode.doMerge

   \

    \--> FSDirectory.imageLoadComplete

            \

             \--> FSDirectory.setReady

                   \

                    \--> NameCache.initialized()

   

This way, transientMap will be cleared and “initialized” will be set to true, and the promotion to “cache” will be disabled.

In addition:

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java
+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSDirectory.java

  void reset() {
-    rootDir = new INodeDirectoryWithQuota(INodeDirectory.ROOT_NAME,
-        getFSNamesystem().createFsOwnerPermissions(new FsPermission((short)0755)),
-        Integer.MAX_VALUE, -1);
+    writeLock();
+    try {
+      setReady(false);
+      rootDir = new INodeDirectoryWithQuota(INodeDirectory.ROOT_NAME,
+          getFSNamesystem().createFsOwnerPermissions(new FsPermission((short)0755)),
+          Integer.MAX_VALUE, -1);
+      nameCache.reset();
+    } finally {
+      writeUnlock();
+    }
  }

--- whenever 2NN calls “reset”, it will call “setReady”, which further calls “initialized” to clear transientMap and set “initialized” to true; It also calls “nameCache.reset()”, which clears the cache, set initialized to “false”, and clear the “transientMap”.

+
+  public void reset() {
+    initialized = false;
+    cache.clear();
+    if (transientMap == null) {
+      transientMap = new HashMap<K, UseCount>();
+    } else {
+      transientMap.clear();
+    }
+  }

 --- reset is called on everytime 2NN reload an image file (checkpointing). In my running log, in fact, doMerge and reset are called together...

4. Root cause

Missed to initialize NameCache in secondary NN.

4.1 Category:

Semantic bug. Missed operation.