HDFS-2938 Report

1. Symptom

When hdfs is removing a large directory (with millions of subfiles/dirs), the system will become unresponsive and hang for any other write requests.

1.1 Severity

Major.

1.2 Was there exception thrown?

No.

1.2 Were there multiple exceptions?

No.

2. How to reproduce this failure

2.0 Version

0.23.3 and manually reverse the patch.

2.1 Configuration

1NN + 1DN.

2.2 Reproduction procedure

Create a directory with a millions sub-dir/sub-files:

$ cd /home/ding/research/hadoop/testing_progs/create_large_dir

$ java -cp `hadoop classpath`:build/classes/ create_large_dir.CREATE_DIR

 --- This program will create a large dir. The program source is attached at the end.

$ hdfs dfs -rm /root

In the meantime:

$ hdfs dfs -put any_local_file /

  --- This will get stucked.

Another way (simpler way) is to run the test:

$ cd ~/research/hadoop/hadoop-0.23.3-src

$ mvn test -Dtest=TestLargeDirectoryDelete#largeDelete

Then you look at the testing output (in hadoop-hdfs-project/hadoop-hdfs/target/surefire-reports):

 createOperations 1

 lockOperations 0

What this test does is to delete a large dir, in the meantime, create two threads, one creating lots of files, and the other try to compete for the write_lock. In the buggy case, we see the other two threads are not successful in creating files nor competing for write_lock; but after the fix, these two numbers are large:

 createOperations 49

 lockOperations 2060

2.2.1 Timing order?

The write requests must be issued before the “rm” finishes.

2.3 Can the logs tell how to reproduce the failure?

Well, yes (but hard).

When deleting the file:

The DN log:

2013-07-17 16:34:15,686 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block BP-2131643711-127.0.1.1-1374092833647 blk_-9222608782257262799_6923 a

 --- Many of them!

The NN log:

2013-07-17 16:34:12,669 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* addToInvalidates: blk_5014222504967780398_10997 to 128.100.23.4:50010

  --- Many of them

But it is hard to know the other request is unresponsive...

2.4 How many machines needed?

1. 2 components: NN + DN.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

If you have a jstack, it would be easy. It will show you:

A. the unresponsive request blocks in “write_lock()”

B. the remove thread holds “write_lock()”

3.2 Backward inference

You look at the code where it holds the write lock:

/**

   * From the given list, incrementally remove the blocks from blockManager

   */

  private void removeBlocks(List<Block> blocks) {

    assert hasWriteLock(); // By Ding..

    int start = 0;

    int end = 0;

    while (start < blocks.size()) {

      end = BLOCK_DELETION_INCREMENT + start;

          --- This mechanism is to incrementally remove the blocks, BLOCK_DELETION_INCREMENT (1000) at a time!

      end = end > blocks.size() ? blocks.size() : end;

      for (int i=start; i<end; i++) {

        blockManager.removeBlock(blocks.get(i));

      }

      start = end;

    }

  }

private boolean deleteInternal (..) {

   writeLock();

    --- critical section too large!!!

    try {

      removeBlocks(collectedBlocks); // Incremental deletion of blocks

    } finally {

      writeUnlock();

    }

}

--- The interesting thing is: the developers already have a mechanism in place to incrementally remove the blocks (probably they meant to make the critical region finer-grained by releasing/regaining locks), but failed to do so....

So the fix:

-  /** From the given list, incrementally remove the blocks from blockManager */
+  /**
+   * From the given list, incrementally remove the blocks from blockManager
+   * Writelock is dropped and reacquired every BLOCK_DELETION_INCREMENT to
+   * ensure that other waiters on the lock can get in. See HDFS-2938
+   */
  private void removeBlocks(List<Block> blocks) {
-    assert hasWriteLock();
    int start = 0;
    int end = 0;
    while (start < blocks.size()) {
      end = BLOCK_DELETION_INCREMENT + start;
      end = end > blocks.size() ? blocks.size() : end;
-      for (int i=start; i<end; i++) {
-        blockManager.removeBlock(blocks.get(i));
+      writeLock();
+      try {
+        for (int i = start; i < end; i++) {
+          blockManager.removeBlock(blocks.get(i));
+        }
+      } finally {
+        writeUnlock();
      }
      start = end;
    }

deleteInternal(): get rid of the lock.
-    getEditLog().logSync();
-
-    writeLock();
-    try {
-      removeBlocks(collectedBlocks); // Incremental deletion of blocks
-    } finally {
-      writeUnlock();
-    }
+    getEditLog().logSync();
+    removeBlocks(collectedBlocks); // Incremental deletion of blocks
    collectedBlocks.clear();
    if (NameNode.stateChangeLog.isDebugEnabled()) {
      NameNode.stateChangeLog.debug("DIR* Namesystem.delete: "

4. Root cause

Critical region too large.

4.1 Category:

Semantic.