HDFS-4328 Report

1. Symptom

Performance problem --- very slow progress, almost seems deadlock.

1.1 Severity

Major.

1.2 Was there exception thrown?

No.

2. How to reproduce this failure

I reproduced it by simply running the test case: TestLargeBlock#testLargeBlockSize. I will also describe how to reproduce it without the test case.

2.0 Version

2.0.3-alpha: need to manually reverse the patch.

2.1 Configuration

Testcase: no configuration needed.

No testcase: 1NN, 1DN. No unusual configuration required.

2.2 Reproduction procedure

Testcase:

mvn test -DTestLargeBlock#testLargeBlockSize

Without test case:

- create a large file (> 2GB)

- upload it to DN.

- Wait until the block scanner starts

- During the read/scan, shutdown DN: the shutdown will take forever to complete.

2.3 Can the logs tell how to reproduce the failure?

Hard....

We do not know that it’s a large file,

we know that it is in the middle of verification.

We know shutdown is called.

2.4 How many machines needed?

1. 1NN + 1DN.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The shutdown procedure will hang (in fact, very slow progress).

If you use jstack, you will notice the following:

"main" prio=5 tid=7fd86d800800 nid=0x10efc1000 in Object.wait() [10efbe000]
  java.lang.Thread.State:
WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1210)
        - locked <7c3965cd8> (a java.lang.Thread)
        at java.lang.Thread.join(Thread.java:1263)
        at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.shutdown(DataBlockScanner.java:251)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdownDataBlockScanner(DataNode.java:490)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdownPeriodicScanners(DataNode.java:462)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown(DataNode.java:1104)
       ...

60" daemon prio=5 tid=7fd86c1a6800 nid=0x11c378000 in Object.wait() [11c377000]
  java.lang.Thread.State:
TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.hdfs.util.DataTransferThrottler.throttle(DataTransferThrottler.java:98)
        - locked <7c3c841a0> (a org.apache.hadoop.hdfs.util.DataTransferThrottler)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:526)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:653)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyBlock(BlockPoolSliceScanner.java:397)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyFirstBlock(BlockPoolSliceScanner.java:476)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scan(BlockPoolSliceScanner.java:633)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:599)
        at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.run(DataBlockScanner.java:101)
       ...

This is super important for the diagnosis.

3.2 Backward inference

Let’s first look at the first “WAITING” thread:

"main" prio=5 tid=7fd86d800800 nid=0x10efc1000 in Object.wait() [10efbe000]
  java.lang.Thread.State:
WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Thread.join(Thread.java:1210)
        - locked <7c3965cd8> (a java.lang.Thread)
        at java.lang.Thread.join(Thread.java:1263)
        at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.shutdown(DataBlockScanner.java:251)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdownDataBlockScanner(DataNode.java:490)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdownPeriodicScanners(DataNode.java:462)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown(DataNode.java:1104)
       …

 public void shutdown() {

    synchronized (this) {

      if (blockScannerThread != null) {

        blockScannerThread.interrupt();

      }

    }

    // We cannot join within the synchronized block, because it would create a

    // deadlock situation.  blockScannerThread calls other synchronized methods.

    if (blockScannerThread != null) {

      try {

        blockScannerThread.join();

      } catch (InterruptedException e) {

        // shutting down anyway

      }

    }

  }

So during the shutdown, DN is trying to shutdown the “blockScannerThread”. It will first send it an interrupt, and then waits it to finish (join).

Now, let’s look at the other stack dump:

60" daemon prio=5 tid=7fd86c1a6800 nid=0x11c378000 in Object.wait() [11c377000]
  java.lang.Thread.State:
TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.hdfs.util.DataTransferThrottler.throttle(DataTransferThrottler.java:98)
        - locked <7c3c841a0> (a org.apache.hadoop.hdfs.util.DataTransferThrottler)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:526)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:653)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyBlock(BlockPoolSliceScanner.java:397)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyFirstBlock(BlockPoolSliceScanner.java:476)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scan(BlockPoolSliceScanner.java:633)
        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:599)
        at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.run(DataBlockScanner.java:101)
       ...

Now, this stack looks like the blockscanner code. From the above discussion, we can infer that the DN sent this thread a interrupt, and expects it to end soon.

Let’s look at the code to see where it waits:

/** Given the numOfBytes sent/received since last time throttle was called,

   * make the current thread sleep if I/O rate is too fast

   * compared to the given bandwidth.

   *

   * @param numOfBytes

   *     number of bytes sent/received since last time throttle was called

   */

  public synchronized void throttle(long numOfBytes) {

    if ( numOfBytes <= 0 ) {

      return;

    }

    curReserve -= numOfBytes;

    bytesAlreadyUsed += numOfBytes;

    while (curReserve <= 0) {

      long now = monotonicNow();

      long curPeriodEnd = curPeriodStart + period;

      if ( now < curPeriodEnd ) {

        // Wait for next period so that curReserve can be increased.

        try {

          wait( curPeriodEnd - now );

        } catch (InterruptedException ignored) {}

            --- They ignored the interrupt! The fix is to handle this interrupt!

      } else if ( now <  (curPeriodStart + periodExtension)) {

        curPeriodStart = curPeriodEnd;

        curReserve += bytesPerPeriod;

      } else {

        // discard the prev period. Throttler might not have

        // been used for a long time.

        curPeriodStart = now;

        curReserve = bytesPerPeriod - bytesAlreadyUsed;

      }

    }

    bytesAlreadyUsed -= numOfBytes;

  }

So it was waiting in the throttle. This throttle is called from BlockSender.sendPacket, which tries to throttle the network usage when sending packets. In fact, if you look at the entire call chain:

DataBlockScanner.run

   \

    \-> .. ..

          \

           \-> BlockSender.sendBlock

                    \

                     \-> sendPacket

                          \

                           \-> throttle

There is no code to handle a received interrupt. In other words, an interrupt from “shutdown” is not effective at all, and the blockscanner will make its usual progress. In this case, it is exteremely slow because:

A). The file is large;

B). The throttling process is going one.

The fix:

https://issues.apache.org/jira/secure/attachment/12564314/HDFS-4328.1.patch

Add logic in throttle to honor the interrupt -- if an interrupt is delivered, no throttle and finish asap.

In particular:

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/DataTransferThrottler.java
+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/DataTransferThrottler.java
@@ -96,7 +96,12 @@ public synchronized void throttle(long numOfBytes) {
        // Wait for next period so that curReserve can be increased.
        try {
          wait( curPeriodEnd - now );
-        } catch (InterruptedException ignored) {}
+        } catch (InterruptedException e) {
+          // Abort throttle and reset interrupted status to make sure other
+          // interrupt handling higher in the call stack executes.
+          Thread.currentThread().interrupt();
+          break;
+        }

4. Root cause

Forgot to handle interrupt.

4.1 Category:

Mishandling of exception!

   --- One finding is that there is an exception, but this is mishandled.