HDFS-3541 Report

1. Symptom

Deadlock between three threads: Recovery, Xceiver, Packet Responder. Client simply hangs.

1.1 Severity

Major.

1.2 Was there exception thrown?

No.

1.3 Scope of failure

All accesses to this particular DN will hang

2. How to reproduce this failure

2.0 Version

0.23.3, 2.0.0-alpha, 2.0.2-alpha, 3.0.0

2.1 Configuration

1NN + 1DN. It is important to have just one DN because we want to be sure that recovery and writer thread are working on the same DN.

2.2 Reproduction procedure

We simulated the deadlock by modifying the code (inserting a sleep) & forcing the deadlock

Essentially to trigger the bug, our inputs need to:

A). write to the file and

B). Kill & restart DN (to trigger the recovery)

C). Continue the file write after the DN restart

1.         Change DataNode.java’s source code: add “public” modifier to line

“volatile FsDatasetSpi<? extends FsVolumeSpi> data = null;”.

2.         cd hadoop-2.0.0-alpha-src/hadoop-dist // cd into hdfs project’s base dir.

3.         mvn package -Pdist -DskipTests -Dtar // Compile the src code.

4.         start-dfs.sh.

5.         jps // find the DataNode process’s <pid>.

6.         kill -9 <pid>

7.         Run the reproduction Java program (as below).

To compile and the client code:

1.     cd  ~/hadoop-2.0.0-alpha-src/files/HDFS_3541_client_package

2.     javac -cp `hadoop classpath` -d build/classes/src/HDFS_3541_Reproduction/HDFS_3541_Reproduction.java

3.     java -cp `hadoop classpath`:build/classes LogCor.HDFS_3541_Reproduction

 

8.         jps // find the HDFS_3541_Reproduction process’s <pid>.

9.         jstack <pid>.

What the client code does:

1.          Create a DFS system

2.          Create a DN

3.          Create a new path

4.          Write out a block of data to the path

5.          Hsync – flush out the data in the client users buffer all the way to the disk

6.          Get the blocks from the same path

7.          For the block, create a recoveringBlock (increment the generation stamp)

8.          Start the block recovering thread

 

2.2.1 Timing:

In addition to the external events, since this is a concurrency bug, we also need some internal dependencies that are not deterministic given only the external events:

A) stopWriter()’s writer.interrupt() should occur before DataXceiver’s wait()

  --- Otherwise the writer.interrupt would interrupt DataXceiver’s wait to break the deadlock.

B) initReplicaRecovery BEFORE PacketResponder.finalizeBlock

  --- both are “synchronized”, which means they are mutual exclusive.

2.2.2 Order externally controllable?

No. The dependencies in “3.2.2” is not externally visible --- that’s why we need to insert sleep.

2.3 Can the logs tell how to reproduce the failure?

- If with jstack log, straight-forward for reproduction.

- Without jstack log, hard to reproduce.

Having just the log is hard to tell how to reproduce the failure. However the jstack dump is very helpful and shows the problem clearly. The failure reproduction is then straight forward

2.4 How many machines needed?

2 nodes

1NN + 1DN. 

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

1.              DataXciever thread is waiting for PacketResponder

2.              Packet Responder is waiting for a lock (FSDataSet) that is held by DN.initReplicaRecovery

3.              DN.initReplicaRecovery is waiting for stopWriter

4.              stopWriter is waiting to join writer (DataXceiver thread), but Interrupt() doesn't interrupt writer

3.2 Backward inference

 

1.     Focus on the thread status that have BLOCKED or WAITING (jstack)

Recovery:

"org.apache.hadoop.hdfs.server.datanode.DataNode$2@3b3b15cf" daemon prio=10 tid=0x00007f8a61216800 nid=0x49d2 in Object.wait() [0x00007f8a4b0ad000]

   java.lang.Thread.State: WAITING (on object monitor)

at java.lang.Object.wait(Native Method)

at java.lang.Thread.join(Thread.java:1186)

- locked <0x000000077cc64660> (a org.apache.hadoop.util.Daemon)

at java.lang.Thread.join(Thread.java:1239)

at org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:157)

at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1422)

at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1402)

- locked <0x000000077c0fd9a8> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)

at org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:1739)

at org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:1750)

at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1819)

at org.apache.hadoop.hdfs.server.datanode.DataNode.access$300(DataNode.java:206)

at org.apache.hadoop.hdfs.server.datanode.DataNode$2.run(DataNode.java:1724)

at java.lang.Thread.run(Thread.java:662)

 

"PacketResponder: BP-81833983-xx.xx.xx.203-1339769296155:blk_-3264545246283992184_10318, type=LAST_IN_PIPELINE, downstreams=0:[]" daemon prio=10 tid=0x00000000436a9800 nid=0x4783 waiting for monitor entry [0x00007f8a4a9a6000]

   java.lang.Thread.State: BLOCKED (on object monitor)

at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:841)

- waiting to lock <0x000000077c0fd9a8> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1055)

at java.lang.Thread.run(Thread.java:662)

"DataXceiver for client DFSClient_NONMAPREDUCE_-1037599878_15 at /xx.xx.xx.201:41945 [Receiving block BP-81833983-xx.xx.xx.203-1339769296155:blk_-3264545246283992184_10318]" daemon prio=10 tid=0x00007f8a61203000 nid=0x4782 in Object.wait() [0x00007f8a4aba8000]

   java.lang.Thread.State: WAITING (on object monitor)

at java.lang.Object.wait(Native Method)

at java.lang.Object.wait(Object.java:485)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.close(BlockReceiver.java:962)

- locked <0x000000077cb681f0> (a org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:754)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:464)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:189)

at java.lang.Thread.run(Thread.java:662)

 

The thread interleaving can be clearly seen from the jstack dump. But it does require tracing the locks to get to the root cause of the deadlock.

2.     DataXciever thread is waiting for PacketResponder

3.     PacketResponders are waiting for a lock that is held by DN.initReplicaRecovery

a.     It is the FSDataSet lock

4.     DN.initReplicaRecovery is waiting for stopWriter

5.     So let us look at stopWriter

public class ReplicaInPipeline { // replication recovery

public void stopWriter() throws IOException {

if (writer != null && writer != Thread.currentThread() && writer.isAlive()) {

  writer.interrupt();

  try {

    writer.join(); <- waiting to join DataXceiver

  } catch (InterruptedException e) {

    throw new IOException("Waiting for writer thread is interrupted.");

  }

}

}

}

Here writer.interrupt() will succeed in interrupting the thread only in case if the it is in waiting/sleeping state. otherwise it will not actually intterrupt it. In this case, when writer.interrupt() is called, DataXceiver was not waiting, by it began to wait btw writer.interrupt() and writer.join(). So it will wait till the thread completes its execution.

 

6.     When we look at stopWriter we find it is waiting to join writer (DataXciever thread)

a.     Interrupt() doesn’t actually stop the writer

7.     That is the point for our failure

 

8.     So PacketResponder.close, ReplicaInPipeline.stopWriter and FsDatasetImpl.finalizeBlock are in deadlock

9.     Our failure is at the stopWriter and to find the root cause we can observe that stopWriter joins the writer thread

10.  The writer thread is the DataXciever thread, which calls the receiveBlock which in turn calls the PacketRespoender.run and the PacketResponder.close. PacketResponder.close is waiting for current thread to finish (PacketResponder.run). The PacketResponder.run calls finializeBlock which is waiting for the FsDataSet lock.

11.  To correct the bug we have to make sure that the interrupt passes through and actually stops the writer, the PacketResponder and the finializeBlock because we do not want anything to asynchronously change the state of the blocks.

12.  From the jstack dump, we can tell PacketResponder.close, ReplicaInPipeline.stopWriter and FsDatasetImpl.finalizeBlock are in deadlock

13.  To fix the bug, we need to make sure the interrupt passes through and stops the writer, PacketResponder and finalizeBlock. We want prevent anything to change the state of the blocks asynchronously.

4. Root cause

StopWriter does not actually stop the writer. In the patch the interrupt is propagated to the finalizeBlock which can be seen in the logs.

Log AFTER the fix:

 

13/07/07 22:34:59 ERROR datanode.DataNode: research-virtual-machine:50010:DataXceiver error processing WRITE_BLOCK operation  src: /192.168.103.149:37575 dest: /192.168.103.149:50010

java.io.IOException: Interrupted receiveBlock

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:804)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:462)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:98)

at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:66)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:189)

at java.lang.Thread.run(Thread.java:662)

13/07/07 22:34:59 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception  for block BP-1589935905-127.0.1.1-1373250817958:blk_-1253963624270240810_1002

java.io.EOFException: Premature EOF: no length prefix available

at org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)

at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:95)

at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:656)

Writing should fail

13/07/07 22:34:59 INFO datanode.DataNode: 3541 Waiting to join writer in stopWriter

13/07/07 22:34:59 INFO datanode.DataNode: 3541 Joined writer in stopWriter

13/07/07 22:34:59 INFO impl.FsDatasetImpl: initReplicaRecovery: changing replica state for blk_-1253963624270240810_1002 from RBW to RUR

13/07/07 22:34:59 WARN datanode.DataNode: IOException in BlockReceiver.run():

java.io.IOException: Cannot finalize block from Interrupted Thread

at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.finalizeBlock(FsDatasetImpl.java:843)

at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1077)

at java.lang.Thread.run(Thread.java:662)

13/07/07 22:34:59 INFO datanode.DataNode: PacketResponder: BP-1589935905-127.0.1.1-1373250817958:blk_-1253963624270240810_1002, type=LAST_IN_PIPELINE, downstreams=0:[] terminating

 

The PacketResponder gets the interrupt and is successfully terminated.

4.1 Category:

Concurrency Bug

Acknowledgement:

This bug was initially reproduced and studied by Dongcai Shen. Our group has reproduced them independently thereafter and provided additional analysis and findings.