HDFS-2359 Report

1. Symptom

NullPointerException is found in Datanode log while Disk failed during the dfs -copyFromLocal/-put/-cp/-rm or distcp operation. And it will force the executing thread to exit.

1.1 Severity

Major

1.2 Was there exception thrown?

Yes

1.2.1 Were there multiple exceptions?

Yes. IOException: Error in deleting blocks and NullPointerException

1.3 Scope of the failure

The executing scan operation will terminate.

2. How to reproduce this failure

2.0 Version

0.20.205.0

2.1 Configuration

In the hdfs-site.xml file:

<property>

  <name>dfs.datanode.failed.volumes.tolerated</name>

  <value>3</value>

  <description>The number of volumes that are allowed to fail before a datanode stops offering service. By default any volume failure will cause a datanode to shutdown.

  </description>

</property>

 

  <property>

    <name>dfs.data.dir</name>

    <value>/app/hadoop-0.20.205/data1, /app/hadoop-0.20.205/data2, /app/hadoop-0.20.205/data3, /app/hadoop-0.20.205/data4, /app/hadoop-0.20.205/data5</value>

    <final>true</final>

  </property>

In order to use distcp command, in the mapred-site.xml file:

<property>

  <name>mapred.job.tracker</name>

  <value>master:54311</value>

  <description>The host and port that the MapReduce job tracker runs

  at.  If "local", then jobs are run in-process as a single map

  and reduce task.

  </description>

</property>

2.2 Reproduction procedure

1 format namenode

2 start dfs and mapreduce

3 copy a file to the dfs:

hadoop dfs -copyFromLocal ~/random.txt /tmp1

4 copy a directory from hdfs to hdfs:

hadoop distcp /tmp1 /tmp2

5 simulate disk fail by changing permission of data dirs as 000:

chmod 000 data{1,2,3}

2.2.2 Events order externally controllable?

No. This is a concurrency bug requires certain thread interleaving.

2.3 Can the logs tell how to reproduce the failure?

No. This is concurrency bug. However, the exception clearly shows where the root cause is.

2.4 How many machines needed?

One

3.1 Detailed Symptom (where you start)

2011-09-19 12:43:40,314 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected error trying to delete block

blk_7065198814142552283_62557. BlockInfo not found in volumeMap.

2011-09-19 12:43:40,314 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected error trying to delete block

blk_7066946313092770579_39189. BlockInfo not found in volumeMap.

2011-09-19 12:43:40,314 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected error trying to delete block

blk_7070305189404753930_49359. BlockInfo not found in volumeMap.

2011-09-19 12:43:40,327 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Error processing datanode Command

java.io.IOException: Error in deleting blocks.

at org.apache.hadoop.hdfs.server.datanode.FSDataset.invalidate(FSDataset.java:1820)

at org.apache.hadoop.hdfs.server.datanode.DataNode.processCommand(DataNode.java:1074)

at org.apache.hadoop.hdfs.server.datanode.DataNode.processCommand(DataNode.java:1036)

at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:891)

at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1419)

at java.lang.Thread.run(Thread.java:619)

2011-09-19 12:43:41,304 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:

DatanodeRegistration(xx.xxx.xxx.xxx:xxxx, storageID=xx-xxxxxxxxxxxx-xx.xxx.xxx.xxx-xxxx-xxxxxxxxxxx, infoPort=1006,

ipcPort=8020):DataXceiver

java.lang.NullPointerException

at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner$LogFileHandler.appendLine(DataBlockScanner.java:788)

at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.updateScanStatusInternal(DataBlockScanner.java:365)

at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.verifiedByClient(DataBlockScanner.java:308)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:205)

at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:99)

at java.lang.Thread.run(Thread.java:619)

2011-09-19 12:43:43,313 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected error trying to delete block

blk_7071818644980664768_40827. BlockInfo not found in volumeMap.

2011-09-19 12:43:43,313 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected error trying to delete block

blk_7073840977856837621_62108. BlockInfo not found in volumeMap.

3.2 Backward inference (how do you infer from the symptom to the root cause)

From the log message, we can see that the NullPointerException is thrown in the appendLine function because of out.println(). Therefore, I know that the out variable is null in this case which throws this exception. The next thing I did is looking through the code and trying to find where the out variable is set to null. The only place where the out is set to null is in the close () function. Therefore, this failure will be triggered if a thread calls the close () function right before the out.println() is operated in another thread.

         synchronized void close() {

          if (out != null) {

            out.close();

            out = null;

          }

}

4. Root cause

If one of the dfs -copyFromLocal/-put/-cp/-rm or distcp operation is executed, due to the number of blocks changes, the verification log should be modified. When a block is added into hdfs, the appendLine() function will be called to append the verification information of this block (stored in the out variable) into the verification log. During this operation, if the disk of one datanode fails, the original verification log cannot be opened and the close() function will be called to set the out variable to null. Therefore, if a thread calls the close () function right before the out.println() is operated in another thread, the NullPointerException will be thrown.

 

4.1 Category:

Concurrency bug

5. Fix

  synchronized boolean appendLine(String line) {

+          if (out == null) {

+            return false;

+          }

           out.println();

5.1 How?

If out is null, appendLine function returns false (appending line fails).