HDFS-1907 Report

1. Symptom

BlockMissingException is thrown under this test scenario:

Two different processes doing concurrent file operation: one read and the other write on the same file

The reader is basically doing:

byteRead = in.read(currentPosition, buffer, 0, byteToReadThisRound);

where CurrentPostion=0, buffer is a byte array buffer, byteToReadThisRound = 1024*10000;

Usually it does not fail right away. The same file needs to be read, closed, re-open a few times to create the problem.

If the file is committed, it appears to throw an exception if the client attempts to read past the end of file. However, in the case where the file has an uncompleted block, I think there's a secondary bug where no exception is thrown if the client is attempting to read past the end of the uncompleted block. Basically a final bounds check should be added.

What is the root-cause?

1.1 Severity

Major

1.2 Was there exception thrown?

BlockMissingException, IOexception

1.2.1 Were there multiple exceptions?

Yes, there are multiple exceptions. BlockMissingException is a subset of IOexception. So

2. How to reproduce this failure

2.0 Version

0.23.0. However both the failure and fix are in the same version of different commit level. Thus small portion of  source code in 0.23.0 need to be reverted to 0.22.0 in order to reproduce the failure.

2.1 Configuration

No special configuration is need to run the testcase.

2.2 Reproduction procedure

2.2.1 Timing order

No particular timing order. Only concurrent read and write is required .

2.2.2 Events order externally controllable?

Yes. In fact, no event order is needed to reproduce this error.

2.3 Can the logs tell how to reproduce the failure?

The log tells us we had error on the DFSInputStream.java file. The log also shows us that there are multiple read and write operations prior to hadoop failure. This helped us narrow down the source of the problem. However a detailed non-trivial analysis of the source code is needed.  

2.4 How many machines needed?

Only one machine is needed to reproduce this failure.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

BlockMissingException propagates to IOException. Offset is negative while reading which causes BME.

Ding: How the “Offset” becomes negative? if the offset becomes negative, why the exception is thrown? You might need to show the code to explain this point...

3.2 Backward inference (how do you infer from the symptom to the root cause)

The log messages mention that the offset is negative and there is a BME in DFSInputStream.read.

Caused by: java.io.IOException: #### Exception caught in readUntilEnd: reader  currentOffset = 0 ; totalByteRead =0 ; latest byteRead = 0; visibleLen= 930000 ; bufferLen = 819200 ; Filename = /tmp/fileX1

        at org.apache.hadoop.hdfs.TestWriteRead.readUntilEnd(TestWriteRead.java:238)

        at org.apache.hadoop.hdfs.TestWriteRead.readData(TestWriteRead.java:167)

        ... 32 more

Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-978087030-127.0.1.1-1374887260510:blk_1244383136209518898_1001 file=/tmp/fileX1

        at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:639)

        at org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:688)

        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:867)

        at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:72)

        at org.apache.hadoop.hdfs.TestWriteRead.readUntilEnd(TestWriteRead.java:219)

4. Root cause

Block offset becomes negative with concurrent read/write on a non-finalized block.

4.1 Category:

Concurrency bug.