BlockMissingException is thrown under this test scenario:
Two different processes doing concurrent file operation: one read and the other write on the same file
The reader is basically doing:
byteRead = in.read(currentPosition, buffer, 0, byteToReadThisRound);
where CurrentPostion=0, buffer is a byte array buffer, byteToReadThisRound = 1024*10000;
Usually it does not fail right away. The same file needs to be read, closed, re-open a few times to create the problem.
If the file is committed, it appears to throw an exception if the client attempts to read past the end of file. However, in the case where the file has an uncompleted block, I think there's a secondary bug where no exception is thrown if the client is attempting to read past the end of the uncompleted block. Basically a final bounds check should be added.
What is the root-cause?
Yes, there are multiple exceptions. BlockMissingException is a subset of IOexception. So
0.23.0. However both the failure and fix are in the same version of different commit level. Thus small portion of source code in 0.23.0 need to be reverted to 0.22.0 in order to reproduce the failure.
No special configuration is need to run the testcase.
No particular timing order. Only concurrent read and write is required .
Yes. In fact, no event order is needed to reproduce this error.
The log tells us we had error on the DFSInputStream.java file. The log also shows us that there are multiple read and write operations prior to hadoop failure. This helped us narrow down the source of the problem. However a detailed non-trivial analysis of the source code is needed.
Only one machine is needed to reproduce this failure.
BlockMissingException propagates to IOException. Offset is negative while reading which causes BME.
Ding: How the “Offset” becomes negative? if the offset becomes negative, why the exception is thrown? You might need to show the code to explain this point...
The log messages mention that the offset is negative and there is a BME in DFSInputStream.read.
Caused by: java.io.IOException: #### Exception caught in readUntilEnd: reader currentOffset = 0 ; totalByteRead =0 ; latest byteRead = 0; visibleLen= 930000 ; bufferLen = 819200 ; Filename = /tmp/fileX1
... 32 more
Caused by: org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-978087030-127.0.1.1-1374887260510:blk_1244383136209518898_1001 file=/tmp/fileX1
Block offset becomes negative with concurrent read/write on a non-finalized block.