HDFS-4660 Report

1. Symptom

Blindly copying the checksum (instead of recomputation) from an upstream DN1 might cause inconsistency btw checksum & data DN2.

Later DN will detect checksum error exception.

1.1 Severity

Critical.

1.2 Was there exception thrown?

Yes.

2013-07-20 10:12:52,750 WARN  datanode.BlockPoolSliceScanner (BlockPoolSliceScanner.java:verifyBlock(444)) - 127.0.0.1:32950 - First Verification failed for BP-2087127

045-127.0.1.1-1374329565029:blk_7809964727882653017_1001org.apache.hadoop.fs.ChecksumException: Checksum failed at 391680

        at org.apache.hadoop.hdfs.server.datanode.BlockSender.verifyChecksum(BlockSender.java:555)

        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:460)

        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:622)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyBlock(BlockPoolSliceScanner.java:408)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyFirstBlock(BlockPoolSliceScanner.java:489)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scan(BlockPoolSliceScanner.java:644)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:624)

        at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.run(DataBlockScanner.java:94)

        at java.lang.Thread.run(Thread.java:662)

2013-07-20 10:12:52,759 WARN  datanode.BlockPoolSliceScanner (BlockPoolSliceScanner.java:verifyBlock(444)) - 127.0.0.1:32950 - Second Verification failed for BP-208712

7045-127.0.1.1-1374329565029:blk_7809964727882653017_1001

org.apache.hadoop.fs.ChecksumException: Checksum failed at 391680

        at org.apache.hadoop.hdfs.server.datanode.BlockSender.verifyChecksum(BlockSender.java:555)

        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:460)

        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:622)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyBlock(BlockPoolSliceScanner.java:408)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyFirstBlock(BlockPoolSliceScanner.java:489)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scan(BlockPoolSliceScanner.java:644)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:624)

        at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.run(DataBlockScanner.java:94)

        at java.lang.Thread.run(Thread.java:662)

1.4 Multiple exceptions?

Yes. One of the triggering exception is DN2 shut down.

2. How to reproduce this failure

In sum, this failure requires the following special events to be triggered:

1. The middle datanode in a 3DN pipeline (DN1 -> *DN2* -> DN3) shuts down during the replica is being written to. Another DN was brought in (DN4) to replace DN2 (or we can simply restart DN2 to replace DN4).

2. The replica is kept being written during the above event.

2.0 Version

2.0.3-alpha

2.1 Configuration

Normal configuration. replica factor = 3.

2.2 Reproduction procedure

1. write to a replica, and find its pipeline: DN1->DN2->DN3

2. restart DN2

3. write more data into the replica

 --- 2 & 3 should occur in parallel.

2.3 Can the logs tell how to reproduce the failure?

Yes.

2.4 How many machines needed?

2 machines are enough: 1NN + 2DN.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

2013-07-20 10:12:52,750 WARN  datanode.BlockPoolSliceScanner (BlockPoolSliceScanner.java:verifyBlock(444)) - 127.0.0.1:32950 - First Verification failed for BLOCK_ID

org.apache.hadoop.fs.ChecksumException: Checksum failed at XXX

        at org.apache.hadoop.hdfs.server.datanode.BlockSender.verifyChecksum(BlockSender.java:555)

        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:460)

        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:622)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyBlock(BlockPoolSliceScanner.java:408)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyFirstBlock(BlockPoolSliceScanner.java:489)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scan(BlockPoolSliceScanner.java:644)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:624)

        at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.run(DataBlockScanner.java:94)

        at java.lang.Thread.run(Thread.java:662)

2013-07-20 10:12:52,759 WARN  datanode.BlockPoolSliceScanner (BlockPoolSliceScanner.java:verifyBlock(444)) - 127.0.0.1:32950 - Second Verification failed for BLOCK_ID

org.apache.hadoop.fs.ChecksumException: Checksum failed at XXX

        at org.apache.hadoop.hdfs.server.datanode.BlockSender.verifyChecksum(BlockSender.java:555)

        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:460)

        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:622)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyBlock(BlockPoolSliceScanner.java:408)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.verifyFirstBlock(BlockPoolSliceScanner.java:489)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scan(BlockPoolSliceScanner.java:644)

        at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner.scanBlockPoolSlice(BlockPoolSliceScanner.java:624)

        at org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.run(DataBlockScanner.java:94)

        at java.lang.Thread.run(Thread.java:662)

3.2 Backward inference

We can  well infer from the log that the DN2 restarted and the pipeline of the block changed. But what is the cause of such exception?

If we are really careful, we can find the following log messages:

DN4’s log after recover the replica (or the restarted DN2):

2013-04-01 21:02:31,570 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_1004

2013-04-01 21:02:31,570 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW

getNumBytes() = 134144

getBytesOnDisk() = 134144

getVisibleLength()= 134144

DN3 after recover rbw

2013-04-01 21:02:31,575 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-325305253-10.2.201.14-1364820083462:blk_-9076133543772600337_10042013-04-01 21:02:31,575 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_-9076133543772600337_1004, RBW

getNumBytes() = 134028

getBytesOnDisk() = 134028

getVisibleLength()= 134028

We can see that there is a mismatch of the number of bytes.

Despite that the above two log messages are subtle (we can hardly notice anything suspicious), these are the only available evidences...

In fact, the root cause is that during the recovery, the newly added DN (DN4) might temporarily have different data content, however, it simply blindly copied the checksum from upstream DN, instead of calculate its own. Therefore, we might result in a situation that the checksum is different from the actual contents.

4. Root cause

The logic is that the checksum is copied, instead of recalculated. This will result in potential mismatch.

Here is a proposed fix:

--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java        (revision 1463398)
+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java        (working copy)
@@ -528,10 +528,18 @@
            computePartialChunkCrc(onDiskLen, offsetInChecksum, bytesPerChecksum);
          }

-          int startByteToDisk = (int)(onDiskLen-firstByteInBlock)
+          int skippedDataBytes = (int)(onDiskLen-firstByteInBlock);
+          int startByteToDisk = skippedDataBytes
              + dataBuf.arrayOffset() + dataBuf.position();

          int numBytesToDisk = (int)(offsetInBlock-onDiskLen);
+          // Skip checksum bytes corresponding to skippedDataBytes
+          int skippedChecksumBytes = ((skippedDataBytes) / bytesPerChecksum) * checksumSize;
+
+          if(skippedDataBytes > 0 && LOG.isDebugEnabled()) {
+            LOG.debug("Skipped data bytes: " + skippedDataBytes +
+                ", skipped checksum bytes: " + skippedChecksumBytes);
+          }
         
          // Write data to disk.
          out.write(dataBuf.array(), startByteToDisk, numBytesToDisk);
@@ -562,8 +570,8 @@
       
checksumOut.write(checksumBuf.array(),
-           checksumBuf.arrayOffset() + checksumBuf.position(),
-           checksumLen);
+           checksumBuf.arrayOffset() + checksumBuf.position() + skippedChecksumBytes,
+           checksumLen - skippedChecksumBytes);
          }
          /// flush entire packet, sync if requested
          flushOrSync(syncBlock);

Basically, when writing the checksum, skip the checksum for the still appending block (don’t simply copy it).

4.1 Category:

Semantic. The developers simply didn’t expect this corner case and the corresponding behavior.