HDFS-1371 Report

https://issues.apache.org/jira/browse/HDFS-1371

1. Symptom

If a client (incorrectly) marks all the blocks as corrupted, then as a result, all the clients will subsequently see blocks as “corrupted” even if they are not.

Note: while the users might think that it is a data-loss, there in fact are NO data-losses. The clients can still read the data even though they were marked by NN as corrupted.

1.1 Severity

Major.

1.2 Was there exception thrown?

Yes. Checksum error (client side).

2013-07-11 22:51:53,006 INFO  fs.FSInputChecker (FSInputChecker.java:readChecksumChunk(282)) - Found checksum error: b[0, 4096]=00010..

org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_3904726247370429212:BP-1090729792-127.0.1.1-1373597509617:of:/tmp/testClientReportBadBlock/testCorruptAllReplicas0 at 0 exp: -311970034 got: 476132726

        at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:320)

        at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:276)

        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:225)

        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:193)

        at org.apache.hadoop.hdfs.RemoteBlockReader.read(RemoteBlockReader.java:141)

        at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:504)

        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:576)

       .. ..

1.2.1 Were there multiple exceptions?

Yes. We have multiple ChecksumExceptions, and IOException: No live nodes contain current block!

2. How to reproduce this failure

This failure requires 3 nodes (3 DN and 1 NN). I used the test case to reproduce this failure.

2.0 Version

0.23.0 -- and manually reversed the patch.

2.1 Configuration

Since I run the test case, so there required no configuration.

To really reproduce it, we can just set-up a 3 node hdfs cluster, and set the replication factor to 3. The key thing is to let the client to corrupt the block. The logic below (from the testing program) is to corrupt the blocks.

2.2 Reproduction procedure

Reproduce via testing:

mvn test -Dtest=TestClientReportBadBlock,TestClientReportBadBlock#testCorruptAllOfThreeReplicas

 -- This code will create three DN (using different ports), and corrupt the file.

To really reproduce it:

- start NN & DN

- one client marks the block as corrupted on all three DN (using the corruption logic below)

- Use another client to read the block: see checksum exception.

2.3 Can the logs tell how to reproduce the failure?

The key is to know:

A. the block was marked as corrupted.

B. all the replicas of a block are marked as corrupted.

 

We can easily get A from the logs:

If we have the client log, this is clear: We will have multiple exceptions:

2013-07-11 22:51:53,006 INFO  fs.FSInputChecker (FSInputChecker.java:readChecksumChunk(282)) - Found checksum error: b[0, 4096]=00010..

org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_3904726247370429212:BP-1090729792-127.0.1.1-1373597509617:of:/tmp/testClientReportBadBlock/testCorruptAllReplicas0 at 0 exp: -311970034 got: 476132726

        at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:320)

        at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:276)

        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:225)

        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:193)

        at org.apache.hadoop.hdfs.RemoteBlockReader.read(RemoteBlockReader.java:141)

        at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:504)

        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:576)

       .. ..

If without client log...

On NN, we still have:

2013-07-11 22:51:53,191 INFO  hdfs.StateChange (NameNodeRpcServer.java:reportBadBlocks(456)) - *DIR* NameNode.reportBadBlocks

2013-07-11 22:51:53,192 INFO  hdfs.StateChange (CorruptReplicasMap.java:addToCorruptReplicasMap(56)) - BLOCK NameSystem.addToCorruptReplicasMap: blk_3904726247370429212 added as corrupt on 127.0.0.1:58013 by /127.0.0.1

So the A “block is corrupted” is easy to satisfy. But the 2nd condition “all replicas of a block is a bit tricky. We can still get it:

Client log:

2013-07-11 22:51:55,648 INFO  hdfs.DFSClient (DFSInputStream.java:chooseDataNode(651)) - Could not obtain block BP-1090729792-127.0.1.1-1373597509617:blk_390472624    7370429212_1001 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...

2.4 How many machines needed?

3.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

All the clients will throw checksum exception as above.

3.2 Backward inference

Following the exception, we have the following error msg:

2013-07-11 22:51:53,189 WARN  hdfs.DFSClient (DFSInputStream.java:readBuffer(512)) - Found Checksum error for BP-1090729792-127.0.1.1-1373597509617:blk_39047262473    70429212_1001 from 127.0.0.1:58013 at 0

  private synchronized int readBuffer(byte buf[], int off, int len)

                                                  throws IOException {

    .. ..

    while (true) {

      // retry as many times as seekToNewSource allows.

      try {

        return blockReader.read(buf, off, len);

      } catch ( ChecksumException ce ) {

        DFSClient.LOG.warn("Ding: checksumException in readBuffer ");

        DFSClient.LOG.warn("Found Checksum error for " + currentBlock + " from " +

                 currentNode.getName() + " at " + ce.getPos());         

     dfsClient.reportChecksumFailure(src, currentBlock, currentNode);

       --- This function will report this checksum failure to NN immediately!!!

        ioe = ce;

        retryCurrentNode = false;

      }

     }

  }

But what’s next?

On namenode’s log:

2013-07-11 22:51:53,191 INFO  hdfs.StateChange (NameNodeRpcServer.java:reportBadBlocks(456)) - *DIR* NameNode.reportBadBlocks

2013-07-11 22:51:53,192 INFO  hdfs.StateChange (CorruptReplicasMap.java:addToCorruptReplicasMap(56)) - BLOCK NameSystem.addToCorruptReplicasMap: blk_39047262473704    29212 added as corrupt on 127.0.0.1:58013 by /127.0.0.1

/**

 * Stores information about all corrupt blocks in the File System.

 * A Block is considered corrupt only if all of its replicas are

 * corrupt. While reporting replicas of a Block, we hide any corrupt

 * copies. These copies are removed once Block is found to have

 * expected number of good replicas.

 * Mapping: Block -> TreeSet<DatanodeDescriptor>

 */

@InterfaceAudience.Private

public class CorruptReplicasMap{

  /**

   * Mark the block belonging to datanode as corrupt.

   *

   * @param blk Block to be added to CorruptReplicasMap

   * @param dn DatanodeDescriptor which holds the corrupt replica

   */

  public void addToCorruptReplicasMap(Block blk, DatanodeDescriptor dn) {

    Collection<DatanodeDescriptor> nodes = getNodes(blk);

    if (nodes == null) {

      nodes = new TreeSet<DatanodeDescriptor>();

      corruptReplicasMap.put(blk, nodes);

    }

    if (!nodes.contains(dn)) {

      nodes.add(dn);

      NameNode.stateChangeLog.info("BLOCK NameSystem.addToCorruptReplicasMap: "+

                                   blk.getBlockName() +

                                   " added as corrupt on " + dn.getName() +

                                   " by " + Server.getRemoteIp());

    } else {

      NameNode.stateChangeLog.info("BLOCK NameSystem.addToCorruptReplicasMap: "+

                                   "duplicate requested for " +

                                   blk.getBlockName() + " to add as corrupt " +

                                   "on " + dn.getName() +

                                   " by " + Server.getRemoteIp());

    }

  }

Apparently, the NN adds this block into the corruptReplicasMap.

Later when NN tries to read the block, the NN will report it as corrupted...

4. Root cause

4.1 Category:

Incorrect handling of exception (checksum exception). This is an example where you can only test it by anticipating the error scenario in the real-world. The original error handling logic was fine with you have occasional checksum errors, but not in this particular error mode. This is a protocol design error.