HDFS-4022 Report

https://issues.apache.org/jira/browse/HDFS-4022

1. Symptom

Incomplete computation (data loss)

One block (the last block) of a file is ALWAYS under-replicated.

1.1 Severity

Blocker

1.2 Was there exception thrown?

No. There was an INFO message:

2013-06-12 23:56:16,536 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Error report from DatanodeRegistration(192.168.103.137, storageID=DS-76757462-127.0.1.1-51010-1370991989963, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=CID-a2140a70-f13e-47de-809e-f7a17b559e1e;nsid=1675131065;c=0): Can't send invalid block BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1100

But there were no exceptions.

2. How to reproduce this failure

2.0 Version

2.0.0-alpha

2.1 Configuration

Two nodes (VM), one as NN + DN, the other as DN. The only untraditional config:

In both hdfs.xml, add:

<property>

  <name>dfs.replication</name>

  <value>2</value>

</property>

Also, do not include the 2nd DN in the slaves of NN (start on the fly).

2.2 Reproduction procedure

- start-dfs.sh

- hdfs dfs -put random.txt /

- java -cp `hadoop classpath`:build/classes mypackage.HDFS_4022

- Add the 2nd DN on the fly

- hdfs fsck / -files -locations -blocks

2.3 Can the logs tell how to reproduce the failure?

Yes.

Key events:

Reproduction factor > 1;

   --- somehow can be inferred from the NN’s log:

2013-06-12 23:56:13,739 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.103.137:50010 to replicate blk_103180142309369584_1100 to datanode(s) 192.168.103.135:50010

File is appended;

2013-06-12 23:55:40,855 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: updatePipeline(BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1100) successfully to BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1101

 

DN added on the fly;

2013-06-12 23:56:11,261 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.103.135:50010

2.4 How many machines needed?

2 nodes are enough.

3. Diagnosis procedure

3.1 Detailed Symptom

You notice from the fsck log:

ing@research-virtual-machine:~/research/hadoop/hadoop-2.0.0-alpha-src/hadoop-dist/target/hadoop-2.0.0-alpha/LogCor$ hadoop fsck / -locations -blocks -files

/random.txt 671907881 bytes, 11 block(s):  Under replicated BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1101. Target Replicas is 2 but found 1 replica(s).

0. BP-904826208-127.0.1.1-1370991955105:blk_508106224367822418_1090 len=67108864 repl=2 [192.168.103.137:50010, 192.168.103.135:50010]

1. BP-904826208-127.0.1.1-1370991955105:blk_-1963754189664148146_1091 len=67108864 repl=2 [192.168.103.137:50010, 192.168.103.135:50010]

2. BP-904826208-127.0.1.1-1370991955105:blk_-5095818792531264431_1092 len=67108864 repl=2 [192.168.103.137:50010, 192.168.103.135:50010]

3. BP-904826208-127.0.1.1-1370991955105:blk_1200156498847246030_1093 len=67108864 repl=2 [192.168.103.137:50010, 192.168.103.135:50010]

4. BP-904826208-127.0.1.1-1370991955105:blk_-6155561162201431463_1094 len=67108864 repl=2 [192.168.103.137:50010, 192.168.103.135:50010]

5. BP-904826208-127.0.1.1-1370991955105:blk_4070565714773300023_1095 len=67108864 repl=2 [192.168.103.137:50010, 192.168.103.135:50010]

6. BP-904826208-127.0.1.1-1370991955105:blk_3286896170678018968_1096 len=67108864 repl=2 [192.168.103.137:50010, 192.168.103.135:50010]

7. BP-904826208-127.0.1.1-1370991955105:blk_2924397875907395816_1097 len=67108864 repl=2 [192.168.103.137:50010, 192.168.103.135:50010]

8. BP-904826208-127.0.1.1-1370991955105:blk_7821028780172106405_1098 len=67108864 repl=2 [192.168.103.137:50010, 192.168.103.135:50010]

9. BP-904826208-127.0.1.1-1370991955105:blk_-2989912312310514571_1099 len=67108864 repl=2 [192.168.103.137:50010, 192.168.103.135:50010]

10. BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1101 len=819241 repl=1 [192.168.103.137:50010]

 

Status: HEALTHY

Total size:    671907881 B

Total dirs:    2

Total files:   1

Total blocks (validated):      11 (avg. block size 61082534 B)

Minimally replicated blocks:   11 (100.0 %)

Over-replicated blocks:        0 (0.0 %)

Under-replicated blocks:       1 (9.090909 %)

Mis-replicated blocks:         0 (0.0 %)

Default replication factor:    2

Average block replication:     1.9090909

Corrupt blocks:                0

Missing replicas:              1 (4.7619047 %)

Number of data-nodes:          2

Number of racks:               1

FSCK ended at Wed Jun 12 23:56:52 EDT 2013 in 2 milliseconds

 

The filesystem under path '/' is HEALTHY

A block is always under-replicated. You can also get the block ID from it...

3.2 Backward inference

A natural next step is to search for this block ID from the NN/DN logs.

Note, in the following context, 192.168.103.137 is the NN and the 1st DN; 192.168.103.135 is the 2nd DN that was added on the fly.

Now: let’s take a look at the namenode’s log to search for that particular block:

Here are all the records (in the short run of the namenode) in NN’s log:

 

2013-06-12 23:53:52,639 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /random.txt._COPYING_. BP-904826208-127.0.1.1-1370991955105 blk_103180142309369584_1100{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[192.168.103.137:50010|RBW]]}

 

2013-06-12 23:53:52,650 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* addStoredBlock: blockMap updated: 192.168.103.137:50010 is added to blk_103180142309369584_1100{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[192.168.103.137:50010|RBW]]} size 0

 

2013-06-12 23:55:40,854 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: updatePipeline(block=BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1100, newGenerationStamp=1101, newLength=819200, newNodes=[192.168.103.137:50010], clientName=DFSClient_NONMAPREDUCE_-685319194_1)

 

2013-06-12 23:55:40,855 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: updatePipeline(BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1100) successfully to BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1101

  --- This is the append operations. After the append, the block’s version will become 1101 from 1100.

 

2013-06-12 23:56:13,739 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.103.137:50010 to replicate blk_103180142309369584_1100 to datanode(s) 192.168.103.135:50010

  --- This is the replication. However, it seems that it was still trying to use the old block ID... Also it asks the datanode on 137 to replicate the block to 135.

 

2013-06-12 23:56:16,536 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Error report from DatanodeRegistration(192.168.103.137, storageID=DS-76757462-127.0.1.1-51010-1370991989963, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=CID-a2140a70-f13e-47de-809e-f7a17b559e1e;nsid=1675131065;c=0): Can't send invalid block BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1100

  --- This is from the DN@137 (first DN), which means that it didn’t try to send the block, because it thought the block was an invalid one.

 

So clearly at this point, there are log messages indicating of the problem.

 

Now, let’s see whether DN’s log can tell us something interesting. So in the slave’s DN log, for every other blocks, we found records like:

2013-06-12 23:56:22,580 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block BP-904826208-127.0.1.1-1370991955105:blk_-509581    8792531264431_1092 src: /192.168.103.137:52811 dest: /192.168.103.135:50010

 

2013-06-12 23:56:23,160 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block BP-904826208-127.0.1.1-1370991955105:blk_12001564    98847246030_1093 src: /192.168.103.137:52812 dest: /192.168.103.135:50010 of size 67108864

 

However, the under-replicated block:

blk_103180142309369584_1100/1101 was not there... Therefore a natural suspicion is that the NN never sent the replication request of this block to the remote DN!!!

 

Now let’s see the master’s DN’s log:

2013-06-12 23:56:16,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Can't send invalid block BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1100

 

 

OK. Next the question is “why”. Why it happened?

 Now, let’s summarize what we can infer from the logs:

A. The append operation succeeded:

In NN’s log:

2013-06-12 23:55:40,855 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: updatePipeline(BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1100) successfully to BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1101

 

In DN@137 (master)’s log:

2013-06-12 23:55:40,867 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.103.137:55322, dest: /192.168.103.137:50010, bytes: 819241, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-685319194_1, offset: 0, srvID: DS-76757462-127.0.1.1-51010-1370991989963, blockid: BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1101, duration: 16080711

 

2013-06-12 23:55:40,867 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1101, type=LAST_IN_PIPELINE, downstreams=0:[] terminating

 

 

B. The DN@135 successfully registered itself to NN.

2013-06-12 23:56:11,260 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from DatanodeRegistration(192.168.103.135, storageID=DS-465293080-127.0.1.1-51010-1371003107527, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=CID-a2140a70-f13e-47de-809e-f7a17b559e1e;nsid=1675131065;c=0) storage DS-465293080-127.0.1.1-51010-1371003107527

 

2013-06-12 23:56:11,261 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.103.135:50010

 

2013-06-12 23:56:11,511 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: BLOCK* processReport: Received first block report from 192.168.103.135:50010 after becoming active. Its block contents are no longer considered stale.

 

C. The NN asked the DN@137 to replicate the blocks to DN@135.

2013-06-12 23:56:13,739 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.103.137:50010 to replicate blk_103180142309369584_1100 to datanode(s) 192.168.103.135:50010

 

Note the “gen_stamp” of this block is stale.

 

D. DN@137 refuses to send it to 135 to replication:

2013-06-12 23:56:16,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Can't send invalid block BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1100

 

2013-06-12 23:56:16,536 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Error report from DatanodeRegistration(192.168.103.137, storageID=DS-76757462-127.0.1.1-51010-1370991989963, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=CID-a2140a70-f13e-47de-809e-f7a17b559e1e;nsid=1675131065;c=0): Can't send invalid block BP-904826208-127.0.1.1-1370991955105:blk_103180142309369584_1100

 

The suspicion at this point is that NN somehow asked DN to replicate the wrong “gen_stamp” of that block...

 

Next, we need to go to the source to understand:

1. Why DN@137 thinks the block is “invalid”?

DataNode.java:

  private void transferBlock( ExtendedBlock block, DatanodeInfo xferTargets[] ) {

    BPOfferService bpos = getBPOSForBlock(block);  // get a worker thread

    DatanodeRegistration bpReg = getDNRegistrationForBP(block.getBlockPoolId());

    if (!data.isValidBlock(block)) {  // → FsDatasetImpl.isValidBlock (ExtendedBlock)

      // block does not exist or is under-construction

      String errStr = "Can't send invalid block " + block;

      LOG.info(errStr);

      bpos.trySendErrorReport(DatanodeProtocol.INVALID_BLOCK, errStr);

      return;

    }

    …

  }

FsDatasetImpl.java: isValidBlock returns true if the block is valid.

  /**

   * Check whether the given block is a valid one.

   * valid means finalized

   */

  public boolean isValidBlock(ExtendedBlock b) {

    return isValid(b, ReplicaState.FINALIZED);  // →FsDatasetImpl.isValid (ExtendedBlock, ReplicaState)

  }

Still FsDatasetImpl.java:

  /** Does the block exist and have the given state? */

  private boolean isValid(final ExtendedBlock b, final ReplicaState state) {

    final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(),

        b.getLocalBlock());  // →ReplicaMap.get(String bpid, Block)

    return replicaInfo != null

        && replicaInfo.getState() == state

        && replicaInfo.getBlockFile().exists();

  }

 --- Now, here there is ambiguity: Apparently this “isValid” returned false. However, it might be because one of the following:

A) replicaInfo == null (cannot find the block in volumeMap)

B) replicaInfo.getState() != state (not finalized)

C) replicaInfo.getBlockFile().exists() == 0

What are the semantics of these three?

So the “replicaInfo” is from

volumeMap.get(b.getBlockPoolId(), b.getLocalBlock())

/**

 * Identifies a Block uniquely across the block pools

 */

public class ExtendedBlock {

  private String poolId;

  private Block block;

  public String getBlockPoolId() {

    return poolId;

  }

  public Block getLocalBlock() {

    return block;

  }

}

public class Block {

  public static final String BLOCK_FILE_PREFIX = "blk_";

  public static final String METADATA_EXTENSION = ".meta";

  private long blockId;

  private long numBytes;

  private long generationStamp;

}

Now the first questions: what is a block pool? (and pool ID)?

Each block pool is one name space. In hdfs, there can have multiple namespaces (thus multiple Namenodes). One poolID identifies one namespace (thus one NN).

Then the line:

volumeMap.get(b.getBlockPoolId(), b.getLocalBlock())

is given the namespace and the particular block, try to get the replica info:

 /**

   * Get the meta information of the replica that matches both block id

   * and generation stamp

   * @param bpid block pool id

   * @param block block with its id as the key

   * @return the replica's meta information

   * @throws IllegalArgumentException if the input block or block pool is null

   */

  ReplicaInfo get(String bpid, Block block) {

    checkBlockPool(bpid);

    checkBlock(block);

    ReplicaInfo replicaInfo = get(bpid, block.getBlockId());

    if (replicaInfo != null &&

        block.getGenerationStamp() == replicaInfo.getGenerationStamp()) {

          ← 1100 (from NN) != 1101 (local replica) This if test failed!!

      return replicaInfo;

    }

    return null;

  }

Here, ‘replicaInfo’ is the replica’s information in the local DN. “block” is from NN, therefore they don’t match.

Very likely, the “1100” generation stamp from NN is suspicious. Therefore the next question is:

2. Why NN asked DN to replicate the block with gen_stamp = 1100?

To answer this question, let’s first look at the logs from NN to see where are the relevant log messages.

Probably the most relevant log message is:

2013-06-12 23:56:13,739 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.103.137:50010 to replicate blk_103180142309369584_1100 to datanode(s) 192.168.103.135:50010

UnderReplicatedBlocks.java:

/**

 * Keeps information related to the blocks stored in the Hadoop cluster.

 */

public class BlockManager {

  /** Replicate a set of blocks

   *

   * @param blocksToReplicate blocks to be replicated, for each priority

   * @return the number of blocks scheduled for replication

   */

computeReplicationWorkForBlocks(List<List<Block>> blocksToReplicate) {

   synchronized (neededReplications) {

        for (int priority = 0; priority < blocksToReplicate.size(); priority++) {

          for (Block block : blocksToReplicate.get(priority)) {

            .. ..

            if (numReplicas.liveReplicas() < requiredReplication) {

              additionalReplRequired = requiredReplication

                  - numEffectiveReplicas;

            } else {

              additionalReplRequired = 1; // Needed on a new rack

            }

            work.add(new ReplicationWork(block, fileINode, srcNode,

                containingNodes, liveReplicaNodes, additionalReplRequired,

                priority));

          }

        }

      }

    if (NameNode.stateChangeLog.isInfoEnabled()) {

      // log which blocks have been scheduled for replication

      for(ReplicationWork rw : work){

        DatanodeDescriptor[] targets = rw.targets;

        if (targets != null && targets.length != 0) {

          StringBuilder targetList = new StringBuilder("datanode(s)");

          for (int k = 0; k < targets.length; k++) {

            targetList.append(' ');

            targetList.append(targets[k]);

          }

          NameNode.stateChangeLog.info(

                  "BLOCK* ask "

                  + rw.srcNode + " to replicate "

                  + rw.block + " to " + targetList);

        }

      }

    }

  -- Apparently, this ‘block’ is from the list “work”, which is further from the list “blocksToReplicate” (which is the passed in parameter of the function computeReplicationWorkForBlocks). Therefore, we need to see the caller of this function.

 

  /**

   * Scan blocks in {@link #neededReplications} and assign replication

   * work to data-nodes they belong to.

   *

   * The number of process blocks equals either twice the number of live

   * data-nodes or the number of under-replicated blocks whichever is less.

   *

   * @return number of blocks scheduled for replication during this iteration.

   */

  int computeReplicationWork(int blocksToProcess) {

    List<List<Block>> blocksToReplicate = null;

    namesystem.writeLock();

    try {

      // Choose the blocks to be replicated

      blocksToReplicate = neededReplications

          .chooseUnderReplicatedBlocks(blocksToProcess);

    } finally {

      namesystem.writeUnlock();

    }

    return computeReplicationWorkForBlocks(blocksToReplicate);

  }

  -- So this “blocksToReplicate” is from “neededReplications”.The block was added to “neededReplications” before it was appended. Therefore even after it was appended, the list “neededReplications” never got refreshed.

At this point, I believe a developer with good understanding of hdfs should already understand clearly the root cause & is able to fix it.

Now the nature question is: shouldn’t the “neededReplications” list be refreshed after the block was “appended”?

Here is the function that is called when the append occurred:

 /**

   * Convert the last block of the file to an under construction block.<p>

   * The block is converted only if the file has blocks and the last one

   * is a partial block (its size is less than the preferred block size).

   * The converted block is returned to the client.

   * The client uses the returned block locations to form the data pipeline

   * for this block.<br>

   * The methods returns null if there is no partial block at the end.

   * The client is supposed to allocate a new block with the next call.

   *

   * @param fileINode file

   * @return the last block locations if the block is partial or null otherwise

   */

  public LocatedBlock convertLastBlockToUnderConstruction(

      MutableBlockCollection fileINode) throws IOException {

    BlockInfo oldBlock = fileINode.getLastBlock();

    if(oldBlock == null ||

        fileINode.getPreferredBlockSize() == oldBlock.getNumBytes())

      return null;

    assert oldBlock == getStoredBlock(oldBlock) :

      "last block of the file is not in blocksMap";

    DatanodeDescriptor[] targets = getNodes(oldBlock);

    BlockInfoUnderConstruction ucBlock =

      fileINode.setLastBlock(oldBlock, targets);

    blocksMap.replaceBlock(ucBlock);

    // Remove block from replication queue.

    updateNeededReplications(oldBlock, 0, 0);

    // remove this block from the list of pending blocks to be deleted.

    for (DatanodeDescriptor dd : targets) {

      String datanodeId = dd.getStorageID();

      invalidateBlocks.remove(datanodeId, oldBlock);

    }

   

    // Adjust safe-mode totals, since under-construction blocks don't

    // count in safe-mode.

    namesystem.adjustSafeModeBlockTotals(

        // decrement safe if we had enough

        targets.length >= minReplication ? -1 : 0,

        // always decrement total blocks

        -1);

    final long fileLength = fileINode.computeContentSummary().getLength();

    final long pos = fileLength - ucBlock.getNumBytes();

    return createLocatedBlock(ucBlock, pos, AccessMode.WRITE);

  }

 --- From the comment and the call “udpateNeededReplication”, it seems that the developers did try to remove this block from the “neededReplications” list. Let’s zoom into this function.

  private void updateNeededReplications(final Block block,

      final int curReplicasDelta, int expectedReplicasDelta) {

    namesystem.writeLock();

    try {

      if (!namesystem.isPopulatingReplQueues()) {

        return;

      }

      NumberReplicas repl = countNodes(block);

      int curExpectedReplicas = getReplication(block);

      if (isNeededReplication(block, curExpectedReplicas, repl.liveReplicas())) {

        --- When this is called on the appended block, this branch is executed (the block needs replication). So instead of removing the block, it just updated it... I guess the developers meant to have it to enter the other branch (which removed the block).

        neededReplications.update(block, repl.liveReplicas(), repl

            .decommissionedReplicas(), curExpectedReplicas, curReplicasDelta,

            expectedReplicasDelta);

      } else {

        int oldReplicas = repl.liveReplicas()-curReplicasDelta;

        int oldExpectedReplicas = curExpectedReplicas-expectedReplicasDelta;

        neededReplications.remove(block, oldReplicas, repl.decommissionedReplicas(),

                                  oldExpectedReplicas);

      }

    } finally {

      namesystem.writeUnlock();

    }

  }

At this point, everything becomes clear:

When the block was appended in covertLastBlockToUnderConstruction, developers indeed considered to remove that block from ‘neededReplication’ list by using “updateNeededReplications”. However, they made mistake: this replication wasn’t removed by that call.

The fix: https://issues.apache.org/jira/secure/attachment/12549972/HDFS-4022.patch

Index: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
===================================================================
--- hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java        (revision 1400024)
+++ hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java        (working copy)
@@ -624,7 +624,10 @@
    blocksMap.replaceBlock(ucBlock);

    // Remove block from replication queue.
-    updateNeededReplications(oldBlock, 0, 0);
+    NumberReplicas replicas = countNodes(ucBlock);
+    neededReplications.remove(ucBlock, replicas.liveReplicas(),
+        replicas.decommissionedReplicas(), getReplication(ucBlock));
+    pendingReplications.remove(ucBlock);

    // remove this block from the list of pending blocks to be deleted.
    for (DatanodeDescriptor dd : targets) {

4. Root cause

4.1 Category:

Semantic bug.

Acknowledgement:

This bug was initially reproduced and studied by Dongcai Shen. Our group has reproduced them independently thereafter and provided additional analysis and findings.