[hadoop]HADOOP-6498 Report

1. Symptom

Network problem may cause some DN to hang (and resulting in clients hang).

Reported in Baidu’s production hadoop cluster.

1.1 Severity

Blocker

1.2 Was there exception thrown?

Yes.

IOException: Network connection problem (not logged though).

NotReplicatedYetException

1.2.1 Were there multiple exceptions?

Yes.

2. How to reproduce this failure

2.0 Version

0.18.3

2.1 Configuration

Standard

2.2 Reproduction procedure

1) Upload a large file to hdfs

2) Introduce a network connection error from DN’s RPC to NN.

2.2.1 Timing order

The timing of network exception is important --- when the RPC occur.

2.2.2 Events order externally controllable?

Yes.

2.3 Can the logs tell how to reproduce the failure?

Not really. The biggest challenge is the “network connection exception” --- it is not logged by default.

2.4 How many machines needed?

1. (client + NN + DN)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Both client & NN have the following exception:

2013-07-25 16:37:05,159 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 54310, call addBlock(/400MB.txt, DFSClient_607772976) from 128.100.23.4:43024: error: org.apache.hadoop.dfs.NotReplicatedYetException: Not replicated yet:/400MB.txt

org.apache.hadoop.dfs.NotReplicatedYetException: Not replicated yet:/400MB.txt

        at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1109)

        at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodA...

It is thrown here:

NN’s code:

 /**

   * The client would like to obtain an additional block for the indicated

   * filename (which is being written-to).  Return an array that consists

   * of the block, plus a set of machines.  The first on this list should

   * be where the client writes data.  Subsequent items in the list must

   * be provided in the connection to the first datanode.

   *

   * Make sure the previous blocks have been reported by datanodes and

   * are replicated.  Will return an empty 2-elt array if we want the

   * client to "try again later".

   */

  public LocatedBlock getAdditionalBlock(String src,

                                         String clientName

                                         ) throws IOException {

      //

      // If we fail this, bad things happen!

      //

      if (!checkFileProgress(pendingFile, false)) {

        throw new NotReplicatedYetException("Not replicated yet:" + src);

      }

  }

  synchronized boolean checkFileProgress(INodeFile v, boolean checkall) {

    if (checkall) {

      //

      // check all blocks of the file.

      //

      for (Block block: v.getBlocks()) {

        if (blocksMap.numNodes(block) < this.minReplication) {

          return false;

        }

      }

   }

So we might infer that it is not replicated. But why? look into the DN’s log.

Not much under default verbosity. But if we’re really careful, we can see that the normal heartbeat msg:

2013-07-25 16:36:43,370 INFO org.apache.hadoop.dfs.DataNode: DatanodeRegistration(128.100.23.4:50010, storageID=DS-2123293469-127.0.1.1-50010-1374784603321, infoPort=50075, ipcPort=50020)In DataNode.run,

Will disappear at some point (network error).

3.2 Backward inference

Without DEBUG mode log, not much can be further inferred.

4. Root cause

A network error result in an RPC object from DN -> NN permenantly lost.

See the figure at the beginning.

4.1 Category:

Incorrect error handling.

5. Fix

5.1 How?

remove “call” object from the “calls” queue only after the response is received.

Acknowledgement:

This bug was initially reproduced and studied by Dongcai Shen. Our group has reproduced them independently thereafter and provided additional analysis and findings.