HDFS-5465-5479 Report

There are two jiras fixing two bugs. This failure requires both bugs to occur at the same time.

1. Symptom

Some blocks remain permanently under-replicated.

1.1 Severity

Blocker

1.2 Was there exception thrown?

NO

1.3 Multiple exceptions?

No

2. How to reproduce this failure

This is a concurrency error, so very hard to reproduce. I sudo-reproduced it by hardcoding “xmitsInProgress” to 1 (it should have been set to 1 by a data race).

2.0 Version

0.18.3

2.1 Configuration

- dfs.recpliation=3.

- 1NN + 3DN.

2.2 Reproduction procedure

- Hardcode “xmitsInProgress” to 1 to simulate the effect of a data-race.

- Start NN + 1st DN

- upload a large (1GB) file

- Start the other 2DNs

- Use fsck -files -locations -blocks to observe under-replicated blocks.

2.2.1 Timing order

The two DataTransfer threads have to race: set xmitsInProgress to 1 when it should have been 0 (no on going transmission). Then upload a file to trigger the replication process.

2.2.2 Order externally controllable?

No. There is a race condition which is not controlled by external events.

2.3 Can the logs tell how to reproduce the failure?

No. The data race is very hard to be reproduced.

2.4 How many machines needed?

3 (4 components).

3. Root cause

Two bugs:

A. DN’s concurrency bug (data race on xmitsInProgress)

B. NN’s semantic bug.

The first bug: in DN, the variable “xmitsInProgress” is a shared variable btw multiple DataTransfer threads. The semantic: the number of ongoing data transmission (replication) process on this DN (as the source). The idea is that HDFS wants to limit the number of such replication. (at most 2).

As a result, this xmitsInProgress can be left with the value “1” (without underlying replication).

This xmitsInProgress will then be passed to NN as a parameter of “sendHeartbeat”.

Once NN receives sendHeartbeat RPC from DN, it will examine if there is any jobs for this DN to be done. In correct behavior, the NN should as DN to replicate the block.

But with this incorrect “xmitsInProgress = 1”, NN eventually will use it in “getReplicationCommand (maxReplicationStreams - xmitsInProgress)”. maxReplicationStreams is set to 2, which means there are at most 2 concurrent replications for each DN. So maxReplicationStreams-xmitsInProgress in this case = 1 (as maxTransfers in the body of getReplicationCommand).

This “1” is further passed down to “poll” (numTargets = 1). Now here is the 2nd bug: this parameter means the number of replication streams can occur on this DN; However, in “poll”, it is mistakenly interpreted as “the number of remaining targets that need to be sent”. See the statement:

numTargets -= blockq.peek().targets.length;

In this case, this blockq.targets.length = 2 -- because we need 2 more replicas for this block. Therefore, numTargets will become -1 after the above statement, resulting that

results.add(blockq.poll());

is not executed! This is why the NN never asked DN to replicate the data.

3.1 Category:

Semantic + concurrency

4. Fix

4.1 How?

1. Change the “xmitsInProgress” into an AtomicInteger type (fixing the data race):

https://issues.apache.org/jira/secure/attachment/12402100/xmitsSync2.patch

2. Change the semantic in poll:

-      else {
-        List<BlockTargetPair> results = new ArrayList<BlockTargetPair>();
-        for(; !blockq.isEmpty() && numTargets > 0; ) {
-          numTargets -= blockq.peek().targets.length;
-          if (numTargets >= 0) {
-            results.add(blockq.poll());
-          }
-        }
-        return results;
+
+      List<BlockTargetPair> results = new ArrayList<BlockTargetPair>();
+      for(; !blockq.isEmpty() && numBlocks > 0; numBlocks--) {
+        results.add(blockq.poll());
      }
+      return results;
    }

 

Acknowledgement:

This bug was initially reproduced and studied by Dongcai Shen. Our group has reproduced them independently thereafter and provided additional analysis and findings.