HDFS-15 Report (relates to HDFS-1480)

0.  Background

Rack awareness:

http://archive.cloudera.com/cdh/3/hadoop/hdfs_user_guide.html#Rack+Awareness

1. Symptom

HDFS replicas placement strategy guarantees that the replicas of a block exist on at least two racks when its replication factor is greater than one. However, after some datanodes are decommissioned, fsck still reports that the replicas of some blocks end up on one rack.

research@research-virtual-machine:~/hadoop-0.20.2/conf$ hadoop fsck / -files -blocks -racks

/movie.rmvb 1399166007 bytes, 21 block(s):  Replica placement policy is violated for blk_3393386534003038363_1001. Block should be additionally replicated on 1 more rack(s).

 Replica placement policy is violated for blk_-5167404889573090398_1001. Block should be additionally replicated on 1 more rack(s).

 Replica placement policy is violated for blk_8403198505648914824_1001. Block should be additionally replicated on 1 more rack(s).

 Replica placement policy is violated for blk_8313130570223714082_1001. Block should be additionally replicated on 1 more rack(s).

 Replica placement policy is violated for blk_-4430021563224916640_1001. Block should be additionally replicated on 1 more rack(s).

 Replica placement policy is violated for blk_-1397182476854856790_1001. Block should be additionally replicated on 1 more rack(s).

0. blk_-4269969750606103620_1001 len=67108864 repl=2 [/rack2/192.168.59.163:50010, /rack1/192.168.59.164:50010]

1. blk_-7347010677420903618_1001 len=67108864 repl=2 [/rack2/192.168.59.162:50010, /rack1/192.168.59.164:50010]

2. blk_3536165086332283966_1001 len=67108864 repl=2 [/rack2/192.168.59.163:50010, /rack1/192.168.59.164:50010]

3. blk_3393386534003038363_1001 len=67108864 repl=2 [/rack1/192.168.59.164:50010, /rack1/192.168.59.160:50010]

4. blk_3933164045607256528_1001 len=67108864 repl=2 [/rack2/192.168.59.162:50010, /rack1/192.168.59.164:50010]

5. blk_-5167404889573090398_1001 len=67108864 repl=2 [/rack1/192.168.59.160:50010, /rack1/192.168.59.164:50010]

6. blk_8403198505648914824_1001 len=67108864 repl=2 [/rack1/192.168.59.159:50010, /rack1/192.168.59.164:50010]

7. blk_-2622481880735692995_1001 len=67108864 repl=2 [/rack2/192.168.59.163:50010, /rack1/192.168.59.164:50010]

8. blk_8313130570223714082_1001 len=67108864 repl=2 [/rack1/192.168.59.159:50010, /rack1/192.168.59.164:50010]

9. blk_-4430021563224916640_1001 len=67108864 repl=2 [/rack1/192.168.59.159:50010, /rack1/192.168.59.164:50010]

10. blk_-4330506324564558625_1001 len=67108864 repl=2 [/rack2/192.168.59.162:50010, /rack1/192.168.59.164:50010]

11. blk_9081262152634020753_1001 len=67108864 repl=2 [/rack2/192.168.59.162:50010, /rack1/192.168.59.164:50010]

12. blk_-2412332177670527575_1001 len=67108864 repl=2 [/rack2/192.168.59.163:50010, /rack1/192.168.59.164:50010]

13. blk_-1397182476854856790_1001 len=67108864 repl=2 [/rack1/192.168.59.160:50010, /rack1/192.168.59.164:50010]

14. blk_-4956387649422932613_1001 len=67108864 repl=2 [/rack2/192.168.59.163:50010, /rack1/192.168.59.164:50010]

15. blk_-6406790172218170147_1001 len=67108864 repl=2 [/rack2/192.168.59.163:50010, /rack1/192.168.59.164:50010]

16. blk_8173994191488328703_1001 len=67108864 repl=2 [/rack2/192.168.59.163:50010, /rack1/192.168.59.164:50010]

17. blk_7064087614226424057_1001 len=67108864 repl=2 [/rack2/192.168.59.162:50010, /rack1/192.168.59.164:50010]

18. blk_6472904931386172792_1001 len=67108864 repl=2 [/rack2/192.168.59.163:50010, /rack1/192.168.59.164:50010]

19. blk_-2787784504894711317_1001 len=67108864 repl=2 [/rack2/192.168.59.163:50010, /rack1/192.168.59.164:50010]

20. blk_-9140572403532982815_1001 len=56988727 repl=2 [/rack2/192.168.59.163:50010, /rack1/192.168.59.164:50010]

 

Status: HEALTHY

 Total size:               1399166007 B

 Total dirs:               0

 Total files:              1

 Total blocks (validated):                   21 (avg. block size 66626952 B)

 Minimally replicated blocks:           21 (100.0 %)

 Over-replicated blocks:                    0 (0.0 %)

 Under-replicated blocks:                 0 (0.0 %)

 Mis-replicated blocks:                       6 (28.571428 %)

 Default replication factor:               2

 Average block replication:               2.0

 Corrupt blocks:                                    0

 Missing replicas:                                  0 (0.0 %)

 Number of data-nodes:                                   5

 Number of racks:                                2

 

 

The filesystem under path '/' is HEALTHY

1.1 Severity

Critical

1.2 Was there exception thrown?

No

2. How to reproduce this failure

2.0 Version

I used version 0.20.2.

2.1 Environment setting and configuration

Using more datanodes can increase the possibilities to see the symptom. So I used one master and 5 slaves (1 namenode and 6 datanode) running on VMs.

Namenode conf directory:

Except for the basic configuration, in the core-site.xml I added following lines to enable rack-awareness which ensures that replicas of data exist on multiple racks.

<property>

<name>topology.script.file.name</name>

<value>/home/research/hadoop-0.20.2/conf/rack-script.sh</value>

</property>

And in the rack-script.sh file is used to determine which datanodes belong to which rack. And I decided to store replicas on two racks – rack1 and rack2.

The content of /home/research/hadoop-0.20.2/conf/rack-script.sh:

#!/bin/bash

 

if [ $1 = "192.168.59.165" ]; then

        echo -n "/rack1 "

elif [ $1 = "192.168.59.166" ]; then

        echo -n "/rack1 "

elif [ $1 = "192.168.59.170" ]; then

        echo -n "/rack1 "

elif [ $1 = "192.168.59.168" ]; then

        echo -n "/rack2 "

elif [ $1 = "192.168.59.169" ]; then

        echo -n "/rack2 "

elif [ $1 = "192.168.59.167" ]; then

        echo -n "/rack2 "

else

        echo -n "/default-rack "

fi

 

Additionally, I added following lines in the hdfs-site.xml file to ensure that I can decommission the datanodes later on.

<property>

        <name>dfs.hosts.exclude</name>

    <value>/home/research/hadoop-0.20.2/conf/nodeDecommission</value>

        <final>true</final>

</property>

 

Before I run hdfs I just leave the nodeDecommission file blank.

I set the default replication factor to 2.

2.2 Reproduction procedure

1. Start hdfs: start-dfs.sh

2. Copy a file to hdfs: hadoop dfs -copyFromLocal file /

3. Decommission one datanode:

1) Add the datanode IP address in the nodeDecommission file: 192.168.59.168 and save this file

2) Update the namenode with the new set of permitted datanodes using this command: hadoop dfsadmin -refreshNodes

3) Check the datanode states whether a datenode is decommissioned: hadoop dfsadmin -report or check the web UI

4 Check the blocks states and you can see the symptom: hadoop fsck / -files -blocks -racks

2.3 Can the logs tell how to reproduce the failure?

2013-07-14 14:12:40,823 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.59.170:50010 storage DS-1748349471-127.0.1.1-50010-1373825291202

2013-07-14 14:12:40,830 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /rack1/192.168.59.170:50010

2013-07-14 14:12:40,876 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.59.166:50010 storage DS-1414684751-127.0.1.1-50010-1373825282705

2013-07-14 14:12:40,880 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /rack1/192.168.59.166:50010

2013-07-14 14:12:40,950 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.59.168:50010 storage DS-1117822516-127.0.1.1-50010-1373825235131

2013-07-14 14:12:40,955 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /rack2/192.168.59.168:50010

2013-07-14 14:12:40,972 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.59.167:50010 storage DS-829542580-127.0.1.1-50010-1373825250975

2013-07-14 14:12:40,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /rack2/192.168.59.167:50010

2013-07-14 14:12:57,439 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.59.169:50010 storage DS-1898478619-127.0.1.1-50010-1373825249971

2013-07-14 14:12:57,444 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /rack2/192.168.59.169:50010

2013-07-14 14:13:10,627 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.59.165:50010 storage DS-2036453321-127.0.1.1-50010-1373825351309

2013-07-14 14:13:10,633 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /rack1/192.168.59.165:50010

2013-07-14 14:14:08,862 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions: 2 Total time for transactions(ms): 1Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0

.

.

.

.

.

.

2013-07-14 14:19:06,285 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=null                 ip=null                    cmd=open              src=/movie.rmvb                 dst=null                    perm=null

2013-07-14 14:19:16,644 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=null                 ip=null                    cmd=open              src=/movie.rmvb                 dst=null                    perm=null

2013-07-14 14:22:36,590 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Start Decommissioning node 192.168.59.168:50010

2013-07-14 14:22:37,322 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.59.168:50010 to replicate blk_-5900818750483032152_1001 to datanode(s) 192.168.59.167:50010

2013-07-14 14:22:37,322 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.59.168:50010 to replicate blk_-5255163811496748544_1001 to datanode(s) 192.168.59.169:50010

2013-07-14 14:22:37,322 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.59.165:50010 to replicate blk_-4728954200112444496_1001 to datanode(s) 192.168.59.166:50010

2013-07-14 14:22:37,323 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.59.165:50010 to replicate blk_-1775408102023223072_1001 to datanode(s) 192.168.59.166:50010

2013-07-14 14:22:40,323 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.59.168:50010 to replicate blk_106175886183835827_1001 to datanode(s) 192.168.59.169:50010

2013-07-14 14:22:40,324 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.59.168:50010 to replicate blk_159981631687714598_1001 to datanode(s) 192.168.59.167:50010

2013-07-14 14:22:40,324 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.59.165:50010 to replicate blk_2194819916468615786_1001 to datanode(s) 192.168.59.166:50010

2013-07-14 14:22:40,324 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 192.168.59.165:50010 to replicate blk_8972449473978345561_1001 to datanode(s) 192.168.59.166:50010

2013-07-14 14:23:06,434 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.59.169:50010 is added to blk_-5255163811496748544_1001 size 67108864

2013-07-14 14:23:11,982 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.59.167:50010 is added to blk_-5900818750483032152_1001 size 67108864

2013-07-14 14:23:25,652 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.59.166:50010 is added to blk_-1775408102023223072_1001 size 67108864

2013-07-14 14:23:28,373 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.59.166:50010 is added to blk_-4728954200112444496_1001 size 67108864

2013-07-14 14:23:35,329 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.59.169:50010 is added to blk_106175886183835827_1001 size 67108864

2013-07-14 14:23:45,671 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.59.167:50010 is added to blk_159981631687714598_1001 size 67108864

2013-07-14 14:23:57,368 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.59.166:50010 is added to blk_8972449473978345561_1001 size 67108864

2013-07-14 14:23:58,646 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.59.166:50010 is added to blk_2194819916468615786_1001 size 67108864

2013-07-14 14:24:26,103 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Decommission complete for node 192.168.59.168:50010

  .. .. ..

 

1 The log tells us witch rack the block located on:

Rack1- 192.168.59.165, 192.168.59.166, 192.168.59.170

Rack2- 192.168.59.168, 192.168.59.169, 192.168.59.167

2 The log tells us hdfs starts Decommissioning node 192.168.59.168:50010. Then the 192.168.59.168:50010 datanode is asked to replicate blocks to other datanodes including 192.168.59.167:50010 and 192.168.59.168:50010 which are on the same rack as 192.168.59.168:50010 datanode is. Because the replication factor is 2 and the replicas of a block should exist on two racks, the blocks stored in the datanode 192.168.59.168:50010 are supposed to store in the rack1 (one of 192.168.59.165, 192.168.59.166, 192.168.59.170).

2.4 How many machines needed?

at least 3 machines are enough.

3. Diagnosis procedure

3.1 Detailed symptom

Same as I described at 2.3.

3.2 Backward inference

2013-07-14 14:22:36,590 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Start Decommissioning node 192.168.59.168:50010

From the log we can see that the blocks are replicated to the wrong rack after one datanode is decommissioned. So the bug should exit in where handle the block replication after decommissioning datanodes.

In fact, the root cause is simply that developers did NOT even implement the feature of “rack awareness” data re-allocation when DNs are decommissioned.

4. Root cause

The cause of the problem is that decommission and corruption handling only check the block's replication factor but not the rack requirement. When an over-replicated block loses a replica due to decommission, namenode does not take any action to guarantee that remaining replicas are on different racks. In other words, hdfs simply didn’t implement the “rack awareness” feature.

The patch simply adds the implementation of this feature.

4.1 Category:

Semantic

5. Fix

5.1 How?

The patch against branch 20 changes a lot in the FSNamesystem.java file and some small changes in UnderReplicatedBlocks.java file. Based on my understanding of this patch, I think it mainly did 2 changes.

1 underReplicatiedBlocks queue: add one priority to ensure those blocks that do not satisfy the HDFS rack requirement will be replicated

2 boolen blockNeedsReplication Function: implementing this function to tell if the given block needs further replication either more total replicas or across racks