[hadoop]MAPREDUCE-3728 Report

1. Symptom

MapReduce job fails at the reduce phase with “DiskErrorException” on a secure cluster.

1.1 Severity

Critical

1.2 Was there exception thrown?

Yes. DiskErrorException

1.2.1 Were there multiple exceptions?

No.

1.3 Scope of the failure?

Any jobs submitted by a user with different username than the one started NM.

2. How to reproduce this failure

2.0 Version

0.23.0

2.1 Configuration

Security enabled. See HDFS-3083 to see how to enable security.

Also, this failure requires “container-executor”, which requires a local compilation of hadoop “-Pnative”.  

2.2 Reproduction procedure

The key to reproduce this failure is to submit a MR job with a different user/group with the user that started the cluster.

Here, assume yarn is a user to start the cluster (nodemanager). It belongs to the group: hadoop. Another user is ding, who submitted the job.

1. yarn$ start-yarn.sh

nodemanager log:

2013-07-30 11:01:44,600 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: Security is enabled on NodeManager. Creating ContainerTokenSecretManager

2. ding$ hadoop jar example.jar wordcount...

2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find usercache/ding/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_000003_0/file.out.index in any of the configured local directories

2.2.1 Timing order

Follow this order.

2.2.2 Events order externally controllable?

Yes.

2.3 Can the logs tell how to reproduce the failure?

Yes.

2.4 How many machines needed?

1. 2 nodes: client + nodemanager.

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

2012-01-19 08:35:32,544 ERROR org.apache.hadoop.mapred.ShuffleHandler: Shuffle error
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find usercache/ding/appcache/application_1326928483038_0001/output/attempt_1326928483038_0001_m_000003_0/file.out.index in any of the configured local directories

Now, with only the log, hard to debug: you need to be able to examine the local FS.

The user provided this:

$ ls -l usercache/ding/appcache/application_1327102703969_0001/output/attempt_1327102703969_0001_m_000001_0
-rw-r----- 1 ding ding 28 Jan 20 15:41 file.out
-rw-r----- 1 ding ding 32 Jan 20 15:41 file.out.index

 --- So the file is there!

3.2 Backward inference

Another critical evidence is that fact that we cannot find the file because of permission. the file is owned by ding:ding, but nodemanager is running under yarn:hadoop. This is in fact, subtle, and cannot be revealed from the logs.

If we can infer this, then a developer can realise the bug: the files “file.out” and “file.out.index” are indeed created by ding:ding, and later it is supposed to be read by “yarn:hadoop”. But the design was the containing directory:

usercache/ding/appcache/application_1327102703969_0001/output/

, created by nodemanger, thus has “yarn:hadoop”, should have setgid bit on --- therefore the containing files should have the “ding:hadoop”, instead of “ding:ding”. This is the root cause.

However, this is missing from the log.  

4. Root cause

Explicitly creating the directory “usercache/ding/appcache/application_1327102703969_0001/output/” will clear the setgid bit.

4.1 Category:

Semantic.

5. Fix

5.1 How?

Do not explicitly create the “output” directory.

-      // $x/usercache/$user/appcache/$appId/output
-      lfs.mkdir(new Path(appBase, OUTPUTDIR), null, false);