[hadoop]MAPREDUCE-3124 Report

1. Symptom

MR job fails with multiple exceptions (container got killed).

1.1 Severity

Blocker

1.2 Was there exception thrown?

Yes.

1.2.1 Were there multiple exceptions?

Yes.

1st:  WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2nd: Failure: “Container killed by the ApplicationMaster. Container killed on request. Exit code is 137 Too Many fetch failures.Failing the attempt “

 --- Note: what failure you see is platform specific.

2. How to reproduce this failure

2.0 Version

0.23.0 --- manually reverse the patch.

2.1 Configuration

Standard configurations. But I should enable the debug log in AM:

mapred-site.xml:

  <property>

    <name>mapreduce.map.log.level</name>

    <value>DEBUG</value>

  </property>

  <property>

    <name>mapreduce.reduce.log.level</name>

    <value>DEBUG</value>

  </property>

2.2 Reproduction procedure

1. generate sort input:

hadoop jar hadoop-*-examples.jar randomwriter rand

 --- This will generate 10 GB of random files in the directory rand (/user/ding/rand).

2. sort it:

hadoop jar hadoop-mapreduce-examples-*.jar sort -D mapreduce.job.acl-view-job=* -D mapreduce.map.output.compress=true -D mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec

-D mapreduce.output.fileoutputformat.compress=true  -D mapreduce.output.fileoutputformat.compression.type=NONE -D mapreduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.GzipCodec  -outKey

org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text rand output

 

2.2.1 Timing order

Single event.

2.2.2 Events order externally controllable?

Yes.

2.3 Can the logs tell how to reproduce the failure?

Yes.

2.4 How many machines needed?

1. AM (single component failure).

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The final failure exceptions (platform specific).

3.2 Backward inference

You can find the error message:

2013-07-30 21:32:19,966 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

3.3 Are the printed log sufficient for diagnosis?

Yes. Developers had little trouble in diagnosing it.

3.4 Are logs misleading?

No.

3.5 Do we need to examine different component’s log for diagnosis?

No.

3.6 Is it a multi-components failure?

No

4. Root cause

The developers simply set the wrong “LD_LIBRARY_PATH” environment variable at the beginning.

It is also a configuration error.

4.1 Category:

Incorrect error handling (environment coverage). They did not handle the configuration error correctly, and this will result in catastrophic symptoms on some specific environment.