[hadoop]MAPREDUCE-4817 Report

1. Symptom

Hardcoded task ping timeout kills tasks localizing large amounts of data

early termination

1.1 Severity

critical

1.2 Was there exception thrown?

No

1.2.1 Were there multiple exceptions?

No

1.3 Scope of the failure

Single request

Tasks need to localizing large amounts of data

2. How to reproduce this failure

2.0 Version

0.23.3, 2.0.3-alpha

2.1 Configuration

1 AM, 1 Task Attempt

2.2 Reproduction procedure

1. Start a MR task. Localization lasts for more than 5 min(feature start)

2. AM Ping Timeout (disconnect)

2.2.1 Timing order

Yes

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

AM kills the MR task when the localization process takes too long time.

3.2 Backward inference

The ping timeout is hardcoded. When it takes more than 5 mins to finish the localization process, AM will try to kill the task.

4. Root cause

The ping timeout is hardcoded to 5 minutes. When the AM pings time out the task, AM will try to kill the task.

4.1 Category:

Incorrect exception handling

5. Fix

5.1 How?

Remove the ping timeout check from the AM task heartbeat handler.