[hadoop]MAPREDUCE-4691 Report

1. Symptom

History server report “Unknown job” after RM says job has completed.

1.1 Severity

critical

1.2 Was there exception thrown?

Yes.

YarnRemoteExceptionPBImpl: Unknown Job

1.2.1 Were there multiple exceptions?

No

1.3 Scope of the failure

single request

2. How to reproduce this failure

2.0 Version

0.23.3, 2.0.1-alpha

2.1 Configuration

1 RM

1 HS

2.2 Reproduction procedure

2.2.1 Timing order

1. 2 threads scan the same user's done_intermediate directory at the same time(feature start)

2. 1 thread won the race and updated timestamp in data structure (file write)

2.2.2 Events order externally controllable?

No

2.3 Can the logs tell how to reproduce the failure?

No

2.4 How many machines needed?

2

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

HS reports “Unknown job” and failed to get map task report.

3.2 Backward inference

The thread failed the race will see the updated timestamp of the folder and think there’s no point doing a scan, returning the exception that job found.

4. Root cause

4.1 Category:

concurrency

5. Fix

5.1 How?

Store the user done intermediate directory in a ConcurrentHashMap so that threads will not race it.