History server report “Unknown job” after RM says job has completed.
YarnRemoteExceptionPBImpl: Unknown Job
1. 2 threads scan the same user's done_intermediate directory at the same time(feature start)
2. 1 thread won the race and updated timestamp in data structure (file write)
HS reports “Unknown job” and failed to get map task report.
The thread failed the race will see the updated timestamp of the folder and think there’s no point doing a scan, returning the exception that job found.
Store the user done intermediate directory in a ConcurrentHashMap so that threads will not race it.