After the RM restarts, History Server does not get restarted job results.
Users who restarts their jobs or who are running jobs when there’s a RM restart.
1. Restart a job. (restart)
2. Try to get job status from HS when the job finishes. (file read)
1 node (1 HS)
The HS does not get right status of the job.
When AM starts, the HS flushes the jobhistory file inside staging directory to done_intermediate dir. So when job History server is queries after a job restart, it will gives back the old AM’s info.
When the AM starts it will copy all logs from staging directory to done_intermediate dir.
After the job completes, the JobHistoryEventHandler will scan the log and output the old AM’s log.
Incorrect exception handling
Add JOB_AM_REBOOT event to the event pool between RM and AM.
Skip writing out the AM’s history log (except for the last AM attempt) and let just the last one be written.