data loss
AM can rerun job after reporting final job status to the client.
blocker
No
No
Single Job
0.23.3, 2.0.1-alpha
1 RM, 1 AM
1. user start a job inside a AM(feature start)
2. AM crashes after reporting the status to RM, without unregistering to RM (disconnect)
AM must crash after reporting the status to RM, before reporting the unregistering itself to the RM
No(multinode)
No
2
AM reruns the job when it recovers from a crash though the job has been finished before it crashes.
As the AM has not flushed the log from intermediate directory to the done directory, when the next AM starts it will believe that the previous attempt is still not finished. Thus it will rerun the job and may cause data loss.
The new AM assumes that previous attempt is not done and will make the job to rerun
Incorrect Handling
Before the AM completes the job, only flush the last action of the AM log from the intermidiate dir to done dir to make the next AM instance do the right recovery action