The RM may think an AM has gone down when the previous AM is still up and running.
1. Due to network problems, RM thinks that an AM goes down though AM is still alive(disconnect)
2. Previous AM tries to commit a job without requiring resources from RM (feature start)
Duplicate jobs committing
Because there’s no heartbeat checking at the AM side when committing job, the old AM is still committing the job though RM thinks it is dead.
The RM thinks the AM has gone down while the AM is actually working.
Add a commit window into the AppMaster. Any tasks that are committed outside the commit window will be regarded as invalid.