Task can get stuck in FAIL_CONTAINER_CLEANUP
critical
No.
No.
single request
0.23.5
1 AM, 1RM and 1 NM
must be in that order
no (multinode)
no
3 (1 NM, 1AM and 1 RM)
The task get stuck when there’s a NM goes down and AM tries to launch a container on it.
The TA_FAILMSG arrives before the TA_CONTAINER_LAUNCH_FAILED message. Then the task attempt tries to kill the container. But the ContainerLauncherImpl will not send back a TA_CONTAINER_CLEANED event, causing the attempt to be stuck.
Not sending the TA_CONTAINER_CLEANED event when killing the container.
Wrong exception handling
Always send a TA_CONTAINER_CLEANED event in all cases, even when the container is failed and going to be killed.