Speculation + Fetch failures can lead to a hung job
2 MR Tasks (1 attempt and 1 speculative attempt)
1. attemp1 starts(feature start)
2. speculative attempt starts(feature start)
The attempt 1 completes - Task moves to SUCCEEDED state, then the speculative attempt is KILLED but the event T_ATTEMPT_KILLED is ignored.
Then attempt 1 fails with TOO_MANY_FETCH_FAILURES, the job will hang.
2 (2 tasks)
The job hangs when there’s a TOO_MANY_FETCH_FAILURES.
Since the attempt is not removed from the InProgress list, when there’s a failure the task still thinks there’s unfinished attempts, hence does not start a new attempt.
Incorrect bookkeeping in TaskImpl ignores speculative attempts when one attempt is successful. The incomplete attempt fails to fetch and doesn’t launch a new attempt.
Add T_ATTEMPT_KILLED and T_ATTEMPT_SUCCEEDED event handling at the SUCCEEDED status. The handling process is to remove the attempt from Inprogress list and move them into the finishedAttempts Tasks list.