As you can see, the first task’s status was still “Running” (in fact, it is killed), but the progress is already 100%.
If we click on that task, we have the following:
Critical
Sure. The job failed with “ShuffleError”. It was because of 3124 (that the AM cannot load the native java classes). But this is not the exception for this particular error though.
No.
0.23.0
Standard
1. start a MR job:
hadoop jar ~/research/hadoop/hadoop-2.0.2-alpha-src/hadoop-dist/target/hadoop-2.0.2-alpha/share/hadoop/mapreduce/hadoop-*-examples.jar randomwriter rand
2. kill the job
3. check the web interface
Kill a running job will do.
Sure.
Yes. The unique event is kill a job:
./container_1375285695285_0001_01_000001/syslog:2013-07-31 11:53:21,982 ERROR [TaskCleaner Event Handler] org.apache.hadoop.mapreduce.v2.app.taskclean.TaskCleanerImpl: Returning, interrupted : java.lang.InterruptedException
1. (AM + RM)
Very straightforward once the symptom is observed.
The screen shot above.
The key is that the job is killed, yet still shows the progress. It is not hard to locate the code snippet in AM that returns the progress:
TaskImpl.java:
public float getProgress() {
readLock.lock();
try {
TaskAttempt bestAttempt = selectBestAttempt();
if (bestAttempt == null) {
return 0;
}
return bestAttempt.getProgress();
} finally {
readLock.unlock();
}
}
//select the nextAttemptNumber with best progress
// always called inside the Read Lock
private TaskAttempt selectBestAttempt() {
float progress = 0f;
TaskAttempt result = null;
for (TaskAttempt at : attempts.values()) {
if (result == null) {
result = at; //The first time around
}
// calculate the best progress
if (at.getProgress() > progress) {
result = at;
progress = at.getProgress();
}
}
return result;
}
TaskAttemptImpl.java:
public float getProgress() {
readLock.lock();
try {
return reportedStatus.progress;
} finally {
readLock.unlock();
}
}
The problem: in the highlighted loop above, it doesn’t consider if the task is already dead!
--- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
+++ hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskImpl.java
@@ -441,10 +441,20 @@ public abstract class TaskImpl implements Task, EventHandler<TaskEvent> {
float progress = 0f;
TaskAttempt result = null;
for (TaskAttempt at : attempts.values()) {
+ switch (at.getState()) {
+
+ // ignore all failed task attempts
+ case FAIL_CONTAINER_CLEANUP:
+ case FAIL_TASK_CLEANUP:
+ case FAILED:
+ case KILL_CONTAINER_CLEANUP:
+ case KILL_TASK_CLEANUP:
+ case KILLED:
+ continue;
+ }
if (result == null) {
result = at; //The first time around
}
- //TODO: consider the nextAttemptNumber only if it is not failed/killed ?
// calculate the best progress
if (at.getProgress() > progress) {
result = at;
Incorrect exception handling. Did not handle the job kill.
See above.
handle the killed jobs.