Cron sending e-mails every time HDFS Balancer finishes its execution successfully, when it was supposed to send emails only on execution failures.
No, but e-mails were sent (wrongly).
1.3 Scope of the failure
No harm was done to the cluster, but this failure flooded the inboxes of the developers, since e-mails were being sent every time Balancer ran and since Balancer was being called often by Cron.
A simple and normal cluster, with 1 namenode and 1 datanode. No files or extra settings needed.
There are several ways to reproduce this failure.
A simple peek in the source code already shows that there is a conceptual error. The success return value was defined 1, when it should be zero. This is a standard being followed in both Hadoop system and Linux environment.
To see this result, put a simple log message into the source code to print the value returned by the tool before it actually returns. Then, compile the source code.
Finally, run the Balancer and check the value on your new log message.
No. The logs originally don’t show the result, but the email should do so.
1 machine with at least 1 namenode and 1 datanode.
This error has a small propagation. The correct result is passed from balancer tool to Cron, which interprets as a wrong result and sends a failure report.
E-mails were being sent by Cron to the developers reporting the success of the balancer tool (it should only report the failures of the balancer tool).
The emails should contain the details of the event that triggered the report. Among these details, there should be the log messages showing that the tool ran successfully, but it returned non-zero. Finally, by a peek on the corresponding piece of source code, it is easy to notice that this return value was defined into the source code.
Figure 1 - Change in the source code.
No. The original log messages didn’t show what was being returned to Cron, but the e-mails probably had it.
The balancer was written to return 1 on success, instead of 0. So, it would be reported by Cron every time it succeeded. This is a conceptual error.
Just swap the values of SUCCESS and IN_PROGRESS. Success is changed from 1 to 0. IN_PROGRESS is changed from 0 to 1. Figure 1 shows the changes.