HDFS-4451 Report

1. Symptom

Cron sending e-mails every time HDFS Balancer finishes its execution successfully, when it was supposed to send emails only on execution failures.

1.1 Severity

Major.

1.2 Was there exception thrown?

No, but e-mails were sent (wrongly).

1.2.1 Were there multiple exceptions?

No.

1.3 Scope of the failure

No harm was done to the cluster, but this failure flooded the inboxes of the developers, since e-mails were being sent every time Balancer ran and since Balancer was being called often by Cron.

2. How to reproduce this failure

2.0 Version

2.0.2-alpha

2.1 Configuration

A simple and normal cluster, with 1 namenode and 1 datanode. No files or extra settings needed.

2.2 Reproduction procedure

There are several ways to reproduce this failure.

A simple peek in the source code already shows that there is a conceptual error. The success return value was defined 1, when it should be zero. This is a standard being followed in both Hadoop system and Linux environment.

To see this result, put a simple log message into the source code to print the value returned by the tool before it actually returns. Then, compile the source code.

Finally, run the Balancer and check the value on your new log message.

2.2.1 Timing order

Irrelevant.

2.2.2 Events order externally controllable?

Yes.

2.3 Can the logs tell how to reproduce the failure?

No. The logs originally don’t show the result, but the email should do so.

2.4 How many machines needed?

1 machine with at least 1 namenode and 1 datanode.

3. Diagnosis procedure

This error has a small propagation. The correct result is passed from balancer tool to Cron, which interprets as a wrong result and sends a failure report.

3.1 Detailed Symptom (where you start)

E-mails were being sent by Cron to the developers reporting the success of the balancer tool (it should only report the failures of the balancer tool).

3.2 Backward inference

The emails should contain the details of the event that triggered the report. Among these details, there should be the log messages showing that the tool ran successfully, but it returned non-zero. Finally, by a peek on the corresponding piece of source code,  it is easy to notice that this return value was defined into the source code.

Figure 1 - Change in the source code.

3.3 Are the printed log sufficient for diagnosis?

No. The original log messages didn’t show what was being returned to Cron, but the e-mails probably had it.

4. Root cause

The balancer was written to return 1 on success, instead of 0. So, it would be reported by Cron every time it succeeded. This is a conceptual error.

4.1 Category:

Conceptual error.

5. Fix

5.1 How?

Just swap the values of SUCCESS and IN_PROGRESS. Success is changed from 1 to 0. IN_PROGRESS is changed from 0 to 1. Figure 1 shows the changes.