HBase-5003 Report

https://issues.apache.org/jira/browse/Hbase-5003

1. Symptom

Master will hang and stuck in a wrong rootdir configuration.

1.1 Severity

Critical

1.2 Was there exception thrown?

Yes.

1.2.1 Were there multiple exceptions?

Yes:  WARN org.apache.hadoop.hbase.util.FSUtils: Unable to create version file at file:/bin/hbase, retrying: Mkdirs failed to create file:/bin/hbase

This will repeat itself….

1.3 Scope of the failure

Entire server.

2. How to reproduce this failure

2.0 Version

0.90.4

2.1 Configuration

Configure a invalid (non-existent) path in the hbase.rootdir

2.2 Reproduction procedure

1. Configure a invalid (non-existent) path in the hbase.rootdir (config change)

2. Start the server (add node)

2.2.1 Timing order

in this order

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

1

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The error msg and the hang.

3.2 Backward inference

The log is pretty clear to take us back. We can notice the following code:

public static void setVersion(FileSystem fs, Path rootdir, String version,

      int wait) throws IOException {

    Path versionFile = new Path(rootdir, HConstants.VERSION_FILE_NAME);

    while (true) {

      try {

        FSDataOutputStream s = fs.create(versionFile);

        s.writeUTF(version);

        LOG.debug("Created version file at " + rootdir.toString() +

            " set its version at:" + version);

        s.close();

        return;

      } catch (IOException e) {

        if (wait > 0) {

          LOG.warn("Unable to create version file at " + rootdir.toString() +

              ", retrying: " + e.getMessage());

          fs.delete(versionFile, false);

          try {

            Thread.sleep(wait);

          } catch (InterruptedException ex) {

            // ignore

          }

        }

      }

    }

  }

4. Root cause

The error handling logic is wrong. Instead of keep retrying, it should simply break out of the loop.

4.1 Category:

Incorrect error handling (handled) (statement coverage)

5. Fix

5.1 How?

      } catch (IOException e) {
-        if (wait > 0) {
+        if (retries > 0) {
          LOG.warn("Unable to create version file at " + rootdir.toString() +
              ", retrying: " + e.getMessage());
          fs.delete(versionFile, false);
          try {
-            Thread.sleep(wait);
+            if (wait > 0) {
+              Thread.sleep(wait);                                                  
+            }
          } catch (InterruptedException ex) {
            // ignore
          }
+          retries--;
+        } else {
+          throw e;
        }
      }