HBase-2337 Report

https://issues.apache.org/jira/browse/hbase-2337

1. Symptom

HLog split process will permenantly lose old logs if Master cannot be connected when splitting

1.1 Severity

Blocker

1.2 Was there exception thrown?

Yes: RS goes down (to trigger split); 2. Master goes down; 3. final data loss

1.2.1 Were there multiple exceptions?

yes

1.3 Scope of the failure

Massive data loss

2. How to reproduce this failure

2.0 Version

0.20.4

2.1 Configuration

Standard

2.2 Reproduction procedure

1. RS goes down -- trigger the log split (disconnect)

2. Master goes down (disconnect)

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

No. 2 must occur after the log split happens

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

2 (HM + RS)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Error message:

Exception processing -- continuing. Possible DATA LOSS!

3.2 Backward inference

The patch:

            } catch (IOException e) {
-              LOG.debug("IOE Pushed=" + count + " entries from " +
-                logfiles[i].getPath());
+              LOG.debug("IOE Pushed=" + count + " entries from " + curLogFile);
              e = RemoteExceptionHandler.checkIOException(e);
              if (!(e instanceof EOFException)) {
-                LOG.warn("Exception processing " + logfiles[i].getPath() +
-                    " -- continuing. Possible DATA LOSS!", e);
+                String msg = "Exception processing " + curLogFile +
+                             " -- continuing. Possible DATA LOSS!";
+                if (corruptDir.length() > 0) {
+                  msg += "  Storing in hlog corruption directory.";
+                }
+                LOG.warn(msg, e);
              }
            }
          } catch (IOException e) {
            if (length <= 0) {
-              LOG.warn("Empty hlog, continuing: " + logfiles[i] + " count=" + count, e);
+              LOG.warn("Empty hlog, continuing: " + logfiles[i]);
+              cleanRead = true;
              continue;
            }

4. Root cause

The above error handling (HM die when splitting) is completely wrong!

4.1 Category:

Incorrect error handling (handled, statement coverage)