HBase-4078 Report


1. Symptom

Data loss (column family loss) when HDFS error right when the column family just been compacted.

1.1 Severity


1.2 Was there exception thrown?

yes: IO Exception and later data not found

1.2.1 Were there multiple exceptions?


1.3 Scope of the failure

One or a few column families.

2. How to reproduce this failure

2.0 Version


2.1 Configuration


2.2 Reproduction procedure

1. Compact a memstore (feature start, long running)

2. HDFS failure (data corrupt)

2.2.1 Timing order

In this order.

2.2.2 Events order externally controllable?

No. HDFS failure must occur right after compact and before “completeCompaction” is called.

2.3 Can the logs tell how to reproduce the failure?


2.4 How many machines needed?

2 (RS + HDFS)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

The exception:

java.io.IOException: java.lang.IllegalArgumentException: Invalid HFile version: 2162721 (expected to be between 1 and 2)

3.2 Backward inference

You can observe from the stack trace,

4. Root cause

During completeCompaction, the renaming from:

.tmp/REGION_ID to colfamily11/8dc5109d70a240e7887c81bd934dbc16 failed, and the IOException was not handled at all.

StoreFile completeCompaction(final Collection<StoreFile> compactedFiles,
                                       final StoreFile.Writer compactedFile)
      throws IOException {
    // 1. Moving the new files into place -- if there is a new file (may not
    // be if all cells were expired or deleted).
    StoreFile result = null;
    if (compactedFile != null) {
      Path p = null;
      try {
        p = StoreFile.rename(this.fs, compactedFile.getPath(),

     } catch (IOException e) {

        LOG.error("Failed move of compacted file " + compactedFile.getPath(), e);

        return null;



 --- They should have re-validate the file path.

4.1 Category:

Incorrect error handling (handled), but they should have validated the file.