HBase-2545 Report

1. Symptom

Unresponsive region server, potential deadlock

Region server infinite loop (hang) on client’s Get or Scan requests.

1.1 Severity

Blocker

1.2 Was there exception thrown?

No

1.2.1 Were there multiple exceptions?

No

1.3 Scope of the failure

The clients GET or SCAN on certain tables.

2. How to reproduce this failure

2.0 Version

0.20.4

2.1 Configuration

Standard

2.2 Reproduction procedure

The testcase “TestExplicitColumnTracker” contains the dataset to construct a table that will end up with this infinite loop

2.2.1 Timing order

Single event

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Hard -- This will be an interesting case to test our failure repro tool.

2.4 How many machines needed?

1. (RS + client)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Users took jstack:

"IPC Server handler 10 on 60020" daemon prio=10 tid=0x00002aacb6844000 nid=0xcc2 runnable [0x0000000042f56000]

   java.lang.Thread.State: RUNNABLE

        at org.apache.hadoop.hbase.regionserver.ExplicitColumnTracker.checkColumn(ExplicitColumnTracker.java:128)

        at org.apache.hadoop.hbase.regionserver.ScanQueryMatcher.match(ScanQueryMatcher.java:165)

        at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:176)

        - locked <0x00002aaacd690ef0> (a org.apache.hadoop.hbase.regionserver.StoreScanner)

        at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:106)

        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.nextInternal(HRegion.java:1923)

        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.java:1887)

        at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2507)

        at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2493)

        at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1742)

        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

        at java.lang.reflect.Method.invoke(Method.java:597)

        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657)

        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

Many threads are stuck here..

3.2 Backward inference

This jstack is super important. Developers quickly located the code:

   public MatchCode checkColumn(byte [] bytes, int offset, int length) {
+    boolean recursive;
    do {
+      recursive = false;
     // No more columns left, we are done with this query

      if(this.columns.size() == 0) {

        return MatchCode.DONE; // done_row

      }

      // No more columns to match against, done with storefile

      if(this.column == null) {

        return MatchCode.NEXT; // done_row

      }

      // Compare specific column to current column

      int ret = Bytes.compareTo(column.getBuffer(), column.getOffset(),

          column.getLength(), bytes, offset, length);

      // Matches, decrement versions left and include

      if(ret == 0) {

        if(this.column.decrement() == 0) {

          // Done with versions for this column

          this.columns.remove(this.index);

          if(this.columns.size() == this.index) {

            // Will not hit any more columns in this storefile

            this.column = null;

          } else {

            this.column = this.columns.get(this.index);

          }

        }

        return MatchCode.INCLUDE;

      }

      // Specified column is bigger than current column

      // Move down current column and check again

      if(ret <= -1) {

        if(++this.index == this.columns.size()) {

          // No more to match, do not include, done with storefile

          return MatchCode.NEXT; // done_row

        }

        this.column = this.columns.get(this.index);

        recursive = true;

        continue;

      }

-    } while(true);
+    } while(recursive);
+    return MatchCode.SKIP; // skip to next column, with hint?
  }

  --- Infinite loop on certain data format.

4. Root cause

Infinite loop on certain format of data.

4.1 Category:

Infinite loop

5. Fix

5.1 How?

Break from the loop