HBase-2174 Report

1. Symptom

Entire cluster goes down.

1.1 Severity

Critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

No.

HMaster thinks it’s normal to detect an unknown Region Server. Then it do some logging and tell the RS to restart.

1.2.1 Were there multiple exceptions?

Yes.

NotServingRegionException (mainly), decodeRemoteException, KeeperExeption, etc.  

1.3 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

all clients

2. How to reproduce this failure

2.0 Version

0.20.0

2.1 Configuration

standard

# of Nodes?

2 (RS + HMaster)

2.2 Reproduction procedure

Start Cluster, and make DNS flaky.

2.2.1 Timing order (Order important column)

NA

2.2.2 Events order externally controllable? (Order externally controllable? column)

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes.

2.4 How many machines needed?

2

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

Entire cluster shutdown. All region servers were detected as unknown server. (some repeated logs deleted)

2009-07-17 16:59:44,958 DEBUG org.apache.hadoop.hbase.master.ServerManager: received server report from unknown server: 10.16.39.18,60020,1247875177274

2009-07-17 16:59:45,058 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: ha06.socialmedia.com,60020,1247875185045

2009-07-17 16:59:45,061 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1247875185045 with data 10.16.39.18:60020

2009-07-17 16:59:45,079 DEBUG org.apache.hadoop.hbase.master.RegionManager: Server is overloaded. Server load: 70 avg: 57.07142857142857, slop: 0.1

2009-07-17 16:59:45,079 DEBUG org.apache.hadoop.hbase.master.RegionManager: Choosing to reassign 12 regions. mostLoadedRegions has 10 regions in it.

2009-07-17 16:59:45,079 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region users,facebook:1289561688,1245991121341

2009-07-17 16:59:45,079 INFO org.apache.hadoop.hbase.master.RegionManager: Skipped 0 region(s) that are in transition states

2009-07-17 16:59:45,105 INFO org.apache.hadoop.hbase.master.ServerManager: Received start message from: ha09.socialmedia.com,60020,1247875185098

2009-07-17 16:59:45,107 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Updated ZNode /hbase/rs/1247875185098 with data 10.16.39.22:60020

2009-07-17 16:59:45,431 DEBUG org.apache.hadoop.hbase.master.RegionManager: Server is overloaded. Server load: 68 avg: 47.0, slop: 0.1

2009-07-17 16:59:45,431 DEBUG org.apache.hadoop.hbase.master.RegionManager: Choosing to reassign 21 regions. mostLoadedRegions has 10 regions in it.

2009-07-17 16:59:45,431 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region users,facebook:1425001306,1246002548333

2009-07-17 16:59:46,538 DEBUG org.apache.hadoop.hbase.master.HMaster: Processing todo: ProcessRegionClose of users,myspace:82740718,1246002074682, false, reassign: true

2009-07-17 16:59:46,538 INFO org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_CLOSE: users,facebook:61413610,1246002297929 from ha04.socialmedia.com,60020,1247873619950; 3 of 8

2009-07-17 16:59:49,574 DEBUG org.apache.hadoop.hbase.master.RegionManager: Going to close region users,facebook:672476541,1246086598149

2009-07-17 16:59:49,574 INFO org.apache.hadoop.hbase.master.RegionManager: Skipped 0 region(s) that are in transition states

2009-07-17 16:59:49,971 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Removed .META.,,1 for tableName=.META. from cache because of users,myspace:20187449,1245990636752

2009-07-17 16:59:50,017 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$TableServers: locateRegionInMeta attempt 0 of 3 failed; retrying after sleep of 2000

org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: -ROOT-,,0

    at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2251)

    at org.apache.hadoop.hbase.regionserver.HRegionServer.getClosestRowBefore(HRegionServer.java:1737)

    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:597)

    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:643)

    at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:913)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)

    at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)

    at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:621)

    at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:526)

    at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.relocateRegion(HConnectionManager.java:499)

    at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocation(HConnectionManager.java:346)

    at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionLocationForRowWithRetries(HConnectionManager.java:981)

    at org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1066)

    at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:583)

    at org.apache.hadoop.hbase.client.HTable.put(HTable.java:449)

    at org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:239)

    at org.apache.hadoop.hbase.RegionHistorian.add(RegionHistorian.java:217)

    at org.apache.hadoop.hbase.RegionHistorian.addRegionAssignment(RegionHistorian.java:143)

    at org.apache.hadoop.hbase.master.RegionManager$1.process(RegionManager.java:338)

    at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:492)

    at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:426)

3.2 Backward inference

Directly connected. Root cause is that DNS failure caused that HMaster doesn’t recognize Region Servers. Thus HM told RS to restart.

4. Root cause

Falsely formatted information from DNS is not properly handled by HMaster. This makes HMaster shutdown all the existing Region Servers.

A RS periodically sends a regionServerReport to the master. Master will contact DNS to check the RS information. When DNS is flaky, master will get falsely formatted server infomation from DNS, which makes master think that RS cannot be recognized. Then master will tell the RS to restart. This repeats and shutdown the entire cluster.

4.1 Category:

incorrect error handling. (incorrect error handling, they handled DNS server lookup error, but did not handle the case where the DNS mapping is already cached...)

    if (this.serverName == null) {

        <<--- DNS error handler
+      // if we have the hostname of the RS, use it
+      if(this.name != null) {
+        this.serverName = getServerName(this.name, this.serverAddress.getPort(), this.startCode);
+      }
+      // go to DNS name resolution only if we dont have the name of the RS
+      else {
      this.serverName = getServerName(this.serverAddress, this.startCode);
    }

5. Fix

5.1 How?

Instead of asking information from DNS every time, HMaster first search if the RS information is stored locally, thus prevent HMaster from shutting down exsiting RS.