HDFS-3921 Report

1. Symptom

Block missing after NN enters active state in HA.

1.1 Severity

Major

1.2 Was there exception thrown?

IO Exception

1.2.1 Were there multiple exceptions?

Yes.

1. The original active NN failed (so failover occurred).

2. NN didn’t detect DN’s heartbeat

3. block missing

2. How to reproduce this failure

2.0 Version

2.0.2-alpha

2.1 Configuration

Basic HA configuration Needed

2.2 Reproduction procedure

1. Stop Datanode

2. Restart NN (or don’t do anything if there is already a second standby NN)

3. transition the restarted NN or the standby NN to active state

4. start DN

5. We will see that block is missing due to the fact DN did not have time to connect to active NN.

2.2.1 Timing order

1. Must have a working HA hdfs

2. then need to switch to a standby NN or just restart NN

3. Must switch NN to active NN before DN connects/gives a heart beat.

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes

2.4 How many machines needed?

1 (2 nodes, 1 DN and 1 NN)

3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

HDFS entered safemode when the original NN goes down. When HDFS is still in safemode, missing block, is reported.

3.2 Backward inference

From the symptom, we know when HA HDFS switches NN, the DN ownership needs to be transferred to a “backup” NN. There will be a delay because the new NN needs to wait for the heart beat signal from the connected DN. Without this heartbeat, NN assumes DN went down and data has been lost.

From the logs, we can see before NN detected DN’s heart beat, NN assumed that DN went down (i.e. alive NN = 0). Thus the data on the DN is lost. Of course the data reported is correct, so it was a semantic error.

The correct procedure would wait for DN to connect or not assume block is missing  in safe mode before DN finished connecting to NN.  And at this point we found the root cause.

4. Root cause

4.1 Category:

Incorrect error handling.

Error: 1. failover (the original NN failed so there is a failover).  

2. The new active NN did not detect DN’s heartbeat

To test this, it requires the understanding of the real world error scenario --- cannot be covered by statement coverage...

4.2 Are there multiple fault?

Yes

5. Fix

5.1 How?

The fix is simple. Put a if(!isInSafeMode) check in the code and prevent block missing from showing up inside safemode.  Only do the check once it is out of safe mode and all DN has connected to the now active NN.