Block missing after NN enters active state in HA.
1. The original active NN failed (so failover occurred).
2. NN didn’t detect DN’s heartbeat
3. block missing
Basic HA configuration Needed
1. Stop Datanode
2. Restart NN (or don’t do anything if there is already a second standby NN)
3. transition the restarted NN or the standby NN to active state
4. start DN
5. We will see that block is missing due to the fact DN did not have time to connect to active NN.
1. Must have a working HA hdfs
2. then need to switch to a standby NN or just restart NN
3. Must switch NN to active NN before DN connects/gives a heart beat.
1 (2 nodes, 1 DN and 1 NN)
HDFS entered safemode when the original NN goes down. When HDFS is still in safemode, missing block, is reported.
From the symptom, we know when HA HDFS switches NN, the DN ownership needs to be transferred to a “backup” NN. There will be a delay because the new NN needs to wait for the heart beat signal from the connected DN. Without this heartbeat, NN assumes DN went down and data has been lost.
From the logs, we can see before NN detected DN’s heart beat, NN assumed that DN went down (i.e. alive NN = 0). Thus the data on the DN is lost. Of course the data reported is correct, so it was a semantic error.
The correct procedure would wait for DN to connect or not assume block is missing in safe mode before DN finished connecting to NN. And at this point we found the root cause.
Incorrect error handling.
Error: 1. failover (the original NN failed so there is a failover).
2. The new active NN did not detect DN’s heartbeat
To test this, it requires the understanding of the real world error scenario --- cannot be covered by statement coverage...
The fix is simple. Put a if(!isInSafeMode) check in the code and prevent block missing from showing up inside safemode. Only do the check once it is out of safe mode and all DN has connected to the now active NN.