Cassandra node A runs RepairSession , and Cassandra node B is involved
node B fails during the RepairSession
Observed on a 2 DC cluster, during a repair that spanned the DC's.
INFO [AntiEntropyStage:1] 2011-11-28 06:22:56,225 StreamingRepairTask.java (line 136) [streaming task #69187510-1989-11e1-0000-5ff37d368cb6] Forwarding streaming repair of 8602
ranges to /10.6.130.70 (to be streamed with /10.37.114.10)
...
INFO [AntiEntropyStage:66] 2011-11-29 11:20:57,109 StreamingRepairTask.java (line 253) [streaming task #69187510-1989-11e1-0000-5ff37d368cb6] task succeeded
ERROR [AntiEntropyStage:66] 2011-11-29 11:20:57,109 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[AntiEntropyStage:66,5,main]
java.lang.NullPointerException
at org.apache.cassandra.service.AntiEntropyService$RepairSession.completed(AntiEntropyService.java:712)
at org.apache.cassandra.service.AntiEntropyService$RepairSession$Differencer$1.run(AntiEntropyService.java:912)
at org.apache.cassandra.streaming.StreamingRepairTask$2.run(StreamingRepairTask.java:186)
at org.apache.cassandra.streaming.StreamingRepairTask$StreamingRepairResponse.doVerb(StreamingRepairTask.java:255)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
“One of the nodes involved in the repair session failed: (Not sure if this is from the same repair session as the streaming task above, but it illustrates the issue)”
ERROR [AntiEntropySessions:1] 2011-11-28 19:39:52,507 AntiEntropyService.java (line 688) [repair #2bf19860-197f-11e1-0000-5ff37d368cb6] session completed with the following error
java.io.IOException: Endpoint /10.29.60.10 died
at org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:725)
at org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:762)
at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:192)
at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:559)
at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:62)
at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:167)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
(2.1) For ease of description, we suppose node A issues the Repair, node B is one of the nodes which are involved in the repair session, and node B fails during the repair. All the logs above come from node A.
Based on the callstacks from the two exceptions NPE & IOException, we can get the following control flow.
· When node A receives the streaming repair response from the other node, it calls StreamingRepairResponse.doVerb() to process the response. Then Differencer.run() will finally call completed(Differencer.this). The NPE happens when executing job.completedSynchronization(), since the “job” we get from activeJobs.get() is null.
· On node A, the GossipTask is running periodically. When a node is detected as failed (its PHI value exceeds a certain threshold), RepairSession.convict() will call failedNode() to stop all the jobs in this repair session, and activeJobs.clear() is executed.
So there is a race between activeJobs.get() and activeJobs.clear(): If node A fails node B before activeJob.get(), the object job is null. Then NPE happens when calling job.completedSynchronization().
How does the IOException print?
When node A issues a repair session, RepairSession.runMayThrow() will be called:
During the repair session issued by node A, node B fails. Node A fails node B and will stop all the jobs by calling activeJobs.clear(). At this time, node A receives a stream repair response from a certain node, and will call activeJobs.get(). If the activeJobs are cleared already, the result of activeJobs.get will be null. Then calling the method of this null object throw NPE.
In this bug, the IOException is allowed, and NPE is not expected.
The patch will eliminate the NPE by checking for null.
• src/java/org/apache/cassandra/service/AntiEntropyService.java
RepairSession.completed()
Function completed() is on the callstack of NPE.
Cassandra node(s)