CASSANDRA-5432

1. Symptom

After upgrading to cassandra version 1.2.4 from 1.2.3, nodetool repair command is unable to run on any node in a EC2 sandboxed (TCP connection from external network is blocked) cassandra cluster.  

Background:

Nodetool repair cleans nodes in cassandra to resolve missing or inconsistent data.  

 

Category (in the spreadsheet):

Wrong Computation

1.1 Severity

critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

no. No error or exception. Only following logs are printed which doesn’t indicate any errors.

INFO [Thread-42214] 2013-04-05 23:30:27,785 StorageService.java (line 2379) Starting repair command #4, repairing 1 ranges for keyspace cardspring_production

INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,789 AntiEntropyService.java (line 652) repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242 new session: will sync /X.X.X.190, /X.X.X.43, /X.X.X.56 on range (1808575600,42535295865117307932921825930779602032] for keyspace_production.[comma separated list of CFs]

INFO [AntiEntropySessions:7] 2013-04-05 23:30:27,790 AntiEntropyService.java (line 858) repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242 requesting merkle trees for BusinessConnectionIndicesEntries (to [/X.X.X.43, /X.X.X.56, /X.X.X.190])

INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,086 AntiEntropyService.java (line 214) repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242 Received merkle tree for ColumnFamilyName from /X.X.X.43

INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,147 AntiEntropyService.java (line 214) repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242 Received merkle tree for ColumnFamilyName from /X.X.X.56

 

1.3 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

entire fs

 

Catastrophic? (spreadsheet column)

no (because it’s down for upgrade anyway)

2. How to reproduce this failure

2.0 Version

1.2.5

2.1 Configuration

Standard configuration with AWS EC2 sandboxed network environment (TCP connection from external network is blocked)

 

# of Nodes?

1

2.2 Reproduction procedure

1) run node repair (feature start)

Num triggering events

1

 

2.2.1 Timing order (Order important column)

NA

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

No. Manual guessing/testing/configuring of the network environment lead to the cause of the failure

2.4 How many machines needed?

2

2.5 How hard is the reproduction?

easy

 

3. Diagnosis procedure

Error msg?

no

3.1 Detailed Symptom (where you start)

After upgrading to cassandra version 1.2.4 from 1.2.3, nodetool repair command is unable to run on any node in a sandboxed (TCP connection from external network is blocked) cassandra cluster.  No changes in the configuration has been made before or after the upgrade. The repair command just gets stuck and the machine is idling. There are no error or warnings. The last log message before repair command gets stuck is:

INFO [AntiEntropyStage:1] 2013-04-05 23:30:28,147 AntiEntropyService.java (line 214) repair #cc5a9aa0-9e48-11e2-98ba-11bde7670242 Received merkle tree for ColumnFamilyName from /X.X.X.56

3.2 Backward inference

During user’s testing and debugging, the user attempted to open all TCP ports to external network (no longer sandboxed and posed security risk). After narrowing down the problem, the user learned that TCP port 7100 is the problem. Once allowing TCP port 7100 to be accessible from external network, nodetool repair command works flawlessly. Finally, a developer with domain knowledge realized this bug is caused by CASSANDRA-5171 optimization ( Save EC2Snitch topology information in system table ). After CASSANDRA-5171 optimization, OutboundTcpConnection tries to connect to public IP of self (XXX.YY.98.11). Since the cassandra cluster is in a sandboxed environment, such request is invalid. This optimization caused nodetool repair to stop working in EC2 sandboxed environment.

3.3 Are the printed log sufficient for diagnosis?

no

3.4 Are logs misleading?

no, just gives no useful information.

3.5 Do we need to examine different component’s log for diagnosis?

no

3.6 Is it a multi-components failure?

no

3.7 How hard is the diagnosis?

Very hard. Need lots of domain knowledge. 

4. Root cause

Optimization in CASSANDRA-5171 changes the behavior of EC2snitch behavior. External TCP access on port 7100 is required, but sandbox environment disallow such access.

4.1 Category:

Semantic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

Rolled back Cassandra-5171 and “de-optimize” the optimization which breaks nodetool repair command in sandboxed environment. Optimization in Cassandra-5171 wasn’t a big gain in performance anyways.