CASSANDRA-6707

1. Symptom

AIOOBE(array index out of bound exception) when doing select count(*). This is to count the total number of rows.  

Category (in the spreadsheet):

Must be one of the following:

early termination,

1.1 Severity

critical

1.2 Was there exception thrown? (Exception column in the spreadsheet)

yes, java.lang.ArrayIndexOutOfBoundsException

1.2.1 Were there multiple exceptions?

no

1.3 Was there a long propagation of the failure?

no

1.4 Scope of the failure (e.g., single client, all clients, single file, entire fs, etc.)

Cllients running this command.

Catastrophic? (spreadsheet column)

no

2. How to reproduce this failure

0. upgrade one node from 1.2.16 to 2.0.5
1. cqlsh> select count(*) from cfs.sblocks with 2.0.5 node

coordinator doesn't send a paging query to the 1.2 node

2.0 Version

2.0.5

2.1 Configuration

2 node configuration. 1 in version 1.2.16 and other in 2.0.5. 2.0.5 servers as coordinator when issuing select count(*) command. We also have to make sure the actual data is on the 1.2.16 node (see “Reproduction procedure” below for how).

# of Nodes?

2

2.2 Reproduction procedure

  1. Start two 1.2 nodes
  2. Insert 1m rows, verify select count(*) returns 1m
  3. Upgrade one node to cassandra-2.0 plus the patch, verify that select count(*) with the 2.0 node as the coordinator doesn't send a paging query to the 1.2 node

Events column in the spreadsheet.

0. upgrade from 1.2.16 to 2.0.5

0. insert data

1. cqlsh> select count(*) from cfs.sblocks;

Num triggering events

1

2.2.1 Timing order (Order important column)

NA

2.2.2 Events order externally controllable? (Order externally controllable? column)

yes

2.3 Can the logs tell how to reproduce the failure?

yes. The log can tell that  Paging query caused the problem

2.4 How many machines needed?

2

3. Diagnosis procedure

Error msg?

yes. java.io.IOException, java.lang.RuntimeException, java.lang.IndexOutOfBoundsException

3.1 Detailed Symptom (where you start)

After upgrading one node from 1.2 to 2.0, the following query fails with timeout

[cqlsh 4.1.0 | Cassandra 2.0.5.1-SNAPSHOT | CQL spec 3.1.1 | Thrift protocol 19.39.0]

cqlsh> select count(*) from cfs.sblocks;

Request did not complete within rpc_timeout

The 1.2 node reports the following error:

java.io.IOException: java.io.IOException: java.lang.RuntimeException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)

        at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)

        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:244)

        at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:538)

        at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:197)

        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:418)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)

        at org.apache.hadoop.mapred.Child$4.run(Child.java:266)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:415)

        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)

        at org.apache.hadoop.mapred.Child.main(Child.java:260)

Caused by: java.io.IOException: java.lang.RuntimeException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

        at org.apache.hadoop.hive.cassandra.cql3.input.HiveCqlInputFormat.getRecordReader(HiveCqlInputFormat.java:102)

        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:241)

        ... 9 more

Caused by: java.lang.RuntimeException: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

        at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:161)

        at org.apache.hadoop.hive.cassandra.cql3.input.CqlHiveRecordReader.initialize(CqlHiveRecordReader.java:91)

        at org.apache.hadoop.hive.cassandra.cql3.input.HiveCqlInputFormat.getRecordReader(HiveCqlInputFormat.java:96)

        ... 10 more

Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0

        at java.util.ArrayList.rangeCheck(ArrayList.java:604)

        at java.util.ArrayList.get(ArrayList.java:382)

        at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader.retrieveKeys(CqlPagingRecordReader.java:710)

        at org.apache.cassandra.hadoop.cql3.CqlPagingRecordReader.initialize(CqlPagingRecordReader.java:155)

        ... 12 more

3.2 Backward inference

2.0 blindly use the new pagers and when it's a range query, this uses PagedRangeCommand which 1.2 nodes don't have

4. Root cause

This is a backward compatibility issue. In the newer version, this command is handled internally by calling Paging function, which is not recognized by the old version node that actually contains the data.

4.1 Category:

semantic

4.2 Are there multiple fault?

no

4.2 Can we automatically test it?

yes

5. Fix

5.1 How?

Added a check to see if the  version of cassandra is above 2.0. If it is, don’t  use the paging function.