CEP-37 Auto Repairs
Streamlining Repair Operations
Motivation
Why Auto-Repairs in Cassandra?
Proposal Overview
CEP-37 Auto-Repair Proposal
Design of the Repair Scheduler
control planes.
Maintaining a Global Repair View
Two key tables:
Scheduler tracks node availability and repair progress.
Ensures globally consistent repair views across nodes.
Support for Multiple Repair Types
Cassandra Node
Auto-Repair Config
Default Scheduler Config
Full Repair Config Overrides
Incremental Config Overrides
Auto-Repair System Tables
repair_history
repair_priority
Full Repair
Inc�Repair
Full Repair
Inc�Repair
Auto-Repair Scheduler
Full Repairs
Incremental Repairs
Repair Flow
Configuration
Configuration - Table Properties
CREATE TABLE test.test (
key text PRIMARY KEY,
value blob
) WITH additional_write_policy = '99p'
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND cdc = false
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '16', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND memtable = 'default'
AND crc_check_chance = 1.0
AND default_time_to_live = 0
AND extensions = {}
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair = 'BLOCKING'
AND speculative_retry = '99p'
AND automated_repair_full = {'enabled': 'true'}
AND automated_repair_incremental = {'enabled': 'true'};
Configuration - YAML
auto_repair:
enabled: true
repair_type_overrides:
full:
enabled: true
number_of_repair_threads: 2
repair_max_retries: 2
repair_primary_token_range_only: true
incremental:
enabled: true
number_of_repair_threads: 1
repair_max_retries: 3
repair_primary_token_range_only: false
Observability & Metrics
Monitoring Repair Progress
Observability & Metrics
> nodetool getautorepairconfig
repair scheduler configuration:
repair eligibility check interval: 5m
TTL for repair history for dead nodes: 2h
max retries for repair: 3
retry backoff: 30s
configuration for repair type: incremental
enabled: true
minimum repair interval: 15m
repair threads: 1
number of repair subranges: 16
priority hosts:
sstable count higher threshold: 10000
table max repair time in sec: 6h
ignore datacenters:
repair primary token-range: true
number of parallel repairs within group: 3
percentage of parallel repairs within group: 3
mv repair enabled: false
initial scheduler delay: 5m
repair session timeout: 3h
Incremental Repair
Reliable Incremental Repair Onboarding
Incremental Repair
Migration
Ship It! Currently in Use
Questions