TiKV Fast Tune:
Laying Down a Stepping Stone
on Our Path
刘聪 (Liu Cong) 2020-11
Content Index (HOME)
The Goal and The Road
of TiDB Cluster Tuning and Maintenance
We are Drifting From the Correct Path Now
TiDB as A Product
TiDB
TiDB
PD
TiKV
TiDB Community
TiDB Cloud
DBaaS
K8S toolset
...
Data Migrating
BR
TiCDC
...
TiDB Tuning and Maintaining
Dashboard
Metrics
...
...
As TiDB dev engineers, we (inappropriately) extended our scope of responsibility to keep almost every TiDB cluster in a good state.
That needs to be addressed, not just for reserving our energy for writing better products, but it's the only way we can have the ability to expand our user groups in parallel.
Must use
Daily use
The Most Painful: TiKV Cluster Maintenance
All Works of TiDB maintaining
POC
Oncall
Maintaining TiKV is the hardest and most resource intensive work.
Works group by modules
TiDB
PD
TiKV
CDC
...
...
WHY?
Stateful
High expectations
Needs to be handled case by case
Hardware Env
Data Status
Compared to CDC, ..
Compared to TiDB
Lowest Convergence Rate
Not just need to be good enough, it must be in the Best Status
Complicated
The Promised Land
TiDB User Community
TiDB Clusters
TiDB Dev Community
On-premise Deployments
Magic Maintenance Toolbox
Extremely powerful, Tells us (the users) what happened in these clusters, and what we should do
Community Deployments
Community Engineers
Non-engineer Users
Enterprise Engineers
On-cloud Deployments
A few issues that magic can't handle
Design, maintain, improve TiDB, and the magic box
Here is our Ultimate Goal.
In this model, the number of users could scale without maintaining pressure.
Community engineers
PingCAP Engineers
HOW to gain this power?
+-+-+-+-+-+
Where We Are
Easy-to-use Toolset
Pro Toolset
Can handle almost all issues SQL-related.
But can't handle TiKV issues, especially performance-related.
One needs to be an expert to solve a TiKV issue, needs to know the architecture, the (changing) implementation, the source code.
Here is the bleeding point.
The Unsatisfied TiDB User Community
A lot of issues
The Exhausted TiDB Dev Community
Community Engineers
Enterprise Engineers
PingCAP Engineers
Logs
Metrics
...
The Missing Piece: Easy-to-use TiKV Toolset
TiDB Tuning and Maintaining Toolbox
Easy-to-use toolset
Pro tools
Dashboard
Metrics in Grafana
TiKV-related
...
...
...
TiKV-related
For:
Non-engineer users
Non-dev engineers
Dev engineers
For:
Dev engineers only
Flames
...
The missing piece
Logs
Metrics stored in TiDB
Why It's Missing
TiDB Design: pluggable storage engine, does not have to be TiKV
TiDB
Storage Layer
PD
TiKV
Too complicated (relatively)
TiKV
TiKV
A simple answer is that we were always busy in our way.
It needs to be improved before we could go further.
TiKV
A Feasible Approach
Now
Pro tools
Easy tools
TiKV tools
Improved
Pro tools
Easy tools
TiKV tools
Dev Engineers
Semi-pro TiKV tools
Dev engineers
Non-engineer Users
Non-engineer Users
Goal
Pro tools
Easy tools
TiKV tools
Semi-pro TiKV tools
Dev engineers
Non-engineer Users
Non-pro TiKV tools
The challenge is, building effective non-pro TiKV tools is a hard job, the better way to do it is step-by-step.
First, we provide semi-pro tools, base on the pro-tools.
Then we build the target non-pro tools.
1
2
3
A Feasible Approach: Toolset Lifting
Semi-pro toolset
Pro toolset
Non-pro toolset
With Pro-core, Pro UI
With Pro-core, Non-pro UI
Dev Engineers Only
Anyone with a little training
Anyone
Lift the Pro tools to Non-pro
This is what we are already doing,
now we apply it to the TiKV toolset.
How Magic (may) Happen
Data (metrics) auto scan, auto troubleshooting
Expert-system knowledge
Improved
Pro tools
Easy tools
TiKV tools
Semi-pro TiKV tools
Dev engineers
Non-engineer Users
2
Goal
Pro tools
Easy tools
TiKV tools
Semi-pro TiKV tools
Dev engineers
Non-engineer Users
Non-pro TiKV tools
3
Easy to use
(With a little training)
Easy to develop
Become (part of) TiKV Tuning & Troubleshooting Standard
Accumulate cases and handling approaches, etc
Rapidly improving
Any TiKV-dev can easily add some contents
Maybe a bit shabby-looking, but that's totally OK
ML?
+-+-+-+-+-+
TiKV Fast Tune: An Experiment, A Stepping Stone
With minimal training, anyone could tell what happened in a TiKV cluster.
It won't be instead of the pro-metrics, just another view based on the original metrics.
Improved
Pro tools
Easy tools
TiKV tools
Semi-pro TiKV tools
Dev engineers
Non-engineer Users
2
Fast Tune Panels
We could keep doing more experiments until we find the way leading to our goal.
Fast Tune is one of many.
The Design Concept of Fast Tune
How Fast Tune Works
TiKV Details Panels
TiKV Details Panels
TiKV Details Panels
TiKV Details Panels
...
TiKV Fast Tune Panels
...
Anyone can browse the panels and find the most common issues in plain sight
Compress a large number of metric panels to a few ones
Panel count: N * 100
Reduce the info and improve the display in each panel
Panel count: N * 10
Easy to Develop, Easy to Use
Fast Tune
Easy to dev
Based on:
Existing metrics,
Existing the platform (grafana)
A list of potential issue causes
Hide all the details,
Hide the irrelevant data(metrics)
Use Clashing Spark to check if the potential issue is true or not
Easy to use
Could be easily checked one by one
The Clashing Spark needs a little explanation, but once we understand it, it's intuitive and convenient
What is Clashing
(Dev) Draw the Potential Cause and the Result into the same panel
A Panel
Potential Cause
Has Spark
No Spark
(User) Check the next panel
(User) An issue cause had been detected!
Result
(User) Observe if there is a Clashing Spark or not
A Panel
Potential Cause
Result
An Example of a Clashing Spark
Result: the TiKV Write-RPC QPS jitter
Potential Cause: block-cache miss rate
Flip it (just one kind of the technics, not necessary)
Draw them together
Draw them together
The curves perfectly matched, that's the Spark.
So we could say: the block-cache miss rate is (one of) the cause of Write-RPC QPS jitter.
Fast Tune: A Set of Clashing Panels
One Panel
Fast Tune Panels
Anyone can browse the panels and find the most common issues in plain sight
Has a statement as to title, propose a (potential) cause could lead to the result
A (may not exist) Clashing Spark, we could easily tell it with a tip or a little explanation.
If the spark exists, the statement of the title is true, otherwise is false.
The Result
The Potential Cause
What's in A Fast Tune Panel
One Panel
Normally the Result will be QPS trend changing, or QPS jitter.
The most usual Spark is the matching curves of the Result and the Cause
Has a statement as title
Has a Result
Has a Potentail Cause
A Spark of Having Matching Trends
Has Spark
No Spark
A Spark of Having Matching Jitters
Has Spark
No Spark
The values are too small to be the root cause
A Spark of Having Abnormal Instances in a Group
Has Spark
No Spark
A Spark of Members' Trends affecting the Group's Trend
Has Spark
Zoom Out to Find the Matching Trends
No Spark ?
Has Spark
Use the Grafana time range tool to select a larger range
Zoom out
Find the Matched Trends
Zoom In to Find the Matching Jitters
Zoom in
No Spark
Zoom in
Has Spark
Use Zoom In for Better Auto Scaling
The panel setting (by developer)
Zoom in
When we zoom in a bit, and do not select the time range including QPS=0, the Y-axis will be auto-scaled, then the QPS jitter should become more obvious.
If the cause-metric could be no data, set max to 0 to avoid weird displaying.
TiDB and TiKV Architecture
TiDB Architecture
TiKV Instance
Disk
RocksDBs
TiKV Instance
TiKV Instance
TiDB Instance
PD Cluster
(connected with TiKV instances)
Client
read
write
gRPC Server
TXN Scheduler
MVCC Storage
RaftDB
Disk
Coprocessor
KVDB
RaftStore
w + r
TiKV Architechture
RocksDBs
Requests
gRPC Server
TXN Scheduler
MVCC Storage
RaftStore
RaftDB
Disk
Coprocessor
KVDB
read
write
w + r
Critical Operations in TiKV Read
RocksDBs
TXN Scheduler
Get/Seek/Next
Memtable-Read
BlockCache-Read
SST-Read => Disk-IO
Coprocessor-RPC
Storage-Snapshot
Storage-Read
Read
Engine-Snapshot
Engine-Get/Seek/Next
Read
Out-Lease-Snapshot
Raft-Write => In-Lease-Read
In-Lease-Read
KVDB-Snapshot
KVDB-Get/Seek/next
Read-RPC
Get-RPC (included batched)
Storage-Snapshot
Storage-Read
Coprocessor-RPC
Requests
RaftDB
Disk
RaftStore
KVDB
gRPC Server
MVCC Storage
Coprocessor
read
w + r
Critical Operations in TiKV Write
RocksDBs
Read/Seek/Next
Memtable-Read
BlockCache-Read
SST-Read => Disk-Read
Write
Memtable-Write
WAL-Write => Disk-IO
Write
Engine-Snapshot
Engine-Read/Seek/Next
Engine-Write
Write-RPC
Prewrite-RPC
Commit-RPC
PessemisticLock-RPC
Requests
Read
Non-Lease-Read
Raft-Write => Lease-Read
Lease-Read
KVDB-Snapshot
KVDB-Read/Seek/Next
Write
StoreLoop
Message-Dispatch * 3
RaftDB-Write
KVDB-Write (a few)
ApplyLoop
Message-Dispatch
KVDB-Write
Write
Memtable-Write
WAL-Write => Disk-IO
Acquire-Latch
RaftDB
Disk
RaftStore
KVDB
gRPC Server
TXN Scheduler
MVCC Storage
Coprocessor
read
write
w + r
TiKV Latency Source
RocksDBs
Requests
If MVCC GC is too slow,
it may scan a lot TXN versions and cause high latency
If writing rows are conflicted, it will cause high latency by waiting latches.
The latch waiting is affected by:
1. Client workload
2. Processing speed of modules below
If MVCC GC is too slow,
it may scan a lot Txn versions and cause high latency
1. If the Disk is slow, RocksDB will be slow
2. If compaction is slow, read maybe slow by reading too much SSTs
3. Some (infrequency) operations (eg: delete files) may hold the inner Mutex for a while and cause performance jitter
4. Having too much deleted but un-GC data in it will slow down Read
When reaching one of the limits, it will cause high latency:
1. Throughput BandWidth
2. IOPS
3. fsync/s
The disk latency is related to
current-flow : throughput-BW
Mixed IO-size workload also slow down the disk performance
gRPC Server
TXN Scheduler
RaftStore
RaftDB
Disk
Coprocessor
KVDB
MVCC Storage
read
write
w + r
TiKV Performance Routine Tuning
RocksDBs
Requests
Make sure coprocessor thread-pool is big enough
Compaction tuning, and use IO-limiter when disk is not good
Use good Disk and proper mounting
Solve conflicts on the client side
(or let it be if the application can't be modified)
Make sure scheduler thread-pool is big enough
Make sure thread-pool sizes are not too big nor too small
gRPC Server
TXN Scheduler
MVCC Storage
RaftDB
Disk
Coprocessor
KVDB
RaftStore
read
write
w + r
TiKV Performance Map
Workload Pattern and Balancing
Write-RPC
Prewrite-RPC
Commit-RPC
PessemisticLock-RPC
Read-RPC
Get-RPC (included batched)
Coprocessor-RPC
TiKV Instance
TiKV Instance
TiDB Instance
PD Cluster
(connected with TiKV instances)
Client
TiKV Instance
Requests
Find Out Which RPC has Problems
gRPC Server
TXN Scheduler
MVCC Storage
RaftStore
RaftDB
Disk
Coprocessor
KVDB
Data codec, CPU usage,
not much latency.
tikv.server.grpc-{...}
No need to tune, some values maybe could be turned down a bit to save resources.
Configs
Requests
Read Performance: Coprocessor, MVCC Storage
Requests
gRPC Server
MVCC Storage
RaftStore
Disk
Coprocessor
KVDB
TXN Scheduler
RaftDB
Read Performance: GC, RaftStore
Requests
TXN Scheduler
RaftDB
Disk
MVCC Storage
Coprocessor
KVDB
gRPC Server
RaftStore
Read Performance: KVDB, Disk
Coprocessor
Requests
TXN Scheduler
RaftDB
KVDB seek and read
gRPC Server
MVCC Storage
RaftStore
Disk
KVDB
Write Performance: Scheduler, MVCC Storage
TXN Scheduler
Coprocessor
Requests
MVCC Storage
RaftDB
Disk
KVDB
gRPC Server
TODO: Store loop duration
TODO: Apply loop duration
read
write
RaftStore
Write Performance: RocksDB
RocksDB
Coprocessor
TXN Scheduler
Requests
MVCC Storage
RaftStore
RaftDB
Disk
KVDB
RocksDB compaction related
gRPC Server
read
write
w + r
Fast Tune Manual
Summary of Fast Tune
Fast Tune is a Grafana page, these are 3 rows (a row is a set of panels) in it.
To check out a TiKV cluster, we could check out the Summary Row first, to learn the basic status
Normally, the workload is mixed with Write and Read, both of them could affect the Write performance
We got lots on-call issues about TiDB Write is too slow (latency or throughput), so the current version of Fast Tune focuses on Write performance
1
3
2
For those clusters that only have Read performance issues, Fast Tune is not able to help yet.
It’s on the schedule and will be in new rows (new panel sets)
TODO
Fast Tune
Summary
The most popular Result (of Clashing)
In almost all panels, we use Write-RPC QPS as the Clashing Result.
Because we want to know is the cause leads to the Write QPS slowing down or to Write jitters.
The Write-RPC QPS is the sum of gRPC:
No matter we’re using pessimistic TXN or optimistic TXN, these three will always represent the Write performance
Using these kind of tactics, users of Fast Tune could skip a lot of detailed info, and get a big picture of the current status quickly.
But still, we want to check out all the detail metrics from Page: TiKV-details and any other pages, once we find something wrong.
It’s always the green part (the lower part) in the panel.
Fast Tune
Summary
More Results (of Clashing)
Get-RPC QPS is the sum of gRPC:
Read-RPC QPS is the sum of gRPC:
Fast Tune
Summary
The Flipped Metric and The Negative Value
Most of the metrics are flipped, to get a better observation between the Result and the Potential Cause.
For the flipped metrics, we should notice that the lower position in the curve means the greater value.
The flipped metrics show as negative values.
It’s a bit not so friendly, we can’t figure out how to display normally in flipped mode yet.
But worry not, the mouse hints could still show the positive values.
Fast Tune
Summary
Jitter Highlighting in Fast Tune
Linear scale could highlight jitters
Set min-max to auto will auto-scale the curves, then could highlight jitters
Most of the TiDB/TiKV metrics show with resolution = 1/2 in Grafana, Fast Tune use 1/1 to highlight jitters
Fast Tune
Summary
When More Than One Panels Have Sparks
If these panels are parent-child in the call stack, the most inner (lower) one is the root cause
If these panels have no relations, then it needs some analysis to conclude which one is the cause and which one is the result, for example, they could use the same resource (eg: Disk, CPU), the hidden relationship is resource competition
Fast Tune
Summary
Requests
gRPC Server
TXN Scheduler
MVCC Storage
Coprocessor
TXN Scheduler
MVCC Storage
RaftStore
RaftDB
read
write
w + r
Row: Summary
Row:
Summary
Find Out the Imbalanced Requests
We could easily tell the Imbalanced Write between TiKV instances with a glance, and hangover the mouse pointer to the graphic to find out the detail.
The Backgound Fill represents the total Write-RPC QPS
The Colored Lines represent the Write-RPC QPS of each single instances
The Imbalanced Read panel is alike
Row:
Summary
kv_prewrite
kv_commit
kv_pessimistic_lock
Find Out the Basic Workload Pattern
An easy way to observe how many Reads and Writes are called, by checking the number of righty Y-axis.
The proportion of QPS represents the basic Workload.
If those RPC calls have a dependency between themselves (for example, have mixed Read and Write in TXNs), their curves will match.
This sample comes from the TPCC benchmark, so we could see all those curves matched.
Row:
Summary
(Not important) There are some repeating routine coprocessor jobs from TiDB, if there are no client-side coprocessor calls, these routine jobs emergence.
Find Out the Abnormal Instances
We could easily tell if only one or a few TiKV instances are Read Too Slow.
The Backgound Fill represents the total Write-RPC QPS.
The Flipped Fills represent the Write-RPC latency of every single instance.
The Panel: Some instances write too slow is alike
If this happens, we could only check out the instances with problem.
Row:
Summary
An Example of One Instance Write Too Slow
One instance has latency jitter of Write, lead to cluster Write jitter.
Row:
Summary
Mouse hangover
Which RPC has Problems: Read or Write, or Both
By:
We could tell the problem is on Read-RPC or Write-RPC.
Then, we could go straight to the Row: Read Performance or Row: Write Performance.
Row:
Summary
Find out Which Type of Read has Problems
By comparing the Get and Coprocessor QPS and latency, we could know which type of Read has problems.
The detail WHY need to check out Row: Read Performance, and even need to check out the origin metrics from Page: TiKV-details.
If we saw a Spark, that might means this type of Read is the main workload.
Row:
Summary
?
More Examples of Read latency
Row:
Summary
?
Find out if there are Write Stall Events
This red line is combined with the Write Stall metrics from KVDB and RaftDB.
Write Stall should never happen, if the red line is not straight, go check the RocksDB:KV and RocksDB:Raft rows in Page: TiKV-details, check out which RocksDB (mostly is KVDB) has Write Stall, and why (there is a panel shows the reason).
Normally it will be the Flush (memtable -> SST) or the Compaction is too slow.
If we got not enough CPU or the disk load is too high (check the w_await metrics from Page: Disk Performance, ignore the util% metric), that we need to lower the pressure from client-side.
If CPU and Disk are both OK, increase the compaction thread-pool size to solve the Write Stall.
Row:
Summary
tikv.rocksdb.max-background-jobs = 8
tikv.rocksdb.max-sub-compactions = 3
tikv.rocksdb.max-background-flushes = 2
tikv.raftdb.max-background-jobs = 4
tikv.raftdb.max-sub-compactions = 2
Configs
The Latch Time Panel
When Write from client-side conflicts, the latch duration increases. When the conflict is heavy, the latch duration will be significantly longer.
When that happens, if we are using optimistic TXN, we could change it into pessimistic TXN first, to avoid lots of rollbacks. But the latch duration will still be long, that's inevitable.
This panel's latch value is combined by kv_prewrite, kv_commit and kv_pessimistic_lock.
These metrics have a bug in the collecting process, it incorrectly included the scheduler's wait for available thread time.
So when latch duration is high, and we can make sure there is not much conflict in the client-side, then maybe it's not enough scheduler threads causing a high "latch time", then we should increase the pool size.
Row:
Summary
tikv.storage.scheduler-worker-pool-size = 4
Configs
The PD Scheduling Panels
If there is a lot of balancing, check out why in the panel below
These panels are for checking out if the PD scheduling operating has a relation with Write QPS.
When scheduling does affect QPS, we could slow down the balancing (with pd-ctrl).
A common scenario is adding (or removing) TiKV instances into an existing cluster. That will cause rebalancing, and might cause performance issues.
This shows the relation of stores' used space ratio and the balancing event count.
Under some (inappropriate) configs, some unnecessary balancing may happen, this is the place we could capture that.
Row:
Summary
The PD client Panel
Like Panel: Write Stall, this is a panel for monitoring critical events.
The PD client having a lot of pending tasks is abnormal, may be caused by lack of CPU resource or other reason, it may lead to heartbeat report failure, then lead to TiKV instance disconnect from PD.
Row:
Summary
RocksDB Compaction Pending Bytes Panel
Row: Read Performance
Increasing compaction thread-pool could be useful in reducing pending bytes if CPU and Disk IO resource is plentiful.
tikv.rocksdb.rate-bytes-per-sec = ""
tikv.rocksdb.auto-tuned = false
tikv.raftdb.rate-bytes-per-sec = ""
tikv.raftdb.auto-tuned = false
Configs
Too many compaction pending bytes could lead to a huge impaction of Disk IO or CPU usage, then affect TiKV performance, and even cause RocksDB Write Stall.
Client-side write pressure too high may cause that, in this situation, we need to reduce the pressure.
Ingest SST could also trigger lots of compaction, deleting range data or PD scheduling would cause that.
Increase the memtable size could help a little, by lower the write amplification (WA) of RocksDB. (need more memory to do that)
If it's IO limit config cause too many pending bytes, it may not impact the Disk IO and CPU, but will increase the SST Read Count then lead to low performance.
tikv.rocksdb.max-background-jobs = 8
tikv.rocksdb.max-sub-compactions = 3
tikv.raftdb.max-background-jobs = 4
tikv.raftdb.max-sub-compactions = 2
Configs
tikv.rocksdb.defaultcf.write-buffer-size = "128MB"
tikv.rocksdb.defaultcf.max-bytes-for-level-base = "512MB"
tikv.rocksdb.writecf.write-buffer-size = "128MB"
tikv.rocksdb.writecf.max-bytes-for-level-base = "512MB"
tikv.rocksdb.lockcf.write-buffer-size = "32MB"
tikv.rocksdb.lockcf.max-bytes-for-level-base = "128MB"
tikv.raftdb.defaultcf.write-buffer-size = "128MB"
tikv.raftdb.defaultcf.max-bytes-for-level-base = "512MB"
(Other compaction configs could be useful too)
Configs
If ingest SST cause high pending bytes, try lower the scheduling by pd-ctrl.
RocksDB Compaction Pending Bytes by Instance
Row: Read Performance
Maybe some instances' pending bytes reach threshold, but we can't tell in the summary panel, so we also need another panel showing it by instance.
By instance
TiKV has a soft limit of pending bytes, reaching this number will trigger Write Stall.
When pending bytes reaches 1/4 soft limit (16G), RocksDB will increase threads to speed up compaction, this may lead to performance jitter.
Here in this by instance panel, we could capture this already happened.
tikv.rocksdb.defaultcf.soft-pending-compaction-bytes-limit = "64G"
Configs
?
Row: TiKV Write Performance
Row: Write Performance
Scheduler Thread-Waiting Panels
These panels are for judging that the scheduler thread pool size is appropriate.
When the waiting time is too long (much longer than MVCC Storage async-write duration, check Page: TiKV-details), it will amplify the latency jitter.
In that case, we could increase the pool size to reduce latency. But, a pool size too big will be a waste of CPU (caused by context switches).
This is an ideal panel to observe the waiting for scheduler thread duration, BUT in some versions of TiDB you may not see any data here.
Before the upper panel can be used, you could use this panel to get rough guessing, by observing the waiting queue size and the latch panel.
So if conditions permit, do benchmarks when other configs are settled, using a relatively big pool size, then reduce it until performance drops.
Row: Write Performance
?
tikv.storage.worker-pool-size = 4
Configs
RaftStore Thread Pool Size Panels
Raftstore uses two message loops (the Store Loop and the Apply Loop) to handle messages, each loop uses a set of threads (thread pool):
RaftStore plays the most important part in TiKV performance, see "TiKV Write Latency and the Loop Speed" for more details; optimization is on-going.
The best pool size should be small, but big enough. we could compare the loop duration and the waiting duration, if the later is much greater than the previous, then the pool size is too small.
We don't have the loop duration metrics in the current version, So we could use the corresponding RocksDB Write latency instead, to make a rough guess.
This metric's gathering is inappropriately implemented by now, but we still could put it in to use as a rough value.
Row: Write Performance
tikv.storage.store-pool-size = 2
tikv.storage.apply-pool-size = 2
Configs
More Examples of RaftStore Thread Pool Size Panels
Row: Write Performance
RocksDB Write Latency Panels
Store Loop writes data (raft log) to RaftDB and occasionally to KVDB.
Apply Loop writes data in the final form to KVDB.
Since a Write involves both loops, so the two panels both play important roles.
Row: Write Performance
Normally we only need to show duration-99% (could disable duration-max by editing the panels).
But for some cloud disks like AWS gp2, it's write latency is stable when BW and IOPS are below quotas, and will suddenly rise to a high value when reaching one of the quotas.
For this cloud disk, the duration-99% may be stable even when duration-max is not, so we show them both.
?
RocksDB Compaction Flow Panels
Compaction IO flow is the most common reason for performance fluctuation, mostly on KVDB compaction.
The underlying reason is that disk latency increases when IO flow grows.
The read flow may hit the OS page cache, and cause less real IO flow on the Disk.
So the pink part in these panels could be ignored to a certain extent, it depends on how much memory we have in this node and how fast data is written to Disk.
Row: Write Performance
tikv.rocksdb.rate-bytes-per-sec = ""
tikv.rocksdb.auto-tuned = false
tikv.raftdb.rate-bytes-per-sec = ""
tikv.raftdb.auto-tuned = false
Configs
IO limit config can smooth the flow, be careful not to set it too low, it will cause Too Many Pending Bytes then lead to Write Stalls.
More Compaction Flow Panel Examples
Row: Write Performance
RocksDB Write Batch Size Panels
Changing of batch sizes of RocksDB Write could cause changes in RocksDB Write latency: the bigger, the longer.
If we can't find any reason why RocksDB Write latency is increasing, the answer may be in here, and the batch size-changing maybe from the client-side.
Row: Write Performance
The Loop Speed will affect the batch size, if Loop Speed becomes slow, then the messages in the queue will accumulate and lead to a big batch size.
This might happen:
This degeneration continues until the batch size reaches the limit configured by max-batch-size.
tikv.raftstore.store-max-batch-size = 256
tikv.raftstore.apply-max-batch-size = 256
Configs
RocksDB Write Mutex Panels
If RocksDB Write is slow but Disk latency is normal, it is mostly caused by PD scheduling or GC worker, occupying the RocksDB Mutex for too long.
So before checking these panels, we should check Disk latency first.
Row: Write Performance
?
?
RaftDB Sync WAL Latency Panel
If a Disk has high-performance specs in latency, BW, IOPS, but slow in doing fsync operations, it could still affect the performance. (this happens a lot on non-enterprise disks)
When that happens, RocksDB Write latency panels will have no spark, but in this panel, the spark should appear.
Row: Write Performance
There is not much works beside IO operation in Sync WAL, When fsync performance is not the bottleneck, we could roughly use this metric as Disk Write metric.
RocksDB Frontend Flow Panels
RaftDB frontend flow is normally caused by PD schedule.
If the flow is high, that's something we should notice.
KVDB frontend Write-Flow is the writing data from the client-side workload.
Similarly, Read-Flow reflects the Read workload from the client-side.
If there is a spark here, it may mean that workload changes affected the performance.
Row: Write Performance
RocksDB Total IO Flow Panels
KVDB compaction flow panel is the most direct panel to observe how compaction affects performance.
But the jitter of total IO flow hitting on the disk is the root cause of Disk latency jitter.
So we have these two panels.
Row: Write Performance
More Examples of Total IO Flow Panels
RocksDB CPU usage
Compaction not only causes IO latency jitter, ot also causes CPU usage jitter.
When IO Flow jitter panels have jitter, but the IO flow is not high, we could use this panel to verify the RocksDB CPU usage jitter of each node.
Row: Write Performance
This instance has RocksDB CPU usage jitter
and caused TiKV performance jitter.
Mouse
hangeover
Zoom in
?
MVCC Storage Async-Write Latency Panel
Row: Write Performance
Storage async-write latency is the core metric to measure TiKV Write performance.
If Write-RPC has Sparks and here has no Spark, it means something is wrong in the modules above Storage:
If here has Sparks, then the Storage inside may have a problem:
Disk Write Latency Panel
Row: Write Performance
A panel to observe Disks' Write latency:
It will be lots of instances and Disks here, all we need to do is simple as listed above.
Row: TiKV Read Performance
Row: Read Performance
RaftStore Lease-Read Rate Panel
Row: Read Performance
A Read request needs to read data from the Storage layer, then the request goes to RaftStore split by regions.
If the region is an in-lease leader in RaftStore, data could directly read from this leader by getting a snapshot from KVDB.
If not, RaftStore needs to process a Raft Write (without data persistence) to make sure this region is a leader and in-lease, then read the data.
Direct in-lease Read is always faster, so if a lot of Reads fall to out-lease, then the performance will be bad.
Notice that sometimes this is the result, not the reason: when the gRPC TPS changes, the fall-rate also changes. If the fall-rate is the only thing changes and a spark is found here, then it may be the reason.
MVCC Storage Async-Snapshot Duration
Row: Read Performance
Read and Write both need to get a snapshot from the engine (implemented by RaftStore) before any operation.
Normally the target region should be In-lease, so the get snapshot duration should be very fast. If there are obvious duration jitters or trends, it may be caused by meeting (lots of) Out-lease regions.
When that happens, we should find out what causes abnormal Out-lease rate, it may be:
If the network is OK, we could try to adjust PD scheduling by pd-ctrl to solve the issue.
?
Coprocessor Count and Latency Panels
At the Row: Summary we already got a general picture of the counts and latencies of each type of RPC calls, so here we just show a little more info about the coprocessor.
Notice that if the coprocessor latency panel has a spark, there is a big chance that it does not mean coprocessor performance is the cause, it may be just the result.
For example, if compaction flow leads to disk latency jitter then leads to performance jitter, it will also lead to coprocessor jitter.
For observe the client-side workload affection.
Row: Read Performance
Coprocessor Threads Panels
Need some analysis to figure out why we don't have enough threads:
How long the tasks waiting in the queue.
How long is the waiting queue.
Row: Read Performance
These panels are for judging if the coprocessor thread pool size is big enough.
Here is only for normal-level (from client-side workload) task waiting durations.
To check the high and low-level tasks' waiting durations, we need to visit the Page: TiKV-details.
Data Scan Count Panels
This panel is for observe RocksDB tombstone number could affect the Write performance.
This could affect both Write and Read calls, not just coprocessor.
Check coprocessor request types from Page: TiKV-details.
If scan too much, it may be caused by client-side workload or TiDB routine works (eg, table analyzation).
Row: Read Performance
If coprocessor scans too much data, it will be slow. This could be caused by:
A Typical Performance Trend Caused by GC
(3) After that, QPS become stable
(2) Then GC starts, MVCC-versions become deleted data in RocksDB, still dragging QPS down
(1) QPS drops when MVCC-version number grows
Row: Read Performance
KVDB Seek Count Panel
Row: Read Performance
Many reasons could cause KVDB Seek: MVCC GC, client-side Read, client side Write, etc.
Here we could observe if the Seek count affected TiKV Write performance, if there is a spark, we could check Page: TiKV-details to find out where those calls come from.
?
KVDB Read Latency Panel
Row: Read Performance
If KVDB Read is slow and affects TiKV Write performance, this panel will tell us.
When a spark is found here, we need to check out other panels to find out why KVDB Read is slow. It may be:
KVDB Read SST Count Panel
Reading too many SSTs in KVDB will lead to Read or Seek too slow.
Some reasons could lead to too many SST reading:
The SST read count is related to compaction a lot, so we could see the periodic changes here (following the compaction period).
If it affects the TiKV Write performance, we should know by this panel.
Row: Read Performance
KVDB Read SST Latency and Seek Latency Panel
Row: Read Performance
Mostly related to Disk IO latency.
KVDB Memtable Read Panel
Row: Read Performance
The written data will be stored in memtable before flushing, Read will find data in memtable before finding it in block-cache and SSTs.
If block-cache Read miss and memtable Read also miss, it will lead to Disk IO (read SSTs) then cause RocksDB Read to become slow.
It also affects Write-RPC in two ways:
Some inappropriate configs might cause memtable hit rate to periodically change, if that happens we could capture that in this panel.
tikv.rocksdb.defaultcf.write-buffer-size = "128MB"
tikv.rocksdb.defaultcf.max-bytes-for-level-base = "512MB"
tikv.rocksdb.defaultcf.max-write-buffer-number = 5
tikv.rocksdb.writecf.write-buffer-size = "128MB"
tikv.rocksdb.writecf.max-bytes-for-level-base = "512MB"
tikv.rocksdb.writecf.max-write-buffer-number = 5
tikv.rocksdb.lockcf.write-buffer-size = "32MB"
tikv.rocksdb.lockcf.max-bytes-for-level-base = "128MB"
tikv.rocksdb.lockcf.max-write-buffer-number = 5
tikv.raftdb.defaultcf.write-buffer-size = "128MB"
tikv.raftdb.defaultcf.max-bytes-for-level-base = "512MB"
(Other compaction configs might affect memtable hit rate too)
Configs
?
KVDB Block Cache Panel
Row: Read Performance
If block-cache Read miss (and memtable Read also miss), it will lead to Disk IO (read SSTs) then cause RocksDB Read to become slow.
It also affects Write-RPC in two ways:
This panel only lists the most important metric of block cache, there is more info in Page: TiKV-details.
The most reason for block cache hit rate change is compaction, so at large chances even there is a spark here, block cache hit rate still not the cause.
Disk Read Latency Panel
Row: Read Performance
A panel to observe Disks' Write latency:
It will be lots of instances and Disks here, all we need to do is simple as listed above.
How To get Fast Tune
Version compatibility
Import
How to get Fast Tune
='_'=
ノ
The End
Although we don't need to do TiKV performance diagnosis step by step with Fast Tune, just browse it and try to find something.
Still, I wanted to write a Diagnosis Process, to make sure nothing miss in Fast Tune panels.
But then I found out I was not the only one writing those things, so I stopped.
Here are some unfinished thoughts if you are interested.
Quick Diagnosis of TiKV Write Performance Issue (1)
Do more research about RocksDB
Compaction used too many CPU
Cluster Write become slow / has jitters
Something unexpected happened, check the OS metrics: memory pages, context switching, total CPU load, etc
Check RocksDB compaction flow
Verify the Disk Write latency
Check Perf Context Mutex
Check write batch size
Verify RocksDB Write latency
Write too slow
Read too slow (Next Page)
Check the RocksDB CPU usage
Check the Disk Writelatency
Check RocksDB Write latency
?
PD scheduling occuply the RocksDB mutex for too long.
Client side changing or something change Loop Speed
Check the Frontend flow
Frontend flow too high
Compaction flow too high
Check Read or Write has issue
?
?
Check more Disk metrics, eg: IOPS. especially when is Cloud Disk
Check Perf Context Thread wait
Not enough RocksDB write threads
Use RocksDB Limiter, if disk BW is low, use bigger memtable
Check client-side
Disk is just slow
Disable/reduce scheduling with pd-ctrl
Cause by GC
Use Compaction filter
Check PD Scheduling
Check GC in TiKV-details
Check RocksDB WAL latency
Sync too slow
Use fsync-ctrl, increase batch size
if CPU load is not too high, try to increase the memtable count.
Check GC, client-side
Use RocksDB Limiter, less threads in config
Check out Async-Write
Check out Read
Check out RaftStore Threads
Increase Threads
Check out latch
Check write stall
Check pending bytes, check reason, Adjust compaction
Increase Threads
Check out Scheduler Threads
Check client-side
Check balance
Quick Diagnosis of TiKV Write Performance Issue (2)
Try adjust scheduling anyway
Cluster Write become slow / has jitters
Check Which Read is slow
Read too slow
Write too slow (Previous Page)
Check Read or Write has issue
Coprocessor too slow
Get too slow
Check coprocessor threads
Check scanned data count
Check balance
Check in-lease-read rate
Check scanned Rocksed tombstone count
Check RPC count
Client-side
Client-side?
MVCC version accumlating
Adjust MVCC GC, use compaction filter
Check and adjust compaction
Compaction too slow?
Check KVDB Seek and Get latency
Scheduling cause low in-lease-read
Check PD leader scheduling
Adjust scheduling
?
Check memtable hit count and block-cache hit rate
Cache hit rate too low
Check client-side, adjust block-cache
Check SST read count
Read too many SST
Check and adjust compaction
Check SST read latency
Disk is Slow
Adjust block-cache, use better Disk
Check Disk read latency
Check async-snap