1 of 98

TiKV Fast Tune:

Laying Down a Stepping Stone

on Our Path

liucong@pingcap.com

刘聪 (Liu Cong) 2020-11

2 of 98

Content Index (HOME)

3 of 98

The Goal and The Road

of TiDB Cluster Tuning and Maintenance

4 of 98

We are Drifting From the Correct Path Now

TiDB as A Product

TiDB

TiDB

PD

TiKV

TiDB Community

TiDB Cloud

DBaaS

K8S toolset

...

Data Migrating

BR

TiCDC

...

TiDB Tuning and Maintaining

Dashboard

Metrics

...

...

As TiDB dev engineers, we (inappropriately) extended our scope of responsibility to keep almost every TiDB cluster in a good state.

That needs to be addressed, not just for reserving our energy for writing better products, but it's the only way we can have the ability to expand our user groups in parallel.

Must use

Daily use

5 of 98

The Most Painful: TiKV Cluster Maintenance

All Works of TiDB maintaining

POC

Oncall

Maintaining TiKV is the hardest and most resource intensive work.

Works group by modules

TiDB

PD

TiKV

CDC

...

...

WHY?

Stateful

High expectations

Needs to be handled case by case

Hardware Env

Data Status

Compared to CDC, ..

Compared to TiDB

Lowest Convergence Rate

Not just need to be good enough, it must be in the Best Status

Complicated

6 of 98

The Promised Land

TiDB User Community

TiDB Clusters

TiDB Dev Community

On-premise Deployments

Magic Maintenance Toolbox

Extremely powerful, Tells us (the users) what happened in these clusters, and what we should do

Community Deployments

Community Engineers

Non-engineer Users

Enterprise Engineers

On-cloud Deployments

A few issues that magic can't handle

Design, maintain, improve TiDB, and the magic box

Here is our Ultimate Goal.

In this model, the number of users could scale without maintaining pressure.

Community engineers

PingCAP Engineers

HOW to gain this power?

+-+-+-+-+-+

7 of 98

Where We Are

Easy-to-use Toolset

Pro Toolset

Can handle almost all issues SQL-related.

But can't handle TiKV issues, especially performance-related.

One needs to be an expert to solve a TiKV issue, needs to know the architecture, the (changing) implementation, the source code.

Here is the bleeding point.

The Unsatisfied TiDB User Community

A lot of issues

The Exhausted TiDB Dev Community

Community Engineers

Enterprise Engineers

PingCAP Engineers

Logs

Metrics

...

8 of 98

The Missing Piece: Easy-to-use TiKV Toolset

TiDB Tuning and Maintaining Toolbox

Easy-to-use toolset

Pro tools

Dashboard

Metrics in Grafana

TiKV-related

...

...

...

TiKV-related

For:

Non-engineer users

Non-dev engineers

Dev engineers

For:

Dev engineers only

Flames

...

The missing piece

Logs

Metrics stored in TiDB

9 of 98

Why It's Missing

TiDB Design: pluggable storage engine, does not have to be TiKV

TiDB

Storage Layer

PD

TiKV

Too complicated (relatively)

TiKV

TiKV

A simple answer is that we were always busy in our way.

It needs to be improved before we could go further.

TiKV

10 of 98

A Feasible Approach

Now

Pro tools

Easy tools

TiKV tools

Improved

Pro tools

Easy tools

TiKV tools

Dev Engineers

Semi-pro TiKV tools

Dev engineers

Non-engineer Users

Non-engineer Users

Goal

Pro tools

Easy tools

TiKV tools

Semi-pro TiKV tools

Dev engineers

Non-engineer Users

Non-pro TiKV tools

The challenge is, building effective non-pro TiKV tools is a hard job, the better way to do it is step-by-step.

First, we provide semi-pro tools, base on the pro-tools.

Then we build the target non-pro tools.

1

2

3

11 of 98

A Feasible Approach: Toolset Lifting

Semi-pro toolset

Pro toolset

Non-pro toolset

With Pro-core, Pro UI

With Pro-core, Non-pro UI

Dev Engineers Only

Anyone with a little training

Anyone

Lift the Pro tools to Non-pro

This is what we are already doing,

now we apply it to the TiKV toolset.

12 of 98

How Magic (may) Happen

Data (metrics) auto scan, auto troubleshooting

Expert-system knowledge

Improved

Pro tools

Easy tools

TiKV tools

Semi-pro TiKV tools

Dev engineers

Non-engineer Users

2

Goal

Pro tools

Easy tools

TiKV tools

Semi-pro TiKV tools

Dev engineers

Non-engineer Users

Non-pro TiKV tools

3

Easy to use

(With a little training)

Easy to develop

Become (part of) TiKV Tuning & Troubleshooting Standard

Accumulate cases and handling approaches, etc

Rapidly improving

Any TiKV-dev can easily add some contents

Maybe a bit shabby-looking, but that's totally OK

ML?

+-+-+-+-+-+

13 of 98

TiKV Fast Tune: An Experiment, A Stepping Stone

With minimal training, anyone could tell what happened in a TiKV cluster.

It won't be instead of the pro-metrics, just another view based on the original metrics.

Improved

Pro tools

Easy tools

TiKV tools

Semi-pro TiKV tools

Dev engineers

Non-engineer Users

2

Fast Tune Panels

We could keep doing more experiments until we find the way leading to our goal.

Fast Tune is one of many.

14 of 98

The Design Concept of Fast Tune

15 of 98

How Fast Tune Works

TiKV Details Panels

TiKV Details Panels

TiKV Details Panels

TiKV Details Panels

...

TiKV Fast Tune Panels

...

Anyone can browse the panels and find the most common issues in plain sight

Compress a large number of metric panels to a few ones

Panel count: N * 100

Reduce the info and improve the display in each panel

Panel count: N * 10

16 of 98

Easy to Develop, Easy to Use

Fast Tune

Easy to dev

Based on:

Existing metrics,

Existing the platform (grafana)

A list of potential issue causes

Hide all the details,

Hide the irrelevant data(metrics)

Use Clashing Spark to check if the potential issue is true or not

Easy to use

Could be easily checked one by one

The Clashing Spark needs a little explanation, but once we understand it, it's intuitive and convenient

17 of 98

What is Clashing

(Dev) Draw the Potential Cause and the Result into the same panel

A Panel

Potential Cause

Has Spark

No Spark

(User) Check the next panel

(User) An issue cause had been detected!

Result

(User) Observe if there is a Clashing Spark or not

A Panel

Potential Cause

Result

18 of 98

An Example of a Clashing Spark

Result: the TiKV Write-RPC QPS jitter

Potential Cause: block-cache miss rate

Flip it (just one kind of the technics, not necessary)

Draw them together

Draw them together

The curves perfectly matched, that's the Spark.

So we could say: the block-cache miss rate is (one of) the cause of Write-RPC QPS jitter.

19 of 98

Fast Tune: A Set of Clashing Panels

One Panel

Fast Tune Panels

Anyone can browse the panels and find the most common issues in plain sight

Has a statement as to title, propose a (potential) cause could lead to the result

A (may not exist) Clashing Spark, we could easily tell it with a tip or a little explanation.

If the spark exists, the statement of the title is true, otherwise is false.

The Result

The Potential Cause

20 of 98

What's in A Fast Tune Panel

One Panel

Normally the Result will be QPS trend changing, or QPS jitter.

The most usual Spark is the matching curves of the Result and the Cause

Has a statement as title

Has a Result

Has a Potentail Cause

21 of 98

A Spark of Having Matching Trends

Has Spark

No Spark

22 of 98

A Spark of Having Matching Jitters

Has Spark

No Spark

The values are too small to be the root cause

23 of 98

A Spark of Having Abnormal Instances in a Group

Has Spark

No Spark

24 of 98

A Spark of Members' Trends affecting the Group's Trend

Has Spark

25 of 98

Zoom Out to Find the Matching Trends

No Spark ?

Has Spark

Use the Grafana time range tool to select a larger range

Zoom out

Find the Matched Trends

26 of 98

Zoom In to Find the Matching Jitters

Zoom in

No Spark

Zoom in

Has Spark

27 of 98

Use Zoom In for Better Auto Scaling

The panel setting (by developer)

Zoom in

When we zoom in a bit, and do not select the time range including QPS=0, the Y-axis will be auto-scaled, then the QPS jitter should become more obvious.

If the cause-metric could be no data, set max to 0 to avoid weird displaying.

28 of 98

TiDB and TiKV Architecture

29 of 98

TiDB Architecture

TiKV Instance

Disk

RocksDBs

TiKV Instance

TiKV Instance

TiDB Instance

PD Cluster

(connected with TiKV instances)

Client

read

write

gRPC Server

TXN Scheduler

MVCC Storage

RaftDB

Disk

Coprocessor

KVDB

RaftStore

w + r

30 of 98

TiKV Architechture

RocksDBs

Requests

gRPC Server

TXN Scheduler

MVCC Storage

RaftStore

RaftDB

Disk

Coprocessor

KVDB

read

write

w + r

31 of 98

Critical Operations in TiKV Read

RocksDBs

TXN Scheduler

Get/Seek/Next

Memtable-Read

BlockCache-Read

SST-Read => Disk-IO

Coprocessor-RPC

Storage-Snapshot

Storage-Read

Read

Engine-Snapshot

Engine-Get/Seek/Next

Read

Out-Lease-Snapshot

Raft-Write => In-Lease-Read

In-Lease-Read

KVDB-Snapshot

KVDB-Get/Seek/next

Read-RPC

Get-RPC (included batched)

Storage-Snapshot

Storage-Read

Coprocessor-RPC

Requests

RaftDB

Disk

RaftStore

KVDB

gRPC Server

MVCC Storage

Coprocessor

read

w + r

32 of 98

Critical Operations in TiKV Write

RocksDBs

Read/Seek/Next

Memtable-Read

BlockCache-Read

SST-Read => Disk-Read

Write

Memtable-Write

WAL-Write => Disk-IO

Write

Engine-Snapshot

Engine-Read/Seek/Next

Engine-Write

Write-RPC

Prewrite-RPC

Commit-RPC

PessemisticLock-RPC

Requests

Read

Non-Lease-Read

Raft-Write => Lease-Read

Lease-Read

KVDB-Snapshot

KVDB-Read/Seek/Next

Write

StoreLoop

Message-Dispatch * 3

RaftDB-Write

KVDB-Write (a few)

ApplyLoop

Message-Dispatch

KVDB-Write

Write

Memtable-Write

WAL-Write => Disk-IO

Acquire-Latch

RaftDB

Disk

RaftStore

KVDB

gRPC Server

TXN Scheduler

MVCC Storage

Coprocessor

read

write

w + r

33 of 98

TiKV Latency Source

RocksDBs

Requests

If MVCC GC is too slow,

it may scan a lot TXN versions and cause high latency

If writing rows are conflicted, it will cause high latency by waiting latches.

The latch waiting is affected by:

1. Client workload

2. Processing speed of modules below

If MVCC GC is too slow,

it may scan a lot Txn versions and cause high latency

1. If the Disk is slow, RocksDB will be slow

2. If compaction is slow, read maybe slow by reading too much SSTs

3. Some (infrequency) operations (eg: delete files) may hold the inner Mutex for a while and cause performance jitter

4. Having too much deleted but un-GC data in it will slow down Read

When reaching one of the limits, it will cause high latency:

1. Throughput BandWidth

2. IOPS

3. fsync/s

The disk latency is related to

current-flow : throughput-BW

Mixed IO-size workload also slow down the disk performance

gRPC Server

TXN Scheduler

RaftStore

RaftDB

Disk

Coprocessor

KVDB

MVCC Storage

read

write

w + r

34 of 98

TiKV Performance Routine Tuning

RocksDBs

Requests

Make sure coprocessor thread-pool is big enough

Compaction tuning, and use IO-limiter when disk is not good

Use good Disk and proper mounting

Solve conflicts on the client side

(or let it be if the application can't be modified)

Make sure scheduler thread-pool is big enough

Make sure thread-pool sizes are not too big nor too small

gRPC Server

TXN Scheduler

MVCC Storage

RaftDB

Disk

Coprocessor

KVDB

RaftStore

read

write

w + r

35 of 98

TiKV Performance Map

36 of 98

Workload Pattern and Balancing

Write-RPC

Prewrite-RPC

Commit-RPC

PessemisticLock-RPC

Read-RPC

Get-RPC (included batched)

Coprocessor-RPC

TiKV Instance

TiKV Instance

TiDB Instance

PD Cluster

(connected with TiKV instances)

Client

TiKV Instance

Requests

37 of 98

Find Out Which RPC has Problems

gRPC Server

TXN Scheduler

MVCC Storage

RaftStore

RaftDB

Disk

Coprocessor

KVDB

Data codec, CPU usage,

not much latency.

tikv.server.grpc-{...}

No need to tune, some values maybe could be turned down a bit to save resources.

Configs

Requests

38 of 98

Read Performance: Coprocessor, MVCC Storage

Requests

gRPC Server

MVCC Storage

RaftStore

Disk

Coprocessor

KVDB

TXN Scheduler

RaftDB

39 of 98

Read Performance: GC, RaftStore

Requests

TXN Scheduler

RaftDB

Disk

MVCC Storage

Coprocessor

KVDB

gRPC Server

RaftStore

40 of 98

Read Performance: KVDB, Disk

Coprocessor

Requests

TXN Scheduler

RaftDB

gRPC Server

MVCC Storage

RaftStore

Disk

KVDB

41 of 98

Write Performance: Scheduler, MVCC Storage

TXN Scheduler

Coprocessor

Requests

MVCC Storage

RaftDB

Disk

KVDB

gRPC Server

TODO: Store loop duration

TODO: Apply loop duration

read

write

RaftStore

42 of 98

Write Performance: RocksDB

RocksDB

Coprocessor

TXN Scheduler

Requests

MVCC Storage

RaftStore

RaftDB

Disk

KVDB

gRPC Server

read

write

w + r

43 of 98

Fast Tune Manual

44 of 98

Summary of Fast Tune

Fast Tune is a Grafana page, these are 3 rows (a row is a set of panels) in it.

To check out a TiKV cluster, we could check out the Summary Row first, to learn the basic status

Normally, the workload is mixed with Write and Read, both of them could affect the Write performance

We got lots on-call issues about TiDB Write is too slow (latency or throughput), so the current version of Fast Tune focuses on Write performance

1

3

2

For those clusters that only have Read performance issues, Fast Tune is not able to help yet.

It’s on the schedule and will be in new rows (new panel sets)

TODO

Fast Tune

Summary

45 of 98

The most popular Result (of Clashing)

In almost all panels, we use Write-RPC QPS as the Clashing Result.

Because we want to know is the cause leads to the Write QPS slowing down or to Write jitters.

The Write-RPC QPS is the sum of gRPC:

  • kv_prewrite
  • kv_commit
  • kv_pessimistic_lock

No matter we’re using pessimistic TXN or optimistic TXN, these three will always represent the Write performance

Using these kind of tactics, users of Fast Tune could skip a lot of detailed info, and get a big picture of the current status quickly.

But still, we want to check out all the detail metrics from Page: TiKV-details and any other pages, once we find something wrong.

It’s always the green part (the lower part) in the panel.

Fast Tune

Summary

46 of 98

More Results (of Clashing)

Get-RPC QPS is the sum of gRPC:

  • kv_batch_get
  • kv_batch_get_command
  • kv_get
  • kv_scan

Read-RPC QPS is the sum of gRPC:

  • Get-RPC
  • coprocessor

Fast Tune

Summary

47 of 98

The Flipped Metric and The Negative Value

Most of the metrics are flipped, to get a better observation between the Result and the Potential Cause.

For the flipped metrics, we should notice that the lower position in the curve means the greater value.

The flipped metrics show as negative values.

It’s a bit not so friendly, we can’t figure out how to display normally in flipped mode yet.

But worry not, the mouse hints could still show the positive values.

Fast Tune

Summary

48 of 98

Jitter Highlighting in Fast Tune

Linear scale could highlight jitters

Set min-max to auto will auto-scale the curves, then could highlight jitters

Most of the TiDB/TiKV metrics show with resolution = 1/2 in Grafana, Fast Tune use 1/1 to highlight jitters

Fast Tune

Summary

49 of 98

When More Than One Panels Have Sparks

If these panels are parent-child in the call stack, the most inner (lower) one is the root cause

If these panels have no relations, then it needs some analysis to conclude which one is the cause and which one is the result, for example, they could use the same resource (eg: Disk, CPU), the hidden relationship is resource competition

Fast Tune

Summary

Requests

gRPC Server

TXN Scheduler

MVCC Storage

Coprocessor

TXN Scheduler

MVCC Storage

RaftStore

RaftDB

read

write

w + r

50 of 98

Row: Summary

Row:

Summary

51 of 98

Find Out the Imbalanced Requests

We could easily tell the Imbalanced Write between TiKV instances with a glance, and hangover the mouse pointer to the graphic to find out the detail.

The Backgound Fill represents the total Write-RPC QPS

The Colored Lines represent the Write-RPC QPS of each single instances

The Imbalanced Read panel is alike

Row:

Summary

kv_prewrite

kv_commit

kv_pessimistic_lock

52 of 98

Find Out the Basic Workload Pattern

An easy way to observe how many Reads and Writes are called, by checking the number of righty Y-axis.

The proportion of QPS represents the basic Workload.

If those RPC calls have a dependency between themselves (for example, have mixed Read and Write in TXNs), their curves will match.

This sample comes from the TPCC benchmark, so we could see all those curves matched.

Row:

Summary

(Not important) There are some repeating routine coprocessor jobs from TiDB, if there are no client-side coprocessor calls, these routine jobs emergence.

53 of 98

Find Out the Abnormal Instances

We could easily tell if only one or a few TiKV instances are Read Too Slow.

The Backgound Fill represents the total Write-RPC QPS.

The Flipped Fills represent the Write-RPC latency of every single instance.

The Panel: Some instances write too slow is alike

If this happens, we could only check out the instances with problem.

Row:

Summary

54 of 98

An Example of One Instance Write Too Slow

One instance has latency jitter of Write, lead to cluster Write jitter.

Row:

Summary

Mouse hangover

55 of 98

Which RPC has Problems: Read or Write, or Both

By:

  • Checking the proportion of QPS
  • Comparing the latency of Read and Write
  • Observing the sparks

We could tell the problem is on Read-RPC or Write-RPC.

Then, we could go straight to the Row: Read Performance or Row: Write Performance.

Row:

Summary

56 of 98

Find out Which Type of Read has Problems

By comparing the Get and Coprocessor QPS and latency, we could know which type of Read has problems.

The detail WHY need to check out Row: Read Performance, and even need to check out the origin metrics from Page: TiKV-details.

If we saw a Spark, that might means this type of Read is the main workload.

Row:

Summary

?

57 of 98

More Examples of Read latency

Row:

Summary

?

58 of 98

Find out if there are Write Stall Events

This red line is combined with the Write Stall metrics from KVDB and RaftDB.

Write Stall should never happen, if the red line is not straight, go check the RocksDB:KV and RocksDB:Raft rows in Page: TiKV-details, check out which RocksDB (mostly is KVDB) has Write Stall, and why (there is a panel shows the reason).

Normally it will be the Flush (memtable -> SST) or the Compaction is too slow.

If we got not enough CPU or the disk load is too high (check the w_await metrics from Page: Disk Performance, ignore the util% metric), that we need to lower the pressure from client-side.

If CPU and Disk are both OK, increase the compaction thread-pool size to solve the Write Stall.

Row:

Summary

tikv.rocksdb.max-background-jobs = 8

tikv.rocksdb.max-sub-compactions = 3

tikv.rocksdb.max-background-flushes = 2

tikv.raftdb.max-background-jobs = 4

tikv.raftdb.max-sub-compactions = 2

Configs

59 of 98

The Latch Time Panel

When Write from client-side conflicts, the latch duration increases. When the conflict is heavy, the latch duration will be significantly longer.

When that happens, if we are using optimistic TXN, we could change it into pessimistic TXN first, to avoid lots of rollbacks. But the latch duration will still be long, that's inevitable.

This panel's latch value is combined by kv_prewrite, kv_commit and kv_pessimistic_lock.

These metrics have a bug in the collecting process, it incorrectly included the scheduler's wait for available thread time.

So when latch duration is high, and we can make sure there is not much conflict in the client-side, then maybe it's not enough scheduler threads causing a high "latch time", then we should increase the pool size.

Row:

Summary

tikv.storage.scheduler-worker-pool-size = 4

Configs

60 of 98

The PD Scheduling Panels

If there is a lot of balancing, check out why in the panel below

These panels are for checking out if the PD scheduling operating has a relation with Write QPS.

When scheduling does affect QPS, we could slow down the balancing (with pd-ctrl).

A common scenario is adding (or removing) TiKV instances into an existing cluster. That will cause rebalancing, and might cause performance issues.

This shows the relation of stores' used space ratio and the balancing event count.

Under some (inappropriate) configs, some unnecessary balancing may happen, this is the place we could capture that.

Row:

Summary

61 of 98

The PD client Panel

Like Panel: Write Stall, this is a panel for monitoring critical events.

The PD client having a lot of pending tasks is abnormal, may be caused by lack of CPU resource or other reason, it may lead to heartbeat report failure, then lead to TiKV instance disconnect from PD.

Row:

Summary

62 of 98

RocksDB Compaction Pending Bytes Panel

Row: Read Performance

Increasing compaction thread-pool could be useful in reducing pending bytes if CPU and Disk IO resource is plentiful.

tikv.rocksdb.rate-bytes-per-sec = ""

tikv.rocksdb.auto-tuned = false

tikv.raftdb.rate-bytes-per-sec = ""

tikv.raftdb.auto-tuned = false

Configs

Too many compaction pending bytes could lead to a huge impaction of Disk IO or CPU usage, then affect TiKV performance, and even cause RocksDB Write Stall.

Client-side write pressure too high may cause that, in this situation, we need to reduce the pressure.

Ingest SST could also trigger lots of compaction, deleting range data or PD scheduling would cause that.

Increase the memtable size could help a little, by lower the write amplification (WA) of RocksDB. (need more memory to do that)

If it's IO limit config cause too many pending bytes, it may not impact the Disk IO and CPU, but will increase the SST Read Count then lead to low performance.

tikv.rocksdb.max-background-jobs = 8

tikv.rocksdb.max-sub-compactions = 3

tikv.raftdb.max-background-jobs = 4

tikv.raftdb.max-sub-compactions = 2

Configs

tikv.rocksdb.defaultcf.write-buffer-size = "128MB"

tikv.rocksdb.defaultcf.max-bytes-for-level-base = "512MB"

tikv.rocksdb.writecf.write-buffer-size = "128MB"

tikv.rocksdb.writecf.max-bytes-for-level-base = "512MB"

tikv.rocksdb.lockcf.write-buffer-size = "32MB"

tikv.rocksdb.lockcf.max-bytes-for-level-base = "128MB"

tikv.raftdb.defaultcf.write-buffer-size = "128MB"

tikv.raftdb.defaultcf.max-bytes-for-level-base = "512MB"

(Other compaction configs could be useful too)

Configs

If ingest SST cause high pending bytes, try lower the scheduling by pd-ctrl.

63 of 98

RocksDB Compaction Pending Bytes by Instance

Row: Read Performance

Maybe some instances' pending bytes reach threshold, but we can't tell in the summary panel, so we also need another panel showing it by instance.

By instance

TiKV has a soft limit of pending bytes, reaching this number will trigger Write Stall.

When pending bytes reaches 1/4 soft limit (16G), RocksDB will increase threads to speed up compaction, this may lead to performance jitter.

Here in this by instance panel, we could capture this already happened.

tikv.rocksdb.defaultcf.soft-pending-compaction-bytes-limit = "64G"

Configs

?

64 of 98

Row: TiKV Write Performance

Row: Write Performance

65 of 98

Scheduler Thread-Waiting Panels

These panels are for judging that the scheduler thread pool size is appropriate.

When the waiting time is too long (much longer than MVCC Storage async-write duration, check Page: TiKV-details), it will amplify the latency jitter.

In that case, we could increase the pool size to reduce latency. But, a pool size too big will be a waste of CPU (caused by context switches).

This is an ideal panel to observe the waiting for scheduler thread duration, BUT in some versions of TiDB you may not see any data here.

Before the upper panel can be used, you could use this panel to get rough guessing, by observing the waiting queue size and the latch panel.

So if conditions permit, do benchmarks when other configs are settled, using a relatively big pool size, then reduce it until performance drops.

Row: Write Performance

?

tikv.storage.worker-pool-size = 4

Configs

66 of 98

RaftStore Thread Pool Size Panels

Raftstore uses two message loops (the Store Loop and the Apply Loop) to handle messages, each loop uses a set of threads (thread pool):

  • All the threads of this loop will take messages from the pool's queue, and then handle them as tasks.
  • The tasks have sync-IO involved.
  • If the pool size is too big, the IO size writes to RocksDB will be small and lead to low performance.
  • If the pool size is too small, the messages will wait in the queue longer than one loop.

RaftStore plays the most important part in TiKV performance, see "TiKV Write Latency and the Loop Speed" for more details; optimization is on-going.

The best pool size should be small, but big enough. we could compare the loop duration and the waiting duration, if the later is much greater than the previous, then the pool size is too small.

We don't have the loop duration metrics in the current version, So we could use the corresponding RocksDB Write latency instead, to make a rough guess.

This metric's gathering is inappropriately implemented by now, but we still could put it in to use as a rough value.

Row: Write Performance

tikv.storage.store-pool-size = 2

tikv.storage.apply-pool-size = 2

Configs

67 of 98

More Examples of RaftStore Thread Pool Size Panels

Row: Write Performance

68 of 98

RocksDB Write Latency Panels

Store Loop writes data (raft log) to RaftDB and occasionally to KVDB.

Apply Loop writes data in the final form to KVDB.

Since a Write involves both loops, so the two panels both play important roles.

Row: Write Performance

Normally we only need to show duration-99% (could disable duration-max by editing the panels).

But for some cloud disks like AWS gp2, it's write latency is stable when BW and IOPS are below quotas, and will suddenly rise to a high value when reaching one of the quotas.

For this cloud disk, the duration-99% may be stable even when duration-max is not, so we show them both.

?

69 of 98

RocksDB Compaction Flow Panels

Compaction IO flow is the most common reason for performance fluctuation, mostly on KVDB compaction.

The underlying reason is that disk latency increases when IO flow grows.

The read flow may hit the OS page cache, and cause less real IO flow on the Disk.

So the pink part in these panels could be ignored to a certain extent, it depends on how much memory we have in this node and how fast data is written to Disk.

Row: Write Performance

tikv.rocksdb.rate-bytes-per-sec = ""

tikv.rocksdb.auto-tuned = false

tikv.raftdb.rate-bytes-per-sec = ""

tikv.raftdb.auto-tuned = false

Configs

IO limit config can smooth the flow, be careful not to set it too low, it will cause Too Many Pending Bytes then lead to Write Stalls.

70 of 98

More Compaction Flow Panel Examples

Row: Write Performance

71 of 98

RocksDB Write Batch Size Panels

Changing of batch sizes of RocksDB Write could cause changes in RocksDB Write latency: the bigger, the longer.

If we can't find any reason why RocksDB Write latency is increasing, the answer may be in here, and the batch size-changing maybe from the client-side.

Row: Write Performance

The Loop Speed will affect the batch size, if Loop Speed becomes slow, then the messages in the queue will accumulate and lead to a big batch size.

This might happen:

  1. Disk slow down a bit, eg, maybe cause by a lot of Read from client-side
  2. Disk write latency increasing causes the Loop Speed slow down
  3. Loop speed slows down which causes big batch size
  4. Big batch size causes lower loop speed

This degeneration continues until the batch size reaches the limit configured by max-batch-size.

tikv.raftstore.store-max-batch-size = 256

tikv.raftstore.apply-max-batch-size = 256

Configs

72 of 98

RocksDB Write Mutex Panels

If RocksDB Write is slow but Disk latency is normal, it is mostly caused by PD scheduling or GC worker, occupying the RocksDB Mutex for too long.

So before checking these panels, we should check Disk latency first.

Row: Write Performance

?

?

73 of 98

RaftDB Sync WAL Latency Panel

If a Disk has high-performance specs in latency, BW, IOPS, but slow in doing fsync operations, it could still affect the performance. (this happens a lot on non-enterprise disks)

When that happens, RocksDB Write latency panels will have no spark, but in this panel, the spark should appear.

Row: Write Performance

There is not much works beside IO operation in Sync WAL, When fsync performance is not the bottleneck, we could roughly use this metric as Disk Write metric.

74 of 98

RocksDB Frontend Flow Panels

RaftDB frontend flow is normally caused by PD schedule.

If the flow is high, that's something we should notice.

KVDB frontend Write-Flow is the writing data from the client-side workload.

Similarly, Read-Flow reflects the Read workload from the client-side.

If there is a spark here, it may mean that workload changes affected the performance.

Row: Write Performance

75 of 98

RocksDB Total IO Flow Panels

KVDB compaction flow panel is the most direct panel to observe how compaction affects performance.

But the jitter of total IO flow hitting on the disk is the root cause of Disk latency jitter.

So we have these two panels.

Row: Write Performance

76 of 98

More Examples of Total IO Flow Panels

77 of 98

RocksDB CPU usage

Compaction not only causes IO latency jitter, ot also causes CPU usage jitter.

When IO Flow jitter panels have jitter, but the IO flow is not high, we could use this panel to verify the RocksDB CPU usage jitter of each node.

Row: Write Performance

This instance has RocksDB CPU usage jitter

and caused TiKV performance jitter.

Mouse

hangeover

Zoom in

?

78 of 98

MVCC Storage Async-Write Latency Panel

Row: Write Performance

Storage async-write latency is the core metric to measure TiKV Write performance.

If Write-RPC has Sparks and here has no Spark, it means something is wrong in the modules above Storage:

  • MVCC Read too slow
  • Latch conflict
  • Lack of Scheduler threads
  • Storage Async-Snapshot too slow

If here has Sparks, then the Storage inside may have a problem:

  • RaftStore Write too slow
  • RocksDB Write too slow
  • Disk Write too slow

79 of 98

Disk Write Latency Panel

Row: Write Performance

A panel to observe Disks' Write latency:

  • Notice the high latency
  • Notice whether the latency trend matches Write-RPC QPS or not

It will be lots of instances and Disks here, all we need to do is simple as listed above.

80 of 98

Row: TiKV Read Performance

Row: Read Performance

81 of 98

RaftStore Lease-Read Rate Panel

Row: Read Performance

A Read request needs to read data from the Storage layer, then the request goes to RaftStore split by regions.

If the region is an in-lease leader in RaftStore, data could directly read from this leader by getting a snapshot from KVDB.

If not, RaftStore needs to process a Raft Write (without data persistence) to make sure this region is a leader and in-lease, then read the data.

Direct in-lease Read is always faster, so if a lot of Reads fall to out-lease, then the performance will be bad.

Notice that sometimes this is the result, not the reason: when the gRPC TPS changes, the fall-rate also changes. If the fall-rate is the only thing changes and a spark is found here, then it may be the reason.

82 of 98

MVCC Storage Async-Snapshot Duration

Row: Read Performance

Read and Write both need to get a snapshot from the engine (implemented by RaftStore) before any operation.

Normally the target region should be In-lease, so the get snapshot duration should be very fast. If there are obvious duration jitters or trends, it may be caused by meeting (lots of) Out-lease regions.

When that happens, we should find out what causes abnormal Out-lease rate, it may be:

  • Lots of leaders transferring
  • Lots of region balancing

If the network is OK, we could try to adjust PD scheduling by pd-ctrl to solve the issue.

?

83 of 98

Coprocessor Count and Latency Panels

At the Row: Summary we already got a general picture of the counts and latencies of each type of RPC calls, so here we just show a little more info about the coprocessor.

Notice that if the coprocessor latency panel has a spark, there is a big chance that it does not mean coprocessor performance is the cause, it may be just the result.

For example, if compaction flow leads to disk latency jitter then leads to performance jitter, it will also lead to coprocessor jitter.

For observe the client-side workload affection.

Row: Read Performance

84 of 98

Coprocessor Threads Panels

Need some analysis to figure out why we don't have enough threads:

  • The thread pool size is too small.
  • The CPU load of the node is way too high.
  • The coprocessor execution is too slow.

How long the tasks waiting in the queue.

How long is the waiting queue.

Row: Read Performance

These panels are for judging if the coprocessor thread pool size is big enough.

Here is only for normal-level (from client-side workload) task waiting durations.

To check the high and low-level tasks' waiting durations, we need to visit the Page: TiKV-details.

85 of 98

Data Scan Count Panels

This panel is for observe RocksDB tombstone number could affect the Write performance.

This could affect both Write and Read calls, not just coprocessor.

Check coprocessor request types from Page: TiKV-details.

If scan too much, it may be caused by client-side workload or TiDB routine works (eg, table analyzation).

Row: Read Performance

If coprocessor scans too much data, it will be slow. This could be caused by:

  • Client-side workload, eg: full table scan.
  • Scan too many MVCC versions, which may be caused by MVCC GC not work well or in time.
  • Scan too many KVDB tombstones, that may be caused by compaction is not in time.

86 of 98

A Typical Performance Trend Caused by GC

(3) After that, QPS become stable

(2) Then GC starts, MVCC-versions become deleted data in RocksDB, still dragging QPS down

(1) QPS drops when MVCC-version number grows

Row: Read Performance

87 of 98

KVDB Seek Count Panel

Row: Read Performance

Many reasons could cause KVDB Seek: MVCC GC, client-side Read, client side Write, etc.

Here we could observe if the Seek count affected TiKV Write performance, if there is a spark, we could check Page: TiKV-details to find out where those calls come from.

?

88 of 98

KVDB Read Latency Panel

Row: Read Performance

If KVDB Read is slow and affects TiKV Write performance, this panel will tell us.

When a spark is found here, we need to check out other panels to find out why KVDB Read is slow. It may be:

  • Read too many deleted data in KVDB.
  • Block cache hit rate is low, then read too many SSTs.
  • Compaction status is not good, then read too many SSTs.
  • Some other thread locks the Mutex a bit too long.
  • Disk IO is slow.

89 of 98

KVDB Read SST Count Panel

Reading too many SSTs in KVDB will lead to Read or Seek too slow.

Some reasons could lead to too many SST reading:

  • Block cache hit-rate is too low.
  • Compaction status is too bad.

The SST read count is related to compaction a lot, so we could see the periodic changes here (following the compaction period).

If it affects the TiKV Write performance, we should know by this panel.

Row: Read Performance

90 of 98

KVDB Read SST Latency and Seek Latency Panel

Row: Read Performance

Mostly related to Disk IO latency.

91 of 98

KVDB Memtable Read Panel

Row: Read Performance

The written data will be stored in memtable before flushing, Read will find data in memtable before finding it in block-cache and SSTs.

If block-cache Read miss and memtable Read also miss, it will lead to Disk IO (read SSTs) then cause RocksDB Read to become slow.

It also affects Write-RPC in two ways:

  • In a TiDB TXN, Write requests may depend on success Read requests
  • There is RocksDB Read in MVCC Write

Some inappropriate configs might cause memtable hit rate to periodically change, if that happens we could capture that in this panel.

tikv.rocksdb.defaultcf.write-buffer-size = "128MB"

tikv.rocksdb.defaultcf.max-bytes-for-level-base = "512MB"

tikv.rocksdb.defaultcf.max-write-buffer-number = 5

tikv.rocksdb.writecf.write-buffer-size = "128MB"

tikv.rocksdb.writecf.max-bytes-for-level-base = "512MB"

tikv.rocksdb.writecf.max-write-buffer-number = 5

tikv.rocksdb.lockcf.write-buffer-size = "32MB"

tikv.rocksdb.lockcf.max-bytes-for-level-base = "128MB"

tikv.rocksdb.lockcf.max-write-buffer-number = 5

tikv.raftdb.defaultcf.write-buffer-size = "128MB"

tikv.raftdb.defaultcf.max-bytes-for-level-base = "512MB"

(Other compaction configs might affect memtable hit rate too)

Configs

?

92 of 98

KVDB Block Cache Panel

Row: Read Performance

If block-cache Read miss (and memtable Read also miss), it will lead to Disk IO (read SSTs) then cause RocksDB Read to become slow.

It also affects Write-RPC in two ways:

  • In a TiDB TXN, Write requests may depend on success Read requests
  • There is RocksDB Read in MVCC Write

This panel only lists the most important metric of block cache, there is more info in Page: TiKV-details.

The most reason for block cache hit rate change is compaction, so at large chances even there is a spark here, block cache hit rate still not the cause.

93 of 98

Disk Read Latency Panel

Row: Read Performance

A panel to observe Disks' Write latency:

  • Notice the high latency
  • Notice the latency trend match Write-RPC QPS or not

It will be lots of instances and Disks here, all we need to do is simple as listed above.

94 of 98

How To get Fast Tune

Version compatibility

  • We try to make the newest Fast Tune work with all TiKV versions, old or new
  • So some panels in some versions may look weird (missing some data)

Import

How to get Fast Tune

  • Along with tiup deployment

95 of 98

='_'=

The End

96 of 98

Although we don't need to do TiKV performance diagnosis step by step with Fast Tune, just browse it and try to find something.

Still, I wanted to write a Diagnosis Process, to make sure nothing miss in Fast Tune panels.

But then I found out I was not the only one writing those things, so I stopped.

Here are some unfinished thoughts if you are interested.

97 of 98

Quick Diagnosis of TiKV Write Performance Issue (1)

Do more research about RocksDB

Compaction used too many CPU

Cluster Write become slow / has jitters

Something unexpected happened, check the OS metrics: memory pages, context switching, total CPU load, etc

Check RocksDB compaction flow

Verify the Disk Write latency

Check Perf Context Mutex

Check write batch size

Verify RocksDB Write latency

Write too slow

Read too slow (Next Page)

Check the RocksDB CPU usage

Check the Disk Writelatency

Check RocksDB Write latency

?

PD scheduling occuply the RocksDB mutex for too long.

Client side changing or something change Loop Speed

Check the Frontend flow

Frontend flow too high

Compaction flow too high

Check Read or Write has issue

?

?

Check more Disk metrics, eg: IOPS. especially when is Cloud Disk

Check Perf Context Thread wait

Not enough RocksDB write threads

Use RocksDB Limiter, if disk BW is low, use bigger memtable

Check client-side

Disk is just slow

Disable/reduce scheduling with pd-ctrl

Cause by GC

Use Compaction filter

Check PD Scheduling

Check GC in TiKV-details

Check RocksDB WAL latency

Sync too slow

Use fsync-ctrl, increase batch size

if CPU load is not too high, try to increase the memtable count.

Check GC, client-side

Use RocksDB Limiter, less threads in config

Check out Async-Write

Check out Read

Check out RaftStore Threads

Increase Threads

Check out latch

Check write stall

Check pending bytes, check reason, Adjust compaction

Increase Threads

Check out Scheduler Threads

Check client-side

Check balance

98 of 98

Quick Diagnosis of TiKV Write Performance Issue (2)

Try adjust scheduling anyway

Cluster Write become slow / has jitters

Check Which Read is slow

Read too slow

Write too slow (Previous Page)

Check Read or Write has issue

Coprocessor too slow

Get too slow

Check coprocessor threads

Check scanned data count

Check balance

Check in-lease-read rate

Check scanned Rocksed tombstone count

Check RPC count

Client-side

Client-side?

MVCC version accumlating

Adjust MVCC GC, use compaction filter

Check and adjust compaction

Compaction too slow?

Check KVDB Seek and Get latency

Scheduling cause low in-lease-read

Check PD leader scheduling

Adjust scheduling

?

Check memtable hit count and block-cache hit rate

Cache hit rate too low

Check client-side, adjust block-cache

Check SST read count

Read too many SST

Check and adjust compaction

Check SST read latency

Disk is Slow

Adjust block-cache, use better Disk

Check Disk read latency

Check async-snap