Ceph Crimson Team
1
CEPH CRIMSON
2021 Quarter 3 Project Update
Multiyear effort. How are we doing?
Efficiency is critical for success.
High Performance: How do we get there?
CEPH
CRIMSON
A refresh of the Ceph OSD focused on efficient resource consumption, higher performance, and modern designs.
2
Why Reimplement the OSD?
Resource efficiency |
|
Enables new workloads |
|
Future-proof for next generation hardware |
|
Why Seastar? Why Crimson?
Lower CPU Usage |
|
Lower Latency |
|
Potentially Better Memory Allocation |
|
Who is Contributing to Crimson?
5
Red Hat
Intel
Qihoo 360
2021 Q3 Project Stats
Started in 2018
30+ Unique Contributors
900+ Pull Requests
2800+ Commits
74K+ Lines of Code
Samsung
Others
The 2021 Q2 commit rate was 2x higher than in Q1. Q3 had fewer commits than Q2 and a modest increase over Q1.
Note
Only includes work committed to the crimson directory in the Ceph master git repository.
6
Crimson Commit History
Statistics
29 over 2 years:
Initial Prototype
Ephemeral Data Store
Replication
RBD Support
Error Handling
Regression Testing
….
7
2018-2019 Crimson Milestones
Statistics
26 over 1 year:
PGLog Support
Backfill (Mostly)
Recovery (Mostly)
Bluestore Support
Seastore (Initial)
Code Stabilization
....
8
2020 Crimson Milestones
Statistics
34 in 2021 Q1-Q3:
Ongoing Seastore Dev
Metrics/Profiling
Seastore optimization
Better debugging
Bug fixes (not all shown)
9
2021 Q1-Q3 Crimson Milestones
Crimson Release Plans
10
Quincy (2022 Q1(Q2?)) |
|
Post Quincy 1 |
|
Post Quincy 2 |
|
2021 Goals (WIP)
11
Q1
Q2
Q3
Q4
2022 Crimson Goals (WIP)
12
Q1
Q2
Q3
Q4
13
2021 Q3 Crimson/Bluestore Performance Preview
3 Factors to consider...
14
Memory Allocator | Crimson currently uses some combination of the seastar memory allocator and libc malloc depending on compilation settings and whether bluestore is used. |
Architecture | Crimson employs a different programming model for the OSD code than classic, even when bluestore is utilized. Do some systems favor one or the other? |
Multi-Core | Crimson does not yet have multi-core support, however this feature is being actively worked on. We can now however run multiple OSDs on a single NVMe device. |
Classic vs Crimson: General Test Setup
15
| Classic | Crimson |
OSDs | 1x2core, 1x8core, 4x2core | 1x2core, 1x8core, 4x2core |
clients | 4xfio (16GB preallocated librbd each) | 4xfio (16GB preallocated librbd each) |
iodepth | 4x32 (32/client) | 4x32 (32/client) |
iosize | 4K, 64K (randread, randwrite) | 4K, 64K (randread, randwrite) |
Objectstore | Bluestore | Bluestore (via alienstore) |
Memory allocator | Tcmalloc, libc | libc+seastar |
Test Duration | 300s | 300s |
Classic vs Crimson: Memory Allocation
The classic OSD is very sensitive to the memory allocator is used during compilation. Tcmalloc tends to be better than libc malloc at optimizing Ceph’s heavy use of dynamic memory.
Bluestore also relies on tcmalloc for the osd memory autotuning system.
16
Allocation Overhead in Classic Worker Threads
tp_osd_tp Thread Wallclock Time 2 Cores, 1 OSD, 4GB osd_memory_target, 4K Random Writes | ||
| libc | tcmalloc |
new | 9.22% | 0.00% |
tc_new | 0% | 3.35% |
_int_free | 7.59% | 0% |
cfree | 1.20% | 2.83% |
Total | 18.01% | 6.18% |
Classic vs Crimson: Memory Allocation
The reduction in overhead shown in the previous slide also plays out when looking at the difference in cycles/op.
Crimson can’t use tcmalloc yet for bluestore, but is already competitive with the classic OSD.
17
Allocation Overhead in Classic Worker Threads
Classic vs Crimson: Memory Allocation
When limited to 2 cores, Crimson is actually faster with bluestore than the classic OSD even though it’s not able to benefit from tcmalloc yet.
18
Allocation Overhead in Classic Worker Threads
Classic vs Crimson: Memory Allocation
Tcmalloc however greatly reduces the amount of memory fragmentation and space amplification given Ceph’s aggressive dynamic allocation of small objects.
Classic with libc malloc and Crimson use more memory than classic with tcmalloc.
19
Allocation Overhead in Classic Worker Threads
Classic vs Crimson: Architecture Test Nodes
20
| Mako (AMD) | Officinalis (Intel) |
CPU | 1 x EPYC 7742 (64c/128t) | 2 x Xeon Plat. 8276M (28c/56t) |
Ram | 8 x 16GB 3200 MT/s DDR4 | 12 x 32GB 2666 MT/s DDR4 |
Network | 100GbE Mellanox ConnectX-6 | 100GbE Intel E810-C |
NVMe | 6 x Samsung 4TB PM983 | 8 x Intel 8TB P4510 |
Classic vs Crimson: Architecture
In these tests both classic and crimson are using bluestore. Surprisingly, the older Xeon based system appears to be faster than the newer AMD system. The effect is most pronounced with small random writes on classic.
21
Architecture Differences - Performance
Classic vs Crimson: Architecture
The performance differences appear to be proportional to the differences in efficiency. 4K random writes running on classic are showing much lower efficiency on the AMD node vs Intel node.
22
Architecture Differences - Efficiency
Classic vs Crimson: Architecture
The lower efficiency didn’t always translate into higher latency. The tail latency on mako was much lower for reads but higher for writes.
Crimson had consistently lower latency than classic across the board. It especially showed lower 99% tail latency under this highly concurrent test.
23
Architecture Differences - Latency
Classic vs Crimson: Architecture
The classic OSD has 3 groups of threads that are usually busy during writes. In this case:
Comparing msgr-worker threads shows more time spent in locking code and pthread_cond_broadcast on AMD.
24
Architecture Differences - Classic Messenger
Classic vs Crimson: Architecture
Currently crimson only has a single reactor thread. When used with bluestore, the reactor does all of the work not handled by the bluestore and worker threads. This includes the msgr thread work on the previous slide.
The single reactor is quite busy on both systems even when crimson is limited to 2 cores.
25
Architecture Differences - Crimson Reactor
Classic vs Crimson: Architecture
Why does the profile of the bstore_kv_sync thread look so different? In classic, the entire OSD is pinned to shared cores via numactl. Crimson silently ignores numactl.
Instead, the reactor gets a dedicated core, worker threads share a set of cores, but bluestore threads are placed by the kernel. Our “2 core” test isn’t really 2 cores in crimson! (but it’s close).
26
Architecture Differences - Bluestore Sync
bstore_kv_sync Thread Wallclock Time 2 Cores, 1 OSD, Bluestore, 4K Random Writes | ||||
| Classic | Crimson | ||
Intel | AMD | Intel | AMD | |
pthread_cond_wait | 23.3% | 9.8% | 76.2% | 73.7% |
sync_file_range | 22.6% | 30.7% | 5.4% | 6.1% |
rocksdb::InlineSkipList | 11.9% | 16.9% | 2.7% | 3.6% |
BlueFS::_consume_dirty | 2.2% | 3.4% | 1.3% | 4.3% |
__lll_unlock_wait | 8.2% | 11.9% | 0.1% | < 0.1% |
io_submit | 4.8% | < 0.1% | < 0.1% | 1.5% |
Classic vs Crimson: Architecture
Comparing profiles from Crimson and Classic is tricky, especially when threads are fighting over cores in different ways.
For now, there are some common themes such as rocksdb skiplist key comparisons in the bstore_kv_sync thread. The biggest take away is that crimson’s reactor thread is very busy and can’t fully feed bluestore.
27
Architecture Differences - Workers Threads
“tp” Worker Thread Wallclock Time 2 Cores, 1 OSD, Bluestore, 4K Random Writes | ||||
| Classic | Crimson | ||
Intel | AMD | Intel | AMD | |
do_futex_wait | 0% | 0% | 76.0% | 78.5% |
__lll_lock_wait | 16.9% | 17.7% | 2.9% | 1.9% |
pthread_cond_wait | 13.6% | 13.8% | 0% | 0% |
__lll_unlock_wait | 8.4% | 6.0% | 0.2% | 0.4% |
_write | 5.2% | 3.8% | 0.7% | 0.8% |
io_submit | 2.7% | 3.2% | 1.9% | 2.0% |
Classic vs Crimson: Multi-Core Simulation
To simulate multi-core, multiple OSDs can run on the same NVMe device simultaneously. This can also improve per-device performance of the classic OSD, but not typically when CPU constrained.
Crimson currently requires multiple OSDs to hit similar performance as the classic OSD with more than 2 cores per NVMe device.
28
Multi-Core Simulation - Performance
Classic vs Crimson: Multi-Core Simulation
Previously it was mentioned that when Crimson is used with Bluestore, the core limit can be exceeded due to the pinning strategy used and inability to use numactl/taskset. In the multi-osd random write tests, crimson actually used a little over 9 cores instead of 8.
29
Multi-Core Simulation - Cores Used
Classic vs Crimson: Multi-Core Simulation
To account for the additional core used on crimson, the IOPS per core can be calculated instead of the total IOPS.
Crimson performance per core is similar to classic except during 4K random reads with multiple OSDs. In that case Crimson is significantly faster, though not quite as fast per core as the single OSD setups.
30
Multi-Core Simulation - Performance per Core
Classic vs Crimson: Multi-Core Simulation
Performance with larger IOs is more limited by the speed of the underlying NVMe device rather than the CPU. Still, multiple OSDs are needed to hit current NVMe throughput capabilities with crimson and bluestore.
Multi-core will be important for more than just small random IO, especially when erasure coding is utilized.
31
Multi-Core Simulation - Throughput
Classic vs Crimson: Sequential Write Test Setup
32
| Classic | Crimson |
OSDs | 1x2core | 1x2core |
clients | 1xfio (64GB preallocated librbd) | 1xfio (64GB preallocated librbd) |
iodepth | 1 | 1 |
iosize | 4K, 64K (seq write, dsync) | 4K, 64K (seq write, dsync) |
Objectstore | Bluestore | Bluestore (via alienstore) |
Memory allocator | Tcmalloc, libc | libc+seastar |
Test Duration | 300s | 300s |
Classic vs Crimson: Sequential Write
Some applications use small sequential dsync writes for journaling or other purposes. Network latency can affect this kind of workload. Running clients on localhost can minimize this effect.
RTT on Mako is much lower than on Officinalis. Both however have low enough avg RTT that Internal OSD machinery should dominate latency.
33
Sequential Dsync Write - Sanity Check
Localhost Ping Test (32 samples) | ||
| Officinalis (Intel) | Mako (AMD) |
RTT min (ms) | 0.013 | 0.002 |
RTT avg (ms) | 0.027 | 0.004 |
RTT max (ms) | 0.031 | 0.014 |
RTT mdev (ms) | 0.005 | 0.003 |
Classic vs Crimson: Sequential Write
Crimson is slightly faster than classic in all test configurations. Mako is is surprisingly quite a bit faster than Officinalis for 64k sequential dsync writes in both classic and crimson.
34
Sequential Dsync Write - Performance
Classic vs Crimson: Sequential Write
One of the ways that seastar tries to reduce latency is by polling in a tight loop inside the reactor. The result of which is that workloads that don’t generate a lot of traffic may use more CPU per operation than approaches that poll less aggressively.
35
Sequential Dsync Write - Efficiency
Classic vs Crimson: Sequential Write
Crimson also shows correspondingly lower average latency and generally lower 99% latency. The one exception is the 64K test on officinalis. Given that the average latency was lower for crimson, it’s likely one or two outliers may have dragged the 99% latency up.
36
Sequential Dsync Write - Latency
Performance Vision
37
Crimson |
|
Bluestore (via Alienstore) |
|
Performance Vision
38
Seastore |
|
Near-Term Goals
39
Compatibility and Correctness Testing |
|
Better Accessibility |
|
Implementation of Major Components and Features |
|
Future Goals
40
Finer grained optimizations |
|
RadosGW, CephFS, and Other Clients |
|
Erasure Coding and other cost saving features |
|
Thank you!
Ceph Crimson Team
41