1 of 41

Ceph Crimson Team

1

CEPH CRIMSON

2021 Quarter 3 Project Update

2 of 41

Multiyear effort. How are we doing?

Efficiency is critical for success.

High Performance: How do we get there?

CEPH

CRIMSON

A refresh of the Ceph OSD focused on efficient resource consumption, higher performance, and modern designs.

2

3 of 41

Why Reimplement the OSD?

Resource efficiency

  • Better value for OpenShift, managed services, and edge.

Enables new workloads

  • Low, consistent latency is critical for tier 1 applications.
  • Competitive advantage from Scale-out + high perf in one platform.

Future-proof for next generation hardware

  • Fast NVMe, persistent memory are here and growing.
  • NVMeoF and ZNS devices complement Crimson well.
  • ZNS has potential to enable significant cost savings.

4 of 41

Why Seastar? Why Crimson?

Lower CPU Usage

  • Higher performance for dense nodes.
  • Cheaper at scale than classic OSDs.

Lower Latency

  • Sharded “shared-nothing” design.
  • Less lock contention.
  • Faster for sync workloads.

Potentially Better Memory Allocation

  • Existing OSDs fragment memory
  • Avoid suboptimal cross-thread allocation/free behavior with shared-nothing sharding.
  • Seastar native memory allocator

5 of 41

Who is Contributing to Crimson?

5

Red Hat

Intel

Qihoo 360

2021 Q3 Project Stats

Started in 2018

30+ Unique Contributors

900+ Pull Requests

2800+ Commits

74K+ Lines of Code

Samsung

Others

  • Red Hat leads the Crimson Effort and has donated hardware and lab space for R&D.
  • Intel has a team working on Crimson and has donated hardware for R&D.
  • Qihoo 360 regularly contributes code.
  • Samsung has contributed code and donated hardware for R&D.
  • SuSE, IBM, Huawei, and others have also contributed code to the project.

6 of 41

The 2021 Q2 commit rate was 2x higher than in Q1. Q3 had fewer commits than Q2 and a modest increase over Q1.

Note

Only includes work committed to the crimson directory in the Ceph master git repository.

6

Crimson Commit History

7 of 41

Statistics

29 over 2 years:

Initial Prototype

Ephemeral Data Store

Replication

RBD Support

Error Handling

Regression Testing

….

7

2018-2019 Crimson Milestones

8 of 41

Statistics

26 over 1 year:

PGLog Support

Backfill (Mostly)

Recovery (Mostly)

Bluestore Support

Seastore (Initial)

Code Stabilization

....

8

2020 Crimson Milestones

9 of 41

Statistics

34 in 2021 Q1-Q3:

Ongoing Seastore Dev

Metrics/Profiling

Seastore optimization

Better debugging

Bug fixes (not all shown)

9

2021 Q1-Q3 Crimson Milestones

10 of 41

Crimson Release Plans

10

Quincy (2022 Q1(Q2?))

  • Experimental preview
  • RBD functionality
  • Replication
  • Peering/Recovery
  • Multi-core

Post Quincy 1

  • Scrub
  • Erasure coding
  • RGW/CephFS
  • Snapshots

Post Quincy 2

  • Seastore

11 of 41

2021 Goals (WIP)

11

  • Snapshots
  • Multi-core Protype
  • RGW support
  • Rook integration
  • Seastore subsystem testing

  • RADOS stress testing
  • Seastore Prototype

Q1

Q2

  • Unit testing and code stabilization
  • Seastore subsystem development

Q3

Q4

12 of 41

2022 Crimson Goals (WIP)

12

  • SPDK/io_uring evaluation
  • Further optimization / stabilization / bugfixing

  • Seastore stabilization
  • Multi-core stabilization
  • Migration to crimson

Q1

Q2

  • EC support
  • CephFS support
  • CephFS stress testing
  • Seastore optimization
  • Quincy Release

Q3

Q4

13 of 41

13

2021 Q3 Crimson/Bluestore Performance Preview

14 of 41

3 Factors to consider...

14

Memory Allocator

Crimson currently uses some combination of the seastar memory allocator and libc malloc depending on compilation settings and whether bluestore is used.

Architecture

Crimson employs a different programming model for the OSD code than classic, even when bluestore is utilized. Do some systems favor one or the other?

Multi-Core

Crimson does not yet have multi-core support, however this feature is being actively worked on. We can now however run multiple OSDs on a single NVMe device.

15 of 41

Classic vs Crimson: General Test Setup

15

Classic

Crimson

OSDs

1x2core, 1x8core, 4x2core

1x2core, 1x8core, 4x2core

clients

4xfio (16GB preallocated librbd each)

4xfio (16GB preallocated librbd each)

iodepth

4x32 (32/client)

4x32 (32/client)

iosize

4K, 64K (randread, randwrite)

4K, 64K (randread, randwrite)

Objectstore

Bluestore

Bluestore (via alienstore)

Memory allocator

Tcmalloc, libc

libc+seastar

Test Duration

300s

300s

16 of 41

Classic vs Crimson: Memory Allocation

The classic OSD is very sensitive to the memory allocator is used during compilation. Tcmalloc tends to be better than libc malloc at optimizing Ceph’s heavy use of dynamic memory.

Bluestore also relies on tcmalloc for the osd memory autotuning system.

16

Allocation Overhead in Classic Worker Threads

tp_osd_tp Thread Wallclock Time

2 Cores, 1 OSD, 4GB osd_memory_target, 4K Random Writes

libc

tcmalloc

new

9.22%

0.00%

tc_new

0%

3.35%

_int_free

7.59%

0%

cfree

1.20%

2.83%

Total

18.01%

6.18%

17 of 41

Classic vs Crimson: Memory Allocation

The reduction in overhead shown in the previous slide also plays out when looking at the difference in cycles/op.

Crimson can’t use tcmalloc yet for bluestore, but is already competitive with the classic OSD.

17

Allocation Overhead in Classic Worker Threads

18 of 41

Classic vs Crimson: Memory Allocation

When limited to 2 cores, Crimson is actually faster with bluestore than the classic OSD even though it’s not able to benefit from tcmalloc yet.

18

Allocation Overhead in Classic Worker Threads

19 of 41

Classic vs Crimson: Memory Allocation

Tcmalloc however greatly reduces the amount of memory fragmentation and space amplification given Ceph’s aggressive dynamic allocation of small objects.

Classic with libc malloc and Crimson use more memory than classic with tcmalloc.

19

Allocation Overhead in Classic Worker Threads

20 of 41

Classic vs Crimson: Architecture Test Nodes

20

Mako (AMD)

Officinalis (Intel)

CPU

1 x EPYC 7742 (64c/128t)

2 x Xeon Plat. 8276M (28c/56t)

Ram

8 x 16GB 3200 MT/s DDR4

12 x 32GB 2666 MT/s DDR4

Network

100GbE Mellanox ConnectX-6

100GbE Intel E810-C

NVMe

6 x Samsung 4TB PM983

8 x Intel 8TB P4510

21 of 41

Classic vs Crimson: Architecture

In these tests both classic and crimson are using bluestore. Surprisingly, the older Xeon based system appears to be faster than the newer AMD system. The effect is most pronounced with small random writes on classic.

21

Architecture Differences - Performance

22 of 41

Classic vs Crimson: Architecture

The performance differences appear to be proportional to the differences in efficiency. 4K random writes running on classic are showing much lower efficiency on the AMD node vs Intel node.

22

Architecture Differences - Efficiency

23 of 41

Classic vs Crimson: Architecture

The lower efficiency didn’t always translate into higher latency. The tail latency on mako was much lower for reads but higher for writes.

Crimson had consistently lower latency than classic across the board. It especially showed lower 99% tail latency under this highly concurrent test.

23

Architecture Differences - Latency

24 of 41

Classic vs Crimson: Architecture

The classic OSD has 3 groups of threads that are usually busy during writes. In this case:

  1. 3 x msgr-worker
  2. 1 x bstore-kv-sync
  3. 4 x “tp” workers

Comparing msgr-worker threads shows more time spent in locking code and pthread_cond_broadcast on AMD.

24

Architecture Differences - Classic Messenger

25 of 41

Classic vs Crimson: Architecture

Currently crimson only has a single reactor thread. When used with bluestore, the reactor does all of the work not handled by the bluestore and worker threads. This includes the msgr thread work on the previous slide.

The single reactor is quite busy on both systems even when crimson is limited to 2 cores.

25

Architecture Differences - Crimson Reactor

26 of 41

Classic vs Crimson: Architecture

Why does the profile of the bstore_kv_sync thread look so different? In classic, the entire OSD is pinned to shared cores via numactl. Crimson silently ignores numactl.

Instead, the reactor gets a dedicated core, worker threads share a set of cores, but bluestore threads are placed by the kernel. Our “2 core” test isn’t really 2 cores in crimson! (but it’s close).

26

Architecture Differences - Bluestore Sync

bstore_kv_sync Thread Wallclock Time

2 Cores, 1 OSD, Bluestore, 4K Random Writes

Classic

Crimson

Intel

AMD

Intel

AMD

pthread_cond_wait

23.3%

9.8%

76.2%

73.7%

sync_file_range

22.6%

30.7%

5.4%

6.1%

rocksdb::InlineSkipList

11.9%

16.9%

2.7%

3.6%

BlueFS::_consume_dirty

2.2%

3.4%

1.3%

4.3%

__lll_unlock_wait

8.2%

11.9%

0.1%

< 0.1%

io_submit

4.8%

< 0.1%

< 0.1%

1.5%

27 of 41

Classic vs Crimson: Architecture

Comparing profiles from Crimson and Classic is tricky, especially when threads are fighting over cores in different ways.

For now, there are some common themes such as rocksdb skiplist key comparisons in the bstore_kv_sync thread. The biggest take away is that crimson’s reactor thread is very busy and can’t fully feed bluestore.

27

Architecture Differences - Workers Threads

“tp” Worker Thread Wallclock Time

2 Cores, 1 OSD, Bluestore, 4K Random Writes

Classic

Crimson

Intel

AMD

Intel

AMD

do_futex_wait

0%

0%

76.0%

78.5%

__lll_lock_wait

16.9%

17.7%

2.9%

1.9%

pthread_cond_wait

13.6%

13.8%

0%

0%

__lll_unlock_wait

8.4%

6.0%

0.2%

0.4%

_write

5.2%

3.8%

0.7%

0.8%

io_submit

2.7%

3.2%

1.9%

2.0%

28 of 41

Classic vs Crimson: Multi-Core Simulation

To simulate multi-core, multiple OSDs can run on the same NVMe device simultaneously. This can also improve per-device performance of the classic OSD, but not typically when CPU constrained.

Crimson currently requires multiple OSDs to hit similar performance as the classic OSD with more than 2 cores per NVMe device.

28

Multi-Core Simulation - Performance

29 of 41

Classic vs Crimson: Multi-Core Simulation

Previously it was mentioned that when Crimson is used with Bluestore, the core limit can be exceeded due to the pinning strategy used and inability to use numactl/taskset. In the multi-osd random write tests, crimson actually used a little over 9 cores instead of 8.

29

Multi-Core Simulation - Cores Used

30 of 41

Classic vs Crimson: Multi-Core Simulation

To account for the additional core used on crimson, the IOPS per core can be calculated instead of the total IOPS.

Crimson performance per core is similar to classic except during 4K random reads with multiple OSDs. In that case Crimson is significantly faster, though not quite as fast per core as the single OSD setups.

30

Multi-Core Simulation - Performance per Core

31 of 41

Classic vs Crimson: Multi-Core Simulation

Performance with larger IOs is more limited by the speed of the underlying NVMe device rather than the CPU. Still, multiple OSDs are needed to hit current NVMe throughput capabilities with crimson and bluestore.

Multi-core will be important for more than just small random IO, especially when erasure coding is utilized.

31

Multi-Core Simulation - Throughput

32 of 41

Classic vs Crimson: Sequential Write Test Setup

32

Classic

Crimson

OSDs

1x2core

1x2core

clients

1xfio (64GB preallocated librbd)

1xfio (64GB preallocated librbd)

iodepth

1

1

iosize

4K, 64K (seq write, dsync)

4K, 64K (seq write, dsync)

Objectstore

Bluestore

Bluestore (via alienstore)

Memory allocator

Tcmalloc, libc

libc+seastar

Test Duration

300s

300s

33 of 41

Classic vs Crimson: Sequential Write

Some applications use small sequential dsync writes for journaling or other purposes. Network latency can affect this kind of workload. Running clients on localhost can minimize this effect.

RTT on Mako is much lower than on Officinalis. Both however have low enough avg RTT that Internal OSD machinery should dominate latency.

33

Sequential Dsync Write - Sanity Check

Localhost Ping Test (32 samples)

Officinalis (Intel)

Mako (AMD)

RTT min (ms)

0.013

0.002

RTT avg (ms)

0.027

0.004

RTT max (ms)

0.031

0.014

RTT mdev (ms)

0.005

0.003

34 of 41

Classic vs Crimson: Sequential Write

Crimson is slightly faster than classic in all test configurations. Mako is is surprisingly quite a bit faster than Officinalis for 64k sequential dsync writes in both classic and crimson.

34

Sequential Dsync Write - Performance

35 of 41

Classic vs Crimson: Sequential Write

One of the ways that seastar tries to reduce latency is by polling in a tight loop inside the reactor. The result of which is that workloads that don’t generate a lot of traffic may use more CPU per operation than approaches that poll less aggressively.

35

Sequential Dsync Write - Efficiency

36 of 41

Classic vs Crimson: Sequential Write

Crimson also shows correspondingly lower average latency and generally lower 99% latency. The one exception is the 64K test on officinalis. Given that the average latency was lower for crimson, it’s likely one or two outliers may have dragged the 99% latency up.

36

Sequential Dsync Write - Latency

37 of 41

Performance Vision

37

Crimson

  • Multi-core should show better performance and efficiency than the classic OSD.
  • Continue to show lower latency and higher performance with in-memory (cyanstore) vs classic memstore.

Bluestore (via Alienstore)

  • Better or equal performance, efficiency, and memory usage vs classic for highly concurrent workloads.
  • Better or equal performance and memory usage for synchronous sequential workloads.
  • Maintain existing gains while working toward 100% functional test coverage.

38 of 41

Performance Vision

38

Seastore

  • To early still to set specific performance goals, but some high level goals drive the design.
  • Significantly lower write amplification vs bluestore.
  • Lower space amplification vs bluestore.
  • Lower latency and especially tail latency.
  • Reduction in the redundancy of work at all levels.

39 of 41

Near-Term Goals

39

Compatibility and Correctness Testing

  • Ongoing work to pass existing OSD tests.
  • Code base is being stabilized on multiple fronts.
  • Still need to implement better test coverage.

Better Accessibility

  • Some work done toward better debugging.
  • Some work done toward better profiling.
  • High Impact bugs can still impact testing/development.

Implementation of Major Components and Features

  • Multi-reactor support is key to improving performance. Work to begin this fall.

40 of 41

Future Goals

40

Finer grained optimizations

  • Bluestore took several years to optimize, Crimson is even more ambitious.
  • SPDK/io_uring show a lot of promise but are also under active development. Watch closely.

RadosGW, CephFS, and Other Clients

  • Basic support needed for testing.
  • RGW and CephFS need CLS classes.
  • Focus on core components and correctness first.

Erasure Coding and other cost saving features

  • Let’s get replication right first, however...
  • Very important in the long run to get EC right to further improve value proposition for users.

41 of 41

Thank you!

Ceph Crimson Team

41