2 of 41

Multiyear effort. How are we doing?

Efficiency is critical for success.

High Performance: How do we get there?

CEPH

CRIMSON

A refresh of the Ceph OSD focused on efficient resource consumption, higher performance, and modern designs.

3 of 41

Why Reimplement the OSD?

Resource efficiency	Better value for OpenShift, managed services, and edge.
Enables new workloads	Low, consistent latency is critical for tier 1 applications. Competitive advantage from Scale-out + high perf in one platform.
Future-proof for next generation hardware	Fast NVMe, persistent memory are here and growing. NVMeoF and ZNS devices complement Crimson well. ZNS has potential to enable significant cost savings.

4 of 41

Why Seastar? Why Crimson?

Lower CPU Usage	Higher performance for dense nodes. Cheaper at scale than classic OSDs.
Lower Latency	Sharded “shared-nothing” design. Less lock contention. Faster for sync workloads.
Potentially Better Memory Allocation	Existing OSDs fragment memory Avoid suboptimal cross-thread allocation/free behavior with shared-nothing sharding. Seastar native memory allocator

5 of 41

Who is Contributing to Crimson?

Red Hat

Intel

Qihoo 360

2021 Q3 Project Stats

Started in 2018

30+ Unique Contributors

900+ Pull Requests

2800+ Commits

74K+ Lines of Code

Samsung

Others

Red Hat leads the Crimson Effort and has donated hardware and lab space for R&D.

Intel has a team working on Crimson and has donated hardware for R&D.

Qihoo 360 regularly contributes code.

Samsung has contributed code and donated hardware for R&D.

SuSE, IBM, Huawei, and others have also contributed code to the project.

6 of 41

The 2021 Q2 commit rate was 2x higher than in Q1. Q3 had fewer commits than Q2 and a modest increase over Q1.

Note

Only includes work committed to the crimson directory in the Ceph master git repository.

Crimson Commit History

7 of 41

Statistics

29 over 2 years:

Initial Prototype

Ephemeral Data Store

Replication

RBD Support

Error Handling

Regression Testing

….

2018-2019 Crimson Milestones

8 of 41

Statistics

26 over 1 year:

PGLog Support

Backfill (Mostly)

Recovery (Mostly)

Bluestore Support

Seastore (Initial)

Code Stabilization

....

2020 Crimson Milestones

9 of 41

Statistics

34 in 2021 Q1-Q3:

Ongoing Seastore Dev

Metrics/Profiling

Seastore optimization

Better debugging

Bug fixes (not all shown)

2021 Q1-Q3 Crimson Milestones

10 of 41

Crimson Release Plans

Quincy (2022 Q1(Q2?))	Experimental preview RBD functionality Replication Peering/Recovery Multi-core
Post Quincy 1	Scrub Erasure coding RGW/CephFS Snapshots
Post Quincy 2	Seastore

11 of 41

2021 Goals (WIP)

Snapshots
Multi-core Protype
RGW support

Rook integration
Seastore subsystem testing

RADOS stress testing
Seastore Prototype

Unit testing and code stabilization
Seastore subsystem development

12 of 41

2022 Crimson Goals (WIP)

SPDK/io_uring evaluation
Further optimization / stabilization / bugfixing

Seastore stabilization
Multi-core stabilization
Migration to crimson

EC support
CephFS support
CephFS stress testing
Seastore optimization
Quincy Release

13 of 41

2021 Q3 Crimson/Bluestore Performance Preview

14 of 41

3 Factors to consider...

Memory Allocator	Crimson currently uses some combination of the seastar memory allocator and libc malloc depending on compilation settings and whether bluestore is used.
Architecture	Crimson employs a different programming model for the OSD code than classic, even when bluestore is utilized. Do some systems favor one or the other?
Multi-Core	Crimson does not yet have multi-core support, however this feature is being actively worked on. We can now however run multiple OSDs on a single NVMe device.

15 of 41

Classic vs Crimson: General Test Setup

	Classic	Crimson
OSDs	1x2core, 1x8core, 4x2core	1x2core, 1x8core, 4x2core
clients	4xfio (16GB preallocated librbd each)	4xfio (16GB preallocated librbd each)
iodepth	4x32 (32/client)	4x32 (32/client)
iosize	4K, 64K (randread, randwrite)	4K, 64K (randread, randwrite)
Objectstore	Bluestore	Bluestore (via alienstore)
Memory allocator	Tcmalloc, libc	libc+seastar
Test Duration	300s	300s

16 of 41

Classic vs Crimson: Memory Allocation

The classic OSD is very sensitive to the memory allocator is used during compilation. Tcmalloc tends to be better than libc malloc at optimizing Ceph’s heavy use of dynamic memory.

Bluestore also relies on tcmalloc for the osd memory autotuning system.

Allocation Overhead in Classic Worker Threads

tp_osd_tp Thread Wallclock Time 2 Cores, 1 OSD, 4GB osd_memory_target, 4K Random Writes
	libc	tcmalloc
new	9.22%	0.00%
tc_new	0%	3.35%
_int_free	7.59%	0%
cfree	1.20%	2.83%
Total	18.01%	6.18%

17 of 41

Classic vs Crimson: Memory Allocation

The reduction in overhead shown in the previous slide also plays out when looking at the difference in cycles/op.

Crimson can’t use tcmalloc yet for bluestore, but is already competitive with the classic OSD.

Allocation Overhead in Classic Worker Threads

18 of 41

Classic vs Crimson: Memory Allocation

When limited to 2 cores, Crimson is actually faster with bluestore than the classic OSD even though it’s not able to benefit from tcmalloc yet.

Allocation Overhead in Classic Worker Threads

19 of 41

Classic vs Crimson: Memory Allocation

Tcmalloc however greatly reduces the amount of memory fragmentation and space amplification given Ceph’s aggressive dynamic allocation of small objects.

Classic with libc malloc and Crimson use more memory than classic with tcmalloc.

Allocation Overhead in Classic Worker Threads

20 of 41

Classic vs Crimson: Architecture Test Nodes

	Mako (AMD)	Officinalis (Intel)
CPU	1 x EPYC 7742 (64c/128t)	2 x Xeon Plat. 8276M (28c/56t)
Ram	8 x 16GB 3200 MT/s DDR4	12 x 32GB 2666 MT/s DDR4
Network	100GbE Mellanox ConnectX-6	100GbE Intel E810-C
NVMe	6 x Samsung 4TB PM983	8 x Intel 8TB P4510

21 of 41