1 of 37

rv64ilp32 - The future of 32-bit Linux

Guo Ren <guoren@kernel.org>

2 of 37

https://lwn.net/Articles/838807/

32-bit is waiting for its death.

3 of 37

32ilp32 v.s. 64ilp32 v.s. 64lp64

Why 32-bit pointer?

Why 64-bit ISA?

	32ilp32�(Traditional 32-bit ABI)	64ilp32�(New 32-bit ABI)	64lp64 (Traditional 64-bit ABI)
pointer	32	32	64
ISA	32	64	64

4 of 37

Why 32-bit pointer?�
Why 64-bit ISA?

5 of 37

It is absolutely idiotic to have 64-bit pointers when I compile a program that uses less than 4 gigabytes of RAM. When such pointer values appear inside a struct, they not only waste half the memory, they effectively throw away half of the cache. – Knuth (2008)

https://www-cs-faculty.stanford.edu/~knuth/news08.html�A Flame About 64-bit Pointers

6 of 37

LP64 waste 25% memory

Test Environment:

Total mem is 16MB (4096 Pages)
Using tinyconfig
The dmesg is line-by-line aligned.

ilp32 = (4096 - 3406) = 690 pages

lp64 = (4096 - 3231) = 865 pages

(865 - 690)/690 = 25%

sizeof(xxx)	ILP32	LP64
struct page	32	64
list_head	8	16
hlist_node	8	16
vm_area_struct …	68	136

7 of 37

唐纳德还指出，不合理的使用64位指针会导致缓存效率降低，进而影响性能。为了验证这一观点，我们对64ilp32和64lp64 SPEC2006的性能进行了对比。实验结果如下：

黄色柱状图表示在Sifive Unmatched开发板上的测量结果。柱状体向上表示32位指针相比64位指针的性能提升幅度，柱体向下表示性能下降幅度。
蓝色柱状图表示在Allwinner D1开发板上的测量结果。

需要注意的是，由于rv64ilp32的编译器仍处于开发阶段，优化尚不完美，因此在456.hmmer测试用例中性能有所下降。经过分析，这是一个编译器性能问题，未来将会得到解决。

总体结果显示，在相同的rv64指令架构的硬件上，使用32位指针相比64位指针可以显著提升性能。这一结果进一步支持了唐纳德的观点，即在不需要64位寻址的场景中，使用32位指针是更明智的选择，可以有效提升缓存利用率达到提升性能的目的。

8 of 37

Unmatched

9 of 37

Why did they choose 64-bit ISA?

WHY?

10 of 37

Why 32-bit pointer?�
Why 64-bit ISA?

11 of 37

32-bit ISA of Application Processor !?

X86-S killed 32-bit ISA.
Armv9 deprecated 32-bit ISA.
RISC-V profiles (RVA23/RVB23/RVA22/RVA20) never include 32-bit ISA.

12 of 37

32-bit mode - Throw away half of the Register

Registers

Cache

Memory

SSD/HD

MB~GB

GB~TB

TB~PB

Faster but costlier

Slower but cheaper

> 10us

80~140ns

1~40ns

0.2ns

Processor

ALU

Registers

无论是基于Arm-v8还是x86架构的64位处理器，都保留了32位指令架构模式。这种模式将64位寄存器压缩至32位使用，导致丢弃了寄存器的高32位。唐纳德认为，浪费一半缓存和内存的行为是“absolutely idiotic”。那么，浪费一半的寄存器的行为又该如何评价呢？...

寄存器作为计算机系统中宝贵的存储部件，直接参与处理器流水线的执行单元运作，对性能起到关键作用。然而，为了兼容性，过去十几年间，我们浪费了一半的寄存器资源。这一现象的根本原因是，无论是Arm还是x86都已建立了根深蒂固的传统32位软件生态，导致他们继续沿用传统32位。

与这些传统架构不同，RISC-V没有32位历史包袱，它的32位软件生态还处于萌芽期，这恰好为矫正和优化提供了绝佳时机。rv64ilp32 ABI 正是瞄准这一机遇，旨在规避过去架构中出现的谬误，为嵌入式RISC-V应用处理器提供更卓越的新32位解决方案，以替代已显老旧的传统32位架构，实现性能与成本的双赢。

13 of 37

memcpy/memload/memset performance in Linux kernel

lw/sw v.s. ld/sd

那么，扔掉一半寄存器究竟会有什么样的后果？我们在相同 RISC-V 64位架构的硬件上，对比 rv64ilp32 和 rv32ilp32 的 Linux 内核 memcpy/memload/memset 函数的性能。蓝色柱状图代表 Allwinner D1硬件平台，而橘色柱状图代表算能 sg2042 硬件平台。rv64ilp32 相比 rv32ilp32 在所有测试用例上，都获得了性能提升，尤其是在 sg2042 硬件平台上，平均获得了接近翻倍的性能提升，这得益于其内存控制器提供了充足的带宽，而 D1 硬件平台受限于内存带宽，性能提升幅度虽不如 sg2042，但也非常显著。测试结果告诉我们，64位指令架构相比32位在性能上有巨大优势，因为它有效提升流水线的吞吐带宽，就如同大炮的口径，口径越大威力越大。

14 of 37

Does a pure 32-bit ISA make the chip area smaller?

ARM said: “Compared to Cortex-A35, the Cortex-A32 offers same 32-bit performance but consumes 10% less power and has a 13% smaller core.”

https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/introducing-cortex-a32-arm-s-smallest-lowest-power-armv8-a-processor-for-next-generation-32-bit-embedded-applications�

AArch64 has 31 GPRs, but AArch32 only has 15 GPRs (16 GPRs removed!).
AArch64 has 32 128-bit SIMD Regs, but AArch32 only has 16 (16 SIMD Regs removed!).
Cortex-A35 is “AArch32 + AArch64”, but Cortex-A32 is only “AArch32”.�13% / 4 = 3.25%

For RISC-V:

The number of GPRs, FPU, and VECTOR registers are the same for RV32 and RV64.
Generally speaking, in embedded application processor scenarios, the CPU core area accounts for no more than 10% of the entire SoC.�3.25% * 10% = 0.325%�Therefore, the benefit is too tiny to care.

用纯32位指令架构设计芯片，能否降低芯片面积？我们的同行 ARM 已经做过尝试，Cortex-A32 就是从 Cortex-A35 裁剪64位模式仅保留32位模式而来，单核面积下降了 13%，乍听起来不错，但这 13% 并不简单，它还包含了调整位宽之外的其他变更：

AArch64 有 31 个寄存器，但 AArch32 只有 15 个，剔除了16个通用寄存器，这项修改所涉及的 bit 总数和改变通用寄存器位宽（从64位到32位）是一样多的。
AArch64 有 32 个128位 SIMD 寄存器，但 AArch32 只有 16 个，又剔除16个 SIMD 寄存器，这项修改所涉及的 bit 总数是改变通用寄存器位宽的 4 倍。
Cortex-A35 是 AArch32 + AArch64 的结合体，并不是纯 64位指令架构的处理器，但 Cortex-A32 只支持 AArch32，是纯32位指令架构的处理器，所以这不是 AArch64 v.s. AArch32。

所以当我们将 13% 换算到 RISC-V 上时，结合以上3个因素打一个折扣，13%/4=3.25%，而这仅仅是从单核的视角去思考。�如果再从整个应用处理器 SoC 的维度看，当加上 L2 缓存、系统总线和各种 IP后，CPU核面积在整个应用处理器芯片的占比，不会超过 10%（一般小于5%）。��综上所述，使用 32 位指令架构能为整个 SoC 芯片面积带来的收益，实在太小了！

15 of 37

64-bit ISA is a visionary and wise choice!

WISE!

16 of 37

Our Solution: rv64ilp32�Run 32-bit pointer on RISC-V 64-bit ISA

17 of 37

The world’s first 64ilp32 ABI Linux kernel!

PATCH [01 - 11] u64ilp32

PATCH [12 - 36] s64ilp32

https://lore.kernel.org/linux-riscv/20231112061514.2306187-1-guoren@kernel.org/

u64ilp32: User space support is similar to x86-x32, mips-n32, and arm64-ilp32.

s64ilp32 - The world’s first 64ilp32 ABI Linux kernel!

实现 rv64ilp32 意味着引入了一个全新的ABI，这是一个庞大的工程，需要大量的投入来推动整个软件生态的发展。尽管有人因为畏惧挑战而犹豫不前，甚至将这种情绪蔓延到整个Linux世界，认为x86-x32、mips-n32和arm64-ilp32都失败了，凭什么RISC-V会成功。但我对此不以为然。这些架构之所以失败，是因为它们的32位ABI根深蒂固，难以改变。如果人们想要使用32位指针，只需让硬件在32位模式下运行现有的软件即可。然而，RISC-V的32位软件生态尚处于萌芽阶段，没有历史包袱。

我们汲取了前人的经验，并在此基础上进行了创新。在实现了用户态u64ilp32 ABI的基础上，我们首次实现了s64ilp32 Linux内核，并以嵌入式小内存场景为切入点，使其更贴近实际应用场景（而非x86-x32的科学计算和基准测试场景）。

我们为Linux内核实现了36个补丁：

前11个补丁是 u64ilp32 用户态支持，而X86、MIPS和ARM也仅实现了这一步。
后25个补丁是 s64ilp32 Linux内核，即让32位Linux完美运行在64位指令架构的硬件上，是世界第一个 64ilp32 ABI 的 Linux 内核。

18 of 37

[RFC PATCH V2] rv64ilp32 patches:

s64ilp32

s64lp64

M-mode

opensbi

S-mode kernel

U-mode �userspace

ISA

m64lp64

s64lp64

u32ilp32 u64ilp32

s32ilp32

u32ilp32

m32ilp32

RV64

RV32

u32ilp32 u64ilp32 u64lp64

19 of 37

s64ilp32 v.s. s32ilp32

memcpy/copy_from(to)_user/get(put)_user ld -> lw
eBPF JIT full -> lite
Native Atomic64 -> GENERIC_ATOMIC64
Native 64-bit algorithm ->

GENERIC_LIB_ASHLDI3�GENERIC_LIB_ASHRDI3 �GENERIC_LIB_LSHRDI3�GENERIC_LIB_UCMPDI2�…

20 of 37

Fedora 38 with “s64ilp32 + u32ilp32”

Gcc, Binutils, Glibc, Benchmark (PLCT)
Linux kernel (Alibaba)
Fedora (PLCT & Redhat)

1800+ rv32 fedora packages

XuanTie C908 (Kendryte K230)
XuanTie C907 (Coming soon)

21 of 37

Next: [RFC PATCH V3] s64ilp32 + u64lp64

s64ilp32 + u64lp64 (2GB)

s64lp64 + u64lp64 (128TB)

Proof of Concept:

22 of 37

Next: [RFC PATCH V3] s64ilp32 + u64lp64

s64ilp32

s64lp64

M-mode

opensbi

S-mode kernel

U-mode �userspace

ISA

m64lp64

s64lp64

u32ilp32 u64ilp32 u64lp64

s32ilp32

u32ilp32

m32ilp32

RV64

RV32

=

23 of 37

Final Goal: s64ilp32 + u64ilp32

Reuse the 64-bit system call table, then delete the 32-bit ISA and its 32-bit system call table from Linux.

s64ilp32

s64lp64

M-mode

opensbi

S-mode kernel

U-mode �userspace

ISA

m64lp64

s64lp64

s32ilp32

u32ilp32

m32ilp32

RV64

RV32

u32ilp32 u64ilp32 u64lp64

=

24 of 37

rv64ilp32 ensures these chips succeed!

SUCCESS!

The future of 32-bit Linux - rv64ilp32

25 of 37

END�

26 of 37

Backup

27 of 37

Linux doesn’t like 32-bit ISA

28 of 37

eBPF JIT

The eBPF registers are 64-bit, while ISA registers are 32-bit. BPF registers either map directly to 2 registers, or reside in stack scratch space and are saved and restored when used.
Many 64-bit ALU operations do not trivially map to 32-bit operations. Operations that move bits between high and low words, such as ADD, LSH, MUL, and others must emulate the 64-bit behavior in terms of 32-bit instructions.

ref: https://lore.kernel.org/netdev/20200220041608.30289-1-lukenels@cs.washington.edu/

29 of 37

Use native 64-bit ALU insns improve crypto algorithms

/*� * On some 32-bit architectures (h8300), GCC ends up using� * over 1 KB of stack if we inline the round calculation into the loop� * in keccakf(). On the other hand, on 64-bit architectures with plenty� * of [64-bit wide] general purpose registers, not inlining it severely� * hurts performance. So let's use 64-bitness as a heuristic to decide� * whether to inline or not.� */�#ifdef CONFIG_64BIT�#define SHA3_INLINE inline�#else�#define SHA3_INLINE noinline�#endif

/* update the state with given number of rounds */

static SHA3_INLINE void keccakf_round(u64 st[25])

{

u64 t[5], tt, bc[5];

/* Theta */

bc[0] = st[0] ^ st[5] ^ st[10] ^ st[15

30 of 37

Use native 64-bit load/store improve user space access

static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)

{

…

#if CONFIG_64BIT

if (get_user(ptr, &t->rseq->rseq_cs))

return -EFAULT;

#else

if (copy_from_user(&ptr, &t->rseq->rseq_cs, sizeof(ptr)))

return -EFAULT;

#endif

31 of 37

Use native 64-bit atomic implement CMPXCHG_DOUBLE

mm/slub.c:

if (s->flags & __CMPXCHG_DOUBLE) {

ret = __update_freelist_fast(slab, freelist_old, counters_old,

freelist_new, counters_new);

} else {

ret = __update_freelist_slow(slab, freelist

32 of 37

Improvements

33 of 37

Sign-extend addressing

Traditional x86-x32, mips-n32, and arm64-ilp32 all use zero-extend addressing, and the compiler needs to insert additional zero-extend instructions, which causes code size and performance problems.�

So rv64ilp32 introduces a new solution called sign-extend addressing.

34 of 37

Stack size optimization

Traditional x86-x32, mips-n32, and arm64-ilp32 all use 64-bit for callee-saved registers, but they waste half of the stack size in ILP32 scenarios.

So rv64ilp32 prepares to use 32-bit stack layout for callee-saved registers.

35 of 37

Current Problems

36 of 37

y2038 problem

The traditional 32-bit Linux will stop working in 2038 when the 32-bit time_t overflows, which is historical ills.

For rv64ilp32:

Move to 64-bit system call table

37 of 37

GCC problem

https://github.com/Liaoshihua/RV64-ILP32/