1 of 37

rv64ilp32 - The future of 32-bit Linux

Guo Ren <guoren@kernel.org>

2 of 37

https://lwn.net/Articles/838807/

32-bit is waiting for its death.

3 of 37

32ilp32 v.s. 64ilp32 v.s. 64lp64

  • Why 32-bit pointer?

  • Why 64-bit ISA?

32ilp32�(Traditional 32-bit ABI)

64ilp32�(New 32-bit ABI)

64lp64

(Traditional 64-bit ABI)

pointer

32

32

64

ISA

32

64

64

4 of 37

  1. Why 32-bit pointer?�
  2. Why 64-bit ISA?

5 of 37

It is absolutely idiotic to have 64-bit pointers when I compile a program that uses less than 4 gigabytes of RAM. When such pointer values appear inside a struct, they not only waste half the memory, they effectively throw away half of the cache. – Knuth (2008)

https://www-cs-faculty.stanford.edu/~knuth/news08.html�A Flame About 64-bit Pointers

6 of 37

LP64 waste 25% memory

Test Environment:

  • Total mem is 16MB (4096 Pages)
  • Using tinyconfig
  • The dmesg is line-by-line aligned.

ilp32 = (4096 - 3406) = 690 pages

lp64 = (4096 - 3231) = 865 pages

(865 - 690)/690 = 25%

sizeof(xxx)

ILP32

LP64

struct page

32

64

list_head

8

16

hlist_node

8

16

vm_area_struct

68

136

7 of 37

8 of 37

Unmatched

9 of 37

Why did they choose 64-bit ISA?

WHY?

10 of 37

  • Why 32-bit pointer?�
  • Why 64-bit ISA?

11 of 37

32-bit ISA of Application Processor !?

  • X86-S killed 32-bit ISA.
  • Armv9 deprecated 32-bit ISA.
  • RISC-V profiles (RVA23/RVB23/RVA22/RVA20) never include 32-bit ISA.

12 of 37

32-bit mode - Throw away half of the Register

Registers

Cache

Memory

SSD/HD

MB~GB

GB~TB

TB~PB

Faster but costlier

Slower but cheaper

> 10us

80~140ns

1~40ns

0.2ns

Processor

ALU

Registers

13 of 37

memcpy/memload/memset performance in Linux kernel

lw/sw v.s. ld/sd

14 of 37

Does a pure 32-bit ISA make the chip area smaller?

ARM said: “Compared to Cortex-A35, the Cortex-A32 offers same 32-bit performance but consumes 10% less power and has a 13% smaller core.”

https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/introducing-cortex-a32-arm-s-smallest-lowest-power-armv8-a-processor-for-next-generation-32-bit-embedded-applications

  1. AArch64 has 31 GPRs, but AArch32 only has 15 GPRs (16 GPRs removed!).
  2. AArch64 has 32 128-bit SIMD Regs, but AArch32 only has 16 (16 SIMD Regs removed!).
  3. Cortex-A35 is “AArch32 + AArch64”, but Cortex-A32 is only “AArch32”.�13% / 4 = 3.25%

For RISC-V:

  • The number of GPRs, FPU, and VECTOR registers are the same for RV32 and RV64.
  • Generally speaking, in embedded application processor scenarios, the CPU core area accounts for no more than 10% of the entire SoC.3.25% * 10% = 0.325%�Therefore, the benefit is too tiny to care.

15 of 37

64-bit ISA is a visionary and wise choice!

WISE!

16 of 37

Our Solution: rv64ilp32�Run 32-bit pointer on RISC-V 64-bit ISA

17 of 37

The world’s first 64ilp32 ABI Linux kernel!

PATCH [01 - 11] u64ilp32

PATCH [12 - 36] s64ilp32

u64ilp32: User space support is similar to x86-x32, mips-n32, and arm64-ilp32.

s64ilp32 - The world’s first 64ilp32 ABI Linux kernel!

18 of 37

[RFC PATCH V2] rv64ilp32 patches:

s64ilp32

s64lp64

M-mode

opensbi

S-mode kernel

U-mode �userspace

ISA

m64lp64

s64lp64

u32ilp32 u64ilp32

s32ilp32

u32ilp32

m32ilp32

RV64

RV32

u32ilp32 u64ilp32 u64lp64

19 of 37

s64ilp32 v.s. s32ilp32

  • memcpy/copy_from(to)_user/get(put)_user ld -> lw
  • eBPF JIT full -> lite
  • Native Atomic64 -> GENERIC_ATOMIC64
  • Native 64-bit algorithm ->

GENERIC_LIB_ASHLDI3�GENERIC_LIB_ASHRDI3 �GENERIC_LIB_LSHRDI3�GENERIC_LIB_UCMPDI2�…

20 of 37

Fedora 38 with “s64ilp32 + u32ilp32”

  • Gcc, Binutils, Glibc, Benchmark (PLCT)
  • Linux kernel (Alibaba)
  • Fedora (PLCT & Redhat)

1800+ rv32 fedora packages

  • XuanTie C908 (Kendryte K230)
  • XuanTie C907 (Coming soon)

21 of 37

Next: [RFC PATCH V3] s64ilp32 + u64lp64

s64ilp32 + u64lp64 (2GB)

s64lp64 + u64lp64 (128TB)

Proof of Concept:

22 of 37

Next: [RFC PATCH V3] s64ilp32 + u64lp64

s64ilp32

s64lp64

M-mode

opensbi

S-mode kernel

U-mode �userspace

ISA

m64lp64

s64lp64

u32ilp32 u64ilp32 u64lp64

s32ilp32

u32ilp32

m32ilp32

RV64

RV32

=

23 of 37

Final Goal: s64ilp32 + u64ilp32

Reuse the 64-bit system call table, then delete the 32-bit ISA and its 32-bit system call table from Linux.

s64ilp32

s64lp64

M-mode

opensbi

S-mode kernel

U-mode �userspace

ISA

m64lp64

s64lp64

s32ilp32

u32ilp32

m32ilp32

RV64

RV32

u32ilp32 u64ilp32 u64lp64

=

24 of 37

rv64ilp32 ensures these chips succeed!

SUCCESS!

The future of 32-bit Linux - rv64ilp32

25 of 37

END�

26 of 37

Backup

27 of 37

Linux doesn’t like 32-bit ISA

28 of 37

eBPF JIT

  • The eBPF registers are 64-bit, while ISA registers are 32-bit. BPF registers either map directly to 2 registers, or reside in stack scratch space and are saved and restored when used.
  • Many 64-bit ALU operations do not trivially map to 32-bit operations. Operations that move bits between high and low words, such as ADD, LSH, MUL, and others must emulate the 64-bit behavior in terms of 32-bit instructions.

ref: https://lore.kernel.org/netdev/20200220041608.30289-1-lukenels@cs.washington.edu/

29 of 37

Use native 64-bit ALU insns improve crypto algorithms

/*� * On some 32-bit architectures (h8300), GCC ends up using� * over 1 KB of stack if we inline the round calculation into the loop� * in keccakf(). On the other hand, on 64-bit architectures with plenty� * of [64-bit wide] general purpose registers, not inlining it severely� * hurts performance. So let's use 64-bitness as a heuristic to decide� * whether to inline or not.� */�#ifdef CONFIG_64BIT�#define SHA3_INLINE inline�#else�#define SHA3_INLINE noinline�#endif

/* update the state with given number of rounds */

static SHA3_INLINE void keccakf_round(u64 st[25])

{

u64 t[5], tt, bc[5];

/* Theta */

bc[0] = st[0] ^ st[5] ^ st[10] ^ st[15

30 of 37

Use native 64-bit load/store improve user space access

static int rseq_get_rseq_cs(struct task_struct *t, struct rseq_cs *rseq_cs)

{

#if CONFIG_64BIT

if (get_user(ptr, &t->rseq->rseq_cs))

return -EFAULT;

#else

if (copy_from_user(&ptr, &t->rseq->rseq_cs, sizeof(ptr)))

return -EFAULT;

#endif

31 of 37

Use native 64-bit atomic implement CMPXCHG_DOUBLE

mm/slub.c:

if (s->flags & __CMPXCHG_DOUBLE) {

ret = __update_freelist_fast(slab, freelist_old, counters_old,

freelist_new, counters_new);

} else {

ret = __update_freelist_slow(slab, freelist

32 of 37

Improvements

33 of 37

Sign-extend addressing

Traditional x86-x32, mips-n32, and arm64-ilp32 all use zero-extend addressing, and the compiler needs to insert additional zero-extend instructions, which causes code size and performance problems.�

So rv64ilp32 introduces a new solution called sign-extend addressing.

34 of 37

Stack size optimization

Traditional x86-x32, mips-n32, and arm64-ilp32 all use 64-bit for callee-saved registers, but they waste half of the stack size in ILP32 scenarios.

So rv64ilp32 prepares to use 32-bit stack layout for callee-saved registers.

35 of 37

Current Problems

36 of 37

y2038 problem

The traditional 32-bit Linux will stop working in 2038 when the 32-bit time_t overflows, which is historical ills.

For rv64ilp32:

  • Move to 64-bit system call table

37 of 37

GCC problem

https://github.com/Liaoshihua/RV64-ILP32/