4 of 23

From Wasm SIMD to Relaxed SIMD

😁 When native SIMD ISA (SSE4.1) is close to WebAssembly SIMD, WebML achieve 66-87% of native performance 🚀

⚠️ Newer SIMD ISAs are more capable than Wasm SIMD

Fused Multiply-Add
Integer and BFloat16 Dot Product instructions
Efficient Min/Max, Swizzle, and Float-to-Int conversion instructions

💡 These instructions are not in Wasm SIMD because they can’t be fully portably implemented across all architectures

5 of 23

Relaxed SIMD proposal

Introduce limited non-determinism to close the Web-native gap

Floating-point instructions: limited variability in rounding

Relaxed Multiply-Add computes a·b+c with either 1 or 2 rounding
Relaxed BFloat16 Dot Product allows either ARM-style c+(a_lo·b_lo+a_hi·b_hi) or x86-style (c+a_lo·b_lo)+a_hi·b_hi computation.

Integer instructions: limited variability in out-of-bounds behavior

Float→Int conversion either saturates or produce INT_MIN for out-of-bounds inputs

6 of 23

Limiting non-deterministic damage: fallback

All Relaxed SIMD instructions have reference lowering to deterministic WebAssembly SIMD instructions

7 of 23

Limiting non-deterministic damage: portability

Relaxed SIMD instructions provide portability guarantees so long as users stay within the defined boundaries

Tolerant to floating-point expression contraction

Most codes are tolerant because GCC & Clang enable it by default.

Avoid out-of-bounds inputs to integer operations

Guarantee that either input to Q15-format Multiplication is not INT16_MIN

💡 Case study: XNNPack (neural network library) was ported to Relaxed SIMD and use the same code path across all architectures without testing for implementation-specific behaviour.

8 of 23

Limiting non-deterministic damage: consistency

Given the same inputs, Relaxed SIMD instructions must produce the same outputs on the same host

Reduce fingerprinting surface
Improve testability

💡Case study: expm1(x) for x≤0

8 Relaxed Multiply-Adds, same input

→ 2 possible outputs with consistency

→ 256 possible outputs without

v128_t vn = wasm_f32x4_relaxed_madd(vx, vlog2e, vmagic_bias);

v128_t vs = wasm_i32x4_shl(vn, 23);

vn = wasm_f32x4_sub(vn, vmagic_bias);

v128_t vt = wasm_f32x4_relaxed_madd(vn, vminus_ln2_hi, vx);

vt = wasm_f32x4_relaxed_madd(vn, vminus_ln2_lo, vt);

const v128_t vm = wasm_f32x4_le(vx, vsat_cutoff);

vs = wasm_v128_andnot(vs, vm);

vt = wasm_v128_andnot(vt, vm);

v128_t vp = wasm_f32x4_relaxed_madd(vc6, vt, vc5);

vp = wasm_f32x4_relaxed_madd(vp, vt, vc4);

vp = wasm_f32x4_relaxed_madd(vp, vt, vc3);

vp = wasm_f32x4_relaxed_madd(vp, vt, vc2);

vp = wasm_f32x4_mul(vp, vt);

vt = wasm_f32x4_mul(vt, vs);

const v128_t vsm1 = wasm_f32x4_sub(vs, vone);

vp = wasm_f32x4_relaxed_madd(vp, vt), vt);

const v128_t vf = wasm_f32x4_add(vp, vsm1);

9 of 23

Limiting non-deterministic damage: tooling

Relaxed SIMD instructions are explicitly opt-in:

Require a compiler flag to enable: -mrelaxed-simd in Clang
Not generated by auto-vectorizer
Can be used only via intrinsic functions

With a little help from toolchain devs can verify expected behavior:

V8 offers -no-enable-fma3 option to disable use of FMA on x86
UBSan could support validating that inputs are within bounds

10 of 23

Suggestion: Drop Consistency

Allow instructions with the same inputs to produce different results:

f32x4.relaxed_madd x y z�...�f32x4.relaxed_madd x y z

➕ Avoids introducing a new kind of non-determinism

➖ Software would still rely on consistency and may get broken if Wasm engines do context-dependent lowering of relaxed instructions

11 of 23

Suggestion: Relaxed Instructions as Imports

Allow instructions with the same inputs to produce different results:

(func $r (import "🆆" "wasm_relaxed_madd") // special namespace� (param v128) (param v128) (param v128) (result v128)

➕ Sidesteps introducing a new kind of non-determinism in WAsm

➖ The difference is mostly theoretical, moving non-determinism from Wasm into environment doesn’t change anything in practice

➖ Mozilla suggested this does not fit well into SpiderMonkey

12 of 23

Suggestion: Explicit Tests for Arch-Specifics

if (wasm_f32x4_fma_supported)

// wasm_f32x4_fma is fast

else

// wasm_f32x4_fma is slow, but can be used nonetheless

➕ Avoids introducing a new kind of non-determinism in WAsm

➖ Impossible to write fast portable code

➖ Some architecture-specific operations are very hard to emulate (e.g. BFloat16 Dot Product use non-IEEE rounding mode on ARM)

13 of 23

Phase 4 poll

14 of 23

Able to get fast code simply

// Inputs are in bounds, validated elsewhere

f32.relaxed_min == single instruction on all platforms

15 of 23

imported functions

(module� (func $r (import "🆆" "relaxed_fmadd") // special namespace� (param v128) (param v128) (param v128) (result v128)� (func� (v128.const ...) (v128.const ...) (v128.const ...)� call $r // function call, optimized by compilers to be performant� )�)

16 of 23

is_fma_supported test

if (fma_supported)

// use fma fast

else

// fallback

18 of 23

min/max

if (MIN_RETURNS_NANS_IF_ANY_INPUT_IS_NAN)

relaxed_min_first_kind

else if (MIN_RETURNS_FIRST_IF_ANY_INPUT_IS_NAN)

relaxed_min_second_kind

else if (MIN_RETURNS_SECOND_IF_ANY_INPUT_IS_NAN)

relaxed_min_third_kind

else if (MIN_RETURNS_OTHER_IF_ANY_INPUT_IS_NAN)

relaxed_min_fourth_kind

code size
instruction bloat
difficult for engines to implement

19 of 23

bf16

dot

product

20 of 23

fpenv

(module� (func (param v128 v128 v128)� (f32x4.qfma $fpu (local.get 0) (local.get 1) (local.get 2))� (f32x4.qfma $fpu (local.get 0) (local.get 1) (local.get 2))� )� (fpenv $fpu 0))

21 of 23

fpenv

(module� (func (param v128 v128 v128)� (f32x4.qfma $fpu (local.get 0) (local.get 1) (local.get 2))� (f32x4.qfma $fpu (local.get 0) (local.get 1) (local.get 2))� )� (fpenv $fpu 0))

New construct
No cases for >1 fpenv in a single engine
Lacking realistic use cases
non-det encapsulated in fpenv (not sure how to spec this yet

22 of 23

Phase 4 requirements

✅ Two or more Web VMs implement the feature.

V8 and SpiderMonkey

✅ At least one toolchain implements the feature.

Emscripten/LLVM/Binaryen

✅ The formalization and the reference interpreter are usually updated (though these two can be done as part of step 3 at the Working Group chair's discretion).

PRs uploaded

Community Group has reached consensus in support of the feature.

Subgroup has reached consensus

1 of 23

2 of 23

3 of 23