2022-10-27
Co-champions: Marat Dukhan, Zhi An Ng
Today
Why Relaxed SIMD?
From Wasm SIMD to Relaxed SIMD
😁 When native SIMD ISA (SSE4.1) is close to WebAssembly SIMD, WebML achieve 66-87% of native performance 🚀
⚠️ Newer SIMD ISAs are more capable than Wasm SIMD
💡 These instructions are not in Wasm SIMD because they can’t be fully portably implemented across all architectures
Relaxed SIMD proposal
Introduce limited non-determinism to close the Web-native gap
Floating-point instructions: limited variability in rounding
Integer instructions: limited variability in out-of-bounds behavior
Limiting non-deterministic damage: fallback
All Relaxed SIMD instructions have reference lowering to deterministic WebAssembly SIMD instructions
Limiting non-deterministic damage: portability
Relaxed SIMD instructions provide portability guarantees so long as users stay within the defined boundaries
💡 Case study: XNNPack (neural network library) was ported to Relaxed SIMD and use the same code path across all architectures without testing for implementation-specific behaviour.
Limiting non-deterministic damage: consistency
Given the same inputs, Relaxed SIMD instructions must produce the same outputs on the same host
💡Case study: expm1(x) for x≤0
8 Relaxed Multiply-Adds, same input
→ 2 possible outputs with consistency
→ 256 possible outputs without
v128_t vn = wasm_f32x4_relaxed_madd(vx, vlog2e, vmagic_bias);
v128_t vs = wasm_i32x4_shl(vn, 23);
vn = wasm_f32x4_sub(vn, vmagic_bias);
v128_t vt = wasm_f32x4_relaxed_madd(vn, vminus_ln2_hi, vx);
vt = wasm_f32x4_relaxed_madd(vn, vminus_ln2_lo, vt);
const v128_t vm = wasm_f32x4_le(vx, vsat_cutoff);
vs = wasm_v128_andnot(vs, vm);
vt = wasm_v128_andnot(vt, vm);
v128_t vp = wasm_f32x4_relaxed_madd(vc6, vt, vc5);
vp = wasm_f32x4_relaxed_madd(vp, vt, vc4);
vp = wasm_f32x4_relaxed_madd(vp, vt, vc3);
vp = wasm_f32x4_relaxed_madd(vp, vt, vc2);
vp = wasm_f32x4_mul(vp, vt);
vt = wasm_f32x4_mul(vt, vs);
const v128_t vsm1 = wasm_f32x4_sub(vs, vone);
vp = wasm_f32x4_relaxed_madd(vp, vt), vt);
const v128_t vf = wasm_f32x4_add(vp, vsm1);
Limiting non-deterministic damage: tooling
Relaxed SIMD instructions are explicitly opt-in:
With a little help from toolchain devs can verify expected behavior:
Suggestion: Drop Consistency
Allow instructions with the same inputs to produce different results:
f32x4.relaxed_madd x y z�...�f32x4.relaxed_madd x y z
➕ Avoids introducing a new kind of non-determinism
➖ Software would still rely on consistency and may get broken if Wasm engines do context-dependent lowering of relaxed instructions
Suggestion: Relaxed Instructions as Imports
Allow instructions with the same inputs to produce different results:
(func $r (import "🆆" "wasm_relaxed_madd") // special namespace� (param v128) (param v128) (param v128) (result v128)
➕ Sidesteps introducing a new kind of non-determinism in WAsm
➖ The difference is mostly theoretical, moving non-determinism from Wasm into environment doesn’t change anything in practice
➖ Mozilla suggested this does not fit well into SpiderMonkey
Suggestion: Explicit Tests for Arch-Specifics
if (wasm_f32x4_fma_supported)
// wasm_f32x4_fma is fast
else
// wasm_f32x4_fma is slow, but can be used nonetheless
➕ Avoids introducing a new kind of non-determinism in WAsm
➖ Impossible to write fast portable code
➖ Some architecture-specific operations are very hard to emulate (e.g. BFloat16 Dot Product use non-IEEE rounding mode on ARM)
Phase 4 poll
Able to get fast code simply
// Inputs are in bounds, validated elsewhere
f32.relaxed_min == single instruction on all platforms
imported functions
(module� (func $r (import "🆆" "relaxed_fmadd") // special namespace� (param v128) (param v128) (param v128) (result v128)� (func� (v128.const ...) (v128.const ...) (v128.const ...)� call $r // function call, optimized by compilers to be performant� )�)
is_fma_supported test
if (fma_supported)
// use fma fast
else
// fallback
min/max
min/max
if (MIN_RETURNS_NANS_IF_ANY_INPUT_IS_NAN)
relaxed_min_first_kind
else if (MIN_RETURNS_FIRST_IF_ANY_INPUT_IS_NAN)
relaxed_min_second_kind
else if (MIN_RETURNS_SECOND_IF_ANY_INPUT_IS_NAN)
relaxed_min_third_kind
else if (MIN_RETURNS_OTHER_IF_ANY_INPUT_IS_NAN)
relaxed_min_fourth_kind
bf16
dot
product
fpenv
(module� (func (param v128 v128 v128)� (f32x4.qfma $fpu (local.get 0) (local.get 1) (local.get 2))� (f32x4.qfma $fpu (local.get 0) (local.get 1) (local.get 2))� )� (fpenv $fpu 0))
fpenv
(module� (func (param v128 v128 v128)� (f32x4.qfma $fpu (local.get 0) (local.get 1) (local.get 2))� (f32x4.qfma $fpu (local.get 0) (local.get 1) (local.get 2))� )� (fpenv $fpu 0))
Phase 4 requirements
✅ Two or more Web VMs implement the feature.
✅ At least one toolchain implements the feature.
✅ The formalization and the reference interpreter are usually updated (though these two can be done as part of step 3 at the Working Group chair's discretion).
Community Group has reached consensus in support of the feature.
Performance improvements