1 of 23

2022-10-27

Co-champions: Marat Dukhan, Zhi An Ng

2 of 23

Today

  • Recap of Relaxed SIMD proposal
  • Suggestions
    • Drop Consistency
    • Relaxed Instructions as Imports
    • Explicit Tests for Architecture Specifics
  • Phase 4 poll

3 of 23

Why Relaxed SIMD?

4 of 23

From Wasm SIMD to Relaxed SIMD

😁 When native SIMD ISA (SSE4.1) is close to WebAssembly SIMD, WebML achieve 66-87% of native performance 🚀

⚠️ Newer SIMD ISAs are more capable than Wasm SIMD

  • Fused Multiply-Add
  • Integer and BFloat16 Dot Product instructions
  • Efficient Min/Max, Swizzle, and Float-to-Int conversion instructions

💡 These instructions are not in Wasm SIMD because they can’t be fully portably implemented across all architectures

5 of 23

Relaxed SIMD proposal

Introduce limited non-determinism to close the Web-native gap

Floating-point instructions: limited variability in rounding

  • Relaxed Multiply-Add computes a·b+c with either 1 or 2 rounding
  • Relaxed BFloat16 Dot Product allows either ARM-style c+(alo·blo+ahi·bhi) or x86-style (c+alo·blo)+ahi·bhi computation.

Integer instructions: limited variability in out-of-bounds behavior

  • Float→Int conversion either saturates or produce INT_MIN for out-of-bounds inputs

6 of 23

Limiting non-deterministic damage: fallback

All Relaxed SIMD instructions have reference lowering to deterministic WebAssembly SIMD instructions

7 of 23

Limiting non-deterministic damage: portability

Relaxed SIMD instructions provide portability guarantees so long as users stay within the defined boundaries

  • Tolerant to floating-point expression contraction
    • Most codes are tolerant because GCC & Clang enable it by default.
  • Avoid out-of-bounds inputs to integer operations
    • Guarantee that either input to Q15-format Multiplication is not INT16_MIN

💡 Case study: XNNPack (neural network library) was ported to Relaxed SIMD and use the same code path across all architectures without testing for implementation-specific behaviour.

8 of 23

Limiting non-deterministic damage: consistency

Given the same inputs, Relaxed SIMD instructions must produce the same outputs on the same host

  • Reduce fingerprinting surface
  • Improve testability

💡Case study: expm1(x) for x≤0

8 Relaxed Multiply-Adds, same input

→ 2 possible outputs with consistency

→ 256 possible outputs without

v128_t vn = wasm_f32x4_relaxed_madd(vx, vlog2e, vmagic_bias);

v128_t vs = wasm_i32x4_shl(vn, 23);

vn = wasm_f32x4_sub(vn, vmagic_bias);

v128_t vt = wasm_f32x4_relaxed_madd(vn, vminus_ln2_hi, vx);

vt = wasm_f32x4_relaxed_madd(vn, vminus_ln2_lo, vt);

const v128_t vm = wasm_f32x4_le(vx, vsat_cutoff);

vs = wasm_v128_andnot(vs, vm);

vt = wasm_v128_andnot(vt, vm);

v128_t vp = wasm_f32x4_relaxed_madd(vc6, vt, vc5);

vp = wasm_f32x4_relaxed_madd(vp, vt, vc4);

vp = wasm_f32x4_relaxed_madd(vp, vt, vc3);

vp = wasm_f32x4_relaxed_madd(vp, vt, vc2);

vp = wasm_f32x4_mul(vp, vt);

vt = wasm_f32x4_mul(vt, vs);

const v128_t vsm1 = wasm_f32x4_sub(vs, vone);

vp = wasm_f32x4_relaxed_madd(vp, vt), vt);

const v128_t vf = wasm_f32x4_add(vp, vsm1);

9 of 23

Limiting non-deterministic damage: tooling

Relaxed SIMD instructions are explicitly opt-in:

  • Require a compiler flag to enable: -mrelaxed-simd in Clang
  • Not generated by auto-vectorizer
  • Can be used only via intrinsic functions

With a little help from toolchain devs can verify expected behavior:

  • V8 offers -no-enable-fma3 option to disable use of FMA on x86
  • UBSan could support validating that inputs are within bounds

10 of 23

Suggestion: Drop Consistency

Allow instructions with the same inputs to produce different results:

f32x4.relaxed_madd x y z�...�f32x4.relaxed_madd x y z

➕ Avoids introducing a new kind of non-determinism

➖ Software would still rely on consistency and may get broken if Wasm engines do context-dependent lowering of relaxed instructions

11 of 23

Suggestion: Relaxed Instructions as Imports

Allow instructions with the same inputs to produce different results:

(func $r (import "🆆" "wasm_relaxed_madd") // special namespace� (param v128) (param v128) (param v128) (result v128)

➕ Sidesteps introducing a new kind of non-determinism in WAsm

➖ The difference is mostly theoretical, moving non-determinism from Wasm into environment doesn’t change anything in practice

➖ Mozilla suggested this does not fit well into SpiderMonkey

12 of 23

Suggestion: Explicit Tests for Arch-Specifics

if (wasm_f32x4_fma_supported)

// wasm_f32x4_fma is fast

else

// wasm_f32x4_fma is slow, but can be used nonetheless

➕ Avoids introducing a new kind of non-determinism in WAsm

➖ Impossible to write fast portable code

➖ Some architecture-specific operations are very hard to emulate (e.g. BFloat16 Dot Product use non-IEEE rounding mode on ARM)

13 of 23

Phase 4 poll

14 of 23

Able to get fast code simply

// Inputs are in bounds, validated elsewhere

f32.relaxed_min == single instruction on all platforms

15 of 23

imported functions

(module� (func $r (import "🆆" "relaxed_fmadd") // special namespace� (param v128) (param v128) (param v128) (result v128)� (func� (v128.const ...) (v128.const ...) (v128.const ...)� call $r // function call, optimized by compilers to be performant� )�)

16 of 23

is_fma_supported test

if (fma_supported)

// use fma fast

else

// fallback

17 of 23

min/max

18 of 23

min/max

if (MIN_RETURNS_NANS_IF_ANY_INPUT_IS_NAN)

relaxed_min_first_kind

else if (MIN_RETURNS_FIRST_IF_ANY_INPUT_IS_NAN)

relaxed_min_second_kind

else if (MIN_RETURNS_SECOND_IF_ANY_INPUT_IS_NAN)

relaxed_min_third_kind

else if (MIN_RETURNS_OTHER_IF_ANY_INPUT_IS_NAN)

relaxed_min_fourth_kind

  • code size
  • instruction bloat
  • difficult for engines to implement

19 of 23

bf16

dot

product

20 of 23

fpenv

(module� (func (param v128 v128 v128)� (f32x4.qfma $fpu (local.get 0) (local.get 1) (local.get 2))� (f32x4.qfma $fpu (local.get 0) (local.get 1) (local.get 2))� )� (fpenv $fpu 0))

21 of 23

fpenv

(module� (func (param v128 v128 v128)� (f32x4.qfma $fpu (local.get 0) (local.get 1) (local.get 2))� (f32x4.qfma $fpu (local.get 0) (local.get 1) (local.get 2))� )� (fpenv $fpu 0))

  • New construct
  • No cases for >1 fpenv in a single engine
  • Lacking realistic use cases
  • non-det encapsulated in fpenv (not sure how to spec this yet

22 of 23

Phase 4 requirements

✅ Two or more Web VMs implement the feature.

  • V8 and SpiderMonkey

✅ At least one toolchain implements the feature.

  • Emscripten/LLVM/Binaryen

✅ The formalization and the reference interpreter are usually updated (though these two can be done as part of step 3 at the Working Group chair's discretion).

  • PRs uploaded

Community Group has reached consensus in support of the feature.

  • Subgroup has reached consensus

23 of 23

Performance improvements