1 of 7

Wasm SIMD bitmask

2 of 7

Instruction proposal

  • iNxK.bitmask
    • i8x16, i16x8, i32x4
  • Returns a K-bit integer with most significant bits of each lane
    • i32x4.bitmask(-1, 0, -1, 0) = 0b1010 = 10 (decimal)
  • Use cases
    • Extracting position of a matching element after lane-wise compare
      • ctz(bitmask(eq(data, splat(element)))) = index of first match
      • Useful for string searches, used in Swiss Hash Tables
    • Testing partial matches after comparisons
      • i32x4.bitmask(mask) == 14 => .xyz matches, .w mismatches
    • Table-driven decompression - emulating AVX512 vexpand*
      • i16x8.bitmask(eq(data, splat(65535))) can be used as a table index

3 of 7

Lowering

  • x86/x64: Direct mapping to 1-2 native instructions
    • i32x4: movmskps, i8x16: pmovmskb, i16x8: unpack + pmovmskb
  • PowerPC: Mapping to 3 native instructions
    • i8x16: lvsl + shift + vbpermq; vbpermq is a more general version of the proposal
  • ARM/AArch64: No direct equivalent, 7-8 instructions
    • Requires emulation using horizontal adds, i8x16 is the most expensive

static const uint8x16_t mask = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};

uint8x16_t masked = vandq_u8(mask, (uint8x16_t)vshrq_n_s8(val, 7));

uint8x16_t maskedhi = vextq_u8(masked, masked, 8);

return vaddvq_u16((uint16x8_t)vzip1q_u8(masked, maskedhi));

4 of 7

Reasons to include

  • The instruction is important for some types of stream processing
    • String matching, decompression, handling of divergent SIMD control flow
    • These algorithms typically require this instruction with no good workarounds
  • Even for ARM, it’s impossible to synthesize a good instruction sequence atm
    • Wasm SIMD doesn’t expose horizontal adds, support for that is less prevalent
    • Alternatives include mostly scalar fallbacks; here’s (perhaps) the best alternative atm

// Note: this is missing one more instruction (arithmetic vector shift) for completeness

v128_t mask_0 = wasm_v32x4_shuffle(mask, mask, 0, 2, 1, 3);

uint64_t mask_1a = wasm_i64x2_extract_lane(mask_0, 0) & 0x0804020108040201ull;

uint64_t mask_1b = wasm_i64x2_extract_lane(mask_0, 1) & 0x8040201080402010ull;

uint64_t mask_2 = mask_1a | mask_1b;

uint64_t mask_4 = mask_2 | (mask_2 >> 16);

uint64_t mask_8 = mask_4 | (mask_4 >> 8);

return uint8_t(mask_8) | (uint8_t(mask_8 >> 32) << 8);

5 of 7

Performance evaluation

Given two algorithms that require a bitmask, we compare “native” bitmask (this proposal) vs “emulated” bitmask (scalar fallback on previous slide) for:

  • Custom decompression algorithm (bitmask is a small part of the workload)
  • String matching algorithm (memmem, returns index of first substring match)

String matching algorithm is evaluated on needle "bbbb!cccc" and haystack (“a” x gap + “!”) x N, where N is 100/(gap+1) MB

Algorithm finds ‘!’ (least frequent letter) in haystack using SIMD comparison, iterates through match positions using bitmask + ctz and tries to confirm the matches.

6 of 7

Performance results

  • Ratio between native throughput and emulated throughput (higher is better)

Algorithm

X64 (Intel Core i7-8700K)

ARM64 (QCOM Snapdragon 845)

vertex decompression

1.08x

1.01x

string search, gap 1

1.17x

0.99x

string search, gap 2

1.16x

1.01x

string search, gap 4

1.44x

1.06x

string search, gap 8

1.50x

1.01x

string search, gap 16

1.62x

1.21x

string search, gap 32

2.08x

1.69x

string search, gap 64

1.99x

1.61x

7 of 7

Performance results - cont

  • String search - gap dependency
    • Bitmask is executed once per 16 input bytes
    • For small gap sizes, there’s many other instructions per 1 bitmask
    • For large gap sizes, there’s few other instructions per 1 bitmask
  • String search - gap 1 regression on ARM
    • Regression is small, and vs emulated bitmask (not vs scalar algorithm)
    • Emulated: 591ms avg, 0.8 std dev
    • Native: 595ms avg, 0.9 std dev
  • Recommendation
    • Given the large wins on x64 and sizeable wins on ARM, recommendation is to add bitmask
    • The only benchmark where emulated code is slightly faster on ARM is one where bitmask is *less* important compared to other instructions