2 of 7

Instruction proposal

iNxK.bitmask

i8x16, i16x8, i32x4

Returns a K-bit integer with most significant bits of each lane

i32x4.bitmask(-1, 0, -1, 0) = 0b1010 = 10 (decimal)

Use cases

Extracting position of a matching element after lane-wise compare

ctz(bitmask(eq(data, splat(element)))) = index of first match
Useful for string searches, used in Swiss Hash Tables

Testing partial matches after comparisons

i32x4.bitmask(mask) == 14 => .xyz matches, .w mismatches

Table-driven decompression - emulating AVX512 vexpand*

i16x8.bitmask(eq(data, splat(65535))) can be used as a table index

3 of 7

Lowering

x86/x64: Direct mapping to 1-2 native instructions

i32x4: movmskps, i8x16: pmovmskb, i16x8: unpack + pmovmskb

PowerPC: Mapping to 3 native instructions

i8x16: lvsl + shift + vbpermq; vbpermq is a more general version of the proposal

ARM/AArch64: No direct equivalent, 7-8 instructions

Requires emulation using horizontal adds, i8x16 is the most expensive

static const uint8x16_t mask = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};

uint8x16_t masked = vandq_u8(mask, (uint8x16_t)vshrq_n_s8(val, 7));

uint8x16_t maskedhi = vextq_u8(masked, masked, 8);

return vaddvq_u16((uint16x8_t)vzip1q_u8(masked, maskedhi));

4 of 7

Reasons to include

The instruction is important for some types of stream processing

String matching, decompression, handling of divergent SIMD control flow
These algorithms typically require this instruction with no good workarounds

Even for ARM, it’s impossible to synthesize a good instruction sequence atm

Wasm SIMD doesn’t expose horizontal adds, support for that is less prevalent
Alternatives include mostly scalar fallbacks; here’s (perhaps) the best alternative atm

// Note: this is missing one more instruction (arithmetic vector shift) for completeness

v128_t mask_0 = wasm_v32x4_shuffle(mask, mask, 0, 2, 1, 3);

uint64_t mask_1a = wasm_i64x2_extract_lane(mask_0, 0) & 0x0804020108040201ull;

uint64_t mask_1b = wasm_i64x2_extract_lane(mask_0, 1) & 0x8040201080402010ull;

uint64_t mask_2 = mask_1a | mask_1b;

uint64_t mask_4 = mask_2 | (mask_2 >> 16);

uint64_t mask_8 = mask_4 | (mask_4 >> 8);

return uint8_t(mask_8) | (uint8_t(mask_8 >> 32) << 8);

5 of 7

Performance evaluation

Given two algorithms that require a bitmask, we compare “native” bitmask (this proposal) vs “emulated” bitmask (scalar fallback on previous slide) for:

Custom decompression algorithm (bitmask is a small part of the workload)
String matching algorithm (memmem, returns index of first substring match)

String matching algorithm is evaluated on needle "bbbb!cccc" and haystack (“a” x gap + “!”) x N, where N is 100/(gap+1) MB

Algorithm finds ‘!’ (least frequent letter) in haystack using SIMD comparison, iterates through match positions using bitmask + ctz and tries to confirm the matches.

6 of 7

Performance results

Ratio between native throughput and emulated throughput (higher is better)

Algorithm	X64 (Intel Core i7-8700K)	ARM64 (QCOM Snapdragon 845)
vertex decompression	1.08x	1.01x
string search, gap 1	1.17x	0.99x
string search, gap 2	1.16x	1.01x
string search, gap 4	1.44x	1.06x
string search, gap 8	1.50x	1.01x
string search, gap 16	1.62x	1.21x
string search, gap 32	2.08x	1.69x
string search, gap 64	1.99x	1.61x

7 of 7

Performance results - cont

String search - gap dependency

Bitmask is executed once per 16 input bytes
For small gap sizes, there’s many other instructions per 1 bitmask
For large gap sizes, there’s few other instructions per 1 bitmask

String search - gap 1 regression on ARM

Regression is small, and vs emulated bitmask (not vs scalar algorithm)
Emulated: 591ms avg, 0.8 std dev
Native: 595ms avg, 0.9 std dev

Recommendation