Wasm SIMD bitmask
Instruction proposal
Lowering
static const uint8x16_t mask = {1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128};
uint8x16_t masked = vandq_u8(mask, (uint8x16_t)vshrq_n_s8(val, 7));
uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
return vaddvq_u16((uint16x8_t)vzip1q_u8(masked, maskedhi));
Reasons to include
// Note: this is missing one more instruction (arithmetic vector shift) for completeness
v128_t mask_0 = wasm_v32x4_shuffle(mask, mask, 0, 2, 1, 3);
uint64_t mask_1a = wasm_i64x2_extract_lane(mask_0, 0) & 0x0804020108040201ull;
uint64_t mask_1b = wasm_i64x2_extract_lane(mask_0, 1) & 0x8040201080402010ull;
uint64_t mask_2 = mask_1a | mask_1b;
uint64_t mask_4 = mask_2 | (mask_2 >> 16);
uint64_t mask_8 = mask_4 | (mask_4 >> 8);
return uint8_t(mask_8) | (uint8_t(mask_8 >> 32) << 8);
Performance evaluation
Given two algorithms that require a bitmask, we compare “native” bitmask (this proposal) vs “emulated” bitmask (scalar fallback on previous slide) for:
String matching algorithm is evaluated on needle "bbbb!cccc" and haystack (“a” x gap + “!”) x N, where N is 100/(gap+1) MB
Algorithm finds ‘!’ (least frequent letter) in haystack using SIMD comparison, iterates through match positions using bitmask + ctz and tries to confirm the matches.
Performance results
Algorithm | X64 (Intel Core i7-8700K) | ARM64 (QCOM Snapdragon 845) |
vertex decompression | 1.08x | 1.01x |
string search, gap 1 | 1.17x | 0.99x |
string search, gap 2 | 1.16x | 1.01x |
string search, gap 4 | 1.44x | 1.06x |
string search, gap 8 | 1.50x | 1.01x |
string search, gap 16 | 1.62x | 1.21x |
string search, gap 32 | 2.08x | 1.69x |
string search, gap 64 | 1.99x | 1.61x |
Performance results - cont