2 of 18

Recap: custom op (#6)

Background (by Daniel):

ML field is fast moving. Model architecture and the ops are evolving quickly. This leads JS ML frameworks usually have big op set (e.g. TF.js has over 200 ops)
Today’s framework’s ops are implemented in WebGL and WASM, and WebGPU
WebNN’s built-in op set that focuses on hardware acceleration will be small and grow slowly

Problem:

It demands a way for library authors to write ops that can interop with built-in ops.

Options:

WebNN built-in ops interop with framework ops in WASM and WebGL/WebGPU (focus of this investigation)
WebNN provides a way to write custom op by a domain specific language (e.g. Kai’s proposal) (future exploration)

3 of 18

WebNN-WebGPU Interop

4 of 18

Example: Conv + Add + Relu by TF.js WebGPU

tf.setBackend('webgpu');

// Prepare tensors

const convInput = tf.tensor(inputData, inputDims);

const filter = tf.tensor(filterData, filterDims);

const bias = tf.tensor(biasData, biasDims);

// Execute conv, add bias and relu by TF.js WebGPU backend

const convOutput = tf.conv2d(convInput, filter, 1, ‘same’);

const addOutput = tf.add(convOutput, bias);

const reluOutput = tf.relu(addOutput);

const resultData = await reluOutput.data();

5 of 18

Example: compile WebNN op for WebGPU device

// Create a Model object contains a Conv op

const conv = createWebNNConvOp(inputDims, filterData, filterDims);

// Use TF.js WebGPU backend

tf.setBackend('webgpu');

// Create a Compilation object for the constructed model that contains built-in op.

const conv_compilation = await conv.createCompilation();

// Get the GPUDevice of WebGPUBackend and set that as WebNN compilation target.

conv_compilation.setGPUDevice(tf.backend().device);

// Finish the compilation.

await conv_compilation.finish();

// Create an Execution object for the compiled model.

const conv_execution = await conv_compilation.createExecution();

6 of 18

Example: execute WebNN’s op with WebGPU op

// Create input and output tensors by TF.js WebGPU backend

const convInput = tf.tensor(inputData, inputDims);

const convOutput = tf.zeros(outputDims);

const bias = tf.tensor(biasData, biasDims);

// Execute WebNN conv op

conv_execution.setInput(0, tensorToGPUBuffer(convInput));

conv_execution.setOutput(0, tensorToGPUBuffer(convOutput));

conv_execution.startCompute();

// Execute add bias and relu by TF.js WebGPU backend

const addOutput = tf.add(convOutput, bias);

const reluOutput = tf.relu(addOutput);

const resultData = await reluOutput.data();

7 of 18

Example: execute WebNN’s fused op

// Create WebNN Conv op with bias and fused relu

// Create input and output tensors by TF.js WebGPU backend

const convInput = tf.tensor(inputData, inputDims);

const convOutput = tf.zeros(outputDims);

// Execute WebNN conv op

fused_conv_execution.setInput(0, tensorToGPUBuffer(convInput));

fused_conv_execution.setOutput(0, tensorToGPUBuffer(convOutput));

fused_conv_execution.startCompute();

const resultData = await convOutput.data();

9 of 18

POC Implementation on MPS

Reuse the same MTLDevice associated with WebGPUDevice.
Get the MTLBuffer associated with input and output WebGPUBuffer.
Allocate MPSImage for inputs with MTLDevice.
Create MTLCommandBuffer from MTLQueue associated with WebGPUDevice.
Encode a compute shader that copies and reorders data from MTLBuffer to MPSImage (MPSImage layout).
Encode MPSNNGraph/MPSCNNKernel to MTLCommandBuffer
Encode a compute shader that copies and reorders data from output MPSImage to output MTLBuffer.
Commit MTLCommandBuffer.

10 of 18

Performance Summary

WebNN op allows to access vendor optimized GPU acceleration
Sharing WebGPUBuffer avoids the GPU-CPU memory moving
WebNN-WebGPU interop introduces copying and reordering overhead

Test	Inference time (ms)
WebGPU conv/add/relu	61.31
WebNN conv interops with WebGPU add/relu via ArrayBuffer	43.42
WebNN conv interops with WebGPU add/relu via WebGPUBuffer	23.06
WebNN conv with fused add/relu	21.25

11 of 18

Copying/Reordering Optimization

Copying/reordering is only needed for WebGPU CS interop
Copying/reordering is not necessary when executing WebNN ops chain
Use opaque operand between WebNN ops to remove copying/reordering
Provide opportunity to use device optimized memory format (e.g. MPSTemproraryImage)

Test	Inference time (ms)
WebGPU conv x2	112.96
WebNN conv + WebGPU conv	67.33
WebNN conv x2 with reordering	24.53
WebNN conv x2 without reordering	23.01

12 of 18

WebNN-WASM Interop

13 of 18

WASM ops and WebNN graph execution

WASM

WebNN

ops

WASM

WASM Heap

Tensor 0

Tensor 1

Tensor 2

Tensor 3

ArrayBufferView

14 of 18

Workload: MobileNet V1

Source: mobilenet_v1_1.0_224
Ops:Conv2DX15, DepthwiseConv2DX13, AveragePool2DX1, SoftmaxX1, SqueezeX1

12x

[Depthwise+Conv2D]

The chart is based on WASM ops execution

15 of 18

Graph partition configurations

WebNN supported Ops

None: all WASM ops
Conv2D
Conv2D+DepthwiseConv2D
All supported (without reordering)
All supported (with reordering)

Conv2D

Conv2D+

DepthwiseConv2D

All (one graph per op)

WebNN Graph

16 of 18

Results: performance

Offload expensive ops gets significant speedup

Conv2D (90% computation): 5X faster on PC, 3X faster on smartphone
Conv2D+DepthwiseConv2D (99% computation): 33X faster on PC, 7X faster on smartphone

Avoiding copying/reordring of WebNN ops execution gets better performance:

Without opaque operands vs. with: 3.5X slower on PC, 1.5X slower on smartphone

Device: Pixel 3, Android 9, updated on 12/2018, Chromium 70.0.3503

Device configuration: XPS 13 Laptop, CPU: Intel i5-8250U, Ubuntu Linux 16.04, Chromium 70.0.3503

17 of 18

Summary of WebNN-WASM interop

WebNN ops allow to access vendor optimized CPU acceleration
Interop between WASM ops and WebNN op has overhead

Memory copying between WASM heap and WebNN backend
Memory reordering, e.g. MKL-DNN blocked layout

Execute WebNN ops chain with opaque operands can avoid unnecessary overhead

18 of 18

Proposal

Support key ops that access hardware acceleration (#17)

E.g. conv2d and matmul

Support compiling and executing ops for devices (new issue?)

CPU or GPU

Support interop with WebAssembly and WebGPU compute shader

Sharing ArrayBuffer with WASM op
Sharing WebGPUBuffer with WebGPU op (new issue?)

Support executing ops chain with opaque operands (#11)

Leverage device optimized memory layout and avoid unnecessary memory reordering

Explore custom op support by DSL (new issue?)