1 of 18

WebNN Interop Investigation

Ningxin Hu

9/17/2019

2 of 18

Recap: custom op (#6)

  • Background (by Daniel):
    • ML field is fast moving. Model architecture and the ops are evolving quickly. This leads JS ML frameworks usually have big op set (e.g. TF.js has over 200 ops)
    • Today’s framework’s ops are implemented in WebGL and WASM, and WebGPU
    • WebNN’s built-in op set that focuses on hardware acceleration will be small and grow slowly
  • Problem:
    • It demands a way for library authors to write ops that can interop with built-in ops.
  • Options:
    • WebNN built-in ops interop with framework ops in WASM and WebGL/WebGPU (focus of this investigation)
    • WebNN provides a way to write custom op by a domain specific language (e.g. Kai’s proposal) (future exploration)

3 of 18

WebNN-WebGPU Interop

4 of 18

Example: Conv + Add + Relu by TF.js WebGPU

tf.setBackend('webgpu');

// Prepare tensors

const convInput = tf.tensor(inputData, inputDims);

const filter = tf.tensor(filterData, filterDims);

const bias = tf.tensor(biasData, biasDims);

// Execute conv, add bias and relu by TF.js WebGPU backend

const convOutput = tf.conv2d(convInput, filter, 1, ‘same’);

const addOutput = tf.add(convOutput, bias);

const reluOutput = tf.relu(addOutput);

const resultData = await reluOutput.data();

5 of 18

Example: compile WebNN op for WebGPU device

// Create a Model object contains a Conv op

const conv = createWebNNConvOp(inputDims, filterData, filterDims);

// Use TF.js WebGPU backend

tf.setBackend('webgpu');

// Create a Compilation object for the constructed model that contains built-in op.

const conv_compilation = await conv.createCompilation();

// Get the GPUDevice of WebGPUBackend and set that as WebNN compilation target.

conv_compilation.setGPUDevice(tf.backend().device);

// Finish the compilation.

await conv_compilation.finish();

// Create an Execution object for the compiled model.

const conv_execution = await conv_compilation.createExecution();

6 of 18

Example: execute WebNN’s op with WebGPU op

// Create input and output tensors by TF.js WebGPU backend

const convInput = tf.tensor(inputData, inputDims);

const convOutput = tf.zeros(outputDims);

const bias = tf.tensor(biasData, biasDims);

// Execute WebNN conv op

conv_execution.setInput(0, tensorToGPUBuffer(convInput));

conv_execution.setOutput(0, tensorToGPUBuffer(convOutput));

conv_execution.startCompute();

// Execute add bias and relu by TF.js WebGPU backend

const addOutput = tf.add(convOutput, bias);

const reluOutput = tf.relu(addOutput);

const resultData = await reluOutput.data();

7 of 18

Example: execute WebNN’s fused op

// Create WebNN Conv op with bias and fused relu

// Create input and output tensors by TF.js WebGPU backend

const convInput = tf.tensor(inputData, inputDims);

const convOutput = tf.zeros(outputDims);

// Execute WebNN conv op

fused_conv_execution.setInput(0, tensorToGPUBuffer(convInput));

fused_conv_execution.setOutput(0, tensorToGPUBuffer(convOutput));

fused_conv_execution.startCompute();

const resultData = await convOutput.data();

8 of 18

Demo

9 of 18

POC Implementation on MPS

  • Reuse the same MTLDevice associated with WebGPUDevice.
  • Get the MTLBuffer associated with input and output WebGPUBuffer.
  • Allocate MPSImage for inputs with MTLDevice.
  • Create MTLCommandBuffer from MTLQueue associated with WebGPUDevice.
  • Encode a compute shader that copies and reorders data from MTLBuffer to MPSImage (MPSImage layout).
  • Encode MPSNNGraph/MPSCNNKernel to MTLCommandBuffer
  • Encode a compute shader that copies and reorders data from output MPSImage to output MTLBuffer.
  • Commit MTLCommandBuffer.

10 of 18

Performance Summary

  • WebNN op allows to access vendor optimized GPU acceleration
  • Sharing WebGPUBuffer avoids the GPU-CPU memory moving
  • WebNN-WebGPU interop introduces copying and reordering overhead

Test

Inference time (ms)

WebGPU conv/add/relu

61.31

WebNN conv interops with WebGPU add/relu via ArrayBuffer

43.42

WebNN conv interops with WebGPU add/relu via WebGPUBuffer

23.06

WebNN conv with fused add/relu

21.25

11 of 18

Copying/Reordering Optimization

  • Copying/reordering is only needed for WebGPU CS interop
  • Copying/reordering is not necessary when executing WebNN ops chain
  • Use opaque operand between WebNN ops to remove copying/reordering
  • Provide opportunity to use device optimized memory format (e.g. MPSTemproraryImage)

Test

Inference time (ms)

WebGPU conv x2

112.96

WebNN conv + WebGPU conv

67.33

WebNN conv x2 with reordering

24.53

WebNN conv x2 without reordering

23.01

12 of 18

WebNN-WASM Interop

13 of 18

WASM ops and WebNN graph execution

WASM

op

WebNN

ops

WASM

op

WASM Heap

Tensor 0

Tensor 1

Tensor 2

Tensor 3

ArrayBufferView

14 of 18

Workload: MobileNet V1

  • Source: mobilenet_v1_1.0_224
  • Ops:Conv2DX15, DepthwiseConv2DX13, AveragePool2DX1, SoftmaxX1, SqueezeX1

12x

[Depthwise+Conv2D]

The chart is based on WASM ops execution

15 of 18

Graph partition configurations

  • WebNN supported Ops
    • None: all WASM ops
    • Conv2D
    • Conv2D+DepthwiseConv2D
    • All supported (without reordering)
    • All supported (with reordering)

Conv2D

Conv2D+

DepthwiseConv2D

All (one graph per op)

WebNN Graph

16 of 18

Results: performance

  • Offload expensive ops gets significant speedup
    • Conv2D (90% computation): 5X faster on PC, 3X faster on smartphone
    • Conv2D+DepthwiseConv2D (99% computation): 33X faster on PC, 7X faster on smartphone
  • Avoiding copying/reordring of WebNN ops execution gets better performance:
    • Without opaque operands vs. with: 3.5X slower on PC, 1.5X slower on smartphone

Device: Pixel 3, Android 9, updated on 12/2018, Chromium 70.0.3503

Device configuration: XPS 13 Laptop, CPU: Intel i5-8250U, Ubuntu Linux 16.04, Chromium 70.0.3503

17 of 18

Summary of WebNN-WASM interop

  • WebNN ops allow to access vendor optimized CPU acceleration
  • Interop between WASM ops and WebNN op has overhead
    • Memory copying between WASM heap and WebNN backend
    • Memory reordering, e.g. MKL-DNN blocked layout
  • Execute WebNN ops chain with opaque operands can avoid unnecessary overhead

18 of 18

Proposal

  • Support key ops that access hardware acceleration (#17)
    • E.g. conv2d and matmul
  • Support compiling and executing ops for devices (new issue?)
    • CPU or GPU
  • Support interop with WebAssembly and WebGPU compute shader
    • Sharing ArrayBuffer with WASM op
    • Sharing WebGPUBuffer with WebGPU op (new issue?)
  • Support executing ops chain with opaque operands (#11)
    • Leverage device optimized memory layout and avoid unnecessary memory reordering
  • Explore custom op support by DSL (new issue?)