1 of 64

SYCL Programming �and Performance Optimization

Talk for SYCL MODE

Patric Zhao, Intel

1

2 of 64

Outline

      • Quick Review of Heterogeneous Computing
      • SYCL Programming
          • Basic Parallel Kernels
          • Execution And Memory Model
      • Performance Optimization with Case Study

2

2

3 of 64

Heterogeneous Computing

XPU

CPU

GPU

FPGA

Other Accelerator

3

赵鹏:基于oneAPI的异构计算与问题求解

3

4 of 64

Introduction

Heterogenous Computing: A computing system and approach that is comprised of different computing units which is based on different instruction set and system architecture.

4

4

5 of 64

General Purpose Processor: CPU + GPU for acceleration

CPU

  • General Purpose Tasks
  • Most Common Applications
  • Compute Intensive

GPU

  • Specialized tasks (GEMM, Conv)
  • Graphics and Vision Applications
  • Computation in parallel
  • High memory bandwidth

5

5

6 of 64

Execution model: independence

  • Physical separated, inter-connect via mainboard
  • Independent memory space
  • Different computing logic and instructions

6

6

7 of 64

Execution model: dependence

7

7

8 of 64

SYCL Programming Language �

8

赵鹏:基于oneAPI的异构计算与问题求解

8

9 of 64

What and Why SYCL?

It is an extension to C++ language, providing:

  • SYCL (pronounced ‘sickle’) is a free, cross-platform abstraction layer.
  • Enables code for heterogeneous and offload processors to be written using modern ISO C++ (at least C++ 17).
  • Provides APIs and abstractions to find devices (e.g. CPUs, GPUs, FPGAs) on which code can be executed, and to manage data resources and code execution on those devices.

9

9

10 of 64

Codeplay oneAPI Plug-ins for NVIDIA and AMDSupport for NVIDIA and AMD GPUs to Intel® oneAPI Base Toolkit

Codeplay press release

Image courtesy of Codeplay Software Ltd.

Intel®

oneAPI for NVIDIA and AMD GPUs

  • Free download of binary plugins to Intel® oneAPI DPC++/C++ Compiler:
  • NVIDIA GPU
  • AMD beta GPU
  • No need to build from source!
  • Plug-ins updated quarterly in-sync with �SYCL 2020 conformance and performance

Priority Support

    • Available through Intel, Codeplay, and our channel
    • Requires Intel Priority Support for Intel oneAPI DPC++/C++ Compiler
    • Intel takes first call, Codeplay delivers backend support
    • Codeplay provides access to older plug-in versions

Other names and brands may be claimed as the property of others.  �SYCL is a trademark of the Khronos Group Inc.

10

10

11 of 64

How SYCL works?

It is an extension to C++ language, providing:

  • Abstractions for accessing hardware devices
  • Methods for data access
  • Methods for expressing parallelism

11

11

12 of 64

Devices

  • Devices, represent various hardware in the oneAPI system

The device class is a predefined method for selecting and querying devices

It contains member functions to query device information

It supports the creation of different hardware, CPU/GPU/FPGA/…

  • The device_selector class supports selecting a specific device at runtime

default_selector, cpu_selector, gpu_selector...

12

12

13 of 64

Code Sample

#include <sycl/sycl.hpp>

#include <iostream>

using namespace sycl;

int main() {

queue my_gpu_queue( gpu_selector{} );

std::cout << "Selected GPU device: " <<

my_gpu_queue.get_device().get_info<info::device::name>() << "\n";

return 0;

}

patric@gpu:~$ icpx -fsycl gpu_selector.cpp

patric@gpu:~$ ./a.out

Selected GPU device: Intel(R) Iris(R) Xe MAX Graphics [0x4905]

13

赵鹏:基于oneAPI的异构计算与问题求解

13

14 of 64

Queue

  • Queue is a mechanism for submitting work to the device
  • The main program pushes the task into the queue, and the heterogeneous device obtains the execution task from the queue
  • A queue is map to a device, and multiple queues can map to the same device

int main() {

queue my_gpu_queue( gpu_selector{} );

std::cout << "Selected GPU device: " <<

my_gpu_queue.get_device().get_info<info::device::name>() << "\n";

return 0;

}

14

14

15 of 64

Data Movement

  • CPU and GPU have independent memory space

15

15

16 of 64

  • Programmers need to consider how to transfer data between different devices

- Explicit data transfer: CPU ->GPU, computing, GPU->CPU

- Implicit data transfer: data is automatically migrated by the system when accessed

16

16

17 of 64

  • Explicit memory allocation

Traditional C/C++

void* malloc (size_t size);

Allocate memory block

Allocates a block of size bytes of memory, returning a pointer to the beginning of the block.

17

赵鹏:基于oneAPI的异构计算与问题求解

17

18 of 64

  • Explicit memory copy

Traditional C/C++

void * memcpy ( void * destination, const void * source, size_t num );

Copy block of memory

Copies the values of num bytes from the location pointed to by source directly to the memory block pointed to by destination.

18

赵鹏:基于oneAPI的异构计算与问题求解

18

19 of 64

Code Sample

constexpr int N = 10;

int *host_mem = malloc_host<int>(N, my_gpu_queue);

int *device_mem = malloc_device<int>(N, my_gpu_queue);

// Init CPU data

for(int i = 0; i < N; i++) { host_mem[i] = i; }

// Copy from host(CPU) to device(GPU)

my_gpu_queue.memcpy(device_mem, host_mem, N * sizeof(int)).wait();

// do some works on GPU

// Copy back from GPU to CPU

my_gpu_queue.memcpy(host_mem, device_mem, N * sizeof(int)).wait();

printf("\nData Result\n");

for(int i = 0; i < N; i++) { printf("%d, ", host_mem[i]); }

Compilation:

patric@gpu:~/course$ ipcx -fsycl data_movement_ex.cpp -o data_move

Run & output:

patric@gpu:~/course$ ./data_move

Selected GPU device: Intel(R) Iris(R) Xe MAX Graphics [0x4905]

Data Result

0, 1, 2, 3, 4, 5, 6, 7, 8, 9,

Task Done!

19

19

20 of 64

Kernel

  • Which parts needs to be parallelized?

The most computationally intensive and time-consuming part

usually within loops

for(int i=0; i < 1024; i++){

    a[i] = b[i] + c[i];

});

Serial execution

launch N kernel instances {� int id = get_instance_id(); c[id] = a[id] + b[id];�}

Parallel execution

20

20

21 of 64

parallel_for

  • The parallel for-loop is the core in parallel computing.

In this loop, each iteration should be completely independent.

  • Parallel kernels are expressed using the parallel_for.

h.parallel_for(range<1>(1024), [=](id<1i){

    A[i] =  B[i] + C[i];

});

for(int i=0; i < 1024; i++){

    a[i] = b[i] + c[i];

});

Offload to accelerator using parallel_for

Serial execution

21

21

22 of 64

Basic Parallel Kernels

The functions of the basic parallel kernels are provided by the range, id and item classes.

h.parallel_for(range<1>(1024), [=](item<1> item){

auto idx = item.get_id();

auto R = item.get_range();

// CODE THAT RUNS ON DEVICE

});

  • range describes the size of the task space.
  • item represents a single instance of a kernel function and exposes other functions to the query attributes of the execution range.
  • Use the information of item to map each thread to the overall task space.

22

22

23 of 64

Code Sample

// Copy from host(CPU) to device(GPU)

my_gpu_queue.memcpy(device_mem, host_mem, N * sizeof(int)).wait();

// do some works on GPU

// submit the content to the queue for execution

my_gpu_queue.submit([&](handler& h) {

// Parallel Computation

h.parallel_for(range{N}, [=](id<1> item) {

device_mem[item] *= 2;

});

}); // end submit

// wait the computation done

my_gpu_queue.wait();

// Copy back from GPU to CPU

my_gpu_queue.memcpy(host_mem, device_mem, N * sizeof(int)).wait();

Run & output:

patric@gpu:~/course$ ./basic_parafor

Selected GPU device: Intel(R) Iris(R) Xe MAX Graphics [0x4905]

Data Result

0, 2, 4, 6, 8, 10, 12, 14, 16, 18,

Task Done!

23

23

24 of 64

Execution model�

24

24

25 of 64

Further Thoughts

  • How does the GPU work?
  • How does parallel_for work?
  • How are tasks mapped to the GPU?
  • How do we control the mapping of tasks?

25

25

26 of 64

Intel ® Xe Max GPU architecture

Intel SubSlice == NVIDIA SM

Intel Slice == NVIDIA GPC

26

26

27 of 64

DPC++ Task Granularity

Work-item:

The most basic computing granularity, which will be mapped to the ALU for execution.

Work-group:

Organize a certain number of work-items to be executed at the same time, which will be scheduled to the same subslice, with the work-time sharing certain resources, enabling collaboration within the group

h.parallel_for(range{N}, [=](id<1> item) {

device_mem[item] *= 2;

});

27

27

28 of 64

Explicitly set computing granularity

The basic parallel kernel is an easy approach to parallelize for-loops, but lacks precise control.

ND-Range kernel parameter is an enhanced parallel representation method.

  • More precise control of task division
  • Correspond the execution of the task with the corresponding hardware computing unit

28

赵鹏:基于oneAPI的异构计算与问题求解

28

29 of 64

ND-Range kernel

The function of the ND_Range kernel is represented by nd_range and nd_item

  • The nd_range class represents global execution range and number of work items in each work-group

h.parallel_for(nd_range<1>(N, 64), [=](nd_item<1item){

    auto idx = item…..;

       device_mem[idx] *= 2;

});

h.parallel_for(range{N}, [=](id<1> item) {

device_mem[item] *= 2;

});

How many work-item in each work-group?

How many work-group will be generated?

29

29

30 of 64

The nd_item class represents a single instance of a kernel function and allows querying the workgroup range and index.

h.parallel_for(nd_range<1>(N, 64), [=](nd_item<1> item){

     auto idx = item. ( );

       device_mem[idx] *= 2;

});

64 items

64 items

64 items

get_group() : group id from 0 - N/64

get_local_id() : index 0-63

get_local_id() : index 0-63

get_global_linear_id() : from 1 to N

check if you understand

30

30

31 of 64

Example: GEMM

Let’s show how strong you are!

C[i][j] += A[i][hA] * B[hB][j]

for (i = 0; i < M; ++i)

for (j = 0; j < N; ++j)

for (k = 0; k < L; ++k)

c[i][j] += a[i][k] * b[k][j];

31

31

32 of 64

GEMM: CPU Version

// Single Thread Computation in CPU

for(int i = 0; i < M; i++) {

for(int j = 0; j < N; j++) {

float sum = 0.0f;

for(int k = 0; k < K; k++) {

sum += cA[i * K + k] * cB[k * N + j];

}

cC[i * N + j] = sum;

}

}

32

32

33 of 64

GEMM: GPU version

// define the workgroup size and mapping

auto grid_rows = (M + block_size - 1) / block_size * block_size;

auto grid_cols = (N + block_size - 1) / block_size * block_size;

auto local_ndrange = range<2>(block_size, block_size);

auto global_ndrange = range<2>(grid_rows, grid_cols);

auto e = q.submit([&](sycl::handler &h) {

h.parallel_for<class k_name_t>(

sycl::nd_range<2>(global_ndrange, local_ndrange),

[=](sycl::nd_item<2> index) {

int row = index.get_global_id(0);

int col = index.get_global_id(1);

float sum = 0.0f;

for (int i = 0; i < K; i++) {

sum += A[row * K + i] * B[i * N + col];

}

C[row * N + col] = sum;

});

});

e.wait();

33

赵鹏:基于oneAPI的异构计算与问题求解

33

34 of 64

Performance and Occupancy

How does the size of a work-group relate to performance?

34

34

35 of 64

Memory model�

35

35

36 of 64

Memory hierarchy

  • Private Memory

Thread-independent resources, complementary and visible

  • (Shared) Local Memory

Visible to group members only

  • Global Memory

Visible to all work-items

36

36

37 of 64

auto e = q.submit([&](sycl::handler &h) {

h.parallel_for<class k_name_t>(

sycl::nd_range<2>(global_ndrange, local_ndrange),

[=](sycl::nd_item<2> index) {

int row = index.get_global_id(0);

int col = index.get_global_id(1);

float sum = 0.0f;

for (int i = 0; i < K; i++) {

sum += A[row * K + i] * B[i * N + col];

}

C[row * N + col] = sum;

});

});

e.wait();

Private memory

  • Created by the work-item itself;
  • Visible only to the current work-item and inaccessible between work-items;
  • Each has a different value, although the name is the same.

Global memory

  • Transfer from external
  • Globally visible, with all workitems have access to the same data, e.g. A[0,5];
  • Write to the same address at the same time may lead to uncertainty issues (data race), which requires special attention.

37

37

38 of 64

Private memory/Register

  • Fastest but very limited in EU
  • Keep the intermediate results in registers
  • Using too many registers will limit the number of workgroups that can be run in parallel

Global Memory

  • Slow and long latency but very huge size
  • Data loading and storing
  • Reduce the redundant read/write for better performance

38

38

39 of 64

Improve the performance of matrix computation (GEMM)

Data reuse: Global Memory Reuse

  • Elements of arrays A and B are reusable in rows and columns
  • Data are not reusable when a work-item computes only one output
  • It is more efficient for a work-item to compute the output of a tile

Intermediate results: register

  • Read data into register to avoid repeat access to the global memory
  • Store the accumulation result in the register of each work-item to avoid repeat read/write to the global memory
  • Finally, write out the result of the register in a single shot

39

39

40 of 64

GEMM: tile version

// Mapping

#define tileX 2

#define tileY 2

// define the workgroup size and mapping

auto grid_rows = M / tileY;

auto grid_cols = N / tileX;

auto local_ndrange = range<2>(BLOCK, BLOCK);

auto global_ndrange = range<2>(grid_rows, grid_cols);

auto e = q.submit([&](sycl::handler &h) {

h.parallel_for<class k_name_t>(

sycl::nd_range<2>(global_ndrange, local_ndrange),

[=](sycl::nd_item<2> index) {

int row = tileY * index.get_global_id(0);

int col = tileX * index.get_global_id(1);

….

}

40

赵鹏:基于oneAPI的异构计算与问题求解

40

41 of 64

float sum[tileY][tileX] = {0.0f};

float subA[tileY] = {0.0f};

float subB[tileX] = {0.0f};

// core computation

for (int k = 0; k < N; k++) {

// read data to register

for(int m = 0; m < tileY; m++) { subA[m] = A[(row + m) * N + k]; }

for(int p = 0; p < tileX; p++) { subB[p] = B[k * N + p + col]; }

// math

for (int m = 0; m < tileY; m++) {

for (int p = 0; p < tileX; p++) {

sum[m][p] += subA[m] * subB[p];

} }

} //end of K

// write results back

for (int m = 0; m < tileY; m++)

for (int p = 0; p < tileX; p++)

C[(row + m) * N + col + p] = sum[m][p];

41

41

42 of 64

Performance Optimization�

Image Source : https://www.constructioninsure.co.uk/high-risk-trade-insurance-most-beneficial/

42

42

43 of 64

Case Study: Interaction Layers in DLRM

  • Inputs:
    • Dense features (128-dim)
    • Sparse features (26 one-hot encoded value)

Output:

    • Probability of Click
  • DLRM modules:
    • Bottom MLP (matmul)
    • EMB lookup (memory copy)
    • Interaction layers
    • Top MLP (matmul)

43

44 of 64

Interaction Module Fusion

How does interaction layers work?

    • (Con)Cat

cat_res = torch.cat(tensor_list, dim=1)

Tensor_list: 27 tensors in shape of [32768, 128]

Cat result shape [32768, 27, 128]

    • Bmm

bmm_res = torch.bmm(cat_res, cat_resT)

bmm result shape [32768, 27, 27]

    • Index

index_res = bmm_res[ : , k , j ]

Index result shape [32768, 351]

    • (Con)Cat

cat_res = torch.cat(tensor1, index_res)

cat result shape [32768, 128+351]

44

45 of 64

  • Analysis Calculation for BMM

OP = output * calculate * FMA = 32768*27*27*128*2

MEM = read + write = 32768 * (27*128*2 + 27*27)

EU time (FP32) = OPs / (EU * SIMD width * 2 Op/cycle * Freq * EU eff)

= 6115295232/ (480*8*2*1400 *1e6)*1000

DPAS (DPAS) = OPs / (EU * Systolic Width * Systolic Depth * 2 * 2 Op/cycle * Freq * EU eff)

= 6115295232/ (480*8*8*2*2*1400 *1e6)*1000

Mem time(FP32) = MEM * data type / (Bandwidth * BW Eff) = 250380288 * 4 / 700

Do we need Systolic Array (DPAS) ?

Do we need FP16/BF16?

Ideally, it’s Max(com, mem)

Reality is [Max, Sum(com, mem)]

ATS

Shape

OP

MEM

OP/Elem

data type

EU Time

DPAS

Mem time (ms) �under 700 GB/s

Max

BMM

32768x27x128 * 32768x128x27 -> 32768x27x27

6115295232

250380288

24.42

FP32

0.5688

 

2.044

2.044

FP16

0.2844

0.0355

1.022

1.022

45

46 of 64

Init Design

  • Start from BMM

Because it’s the core part of the whole interaction layers

  • BMM: Batch Matrix Multiplication
                  • Input 32768x27x27, FP32 datatype
                  • Each workgroup compute 1 batch of output
                  • Padded (virtual) work-item is set to 32x32, but the data is still 27x27
                  • Output Tiling for each work-item is 4x4 submatrix
                  • Global memory loading, computing and storing directly

N = 32

M= 32

4

4

output

46

47 of 64

Low Precision

FP32 .vs. FP16

Active from 29.0% to 45.3%

Note:

The instruction of FP16 is also Mad – fmad

  • FP32 mad -- 2 cycles
  • FP16 mad -- 1 cycle

The CUDA instruction is different

  • FP32 fma -- 1 cycle
  • FP16 hfma -- 1 cycle
  • FP16 hfma2 -- 1 cycle but 2 compute inside

FP32

FP16

Begin Optimization

47

48 of 64

Mem: FP32 .vs. FP16

FP32

FP16

Half memory consumption

HBM from GMEM to EU

R : 113.8GB -> 53.8GB

W: 13.8GB -> 6.5 GB

L3 read byte

426.4GB -> 272.2GB

48

49 of 64

Problem size: batch c(32768,32,32) = a(32768,32,128) * b(32768,128,32)

Performance Params: TILE_X = 8 TILE_Y = 8 TILE_CX = 4 TILE_CY = 4

perf %ops% %prb_bytes% %time% %gflops% %bandwidth%

perf 8589934592 8192 5.2906 1623.6264 0.0014

Number of 33554432 data are verified.

Test Passed

(base) gta@DUT3014-ATS:~/Patric/gpu-perf-show/codex/build$ ./src/sycl-bmm-bmm-fp16-v1-cpp

Problem size: batch c(32768,32,32) = a(32768,32,128) * b(32768,128,32)

Performance Params: THREAD_X = 8 THREAD_Y = 8 TILE_CX = 4 TILE_CY = 4

perf %ops% %prb_bytes% %time% %gflops% %bandwidth%

perf 8589934592 8192 3.0827 2786.5310 0.0025

Number of 33554432 data are verified.

Test Passed

49

50 of 64

Data Exploration

FP32

FP16

What can we learn from data?

L3 read byte

426.4GB -> 272.2GB

But L3 read is very larger, 3-5X than HBM

FP32: 3.7X -> 426.4GB/113.8GB

FP16: 5.0X -> 272.2GB/53.8GB

What does this mean?

50

51 of 64

FP16: SLM version

FP16 w/ SLM

Mem:

SLM bandwidth: 448GB/s

HBM bandwidth improved:

from 158GB/s -> 202GB/s

EU:

Active improved from 45.3% to 67.8%

Stalled reduced from 50.9% to 27.2%

FP16 w/o SLM

Why HBM BW is improved?

Why ‘EU Active’ is improved?

51

52 of 64

FP16, SLM version, L3�

FP16 w/ SLM

FP16 w/o SLM

HMB:

Read number doesn’t changed

L3:

Read number is similar with HBM

We don’t go back L3 a lot now!

SMEM:

Repeat read from SMEM, as our expectation

52

53 of 64

Problem size: batch c(32768,32,32) = a(32768,32,128) * b(32768,128,32)

Performance Params: THREAD_X = 8 THREAD_Y = 8 TILE_CX = 4 TILE_CY = 4

perf %ops% %prb_bytes% %time% %gflops% %bandwidth%

perf 8589934592 8192 3.0827 2786.5310 0.0025

Problem size: batch c(32768,32,32) = a(32768,32,128) * b(32768,128,32)

Performance Params: THREAD_X = 8 THREAD_Y = 8 TILE_CX = 4 TILE_CY = 4 TILE_K = 32

perf %ops% %prb_bytes% %time% %gflops% %bandwidth%

perf 8589934592 8192 2.3721 3621.2668 0.0032

53

54 of 64

Do more DLRM specific changes now ☺�

Switch the source A

  • One source from A

memory read number is half

  • Fused output of index

memory output number is half

54

55 of 64

OP

FWK

Benchdnn

DPCPP

DPCPP (one Source)

cat0

1.49

 

bmm

0.94

1.82

2.11

1.42

index

0.52

 

cat1

0.30

 

total

3.25

3.75

Where are we going next step?

    • Continuous BMM to achieve better performance
    • Fusion others

3.21

55

56 of 64

Prefix Fusion for Concate

56

57 of 64

Data Access Model is changed

Inputs (batch, m, k) w/ concat (cat)

Inputs (m ,batch, k) w/o concat(cat)

k

m

batch

k

batch

m

57

58 of 64

  • Bandwidth decreased since the worse locality

How to improve the GMEM bandwidth?

58

59 of 64

  • Tiling in batch size

reading data in batch direction first, contiguous loading

  • Workgroup ordering

reuse the data from L3 and less data switch

1

3

m

m+1

m+2

2

59

60 of 64

  • Batchsize direction data loading
                • Improved the GMEM BW
                • Improved SLM read BW

More work-items and SLM are used so the SLM read BW improved

60

61 of 64

Fused all together

Last concate (cat1)

Fused Index

61

62 of 64

Half Computation

Only half WG is needed, and we changed the workgroup mapping

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0

1

2

6

7

8

9

3

4

5

62

63 of 64

Final Performance

OP

FWK

DPCPP

DPCPP �(bmm + index)

DPCPP�(bmm + index + concat)

DPCPP �(concat + bmm + index)

DPCPP �(concat + bmm + index + concat)

DPCPP : half computation �(concat + bmm + index + concat)

speedup

cat0

1.49

 

 

 

1.68

1.87

1.39

2.34X

bmm

0.94

1.44

1.43

1.5

index

0.52

 

cat1

0.30

 

 

 

total

3.25

3.75

3.22

2.99

1.98

63

64 of 64

Thank you and Questions�

64

64