1 of 64

SYCL Programming �and Performance Optimization

Talk for SYCL MODE

Patric Zhao, Intel

1

2 of 64

Outline

Quick Review of Heterogeneous Computing
SYCL Programming

Basic Parallel Kernels
Execution And Memory Model

Performance Optimization with Case Study

2

3 of 64

Heterogeneous Computing

XPU

CPU

GPU

FPGA

Other Accelerator

3

赵鹏：基于oneAPI的异构计算与问题求解

3

4 of 64

Introduction

Heterogenous Computing: A computing system and approach that is comprised of different computing units which is based on different instruction set and system architecture.

4

5 of 64

General Purpose Processor: CPU + GPU for acceleration

CPU

General Purpose Tasks
Most Common Applications
Compute Intensive

GPU

Specialized tasks （GEMM, Conv)
Graphics and Vision Applications
Computation in parallel
High memory bandwidth

5

6 of 64

Execution model: independence

Physical separated, inter-connect via mainboard
Independent memory space
Different computing logic and instructions

Image Source: https://bmcsystbiol.biomedcentral.com/articles/10.1186/1752-0509-6-S1-S16/figures/3

6

7 of 64

Execution model: dependence

7

8 of 64

SYCL Programming Language �

https://github.com/Apress/data-parallel-CPP

8

赵鹏：基于oneAPI的异构计算与问题求解

8

9 of 64

What and Why SYCL?

It is an extension to C++ language, providing:

SYCL (pronounced ‘sickle’) is a free, cross-platform abstraction layer.
Enables code for heterogeneous and offload processors to be written using modern ISO C++ (at least C++ 17).
Provides APIs and abstractions to find devices (e.g. CPUs, GPUs, FPGAs) on which code can be executed, and to manage data resources and code execution on those devices.

9

10 of 64

Codeplay oneAPI Plug-ins for NVIDIA and AMD�Support for NVIDIA and AMD GPUs to Intel^® oneAPI Base Toolkit

NVIDIA GPU plug-in

AMD GPU plug-in

Codeplay blog

Codeplay press release

Image courtesy of Codeplay Software Ltd.

Intel^®

oneAPI for NVIDIA and AMD GPUs

Free download of binary plugins to Intel^® oneAPI DPC++/C++ Compiler:
NVIDIA GPU
AMD beta GPU
No need to build from source!
Plug-ins updated quarterly in-sync with �SYCL 2020 conformance and performance

Priority Support

Available through Intel, Codeplay, and our channel
Requires Intel Priority Support for Intel oneAPI DPC++/C++ Compiler
Intel takes first call, Codeplay delivers backend support
Codeplay provides access to older plug-in versions

Other names and brands may be claimed as the property of others. �SYCL is a trademark of the Khronos Group Inc.

10

11 of 64

How SYCL works?

It is an extension to C++ language, providing:

Abstractions for accessing hardware devices
Methods for data access
Methods for expressing parallelism

11

12 of 64

Devices

Devices, represent various hardware in the oneAPI system

The device class is a predefined method for selecting and querying devices

It contains member functions to query device information

It supports the creation of different hardware, CPU/GPU/FPGA/…

The device_selector class supports selecting a specific device at runtime

default_selector, cpu_selector, gpu_selector...

12

13 of 64

Code Sample

https://github.com/pengzhao-intel/oneAPI_course/blob/main/code/gpu_selector.cpp

#include <sycl/sycl.hpp>

#include <iostream>

using namespace sycl;

int main() {

queue my_gpu_queue( gpu_selector{} );

std::cout << "Selected GPU device: " <<

my_gpu_queue.get_device().get_info<info::device::name>() << "\n";

return 0;

}

patric@gpu:~$ icpx -fsycl gpu_selector.cpp

patric@gpu:~$ ./a.out

Selected GPU device: Intel(R) Iris(R) Xe MAX Graphics [0x4905]

13

赵鹏：基于oneAPI的异构计算与问题求解

13

14 of 64

Queue

Queue is a mechanism for submitting work to the device
The main program pushes the task into the queue, and the heterogeneous device obtains the execution task from the queue
A queue is map to a device, and multiple queues can map to the same device

int main() {

queue my_gpu_queue( gpu_selector{} );

std::cout << "Selected GPU device: " <<

my_gpu_queue.get_device().get_info<info::device::name>() << "\n";

return 0;

}

14

15 of 64

Data Movement

CPU and GPU have independent memory space

15

16 of 64

Programmers need to consider how to transfer data between different devices

- Explicit data transfer: CPU ->GPU, computing, GPU->CPU

- Implicit data transfer: data is automatically migrated by the system when accessed

16

17 of 64

Explicit memory allocation

Traditional C/C++

void* malloc (size_t size);

Allocate memory block

Allocates a block of size bytes of memory, returning a pointer to the beginning of the block.

Data Parallel C++

void* malloc_host(size_t size, const sycl::queue& q );

void* malloc_device(size_t size, const sycl::queue& q );

void* malloc_shared(size_t size, const sycl::queue& q );

17

赵鹏：基于oneAPI的异构计算与问题求解

17

18 of 64

Explicit memory copy

Traditional C/C++

void * memcpy ( void * destination, const void * source, size_t num );

Copy block of memory

Copies the values of num bytes from the location pointed to by source directly to the memory block pointed to by destination.

Data Parallel C++

void * queue.memcpy ( void * destination, const void * source, size_t num );

18

赵鹏：基于oneAPI的异构计算与问题求解

18

19 of 64

Code Sample

constexpr int N = 10;

int *host_mem = malloc_host<int>(N, my_gpu_queue);

int *device_mem = malloc_device<int>(N, my_gpu_queue);

// Init CPU data

for(int i = 0; i < N; i++) { host_mem[i] = i; }

// Copy from host(CPU) to device(GPU)

my_gpu_queue.memcpy(device_mem, host_mem, N * sizeof(int)).wait();

// do some works on GPU

// Copy back from GPU to CPU

my_gpu_queue.memcpy(host_mem, device_mem, N * sizeof(int)).wait();

printf("\nData Result\n");

for(int i = 0; i < N; i++) { printf("%d, ", host_mem[i]); }

Compilation:

patric@gpu:~/course$ ipcx -fsycl data_movement_ex.cpp -o data_move

Run & output:

patric@gpu:~/course$ ./data_move

Selected GPU device: Intel(R) Iris(R) Xe MAX Graphics [0x4905]

Data Result

0, 1, 2, 3, 4, 5, 6, 7, 8, 9,

Task Done!

https://github.com/pengzhao-intel/oneAPI_course/blob/main/code/data_movement_ex.cpp

19

20 of 64

Kernel

Which parts needs to be parallelized?

The most computationally intensive and time-consuming part

usually within loops

for(int i=0; i < 1024; i++){

a[i] = b[i] + c[i];

});

Serial execution

launch N kernel instances {� int id = get_instance_id(); � c[id] = a[id] + b[id];�}

Parallel execution

20

21 of 64

parallel_for

The parallel for-loop is the core in parallel computing.

In this loop, each iteration should be completely independent.

Parallel kernels are expressed using the parallel_for.

h.parallel_for(range<1>(1024), [=](id<1> i){

A[i] = B[i] + C[i];

});

for(int i=0; i < 1024; i++){

a[i] = b[i] + c[i];

});

Offload to accelerator using parallel_for

Serial execution

21

22 of 64

Basic Parallel Kernels

The functions of the basic parallel kernels are provided by the range, id and item classes.

h.parallel_for(range<1>(1024), [=](item<1> item){

auto idx = item.get_id();

auto R = item.get_range();

// CODE THAT RUNS ON DEVICE

});

range describes the size of the task space.
item represents a single instance of a kernel function and exposes other functions to the query attributes of the execution range.
Use the information of item to map each thread to the overall task space.

�

22

23 of 64

Code Sample

// Copy from host(CPU) to device(GPU)

my_gpu_queue.memcpy(device_mem, host_mem, N * sizeof(int)).wait();

// do some works on GPU

// submit the content to the queue for execution

my_gpu_queue.submit([&](handler& h) {

// Parallel Computation

h.parallel_for(range{N}, [=](id<1> item) {

device_mem[item] *= 2;

});

}); // end submit

// wait the computation done

my_gpu_queue.wait();

// Copy back from GPU to CPU

my_gpu_queue.memcpy(host_mem, device_mem, N * sizeof(int)).wait();

Run & output:

patric@gpu:~/course$ ./basic_parafor

Selected GPU device: Intel(R) Iris(R) Xe MAX Graphics [0x4905]

Data Result

0, 2, 4, 6, 8, 10, 12, 14, 16, 18,

Task Done!

https://github.com/pengzhao-intel/oneAPI_course/blob/main/code/basic_parafor.cpp

23

24 of 64

Execution model�

Image Source : http://alana.io/downloads/rails-3/

24

25 of 64

Further Thoughts

How does the GPU work?
How does parallel_for work?
How are tasks mapped to the GPU?
How do we control the mapping of tasks?

25

26 of 64

Intel ® Xe Max GPU architecture

Intel SubSlice == NVIDIA SM

Intel Slice == NVIDIA GPC

26

27 of 64

DPC++ Task Granularity

Work-item:

The most basic computing granularity, which will be mapped to the ALU for execution.

Work-group:

Organize a certain number of work-items to be executed at the same time, which will be scheduled to the same subslice, with the work-time sharing certain resources, enabling collaboration within the group

h.parallel_for(range{N}, [=](id<1> item) {

device_mem[item] *= 2;

});

27

28 of 64

Explicitly set computing granularity

The basic parallel kernel is an easy approach to parallelize for-loops, but lacks precise control.

ND-Range kernel parameter is an enhanced parallel representation method.

More precise control of task division
Correspond the execution of the task with the corresponding hardware computing unit

https://docs.oneapi.io/versions/latest/dpcpp/iface/nd_range.html#nd-range

28

赵鹏：基于oneAPI的异构计算与问题求解

28

29 of 64

ND-Range kernel

The function of the ND_Range kernel is represented by nd_range and nd_item

The nd_range class represents global execution range and number of work items in each work-group

h.parallel_for(nd_range<1>(N, 64), [=](nd_item<1> item){

auto idx = item…..;

device_mem[idx] *= 2;

});

h.parallel_for(range{N}, [=](id<1> item) {

device_mem[item] *= 2;

});

How many work-item in each work-group?

How many work-group will be generated?

29

30 of 64

The nd_item class represents a single instance of a kernel function and allows querying the workgroup range and index.

h.parallel_for(nd_range<1>(N, 64), [=](nd_item<1> item){

auto idx = item. ( );

device_mem[idx] *= 2;

});

64 items

get_group() : group id from 0 - N/64

get_local_id() : index 0-63

get_global_linear_id() : from 1 to N

check if you understand

30

31 of 64

Example: GEMM

Let’s show how strong you are!

C[i][j] += A[i][hA] * B[hB][j]

for (i = 0; i < M; ++i)

for (j = 0; j < N; ++j)

for (k = 0; k < L; ++k)

c[i][j] += a[i][k] * b[k][j];

31

32 of 64

GEMM: CPU Version

// Single Thread Computation in CPU

for(int i = 0; i < M; i++) {

for(int j = 0; j < N; j++) {

float sum = 0.0f;

for(int k = 0; k < K; k++) {

sum += cA[i * K + k] * cB[k * N + j];

}

cC[i * N + j] = sum;

}

32

33 of 64

GEMM: GPU version

// define the workgroup size and mapping

auto grid_rows = (M + block_size - 1) / block_size * block_size;

auto grid_cols = (N + block_size - 1) / block_size * block_size;

auto local_ndrange = range<2>(block_size, block_size);

auto global_ndrange = range<2>(grid_rows, grid_cols);

auto e = q.submit([&](sycl::handler &h) {

h.parallel_for<class k_name_t>(

sycl::nd_range<2>(global_ndrange, local_ndrange),

[=](sycl::nd_item<2> index) {

int row = index.get_global_id(0);

int col = index.get_global_id(1);

float sum = 0.0f;

for (int i = 0; i < K; i++) {

sum += A[row * K + i] * B[i * N + col];

}

C[row * N + col] = sum;

});

e.wait();

https://github.com/pengzhao-intel/oneAPI_course/blob/main/code/gemm_basic.cpp

33

赵鹏：基于oneAPI的异构计算与问题求解

33

34 of 64

Performance and Occupancy

How does the size of a work-group relate to performance?

34

35 of 64

Memory model�

Image Source : https://www.gocivilairpatrol.com/media/cms/Leadership_Concept__Slides_EBCF53691D635.pdf

35

36 of 64

Memory hierarchy

Private Memory

Thread-independent resources, complementary and visible

(Shared) Local Memory

Visible to group members only

Global Memory

Visible to all work-items

36

37 of 64

auto e = q.submit([&](sycl::handler &h) {

h.parallel_for<class k_name_t>(

sycl::nd_range<2>(global_ndrange, local_ndrange),

[=](sycl::nd_item<2> index) {

int row = index.get_global_id(0);

int col = index.get_global_id(1);

float sum = 0.0f;

for (int i = 0; i < K; i++) {

sum += A[row * K + i] * B[i * N + col];

}

C[row * N + col] = sum;

});

e.wait();

Private memory

Created by the work-item itself;
Visible only to the current work-item and inaccessible between work-items;
Each has a different value, although the name is the same.

Global memory

Transfer from external
Globally visible, with all workitems have access to the same data, e.g. A[0,5];
Write to the same address at the same time may lead to uncertainty issues (data race), which requires special attention.

37

38 of 64

Private memory/Register

Fastest but very limited in EU
Keep the intermediate results in registers
Using too many registers will limit the number of workgroups that can be run in parallel

Global Memory

Slow and long latency but very huge size
Data loading and storing
Reduce the redundant read/write for better performance

38

39 of 64

Improve the performance of matrix computation (GEMM)

Data reuse: Global Memory Reuse

Elements of arrays A and B are reusable in rows and columns
Data are not reusable when a work-item computes only one output
It is more efficient for a work-item to compute the output of a tile

Intermediate results: register

Read data into register to avoid repeat access to the global memory
Store the accumulation result in the register of each work-item to avoid repeat read/write to the global memory
Finally, write out the result of the register in a single shot

39

40 of 64

GEMM: tile version

// Mapping

#define tileX 2

#define tileY 2

// define the workgroup size and mapping

auto grid_rows = M / tileY;

auto grid_cols = N / tileX;

auto local_ndrange = range<2>(BLOCK, BLOCK);

auto global_ndrange = range<2>(grid_rows, grid_cols);

auto e = q.submit([&](sycl::handler &h) {

h.parallel_for<class k_name_t>(

sycl::nd_range<2>(global_ndrange, local_ndrange),

[=](sycl::nd_item<2> index) {

int row = tileY * index.get_global_id(0);

int col = tileX * index.get_global_id(1);

….

}

https://github.com/pengzhao-intel/oneAPI_course/blob/main/code/gemm_tile.cpp

40

赵鹏：基于oneAPI的异构计算与问题求解

40

41 of 64

float sum[tileY][tileX] = {0.0f};

float subA[tileY] = {0.0f};

float subB[tileX] = {0.0f};

// core computation

for (int k = 0; k < N; k++) {

// read data to register

for(int m = 0; m < tileY; m++) { subA[m] = A[(row + m) * N + k]; }

for(int p = 0; p < tileX; p++) { subB[p] = B[k * N + p + col]; }

// math

for (int m = 0; m < tileY; m++) {

for (int p = 0; p < tileX; p++) {

sum[m][p] += subA[m] * subB[p];

} }

} //end of K

// write results back

for (int m = 0; m < tileY; m++)

for (int p = 0; p < tileX; p++)

C[(row + m) * N + col + p] = sum[m][p];

41

42 of 64

Performance Optimization�

Image Source : https://www.constructioninsure.co.uk/high-risk-trade-insurance-most-beneficial/

42

43 of 64

Case Study: Interaction Layers in DLRM

Inputs:

Dense features (128-dim)
Sparse features (26 one-hot encoded value)

Output:

Probability of Click

DLRM modules:

Bottom MLP (matmul)
EMB lookup (memory copy)
Interaction layers
Top MLP (matmul)

43

44 of 64

Interaction Module Fusion

How does interaction layers work?

(Con)Cat

cat_res = torch.cat(tensor_list, dim=1)

Tensor_list: 27 tensors in shape of [32768, 128]

Cat result shape [32768, 27, 128]

Bmm

bmm_res = torch.bmm(cat_res, cat_res^T)

bmm result shape [32768, 27, 27]

Index

index_res = bmm_res[ : , k , j ]

Index result shape [32768, 351]

(Con)Cat

cat_res = torch.cat(tensor1, index_res)

cat result shape [32768, 128+351]

44

45 of 64

Analysis Calculation for BMM

OP = output * calculate * FMA = 32768*27*27*128*2

MEM = read + write = 32768 * (27*128*2 + 27*27)

EU time (FP32) = OPs / (EU * SIMD width * 2 Op/cycle * Freq * EU eff)

= 6115295232/ (480*8*2*1400 *1e6)*1000

DPAS (DPAS) = OPs / (EU * Systolic Width * Systolic Depth * 2 * 2 Op/cycle * Freq * EU eff)

= 6115295232/ (480*8*8*2*2*1400 *1e6)*1000

Mem time(FP32) = MEM * data type / (Bandwidth * BW Eff) = 250380288 * 4 / 700

Do we need Systolic Array (DPAS) ?

Do we need FP16/BF16?

Ideally, it’s Max(com, mem)

Reality is [Max, Sum(com, mem)]

ATS	Shape	OP	MEM	OP/Elem	data type	EU Time	DPAS	Mem time (ms) �under 700 GB/s	Max
BMM	32768x27x128 * 32768x128x27 -> 32768x27x27	6115295232	250380288	24.42	FP32	0.5688		2.044	2.044
BMM	32768x27x128 * 32768x128x27 -> 32768x27x27	6115295232	250380288	24.42	FP16	0.2844	0.0355	1.022	1.022

45

46 of 64

Init Design

Start from BMM

Because it’s the core part of the whole interaction layers

BMM: Batch Matrix Multiplication

Input 32768x27x27, FP32 datatype
Each workgroup compute 1 batch of output
Padded (virtual) work-item is set to 32x32, but the data is still 27x27
Output Tiling for each work-item is 4x4 submatrix
Global memory loading, computing and storing directly

N = 32

M= 32

4

output

46

47 of 64

Low Precision

FP32 .vs. FP16

Active from 29.0% to 45.3%

Note:

The instruction of FP16 is also Mad – fmad

FP32 mad -- 2 cycles
FP16 mad -- 1 cycle

The CUDA instruction is different

FP32 fma -- 1 cycle
FP16 hfma -- 1 cycle
FP16 hfma2 -- 1 cycle but 2 compute inside

FP32

FP16

Begin Optimization

47

48 of 64

Mem: FP32 .vs. FP16

FP32

FP16

Half memory consumption

HBM from GMEM to EU

R : 113.8GB -> 53.8GB

W: 13.8GB -> 6.5 GB

L3 read byte

426.4GB -> 272.2GB

48

49 of 64

Problem size: batch c(32768,32,32) = a(32768,32,128) * b(32768,128,32)

Performance Params: TILE_X = 8 TILE_Y = 8 TILE_CX = 4 TILE_CY = 4

perf %ops% %prb_bytes% %time% %gflops% %bandwidth%

perf 8589934592 8192 5.2906 1623.6264 0.0014

Number of 33554432 data are verified.

Test Passed

(base) gta@DUT3014-ATS:~/Patric/gpu-perf-show/codex/build$ ./src/sycl-bmm-bmm-fp16-v1-cpp

Problem size: batch c(32768,32,32) = a(32768,32,128) * b(32768,128,32)

Performance Params: THREAD_X = 8 THREAD_Y = 8 TILE_CX = 4 TILE_CY = 4

perf %ops% %prb_bytes% %time% %gflops% %bandwidth%

perf 8589934592 8192 3.0827 2786.5310 0.0025

Number of 33554432 data are verified.

Test Passed

49

50 of 64

Data Exploration

FP32

FP16

What can we learn from data?

L3 read byte

426.4GB -> 272.2GB

But L3 read is very larger, 3-5X than HBM

FP32: 3.7X -> 426.4GB/113.8GB

FP16: 5.0X -> 272.2GB/53.8GB

What does this mean?

50

51 of 64

FP16: SLM version

FP16 w/ SLM

Mem:

SLM bandwidth: 448GB/s

HBM bandwidth improved:

from 158GB/s -> 202GB/s

EU:

Active improved from 45.3% to 67.8%

Stalled reduced from 50.9% to 27.2%

FP16 w/o SLM

Why HBM BW is improved?

Why ‘EU Active’ is improved?

51

52 of 64

FP16, SLM version, L3�

FP16 w/ SLM

FP16 w/o SLM

HMB:

Read number doesn’t changed

L3:

Read number is similar with HBM

We don’t go back L3 a lot now!

SMEM:

Repeat read from SMEM, as our expectation

52

53 of 64

Problem size: batch c(32768,32,32) = a(32768,32,128) * b(32768,128,32)

Performance Params: THREAD_X = 8 THREAD_Y = 8 TILE_CX = 4 TILE_CY = 4

perf %ops% %prb_bytes% %time% %gflops% %bandwidth%

perf 8589934592 8192 3.0827 2786.5310 0.0025

Problem size: batch c(32768,32,32) = a(32768,32,128) * b(32768,128,32)

Performance Params: THREAD_X = 8 THREAD_Y = 8 TILE_CX = 4 TILE_CY = 4 TILE_K = 32

perf %ops% %prb_bytes% %time% %gflops% %bandwidth%

perf 8589934592 8192 2.3721 3621.2668 0.0032

53

54 of 64

Do more DLRM specific changes now ☺�

Switch the source A

One source from A

memory read number is half

Fused output of index

memory output number is half

54

55 of 64

OP	FWK	Benchdnn	DPCPP	DPCPP (one Source)
cat0	1.49
bmm	0.94	1.82	2.11	1.42
index	0.52			1.42
cat1	0.30
total	3.25		3.75

Where are we going next step?

Continuous BMM to achieve better performance
Fusion others

3.21

55

56 of 64

Prefix Fusion for Concate

56

57 of 64

Data Access Model is changed

Inputs (batch, m, k) w/ concat (cat)

Inputs (m ,batch, k) w/o concat(cat)

k

m

batch

k

batch

m

57

58 of 64

Bandwidth decreased since the worse locality

How to improve the GMEM bandwidth?

58

59 of 64

Tiling in batch size

reading data in batch direction first, contiguous loading

Workgroup ordering

reuse the data from L3 and less data switch

1

3

m

m+1

m+2

2

59

60 of 64

Batchsize direction data loading

Improved the GMEM BW
Improved SLM read BW

More work-items and SLM are used so the SLM read BW improved

60

61 of 64

Fused all together

Last concate (cat1)

Fused Index

61

62 of 64

Half Computation

Only half WG is needed, and we changed the workgroup mapping

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

0

1

2

6

7

8

9

3

4

5

62

63 of 64

Final Performance

OP	FWK	DPCPP	DPCPP �（bmm + index）	DPCPP�(bmm + index + concat)	DPCPP �（concat + bmm + index）	DPCPP �（concat + bmm + index + concat）	DPCPP : half computation �（concat + bmm + index + concat）	speedup
cat0	1.49				1.68	1.87	1.39	2.34X
bmm	0.94	1.44	1.43	1.5
index	0.52		1.43
cat1	0.30
total	3.25	3.75	3.22	2.99	1.98

63

64 of 64

Thank you and Questions�

64