SYCL Programming �and Performance Optimization
Talk for SYCL MODE
Patric Zhao, Intel
1
Outline
2
2
Heterogeneous Computing
XPU
CPU
GPU
FPGA
Other Accelerator
3
赵鹏:基于oneAPI的异构计算与问题求解
3
Introduction
Heterogenous Computing: A computing system and approach that is comprised of different computing units which is based on different instruction set and system architecture.
4
4
General Purpose Processor: CPU + GPU for acceleration
CPU
GPU
5
5
Execution model: independence
6
6
Execution model: dependence
7
7
SYCL Programming Language �
8
赵鹏:基于oneAPI的异构计算与问题求解
8
What and Why SYCL?
It is an extension to C++ language, providing:
9
9
Codeplay oneAPI Plug-ins for NVIDIA and AMD�Support for NVIDIA and AMD GPUs to Intel® oneAPI Base Toolkit
Codeplay press release
Image courtesy of Codeplay Software Ltd.
Intel®
oneAPI for NVIDIA and AMD GPUs
Priority Support
Other names and brands may be claimed as the property of others. �SYCL is a trademark of the Khronos Group Inc.
10
10
How SYCL works?
It is an extension to C++ language, providing:
11
11
Devices
The device class is a predefined method for selecting and querying devices
It contains member functions to query device information
It supports the creation of different hardware, CPU/GPU/FPGA/…
default_selector, cpu_selector, gpu_selector...
12
12
Code Sample
#include <sycl/sycl.hpp>
#include <iostream>
using namespace sycl;
int main() {
queue my_gpu_queue( gpu_selector{} );
std::cout << "Selected GPU device: " <<
my_gpu_queue.get_device().get_info<info::device::name>() << "\n";
return 0;
}
patric@gpu:~$ icpx -fsycl gpu_selector.cpp
patric@gpu:~$ ./a.out
Selected GPU device: Intel(R) Iris(R) Xe MAX Graphics [0x4905]
13
赵鹏:基于oneAPI的异构计算与问题求解
13
Queue
int main() {
queue my_gpu_queue( gpu_selector{} );
std::cout << "Selected GPU device: " <<
my_gpu_queue.get_device().get_info<info::device::name>() << "\n";
return 0;
}
14
14
Data Movement
15
15
- Explicit data transfer: CPU ->GPU, computing, GPU->CPU
- Implicit data transfer: data is automatically migrated by the system when accessed
16
16
Allocate memory block
Allocates a block of size bytes of memory, returning a pointer to the beginning of the block.
void* malloc_host(size_t size, const sycl::queue& q);
void* malloc_device(size_t size, const sycl::queue& q);
void* malloc_shared(size_t size, const sycl::queue& q);
17
赵鹏:基于oneAPI的异构计算与问题求解
17
void * memcpy ( void * destination, const void * source, size_t num );
Copy block of memory
Copies the values of num bytes from the location pointed to by source directly to the memory block pointed to by destination.
18
赵鹏:基于oneAPI的异构计算与问题求解
18
Code Sample
constexpr int N = 10;
int *host_mem = malloc_host<int>(N, my_gpu_queue);
int *device_mem = malloc_device<int>(N, my_gpu_queue);
// Init CPU data
for(int i = 0; i < N; i++) { host_mem[i] = i; }
// Copy from host(CPU) to device(GPU)
my_gpu_queue.memcpy(device_mem, host_mem, N * sizeof(int)).wait();
// do some works on GPU
// Copy back from GPU to CPU
my_gpu_queue.memcpy(host_mem, device_mem, N * sizeof(int)).wait();
printf("\nData Result\n");
for(int i = 0; i < N; i++) { printf("%d, ", host_mem[i]); }
Compilation:
patric@gpu:~/course$ ipcx -fsycl data_movement_ex.cpp -o data_move
Run & output:
patric@gpu:~/course$ ./data_move
Selected GPU device: Intel(R) Iris(R) Xe MAX Graphics [0x4905]
Data Result
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
Task Done!
19
19
Kernel
The most computationally intensive and time-consuming part
usually within loops
for(int i=0; i < 1024; i++){
a[i] = b[i] + c[i];
});
Serial execution
launch N kernel instances {� int id = get_instance_id(); � c[id] = a[id] + b[id];�}
Parallel execution
20
20
parallel_for
In this loop, each iteration should be completely independent.
h.parallel_for(range<1>(1024), [=](id<1> i){
A[i] = B[i] + C[i];
});
for(int i=0; i < 1024; i++){
a[i] = b[i] + c[i];
});
Offload to accelerator using parallel_for
Serial execution
21
21
Basic Parallel Kernels
The functions of the basic parallel kernels are provided by the range, id and item classes.
h.parallel_for(range<1>(1024), [=](item<1> item){
auto idx = item.get_id();
auto R = item.get_range();
// CODE THAT RUNS ON DEVICE
});
�
22
22
Code Sample
// Copy from host(CPU) to device(GPU)
my_gpu_queue.memcpy(device_mem, host_mem, N * sizeof(int)).wait();
// do some works on GPU
// submit the content to the queue for execution
my_gpu_queue.submit([&](handler& h) {
// Parallel Computation
h.parallel_for(range{N}, [=](id<1> item) {
device_mem[item] *= 2;
});
}); // end submit
// wait the computation done
my_gpu_queue.wait();
// Copy back from GPU to CPU
my_gpu_queue.memcpy(host_mem, device_mem, N * sizeof(int)).wait();
Run & output:
patric@gpu:~/course$ ./basic_parafor
Selected GPU device: Intel(R) Iris(R) Xe MAX Graphics [0x4905]
Data Result
0, 2, 4, 6, 8, 10, 12, 14, 16, 18,
Task Done!
23
23
Execution model�
Image Source : http://alana.io/downloads/rails-3/
24
24
Further Thoughts
25
25
Intel ® Xe Max GPU architecture
Intel SubSlice == NVIDIA SM
Intel Slice == NVIDIA GPC
26
26
DPC++ Task Granularity
Work-item:
The most basic computing granularity, which will be mapped to the ALU for execution.
Work-group:
Organize a certain number of work-items to be executed at the same time, which will be scheduled to the same subslice, with the work-time sharing certain resources, enabling collaboration within the group
h.parallel_for(range{N}, [=](id<1> item) {
device_mem[item] *= 2;
});
27
27
Explicitly set computing granularity
The basic parallel kernel is an easy approach to parallelize for-loops, but lacks precise control.
ND-Range kernel parameter is an enhanced parallel representation method.
28
赵鹏:基于oneAPI的异构计算与问题求解
28
ND-Range kernel
The function of the ND_Range kernel is represented by nd_range and nd_item
h.parallel_for(nd_range<1>(N, 64), [=](nd_item<1> item){
auto idx = item…..;
device_mem[idx] *= 2;
});
h.parallel_for(range{N}, [=](id<1> item) {
device_mem[item] *= 2;
});
How many work-item in each work-group?
How many work-group will be generated?
29
29
The nd_item class represents a single instance of a kernel function and allows querying the workgroup range and index.
h.parallel_for(nd_range<1>(N, 64), [=](nd_item<1> item){
auto idx = item. ( );
device_mem[idx] *= 2;
});
64 items
64 items
64 items
get_group() : group id from 0 - N/64
get_local_id() : index 0-63
get_local_id() : index 0-63
get_global_linear_id() : from 1 to N
check if you understand
30
30
Example: GEMM
Let’s show how strong you are!
C[i][j] += A[i][hA] * B[hB][j]
for (i = 0; i < M; ++i)
for (j = 0; j < N; ++j)
for (k = 0; k < L; ++k)
c[i][j] += a[i][k] * b[k][j];
31
31
GEMM: CPU Version
// Single Thread Computation in CPU
for(int i = 0; i < M; i++) {
for(int j = 0; j < N; j++) {
float sum = 0.0f;
for(int k = 0; k < K; k++) {
sum += cA[i * K + k] * cB[k * N + j];
}
cC[i * N + j] = sum;
}
}
32
32
GEMM: GPU version
// define the workgroup size and mapping
auto grid_rows = (M + block_size - 1) / block_size * block_size;
auto grid_cols = (N + block_size - 1) / block_size * block_size;
auto local_ndrange = range<2>(block_size, block_size);
auto global_ndrange = range<2>(grid_rows, grid_cols);
auto e = q.submit([&](sycl::handler &h) {
h.parallel_for<class k_name_t>(
sycl::nd_range<2>(global_ndrange, local_ndrange),
[=](sycl::nd_item<2> index) {
int row = index.get_global_id(0);
int col = index.get_global_id(1);
float sum = 0.0f;
for (int i = 0; i < K; i++) {
sum += A[row * K + i] * B[i * N + col];
}
C[row * N + col] = sum;
});
});
e.wait();
33
赵鹏:基于oneAPI的异构计算与问题求解
33
Performance and Occupancy
How does the size of a work-group relate to performance?
34
34
Memory model�
Image Source : https://www.gocivilairpatrol.com/media/cms/Leadership_Concept__Slides_EBCF53691D635.pdf
35
35
Memory hierarchy
Thread-independent resources, complementary and visible
Visible to group members only
Visible to all work-items
36
36
auto e = q.submit([&](sycl::handler &h) {
h.parallel_for<class k_name_t>(
sycl::nd_range<2>(global_ndrange, local_ndrange),
[=](sycl::nd_item<2> index) {
int row = index.get_global_id(0);
int col = index.get_global_id(1);
float sum = 0.0f;
for (int i = 0; i < K; i++) {
sum += A[row * K + i] * B[i * N + col];
}
C[row * N + col] = sum;
});
});
e.wait();
Private memory
Global memory
37
37
Private memory/Register
Global Memory
38
38
Improve the performance of matrix computation (GEMM)
Data reuse: Global Memory Reuse
Intermediate results: register
39
39
GEMM: tile version
// Mapping
#define tileX 2
#define tileY 2
// define the workgroup size and mapping
auto grid_rows = M / tileY;
auto grid_cols = N / tileX;
auto local_ndrange = range<2>(BLOCK, BLOCK);
auto global_ndrange = range<2>(grid_rows, grid_cols);
auto e = q.submit([&](sycl::handler &h) {
h.parallel_for<class k_name_t>(
sycl::nd_range<2>(global_ndrange, local_ndrange),
[=](sycl::nd_item<2> index) {
int row = tileY * index.get_global_id(0);
int col = tileX * index.get_global_id(1);
….
}
40
赵鹏:基于oneAPI的异构计算与问题求解
40
float sum[tileY][tileX] = {0.0f};
float subA[tileY] = {0.0f};
float subB[tileX] = {0.0f};
// core computation
for (int k = 0; k < N; k++) {
// read data to register
for(int m = 0; m < tileY; m++) { subA[m] = A[(row + m) * N + k]; }
for(int p = 0; p < tileX; p++) { subB[p] = B[k * N + p + col]; }
// math
for (int m = 0; m < tileY; m++) {
for (int p = 0; p < tileX; p++) {
sum[m][p] += subA[m] * subB[p];
} }
} //end of K
// write results back
for (int m = 0; m < tileY; m++)
for (int p = 0; p < tileX; p++)
C[(row + m) * N + col + p] = sum[m][p];
41
41
Performance Optimization�
Image Source : https://www.constructioninsure.co.uk/high-risk-trade-insurance-most-beneficial/
42
42
Case Study: Interaction Layers in DLRM
Output:
43
Interaction Module Fusion
How does interaction layers work?
cat_res = torch.cat(tensor_list, dim=1)
Tensor_list: 27 tensors in shape of [32768, 128]
Cat result shape [32768, 27, 128]
bmm_res = torch.bmm(cat_res, cat_resT)
bmm result shape [32768, 27, 27]
index_res = bmm_res[ : , k , j ]
Index result shape [32768, 351]
cat_res = torch.cat(tensor1, index_res)
cat result shape [32768, 128+351]
44
OP = output * calculate * FMA = 32768*27*27*128*2
MEM = read + write = 32768 * (27*128*2 + 27*27)
EU time (FP32) = OPs / (EU * SIMD width * 2 Op/cycle * Freq * EU eff)
= 6115295232/ (480*8*2*1400 *1e6)*1000
DPAS (DPAS) = OPs / (EU * Systolic Width * Systolic Depth * 2 * 2 Op/cycle * Freq * EU eff)
= 6115295232/ (480*8*8*2*2*1400 *1e6)*1000
Mem time(FP32) = MEM * data type / (Bandwidth * BW Eff) = 250380288 * 4 / 700
Do we need Systolic Array (DPAS) ?
Do we need FP16/BF16?
Ideally, it’s Max(com, mem)
Reality is [Max, Sum(com, mem)]
ATS | Shape | OP | MEM | OP/Elem | data type | EU Time | DPAS | Mem time (ms) �under 700 GB/s | Max |
BMM | 32768x27x128 * 32768x128x27 -> 32768x27x27 | 6115295232 | 250380288 | 24.42 | FP32 | 0.5688 |
| 2.044 | 2.044 |
FP16 | 0.2844 | 0.0355 | 1.022 | 1.022 |
45
Init Design
Because it’s the core part of the whole interaction layers
N = 32
M= 32
4
4
output
46
Low Precision
FP32 .vs. FP16
Active from 29.0% to 45.3%
Note:
The instruction of FP16 is also Mad – fmad
The CUDA instruction is different
FP32
FP16
Begin Optimization
47
Mem: FP32 .vs. FP16
FP32
FP16
Half memory consumption
HBM from GMEM to EU
R : 113.8GB -> 53.8GB
W: 13.8GB -> 6.5 GB
L3 read byte
426.4GB -> 272.2GB
48
Problem size: batch c(32768,32,32) = a(32768,32,128) * b(32768,128,32)
Performance Params: TILE_X = 8 TILE_Y = 8 TILE_CX = 4 TILE_CY = 4
perf %ops% %prb_bytes% %time% %gflops% %bandwidth%
perf 8589934592 8192 5.2906 1623.6264 0.0014
Number of 33554432 data are verified.
Test Passed
(base) gta@DUT3014-ATS:~/Patric/gpu-perf-show/codex/build$ ./src/sycl-bmm-bmm-fp16-v1-cpp
Problem size: batch c(32768,32,32) = a(32768,32,128) * b(32768,128,32)
Performance Params: THREAD_X = 8 THREAD_Y = 8 TILE_CX = 4 TILE_CY = 4
perf %ops% %prb_bytes% %time% %gflops% %bandwidth%
perf 8589934592 8192 3.0827 2786.5310 0.0025
Number of 33554432 data are verified.
Test Passed
49
Data Exploration
FP32
FP16
What can we learn from data?
L3 read byte
426.4GB -> 272.2GB
But L3 read is very larger, 3-5X than HBM
FP32: 3.7X -> 426.4GB/113.8GB
FP16: 5.0X -> 272.2GB/53.8GB
What does this mean?
50
FP16: SLM version
FP16 w/ SLM
Mem:
SLM bandwidth: 448GB/s
HBM bandwidth improved:
from 158GB/s -> 202GB/s
EU:
Active improved from 45.3% to 67.8%
Stalled reduced from 50.9% to 27.2%
FP16 w/o SLM
Why HBM BW is improved?
Why ‘EU Active’ is improved?
51
FP16, SLM version, L3�
FP16 w/ SLM
FP16 w/o SLM
HMB:
Read number doesn’t changed
L3:
Read number is similar with HBM
We don’t go back L3 a lot now!
SMEM:
Repeat read from SMEM, as our expectation
52
Problem size: batch c(32768,32,32) = a(32768,32,128) * b(32768,128,32)
Performance Params: THREAD_X = 8 THREAD_Y = 8 TILE_CX = 4 TILE_CY = 4
perf %ops% %prb_bytes% %time% %gflops% %bandwidth%
perf 8589934592 8192 3.0827 2786.5310 0.0025
Problem size: batch c(32768,32,32) = a(32768,32,128) * b(32768,128,32)
Performance Params: THREAD_X = 8 THREAD_Y = 8 TILE_CX = 4 TILE_CY = 4 TILE_K = 32
perf %ops% %prb_bytes% %time% %gflops% %bandwidth%
perf 8589934592 8192 2.3721 3621.2668 0.0032
53
Do more DLRM specific changes now ☺�
Switch the source A
memory read number is half
memory output number is half
54
OP | FWK | Benchdnn | DPCPP | DPCPP (one Source) |
cat0 | 1.49 | |
| |
bmm | 0.94 | 1.82 | 2.11 | 1.42 |
index | 0.52 | |
| |
cat1 | 0.30 | |
| |
total | 3.25 | | 3.75 | |
Where are we going next step?
3.21
55
Prefix Fusion for Concate
56
Data Access Model is changed
Inputs (batch, m, k) w/ concat (cat)
Inputs (m ,batch, k) w/o concat(cat)
k
m
batch
k
batch
m
57
How to improve the GMEM bandwidth?
58
reading data in batch direction first, contiguous loading
reuse the data from L3 and less data switch
1
3
m
m+1
m+2
2
59
More work-items and SLM are used so the SLM read BW improved
60
Fused all together
Last concate (cat1)
Fused Index
61
Half Computation
Only half WG is needed, and we changed the workgroup mapping
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
1
2
6
7
8
9
3
4
5
62
Final Performance
OP | FWK | DPCPP | DPCPP �(bmm + index) | DPCPP�(bmm + index + concat) | DPCPP �(concat + bmm + index) | DPCPP �(concat + bmm + index + concat) | DPCPP : half computation �(concat + bmm + index + concat) | speedup |
cat0 | 1.49 |
|
|
| 1.68 | 1.87 | 1.39 | 2.34X |
bmm | 0.94 | 1.44 | 1.43 | 1.5 | ||||
index | 0.52 |
| ||||||
cat1 | 0.30 |
|
|
| ||||
total | 3.25 | 3.75 | 3.22 | 2.99 | 1.98 |
63
Thank you and Questions�
64
64