Optimizing with OpenCL on Intel Xeon Phi

CGO 2013 Tutorial - February 24th 2013

Arik Narkis, OpenCL SW Performance Architect, Intel.

Ayal Zaks, Manager, OpenCL Xeon Phi Compiler team, Intel.

The tutorial deals with highly efficient programming of state-of-the-art Many-Core architectures using OpenCL, focusing on the recently announced Intel® Xeon Phi coprocessor which combines many CPU cores onto a single chip. Intel Xeon Phi coprocessor is targeted for highly parallel, High Performance Computing (HPC) workloads in a variety of fields such as computational physics, chemistry, biology, and financial services. Along with the Intel Xeon Phi coprocessor itself, Intel also provides a C/C++ compiler which supports "offload" mode as well as direct execution on the coprocessor. Core parallelism is achieved through the ubiquitous OpenMP standard, while vectorization is achieved through auto-vectorization, SIMD pragma and similar compiler extensions. With this approach, customers can move their Xeon code to execute on Xeon Phi with small effort.

For cross device (GPU) portability, Intel added Xeon Phi support to its OpenCL SDK (http://software.intel.com/en-us/vcsource/tools/opencl-sdk-xe). While OpenCL is a portable programming model, performance portability is not guaranteed. Traditional GPUs and the Intel Xeon Phi coprocessor have different HW designs. These differences imply different benefits from application optimizations. For example, traditional GPUs rely on the existence of fast shared local memory which the programmer needs to program explicitly. In contrast, Intel Xeon Phi coprocessor includes fully coherent cache hierarchy, similar to regular CPU caches, which automatically speed up memory accesses. Another example: while some traditional GPUs are based on HW scheduling of many tiny threads, Intel Xeon Phi coprocessors run LINUX on the device and rely on the OS to schedule medium size threads over the 60 cores, 4 hardware threads per core. These and other differences imply that applications usually benefit from tuning to the HW they’re intended to run on.

In this tutorial we will introduce the Intel Xeon Phi HW architecture and its associated performance features. Then we will briefly describe the main OpenCL constructs and map them to the 60 cores, 240 threads of the state-of-the-art Intel Xeon Phi chip. In this tutorial, we will explain how to expose the workload parallelism to a Many Integrated Cores architecture through OpenCL. The Intel Xeon Phi OpenCL compiler uses the data parallel nature of the OpenCL kernels and implicitly vectorizes the OpenCL kernels. However, some programming paradigms can be vectorized more efficiently than others. For example non-uniform control flow is vectorized with a cost. In some cases this cost can be avoided. In the tutorial we will discuss and illustrate these topics and more. Data layout and memory access patterns have critical impact on the application performance. In this tutorial we will illustrate performing versus non-performing data layouts and memory access patterns. Tools help locate performance issues and understanding them. We will review the available tools and incorporate use cases for optimizations.

Tutorial outline: (3:30 hours total duration)

Intel Xeon Phi coprocessor overview (20 min)
Introduction to OpenCL parallelism (10 min)
Tools and performance resources (10 min)
Mapping the main OpenCL constructs to Xeon Phi (20 min)
Top-down OpenCL application design for performance (2:15 h:m)

Host-device efficiency (15 min)
Multi-threading in many core environment (15 min)
Implicit vectorization (30 min)
Cache optimizations (20 min)
Data layout and memory access patterns (20 min)
Data prefetching (20 min)
Local memory and barriers avoidance (15 min)

Summary and Q&As (15 min)