1 of 78

GPUs in MLIR

MLIR Open Design Meeting

December 12, 2019

Stephan Herhut, Alex Zinenko

2 of 78

Outline

  • Modeling GPU execution as dialect(s), scope and goals.�
  • Kernel-side dialect
    • Host/device separation
    • Operation selection
    • Execution model
    • Memory attribution�
  • Host-side dialect

3 of 78

Modeling GPU execution as dialect(s)

4 of 78

5 of 78

6 of 78

7 of 78

Design goals

  • Model computation suitable for execution on GPUs.�
  • Common kernel-level abstraction layer above NVVM, ROCm, SPIR-V.�
  • Target of choice for generating GPU code in MLIR.�
  • Single IR to enable optimization between host and device.�
  • (Inherited:) composes well with other dialects.

8 of 78

Design non-goals

  • Not a generic SIMT model.� GPU-specific features exposed: memory hierarchy, grid/block structure, etc.
  • Not a raising target.� Not expected to raise from NVVM, ROCm, SPIR-V to this dialect.

9 of 78

Scope

  • Kernel launch and grid specification.
  • Execution model (threads, synchronization).
  • Memory hierarchy.
  • Computation primitives.

10 of 78

Kernel-side dialect

11 of 78

Kernel launch and grid specification

// ...

gpu.launch

Kernel launches from host code

12 of 78

Kernel launch and grid specification

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nby = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %nty = %c1)

Grid specification

13 of 78

Kernel launch and grid specification

Additional arguments

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nby = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %nty = %c1)

args(%arg0 = %0, %arg1 = %1) : f32, memref<?xf32> {

14 of 78

Kernel launch and grid specification

Work item ID available inside the region

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nby = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %nty = %c1)

args(%arg0 = %0, %arg1 = %1) : f32, memref<?xf32> {

// any dialect here

%2 = index_cast %tx : index to i64

%3 = add %2, %2 : i64

// ...

gpu.return

}

// ...

15 of 78

Kernel launch and grid specification

Any dialect can be inside (but not other launch*)

*not yet modeling dynamic parallelism

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nby = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %nty = %c1)

args(%arg0 = %0, %arg1 = %1) : f32, memref<?xf32> {

// any dialect here

%2 = index_cast %tx : index to i64

%3 = add %2, %2 : i64

// ...

gpu.return

}

// ...

16 of 78

Kernel launch and grid specification

Any dialect can be inside (but not other launch*)

*not yet modeling dynamic parallelism

Grid specification

Additional arguments

Work item ID available as region argument

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nby = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %nty = %c1)

args(%arg0 = %0, %arg1 = %1) : f32, memref<?xf32> {

// any dialect here

%2 = index_cast %tx : index to i64

%3 = add %2, %2 : i64

// ...

gpu.return

}

// ...

Kernel launches from host code

17 of 78

“Inline” form simplifies host/device code motion

// ...

%c0 = constant 0 : index

%c1 = constant 1 : index

%c32 = constant 32 : index

gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nby = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %nty = %c1)

args(%arg0 = %c0, %arg1 = %0) : index, memref<?xi64> {

%1 = index_cast %arg0 : index to i64

%c0_1 = constant 0 : index

store %1, %arg1[%c0_1] : memref<?xi64>

gpu.return

}

// ...

18 of 78

“Inline” form simplifies host/device code motion

// ...

%c0 = constant 0 : index

%c1 = constant 1 : index

%c32 = constant 32 : index

gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nby = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %nty = %c1)

args(%arg0 = %c0, %arg1 = %0) : index, memref<?xi64> {

%1 = index_cast %arg0 : index to i64

%c0_1 = constant 0 : index

store %1, %arg1[%c0_1] : memref<?xi64>

gpu.return

}

// no more uses of %c0 here

19 of 78

“Inline” form simplifies host/device code motion

// ...

.

%c1 = constant 1 : index

%c32 = constant 32 : index

gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nby = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %nty = %c1)

args( %arg1 = %0) : memref<?xi64> {

%c0 = constant 0 : index

%1 = index_cast %c0 : index to i64

%c0_1 = constant 0 : index

store %1, %arg1[%c0_1] : memref<?xi64>

gpu.return

}

// no more uses of %c0 here

E.g., constant propagation can operate�across host and device.

20 of 78

“Inline” form simplifies host/device code motion

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nby = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %nty = %c1)

args( %arg1 = %0) : memref<?xi64> {

%c0 = constant 0 : index

%1 = index_cast %c0 : index to i64

.

store %1, %arg1[%c0] : memref<?xi64>

gpu.return

}

// no more uses of %c0 here

E.g., constant propagation can operate�across host and device.

Which in turn enables other transformations.

21 of 78

Outlined form

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

"gpu.launch_func"(%c1, %c1, %c1,

%c32, %c1, %c1,

%0)

{ kernel_module = “mod”,

kernel = “func” }

: (index, index, index, index, index, index, memref<?xi64>) -> ()

// ...

module @mod attributes {gpu.kernel_module} {

gpu.func @func(%arg0: memref<?xi64) kernel {

%c0 = constant 0 : index

%0 = index_cast %c0 : index to i64

store %0, %arg1[%c0] : memref<?xi64>

gpu.return

}

}

The kernel body can get outlined into

a new kernel function inside a separate module.

22 of 78

Outlined form

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

"gpu.launch_func"(%c1, %c1, %c1,

%c32, %c1, %c1,

%0)

{ kernel_module = “mod”,

kernel = “func” }

: (index, index, index, index, index, index, memref<?xi64>) -> ()

// ...

module @mod attributes {gpu.kernel_module} {

gpu.func @func(%arg0: memref<?xi64) kernel {

%c0 = constant 0 : index

%0 = index_cast %c0 : index to i64

store %0, %arg1[%c0] : memref<?xi64>

gpu.return

}

}

Since modules are just Ops, they can be nested.

module {

}

23 of 78

Outlined form

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

"gpu.launch_func"(%c1, %c1, %c1,

%c32, %c1, %c1,

%0)

{ kernel_module = “mod”,

kernel = “func” }

: (index, index, index, index, index, index, memref<?xi64>) -> ()

// ...

module @mod attributes {gpu.kernel_module} {

gpu.func @func(%arg0: memref<?xi64) kernel {

%c0 = constant 0 : index

%0 = index_cast %c0 : index to i64

store %0, %arg1[%c0] : memref<?xi64>

gpu.return

}

}

Since modules are just Ops, they can be nested.

module {

}

Full-fledged module:

  • run module passes on it;
  • refer/reuse external declarations;
  • separate compilation!

24 of 78

Outlined form

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

"gpu.launch_func"(%c1, %c1, %c1,

%c32, %c1, %c1,

%0)

{ kernel_module = “mod”,

kernel = “func” }

: (index, index, index, index, index, index, memref<?xi64>) -> ()

// ...

module @mod attributes {gpu.kernel_module} {

gpu.func @func(%arg0: memref<?xi64) kernel {

%c0 = constant 0 : index

%0 = index_cast %c0 : index to i64

store %0, %arg1[%c0] : memref<?xi64>

gpu.return

}

}

module {

}

module @mod {

llvm.func @func(%arg0: !llvm<"{i64*, i64*, i64, i64[1], i64[1]}">)

attributes { nvvm.kernel } {

%0 = llvm.constant(0: index) : !llvm.i64

%1 = llvm.constant(1: index) : !llvm.i64

%2 = llvm.extractvalue %arg0[%1]

: !llvm<"{i64*, i64*, i64, i64[1], i64[1]"}

-> !llvm<"i64*">

%3 = llvm.getelementptr %2[%0] : !llvm<"i64*">

llvm.store %0, %3 : !llvm<"i64*">

// ...

nvvm.something // ...

// ...

}

}

Lower to NVVM

25 of 78

Outlined form

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

"gpu.launch_func"(%c1, %c1, %c1,

%c32, %c1, %c1,

%0)

{ kernel_module = “mod”,

kernel = “func” }

: (index, index, index, index, index, index, memref<?xi64>) -> ()

// ...

module @mod attributes {gpu.kernel_module} {

gpu.func @func(%arg0: memref<?xi64) kernel {

%c0 = constant 0 : index

%0 = index_cast %c0 : index to i64

store %0, %arg1[%c0] : memref<?xi64>

gpu.return

}

}

module {

}

module @mod {

llvm.func @func(%arg0: !llvm<"{i64*, i64*, i64, i64[1], i64[1]}">)

attributes { nvvm.kernel } {

%0 = llvm.constant(0: index) : !llvm.i64

%1 = llvm.constant(1: index) : !llvm.i64

%2 = llvm.extractvalue %arg0[%1]

: !llvm<"{i64*, i64*, i64, i64[1], i64[1]"}

-> !llvm<"i64*">

%3 = llvm.getelementptr %2[%0] : !llvm<"i64*">

llvm.store %0, %3 : !llvm<"i64*">

// ...

nvvm.something // ...

// ...

}

}

Lower to SPIR-V

module @mod {

spv.func @func(%arg0) {

// ...

spv.something // ...

// ...

}

}

26 of 78

Outlined form

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

"gpu.launch_func"(%c1, %c1, %c1,

%c32, %c1, %c1,

%0)

{ kernel_module = “mod”,

kernel = “func” }

: (index, index, index, index, index, index, memref<?xi64>) -> ()

// ...

module @mod attributes {gpu.kernel_module} {

gpu.func @func(%arg0: memref<?xi64) kernel {

%c0 = constant 0 : index

%0 = index_cast %c0 : index to i64

store %0, %arg1[%c0] : memref<?xi64>

gpu.return

}

}

module {

}

module @mod {

llvm.func @func(%arg0: !llvm<"{i64*, i64*, i64, i64[1], i64[1]}">)

attributes { nvvm.kernel } {

%0 = llvm.constant(0: index) : !llvm.i64

%1 = llvm.constant(1: index) : !llvm.i64

%2 = llvm.extractvalue %arg0[%1]

: !llvm<"{i64*, i64*, i64, i64[1], i64[1]"}

-> !llvm<"i64*">

%3 = llvm.getelementptr %2[%0] : !llvm<"i64*">

llvm.store %0, %3 : !llvm<"i64*">

// ...

nvvm.something // ...

// ...

}

}

Lower to ROCm

module @mod {

spv.func @func(%arg0) {

// ...

spv.something // ...

// ...

}

}

module @mod {

llvm.func @func(%arg0) {

// ...

rocm.something // ...

// ...

}

}

27 of 78

Separate compilation with nested modules

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

"gpu.launch_func"(%c1, %c1, %c1,

%c32, %c1, %c1,

%0)

{ kernel_module = “mod”,

kernel = “func” }

: (index, index, index, index, index, index, memref<?xi64>) -> ()

// ...

module @mod attributes {gpu.kernel_module} {

gpu.func @func(%arg0: memref<?xi64) kernel {

%c0 = constant 0 : index

%0 = index_cast %c0 : index to i64

store %0, %arg1[%c0] : memref<?xi64>

gpu.return

}

}

module {

}

module @mod {

llvm.func @func(%arg0: !llvm<"{i64*, i64*, i64, i64[1], i64[1]}">)

attributes { nvvm.kernel } {

%0 = llvm.constant(0: index) : !llvm.i64

%1 = llvm.constant(1: index) : !llvm.i64

%2 = llvm.extractvalue %arg0[%1]

: !llvm<"{i64*, i64*, i64, i64[1], i64[1]"}

-> !llvm<"i64*">

%3 = llvm.getelementptr %2[%0] : !llvm<"i64*">

llvm.store %0, %3 : !llvm<"i64*">

// ...

nvvm.something // ...

// ...

}

}

28 of 78

Separate compilation with nested modules

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

"gpu.launch_func"(%c1, %c1, %c1,

%c32, %c1, %c1,

%0)

{ kernel_module = “mod”,

kernel = “func” }

: (index, index, index, index, index, index, memref<?xi64>) -> ()

// ...

module @mod attributes {gpu.kernel_module} {

gpu.func @func(%arg0: memref<?xi64) kernel {

%c0 = constant 0 : index

%0 = index_cast %c0 : index to i64

store %0, %arg1[%c0] : memref<?xi64>

gpu.return

}

}

module {

}

“NVPTX0000_Some_Binary_Blob_Understood_By_The_Device”

Device-specificcompilation

29 of 78

Separate compilation with nested modules

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

"gpu.launch_func"(%c1, %c1, %c1,

%c32, %c1, %c1,

%0)

{ kernel_module = “mod”,

kernel = “func” }

: (index, index, index, index, index, index, memref<?xi64>) -> ()

// ...

!llvm.global constant @mod_func(“NVPTX0000_Some_Binary_Blob_Understood_By_The_Device”)� : !llvm<"[51xi8]">

module {

}

30 of 78

Separate compilation with nested modules

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

"gpu.launch_func"(%c1, %c1, %c1,

%c32, %c1, %c1,

%0)

{ kernel_module = “mod”,

kernel = “func” }

: (index, index, index, index, index, index, memref<?xi64>) -> ()

// ...

!llvm.global constant @mod_func(“NVPTX0000_Some_Binary_Blob_Understood_By_The_Device”)� : !llvm<"[51xi8]">

module {

}

31 of 78

Separate compilation with nested modules

%0 = llvm.mlir.addressof @mod_func : !llvm<"[51xi8]*">

%1 = llvm.mlir.constant(0 : index) : !llvm.i64

%2 = llvm.getelementptr %0[%1] // ...

// ...

%42 = llvm.alloca // ... allocate module handle

llvm.call @mcuModuleLoad(%42, %2) : (!llvm<"i8**">, !llvm<"i8*">) -> !llvm.i32

// ... check errors ...

%43 = llvm.alloca // ... allocate function handle

llvm.call @mcuModuleGetFunction // ...

// ... check errors ...

%44 = llvm.mlir.constant(32: index) : !llvm.i64

%45 = llvm.mlir.constant(1: index) !llvm.i64

// ... prepare arguments ...

llvm.call @mcuLaunchKernel(%43, %45, %45, %45, %44, %45, ...)

// ... check errors ...

!llvm.global constant @mod_func(“NVPTX0000_Some_Binary_Blob_Understood_By_The_Device”)� : !llvm<"[51xi8]">

module {

}

Lower, e.g., to LLVM function calls.

Host-side dialect to come.

32 of 78

Separate compilation with nested modules

%0 = llvm.mlir.addressof @mod_func : !llvm<"[51xi8]*">

%1 = llvm.mlir.constant(0 : index) : !llvm.i64

%2 = llvm.getelementptr %0[%1] // ...

// ...

%42 = llvm.alloca // ... allocate module handle

llvm.call @mcuModuleLoad(%42, %2) : (!llvm<"i8**">, !llvm<"i8*">) -> !llvm.i32

// ... check errors ...

%43 = llvm.alloca // ... allocate function handle

llvm.call @mcuModuleGetFunction // ...

// ... check errors ...

%44 = llvm.mlir.constant(32: index) : !llvm.i64

%45 = llvm.mlir.constant(1: index) !llvm.i64

// ... prepare arguments ...

llvm.call @mcuLaunchKernel(%43, %45, %45, %45, %44, %45, ...)

// ... check errors ...

!llvm.global constant @mod_func(“NVPTX0000_Some_Binary_Blob_Understood_By_The_Device”)� : !llvm<"[51xi8]">

module {

}

Thin wrapper for instrumentation and compatibility

33 of 78

Sample compilation flow

  • A module with inline gpu.launch
  • Code motion or optimization across host and device
  • Outline gpu.launch bodies into separate nested modules
  • Compile nested GPU modules separately and replace them with blobs
  • Lower the top module to platform-specific calls
  • Compile the top module with blobs to one fat binary

34 of 78

Operation Selection

35 of 78

“Guidelines” for adding new operations

  • Is not available from other dialects
    • no gpu.add or gpu.alloc
  • Has a current need
    • Incubate new operations in device dialects
    • Consolidate at gpu dialect level
  • Needs to be common to target dialects
    • Device specific operations should live in a device dialect

36 of 78

Example: Subgroup communication primitives

  • Unique to the SIMT model of GPU execution
    • For SIMD, a vector dialect could have gather/scatter/permute/etc on vectors
  • Required to express interesting algorithms
    • Investigation started to support efficient reductions
  • Incubated in the NVVM dialect
    • nvvm::ShflBflyOp
  • Common to GPU targets

37 of 78

Example: Shuffle operations in targets

shfl.sync

(up, down, permute, swizzle)

swizzle

(swizzle)

permute

(permute)

bpermute

(up, down)

subgroupShuffle

(permute)

subgroupShuffleXor

(swizzle)

subgroupShuffleUp

(up)

subgroupShuffleDown

(down)

38 of 78

Example: Subgroup communication primitives

  • Unique to the SIMT model of GPU execution
    • For SIMD, a vector dialect could have gather/scatter/permute/etc on vectors
  • Required to express interesting algorithms
    • Investigation started to support efficient reductions
  • Incubated in the NVVM dialect
    • nvvm::ShflBflyOp
  • Common to GPU targets
    • gpu::SwizzleOp

39 of 78

Example: Workgroup reduction

  • Unique to the SIMT programming model
    • Requires workgroup communication
  • Exists only in one target
    • SPIR-V has a reduce group operation
  • In-dialect lowering can support other targets
    • Reduction can be expressed with existing primitives

40 of 78

Execution Model

41 of 78

Execution model

  • Device code is written for one thread (workitem).
  • Threads may execute concurrently.
  • Implicit communication between threads in some operations�(barrier, shuffle, ballot, …).

42 of 78

Execution model

  • Device code is written for one thread (workitem).
  • Threads may execute concurrently.
  • Implicit communication between threads in some operations�(barrier, shuffle, ballot, …).

Open design!

How do we make transformations aware of this?

43 of 78

Execution model: LLVM approach

44 of 78

Execution model: LLVM approach problems

45 of 78

Execution model: LLVM approach problems

Some operations cannot be made control-dependent on more values.

Some others cannot be made control-dependent on less values.

46 of 78

Execution model: LLVM approach problems for MLIR

Some operations cannot be made control-dependent on more values.

Some others cannot be made control-dependent on less values.

Control flow in MLIR regions is defined by the surrounding operation.

We would like to keep MLIR extensible.

47 of 78

Execution model: attributes / traits?

gpu.launch(...) {

some.op()

call @func() // HasTrait: NoMoreControl

another.op() // HasTrait: NoLessControl

}

  • Have “convergent” and its reciprocal as traits (attributes are too easy to discard).�
  • Cannot move into another block or into a child/parent/adjacent region if it would modify the set of control-affecting values.�
  • Transformations and GPU-unrelated should be made aware of this trait.

48 of 78

Execution model: attributes / traits?

gpu.launch(...) {

some.op()

call @func() // HasTrait: NoMoreControl

another.op() // HasTrait: NoLessControl

}

  • Have “convergent” and its reciprocal as traits (attributes are too easy to discard).�
  • Cannot move into another block or into a child/parent/adjacent region if it would modify the set of control-affecting values.�
  • Transformations and GPU-unrelated should be made aware of this trait.

Core IR property?

49 of 78

Execution model: structured control flow only?

gpu.launch(...) {

// ...

gpu.barrier

if (%0) {

// ...

} else {

// ...

}

}

50 of 78

Execution model: structured control flow only?

gpu.launch(...) {

// ...

gpu.barrier

if (%0) {

// ...

} else {

// ...

}

}

  • MLIR is unaware of structural control flow,�any op with regions is some form of control flow.�
  • Can require at most one block in “gpu.launch”,�cannot propagate the requirement down to regions�unless only accept ops that have the same requirement.

51 of 78

Execution model: structured control flow only?

gpu.launch(...) {

// ...

gpu.barrier

if (%0) {

// ...

} else {

// ...

}

}

  • MLIR is unaware of structural control flow,�any op with regions is some form of control flow.�
  • Can require at most one block in “gpu.launch”,�cannot propagate the requirement down to regions�unless only accept ops that have the same requirement.

Cannot use any dialect inside GPU.

Potential op duplication (gpu.if ?)

52 of 78

Execution model: structured control flow only?

gpu.launch(...) attributes {gpu.region} {

// ...

gpu.barrier

if (%0) attributes {gpu.region} {

// ...

} else {

// ...

}

}

53 of 78

Execution model: structured control flow only?

gpu.launch(...) attributes {gpu.region} {

// ...

gpu.barrier

if (%0) attributes {gpu.region} {

// ...

} else {

// ...

}

}

  • Require operations to have GPU dialect attribute attached, enabling recursive verification.�
  • Transformations on operations should be made aware of this attribute.

54 of 78

Execution model: structured control flow only?

gpu.launch(...) attributes {gpu.region} {

// ...

gpu.barrier

if (%0) attributes {gpu.region} {

// ...

} else {

// ...

}

}

  • Require operations to have GPU dialect attribute attached, enabling recursive verification.�
  • Transformations on operations should be made aware of this attribute.

Core IR property?

55 of 78

Execution model: explicit vector mask?

gpu.launch(...) mask(@m) {

// ...

gpu.barrier @m

%42 = cmpf "olt" %41, %40 : float

gpu.cond_br @m %42, ^bb1, ^bb2

^bb1:

// ...

gpu.ballot @m

^bb2:

// ...

}

56 of 78

Execution model: explicit vector mask?

gpu.launch(...) mask(@m) {

// ...

gpu.barrier

%42 = cmpf "olt" %41, %40 : float

gpu.cond_br @m %42, ^bb1, ^bb2

^bb1:

// ...

gpu.ballot @m

^bb2:

// ...

}

  • Control flow operations can be seen as modifying a vector mask register -- materialize it.�
  • Branch-like operations and inter-thread communication read+update the mask.

57 of 78

Execution model: explicit vector mask?

gpu.launch(...) mask(@m) {

// ...

gpu.barrier

%42 = cmpf "olt" %41, %40 : float

gpu.cond_br @m %42, ^bb1, ^bb2

^bb1:

// ...

gpu.ballot @m

^bb2:

// ...

}

  • Control flow operations can be seen as modifying a vector mask register -- materialize it.�
  • Branch-like operations and inter-thread communication read+update the mask.

Cannot use any dialect inside GPU.

Potential op duplication (gpu.if ?)

58 of 78

Execution model dilemma

Restrict composability

Expose the model to core IR

59 of 78

Execution model dilemma

Restrict composability

Expose the model to core IR

Open design!

We don’t have an answer

60 of 78

Memory Attribution

61 of 78

GPU memory hierarchy

  • Global Memory
    • Lifetime managed by host
    • Fully visible
  • Workgroup memory
    • Lifetime fixed to current invocation
    • Visible to workgroup
  • Private memory
    • Lifetime fixed to current invocation
    • Visible to thread

62 of 78

GPU memory hierarchy

  • Global Memory
    • Lifetime managed by host
    • Fully visible
  • Workgroup memory
    • Lifetime fixed to current invocation
    • Visible to workgroup
  • Private memory
    • Lifetime fixed to current invocation
    • Visible to thread

memref<42xf32, 1>

memref<42xf32, 3>

memref<42xf32, 5>

63 of 78

GPU memory hierarchy

  • Global Memory
    • Lifetime managed by host
    • Fully visible
  • Workgroup memory
    • Lifetime fixed to current invocation
    • Visible to workgroup
  • Private memory
    • Lifetime fixed to current invocation
    • Visible to thread
  • Constant Memory
    • Lifetime managed by host
    • Fully visible

memref<42xf32, 1>

memref<42xf32, 3>

memref<42xf32, 5>

memref<42xf32, 4>

64 of 78

Modelling Lifetime: Using globals

PTX and SPIR-V model this as global values on the module level.

module attributes {gpu.kernel_module} {

llvm.mlir.global @wg_memory() {addr_space = 3 : i32} : !llvm<"[32 x float]">

llvm.mlir.global @priv_memory() {addr_space = 5 : i32} : !llvm<"[1 x float]">

}

65 of 78

Modelling Lifetime: Using globals

PTX and SPIR-V model this as global values on the module level.

module attributes {gpu.kernel_module} {

llvm.mlir.global @wg_memory() {addr_space = 3 : i32} : !llvm<"[32 x float]">

llvm.mlir.global @priv_memory() {addr_space = 5 : i32} : !llvm<"[1 x float]">

llvm.func @kernel() {

%0 = llvm.mlir.addressof @wg_memory : !llvm<"[32 x float] addrspace(3)*">

// ...

}

}

66 of 78

Modelling Lifetime: Using globals

PTX and SPIR-V model this as global values on the module level.

module attributes {gpu.kernel_module} {

llvm.mlir.global @wg_memory() {addr_space = 3 : i32} : !llvm<"[32 x float]">

llvm.mlir.global @priv_memory() {addr_space = 5 : i32} : !llvm<"[1 x float]">

llvm.func @kernel() {

%0 = llvm.mlir.addressof @wg_memory : !llvm<"[32 x float] addrspace(3)*">

// ...

}

llvm.func @kernelTwo() {

%0 = llvm.mlir.addressof @wg_memory : !llvm<"[32 x float] addrspace(3)*">

// ...

}

}

Using globals suggests lifetime aligned with the module. In reality it is aligned with kernel invocation.

This leads to unintuitive aliasing semantics.

67 of 78

Modelling Lifetime: Alloca

We could model this with “dynamic” allocation.

module attributes {gpu.kernel_module} {

func @kernel() {

%0 = gpu.alloca() : memref<32xf32, 3> // allocate in workgroup memory

%1 = gpu.alloca() : memref<1xf32, 5> // allocate in private memory

// ...

}

}

68 of 78

Modelling Lifetime: Alloca

We could model this with “dynamic” allocation.

module attributes {gpu.kernel_module} {

func @kernel() {

%0 = gpu.alloca() : memref<32xf32, 3> // allocate in workgroup memory

%1 = gpu.alloca() : memref<1xf32, 5> // allocate in private memory

// ...

kernel() : () -> ()

// ...

}

}

Using alloca suggests stack allocation while this in reality is static.

Recursion would have unintuitive aliasing semantics.

69 of 78

Modelling Lifetime: Function attribution

Use static memory declarations at the function level.

module attributes {gpu.kernel_module} {

gpu.func @kernel()

workgroup(%0: memref<32xf32, 3>)

private(%1: memref<1xf32, 5>) {

// ...

}

}

70 of 78

Modelling Lifetime: Function attribution

Use static memory declarations at the function level.

module attributes {gpu.kernel_module} {

gpu.func @kernel()

workgroup(%0: memref<32xf32, 3>)

private(%1: memref<1xf32, 5>) {

// ...

}

}

71 of 78

Modelling Lifetime: Function attribution

Use static memory declarations at the function level.

module attributes {gpu.kernel_module} {

gpu.func @kernel()

workgroup(%0: memref<32xf32, 3>)

private(%1: memref<1xf32, 5>) {

// ...

}

}

72 of 78

Modelling Lifetime: Function attribution

Use static memory declarations at the function level.

module attributes {gpu.kernel_module} {

gpu.func @kernel()

workgroup(%0: memref<32xf32, 3>)

private(%1: memref<1xf32, 5>) {

// ...

}

}

Lifetime is coupled with invocation of kernel function.

Allocation is static.

73 of 78

Host-side dialect

74 of 78

Status Quo

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nbz = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %ntz = %c1)

Kernels launches are synchronous.

There is only a single device.

No notion of memory allocation.

75 of 78

How about streams?

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nbz = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %ntz = %c1)

on(%stream)

Kernels launches would be asynchronous.

Can encode many devices.

Could design allocation around streams.

76 of 78

How about chaining?

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

%chn1 = gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nbz = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %ntz = %c1)

deps(%chn0)

// ...

%chn2 = gpu.wait(%chn1)

Kernels launches would be asynchronous?

Chains could encode devices?

Could design allocation around chains?

77 of 78

How about futures?

// ...

%c1 = constant 1 : index

%c32 = constant 32 : index

%fut0, %fut1 = gpu.launch

blocks(%bx, %by, %bz) in (%nbx = %c1, %nby = %c1, %nbz = %c1)

threads(%tx, %ty, %tz) in (%ntx = %c32, %nty = %c1, %ntz = %c1)

args(%arg0, %arg1)

// ...

%val = gpu.await(%fut0)

Kernels launches would be asynchronous?

How could we encode devices?

Could design allocation around futures?

78 of 78

mlir@tensorflow.org