ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
Initial Proof of Concept: Data Movement
2
3
Goals
4
Show forward progress on HiHAT
5
Demonstrate the layering of a simple app, the user layer, the common layer, glue code and implementation
6
One user-layer API, makes implementation-specific trade-offs to choose which common-layer API to call
7
Demonstrate low performance overheads
8
Retargetability for CPU in addition to GPU
9
Focus on data movement
10
What would be enabled for runtime clients
11
Basic data movement among multiple instances of multiple device kinds
12
Incremental goals
13
Illustrate interoperability with respect to user memory allocation
14
Introduce basic enumeration, e.g. for devices
15
Illustrate generality, e.g. with support for managed memory
16
Stretch goals
17
Demonstrate node to node transfers
18
Non-goals (at least initially)
19
Final APIs
20
Breadth of functionality
21
22
Phase 1: simple data movement of HiHAT-allocated data
23
Objectives
Choose API for pair of devices, based on address
24
Caveats
Use allocator to associate address with device; introduces buffers
25
Assume that device kinds <CPU and GPU> and index order is <CPU0, GPU0, GPU1>
26
Assume that there's enough physical memory on host to pin all host-side memory
27
Test app
Batch with a mix of source and targets for GPU, CPU, no dependences
28
Baseline app
Same, but code using raw CUDA APIs instead of HiHAT
29
Runtime client/app
John Stone @ UIUC: molecular orbital algorithm from quantum chemistry (extracted from VMD)
30
User layer APIs
alloc_buffer(void** out_address, size_t size, int device_kind_index, int device_index, int mem_kind)
31
move_data(void* target, void* src, size_t size), determine CPU vs. GPU based on address range, choose which common layer API based on source and target
32
Common layer APIs
move_{local_to_remote, remote_to_local, remote_to_remote}(void* dest_addr, void* src_addr, int dest_device_kind_index, int dest_device_index, int src_device_kind_index, int src_device_index, int src_device_id, size_t size)
33
move_local_to_local(void* dest_addr, void* src_addr, size_t size)
34
NV-based implementations
cudaMallocHost, cudaMalloc( void** address, sizte_t size )
35
cudaMemcpy(void* dest, const void* src, size_t size, cudaMemcpyKind cudaMemcpy{Host, Device}To{Host, Device})
36
CPU-based implementations
memcpy(void* dest, void* src, size_t size)
37
Analysis and variations
Vary the payload size, show how % overhead from HiHAT APIs and glue code varies
38
Use cudaMemcpyAsync with a range of # of streams, as compared with blocking APIs only, for a range of batch sizes
39
Make a subset of them have a dependence chain, and compare all blocking vs. having dependent subsets in same streams
40
Compare cudaMemcpy(dest, src, size, cudaMemcpyDeviceToDevice) with cudaMemcpyPeer
41
Phase 2: simple data movement of non-HiHAT-allocated data
42
Objectives
Enable wrapping existing memory in a buffer without allocating it in HiHAT
43
User layer APIs
wrap_with_buffer(void* in_address, sizte_t size, int device_kind, int device_index)
44
[same as Phase 1]
45
Phase 3: basic device enumeration
46
Objectives
Enumerate and specify devices
47
User layer APIs
query_num_device_kinds(int *out_num_device_kinds, int PE_index)
48
query_device_kind(int *out_device_kind, int in_device_kind_index, int PE_index)
49
query_num_devices_of_kind(int *out_num_devices_of_kind, int in_device_kind_index, int PE_index)
50
[same as Phase 1]
51
Phase 4: basic managed memory
52
Objectives
Demonstrate support for managed memory
53
User layer APIs
alloc_buffer(void* out_address, size_t size, int device_kind_index=MANAGED, int device_index=MANAGED) (on this PE)
54
Test app
Choose cases that minimize overhead of managed memory
55
NV-based implementations
cudaMallocManaged
56
Analysis and variations
Compare managed vs. non-managed memory
57
Phase 5 (stretch): node to node transfers
58
Objectives
Demonstrate node to node transfers
59
Caveats
Presume that mpi_run is used to create multiple ranks
60
Default configuration uses MPI
61
User layer APIs
init_communication() - inits MPI, sets up default communicator
62
query_num_PEs(int *out_num_PEs)
63
query_local_PE(int *out_PE_index)
64
[same as Phase 1]
65
Phase 6 (stretch): node to node transfers on selected transport
66
Objectives
Select network transport
67
User layer APIs
config_network_transport(<transport type>)
68
Analysis and variations
Vary the transport type, e.g. between MPI, UCX, TCP
69
Future directions
70
George Bosilca
Xfers of multi-D and non-dense data, ability to use UCX
71
John Stone
cudaMemAdvise for managed memory, cudaMemcpyToSymbol for constant memory
72
73
Ideas for lookups and trade-offs
Pros and cons
74
Data direction, based on domain
Have to introduce notion of domains, need to use APIs to associate addresses with domains
75
Pin or not, based on reuse
Differentiating between frequent and infrequent and contriving motivation not to pin are a little contrived
76
Data mover, based on size
CUDA handles this, doesn't offer a choice between DMA and aperture write
77
Multi-D xfers
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100