POC - Google Sheets

	A	B	C
1	Initial Proof of Concept: Data Movement
2
3	Goals
4		Show forward progress on HiHAT
5		Demonstrate the layering of a simple app, the user layer, the common layer, glue code and implementation
6		One user-layer API, makes implementation-specific trade-offs to choose which common-layer API to call
7		Demonstrate low performance overheads
8		Retargetability for CPU in addition to GPU
9		Focus on data movement
10	What would be enabled for runtime clients
11		Basic data movement among multiple instances of multiple device kinds
12	Incremental goals
13		Illustrate interoperability with respect to user memory allocation
14		Introduce basic enumeration, e.g. for devices
15		Illustrate generality, e.g. with support for managed memory
16	Stretch goals
17		Demonstrate node to node transfers
18	Non-goals (at least initially)
19		Final APIs
20		Breadth of functionality
21
22	Phase 1: simple data movement of HiHAT-allocated data
23		Objectives	Choose API for pair of devices, based on address
24		Caveats	Use allocator to associate address with device; introduces buffers
25			Assume that device kinds <CPU and GPU> and index order is <CPU0, GPU0, GPU1>
26			Assume that there's enough physical memory on host to pin all host-side memory
27		Test app	Batch with a mix of source and targets for GPU, CPU, no dependences
28		Baseline app	Same, but code using raw CUDA APIs instead of HiHAT
29		Runtime client/app	John Stone @ UIUC: molecular orbital algorithm from quantum chemistry (extracted from VMD)
30		User layer APIs	alloc_buffer(void** out_address, size_t size, int device_kind_index, int device_index, int mem_kind)
31			move_data(void* target, void* src, size_t size), determine CPU vs. GPU based on address range, choose which common layer API based on source and target
32		Common layer APIs	move_{local_to_remote, remote_to_local, remote_to_remote}(void* dest_addr, void* src_addr, int dest_device_kind_index, int dest_device_index, int src_device_kind_index, int src_device_index, int src_device_id, size_t size)
33			move_local_to_local(void* dest_addr, void* src_addr, size_t size)
34		NV-based implementations	cudaMallocHost, cudaMalloc( void** address, sizte_t size )
35			cudaMemcpy(void* dest, const void* src, size_t size, cudaMemcpyKind cudaMemcpy{Host, Device}To{Host, Device})
36		CPU-based implementations	memcpy(void* dest, void* src, size_t size)
37		Analysis and variations	Vary the payload size, show how % overhead from HiHAT APIs and glue code varies
38			Use cudaMemcpyAsync with a range of # of streams, as compared with blocking APIs only, for a range of batch sizes
39			Make a subset of them have a dependence chain, and compare all blocking vs. having dependent subsets in same streams
40			Compare cudaMemcpy(dest, src, size, cudaMemcpyDeviceToDevice) with cudaMemcpyPeer
41	Phase 2: simple data movement of non-HiHAT-allocated data
42		Objectives	Enable wrapping existing memory in a buffer without allocating it in HiHAT
43		User layer APIs	wrap_with_buffer(void* in_address, sizte_t size, int device_kind, int device_index)
44		[same as Phase 1]
45	Phase 3: basic device enumeration
46		Objectives	Enumerate and specify devices
47		User layer APIs	query_num_device_kinds(int *out_num_device_kinds, int PE_index)
48			query_device_kind(int *out_device_kind, int in_device_kind_index, int PE_index)
49			query_num_devices_of_kind(int *out_num_devices_of_kind, int in_device_kind_index, int PE_index)
50		[same as Phase 1]
51	Phase 4: basic managed memory
52		Objectives	Demonstrate support for managed memory
53		User layer APIs	alloc_buffer(void* out_address, size_t size, int device_kind_index=MANAGED, int device_index=MANAGED) (on this PE)
54		Test app	Choose cases that minimize overhead of managed memory
55		NV-based implementations	cudaMallocManaged
56		Analysis and variations	Compare managed vs. non-managed memory
57	Phase 5 (stretch): node to node transfers
58		Objectives	Demonstrate node to node transfers
59		Caveats	Presume that mpi_run is used to create multiple ranks
60			Default configuration uses MPI
61		User layer APIs	init_communication() - inits MPI, sets up default communicator
62			query_num_PEs(int *out_num_PEs)
63			query_local_PE(int *out_PE_index)
64		[same as Phase 1]
65	Phase 6 (stretch): node to node transfers on selected transport
66		Objectives	Select network transport
67		User layer APIs	config_network_transport(<transport type>)
68		Analysis and variations	Vary the transport type, e.g. between MPI, UCX, TCP
69	Future directions
70		George Bosilca	Xfers of multi-D and non-dense data, ability to use UCX
71		John Stone	cudaMemAdvise for managed memory, cudaMemcpyToSymbol for constant memory
72
73	Ideas for lookups and trade-offs		Pros and cons
74		Data direction, based on domain	Have to introduce notion of domains, need to use APIs to associate addresses with domains
75		Pin or not, based on reuse	Differentiating between frequent and infrequent and contriving motivation not to pin are a little contrived
76		Data mover, based on size	CUDA handles this, doesn't offer a choice between DMA and aperture write
77		Multi-D xfers
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100