Question: What are the biggest challenges for heterogeneous and distributed computing in C++ (DHPC++) for HPC? And how do we approach the solution?
What is the preferred way to expose massive parallelism: Directive based (OpenMP/ACC) (in C++, likely through attributes), thread/task based(TBB, Fibers, C++11 threads), explicit parallel API (C++ AMP/HCC, CUDA, SYCL, Kokkos, C++17 ParallelSTL), stream (for DSP and FPGA over continuous stream of data), dispatch (Networking TS using runtime executor of function object on remote device via network)
Is Performance Portability across heterogeneous devices really possible? Under what constraints?
Do you see this ever replacing MPI or OpenMP/ACC
What is the preferred Data movement if any: user directed with appropriate API (explicit e.g. most models) or runtime directed (implicit e.g. SYCL, C++AMP)
Should C++ change its memory model from flat/single cache coherent address space (OpenMP/ACC, HSA, OPenCL 2.x, C/C++) to Multiple hierchcical memory model (SYCL, C++AMP, OpenCL 1.x) or some other form like non-coherent single address space (CUDA)
Is there a preferred way of compiling source code for a sub-architecture/node (Separate source: OpenCL C,/C++, GLSL), Single Source (CUDA, OpenMP/ACC, SYCL, C++ AMP), Embedded DSL expressions constructed from C++ operator overloading (RapidMind, Halide). Should we support different separate host and device compilers instead of monolothic tool chain?
Should we design heterogeneous C++ programming models to accomodate hardware that lacks support for existing C++ features (i.e. virtual functions, unified memory, C+11 atomics, hardware floating point) or simply restrict support to more capable devices?
Parallel patterns: should we make greater use of them? can they be used for an intuitive API?
What challenges you have in adopting to Modern C++ (C++11/14/17) in HPC, whats missing, what doesn’t work, what does better in other languages, what works well, what would you say to the Standard committee to do for HPC
C++ lacks affinity, locality of reference, how should we add these support or should we? Should an offload node just be expressed as an affinity problem?
C++ prioritize standardizing heterogeneous, distributed, or both? Which is a higher priority?
How important is ISO standardized support for heterogeneous compute, as opposed to third-party extensions such as OpenMP or CUDA?
What about complexity of the language and programmer productivity?
What steps should we take to enable concurrent execution across heterogeneous computing resources, e.g. across a single STL data collection? How optimistic are you that we can accomplish this?
Features you might want in C++: Channels, pipelines,
How to describe the dimension/shape
What about data placement, layout, memory space, execution space
Is this still important given that Linux will support HMM and Nvidia has Unified Virtual Memory?
How important is exception handling for DHPC++ computing? Or should we use exception-less error handling?
Given that self-driving cars are supercomputer on wheels, do you feel we need to enhance DHPC++ for safety and security, i.e. in a modular fashon (HiHat?)
Could you tell us what you think about HPX as a C++ Standard Library for Concurrency and Parallelism ?
Follow-up to (3): Do you see portability layers (e.g. Kokkos) being required as HPC plarforms diverge?
How far are we from having a true C++ distributed, parallel runtime and library which could compete with and beat MPI?
How about having a helping guidline for people used to write Fortran scientific code to cover their code to modern C++ without the harm of performance ?
How do we express vector constructs such as "vector permute" without resorting to (non-portable) instrinsics?
Thoughts on embedded domain specific languages for parallelism
Is Heterogeneous computing a subcase of distributed computing, can we combine them
Asynchronous algorithm and futures, how important are they to DHPC++
What is the feature of your work that you feel is the most important to be added to future ISO C++ DHPC++
Would you want this work to be done for a TS or straight into ISO C++? How would you feel if your work disappears into ISO C++?
Should asynchrony have first-class support in C++, e.g. for allocation, invocation, data movement, error handling?
Should we support real-time dynamic onlining and offlining of devices (plastic parallelism), and how would we do that? What about selection based on energy consumption, safety security requirements?
What is the easiest programming model to learn and use among yours?
Is there a chance that we won't end up with several standards for single-source C++ proggraming, locked to specific vendors (CUDA, AMD HC, SYCL)?
Is SYCL easier than OpenCL?
What future has SYCL in the HPC market without any support from major GPU vendors?