|X10||Swift (/K and /T)*||UPC||Habanero-C (HC)||CAF 2.0||SLEEC||SWARM/SCALE||Legion||Charm++||UPC++||MPI||XPRESS|
|1||Is there support for specific science domains, e.g., high level array abstractions, other data structures, tensors, stencils, etc.?||Support for multi-dimensional arrays over a variety of regions and distributions. Stencil computations can be described compactly using regions and iterations over regions.||No. Swift is a compact, implicitly parallel, functional dataflow language for programming the outer loops of large-scale apps. It is pointerless: all data atoms are "futures" with write-once synchronization semantics. Arrays are multi-dimensional hash tables and can be sparse (and indexed by strings in /K). Arrays and structures can be passed and returned.||No, UPC is a general purpose language. However "global random access" may be of domain-specific interest.||No. Habanero-C is a general-purpose language. However, it offers new features --- data-driven tasks (DDTs) and partitioned global name space (PGNS) --- that offer new opportunities for mapping on to exascale and extreme-scale systems. The DDTs are well suited for dag parallelism & futures, and the PGNS model is well suited for resilience. In addition, Habanero-C can be used to support higher-level programming models, with early demonstrations of CnC on HC and its mapping on clusters and heterogeneous nodes (CPUs, GPUs, FPGAs).||No. CAF 2.0 is a general-purpose language.||Yes. SLEEC's goal is to turn programs that use domain libraries into programs that "automatically" use DSLs. If the library operates on tensors/matrices, then SLEEC "supports" those.||No. SWARM is a general purpose runtime. SCALE is a general purpose language to program to the codelet execution model easier.||No. Legion is a low-level, and general purpose, programming model. There is direct support for partitioning of multidimensional arrays, and for arbitrary user-defined partitioning strategies.||Charm++ itself is general purpose. Support for specific science domains is provided only in lbraries or frameworks above Charm++ (e.g. for unstructured-meshes, structured AMR, ..)||Support for multidimensional arrays over dense and strided domains. Multidimensional arrays are not distributed but can be accessed remotely. Also supports 1D shared arrays with block-cyclic distribution.|
|2||Does the language support programming both within and between compute nodes on the systems (or, for example, is this a languages for X in an MPI+X strategy)? If the language handles both, does it distinguish between on-node and between-node parallelism support?||X10 supports the full scale of programming probelms -- within a node, across nodes, and across accelerators (GPUs, others)||Swift programs typically run complete apps (serial or parallel) or high-level functions as "leaf" functions. Swift+MPI, Swift+OpenMP, or Swift+MPI+OpenMP are intended strategies.||Yes, UPC can be run on top of threads within a node (or processes with shared memory support) or across nodes.||Yes, Habanero-C can be run both within and across nodes. The PGNS model supports a range of distributions across nodes for computation and data, with a uniform interface for on-node and between-node parallelism. In addition, the Hierarchical Place Tree (HPT) model provides a range of options for controlling locality and load balance within a node. Finally, HC offers MPI programmers a MPI+X migration path to adding on-node parallelism using HCMPI library calls.||CAF 2.0 is a SPMD model intended for running across compute nodes. One can use function shipping to create multi-threaded parallelism within and across nodes. Supporting parallel loops within compute nodes with a work-stealing runtime system is envisioned.||N/A. SLEEC is not a parallelism language. That must be provided by the original program.||Yes, SWARM supports both. SCALE will have inter-node language constructs soon, but currently can use direct SWARM calls to asynchronously invoke remote codelets.||Yes. Legion has runtime support for parallelism between nodes and between the components of a node, including among CPU cores and between cores and accelerators.||Yes, Charm++ is a complete model that runs on a multi-node systems with multiple cores on each. Charm++ objects commonly direct their communication to other charm++ objects (as logical entities). The runtime does location management and delivers messages within or across nodes. So, the prorammers don't distinguish between within-node and across-node parallelism or communication in this default mode. However, primitives exist that allow programmers to distinguish such communication (e.g. node-group objects for within-node communication). Again, as an option, one can define objects that span an entire node and exploit parallelism within a node using special constructs.||Yes. UPC++ supports computation across nodes and can be used with process shared memory to support on-node parallelism. It can also interoperate with shared memory libraries such as OpenMP.|
|3||What types of parallelism are supported? Is there data parallelism intended both for on-node SIMD hardware and global data parallelism (spread over nodes)? Is there dynamic task parallelism, and if so, both between and within nodes? Do they rely on some form of static parallelism, e.g., SPMD?||I dont quite understand this question. X10 supports fine-grained asynchrony. Period. With this one can support "task parallelism", "fork join", "event-based computation", asynchronous data transfer (RDMAs), one-sided communication, active messaging, etc etc. SIMDization is not directly supported -- the capabilities of the downstream compiler are relied on.||Swift is pervasively task-parallel: every function call evaluation is implicitly a task (with many low-level task invocations eliminated by the compiler). Swift/T programs run as MPI SPMD programs that use ADLB for task (and data) distribution.||UPC uses SPMD parallellism, with collective communication for data-parallel style programming. Task programming is possible through libraries on top of UPC.||Habanero-C is founded on dynamic task parallelism both within and across nodes. Forasync constructs with barriers (phasers) can also be used to express SPMD parallelism. Further, automatic generation of OpenCL from forasync enables mapping onto on-node data-parallel hardware include GPUs and SIMD vectors.||CAF 2.0 uses SPMD parallelism. One can create process subsets known as teams. Dynamic parallelism is supported with function shipping.||N/A. See above.||SWARM is based on task parallelism. However, within a task, there may be SIMD parallelism (i.e. an OpenCL/CUDA codelet)||Legion has task-based parallelism that is detected and scheduled dynamically. Legion supports nested parallelism: tasks may spawn parallel subtasks, supporting both hierarchical, divide-and conquer style computations and MPI and multithreaded style parallel computations (where a single task directly spawns a large number of subtasks). The Legion runtime also seeks to overlap data transfers and computation wherever possible. Fine-grain data parallel computations (i.e, vector processing) are only supported through the use of vector intrinsics; coarse-grain data parallel computations can still be scheduled with low overhead directly in Legion.||Charm++ supports both data parallelism and task parallelism. the former is supported by object collections (called chare-arrays), and the latter by dynamic creation of singleton objects (which can create other such objects). The latter lead to "tasks" (or "seeds" for chare objects) which are balanced using a seed balancer.||UPC++ primarily uses SPMD parallellism, with collective communication for data-parallel style programming. It also supports X10/Habanero-style asyncs for dynamic task parallelism, with task queues located on each rank.|
|4||What types of synchronization exist in the languages? What features exist to reduce the overhead of synchronization or to avoid over-synchronizing code? Are there assumptions about particular hardware supported atomic operations or synchronization?||X10 has extensive support for non-blocking operations. It supports conditional atomic blocks. It also supports the use of locks at lower levels in the stack (e.g. runtime) ... so X10 programmers have access to it if they need it. A lot of HPC code runs with async/finish parallellism. X10 clocks support multi-phased computations. Each phase may be split into two -- thus a registered activity may signal arrival at a barrier separately from waiting for the arrival of all other activities at the barrier.||Implicit synchronization occurs with data flow, explicit synchronization is rare. Tasks (function calls) can be given a priority to schedule longer tasks earlier. A wait() statement can be used to force synchronization on a future (/T). We have an optimizer that is often able to eliminate data flow synchronization by rearranging code.||Barriers, split-phase barriers, and locks. There are proposals / prototype for atomics (local and remote), but nothing official (due to lack of consistent hardware support)||Habanero-C currently includes the following kinds of synchronization: 1) Finish, 2) Async-await (DDTs), and 3) Phasers. Additional primitives from Habanero-Java are in the process of being added to Habanero-C, including 4) Futures, 5) Nested atomic/isolated blocks with "delegated isolation" support, and 6) Actors.||Events, locks, cofence. Team synchronization includes barriers, finish, and collectives including broadcast, reduce, allreduce, gather, allgather, scatter, scan, shift, alltoall. Support for atomics is envisioned.||N/A. Though we plan to extend SLEEC's domain knowledge to include knowledge of synchronization performed inside libraries||Within a coherent cache domain: 1) dependences (i.e. when this is satisfied N times, spawn codelet X), 2) codelet chaining (when this codelet is complete, run this other codelet), 3) put/get resources [latches, barriers, locks, futures, etc].|
Across nodes, we use remote codelet invokations (with continuation codelets), and implement collectives above that.
In SCALE, there is a myproc.enter(args) to spawn a procedure and a ''=>' operator to define what to run next. To attach a codelet to a dependence, you can do my_dep.init(5)=>continuation_codelet(args). To just spawn something, a 'do' keyword allows do => my_codelet(args). A myProc.remoteEnter(node_id, args) adds internode support.
|Legion relies on dynamic dependence analysis to determine where synchronization is required between tasks. To make this runtime analysis inexpensive, it is done at the granularity of user-specified partitions of the data (called regions). if two tasks conflict on regions that overlap then the Legion runitme will require that the one that would execute later in the sequential execution order wait to execute until the earlier task has completed. At the lowest level, these dependences are represented by an event system. The primitive units of Legion execution (tasks and data movement operations) can take events as preconditions and can trigger events as postconditions. Synchronization is expressed by operations that must wait for certain events to complete before they can execute. By chaining together sequences of operations using events, the system builds a runtime dataflow graph that is dynamically scheduled on to the hardware. Note that events are internal to Legion and not available to the user, who thinks of synchronization in terms of dependences between tasks. Legion also provides relaxed coherence modes where only atomicity rather than strict ordering of tasks is required; these are implemented witha novel kind of lock that composes well with the event system.||For collectives, Charm++ supports asynchronous reductions: each object contributes to the reduction and the result sent to user-specified callback at a later time. Each object itself executes methods as soon as invocations are available. (method invocations are asynchronous, with no return values, and so are similar to messages). However, an object might specify additional synchronization and ordering using the sdag (structured dagger) notation.||UPC++ includes barriers and other collectives, locks, async-await, and async-finish synchronization. Team collectives are also supported.|
|5||How is communication between tasks handled? Can arbitrary communication be performed, or is it limited by task structure, type constraints, or some hardware features (e.g., shared memory within nodes, but not between)? Is there global communication (e.g., collectives)?||X10 supports arbitrary communication within tasks. That is the point of shared memory. X10 also has elegant support for map reduce using collecting finish, here tasks spawned within the control of a finish can send results back to the finish, where the results are combined with a reducer. This has been implemented for a few years. This implements some of the idioms being talked about in the recent "communication-avoiding algorithms" work. Collectives implemented in lower levels of the stack (e.g. in PERCS hardware) are also available within the language.||The type system separates private and shared space, and within the shared space any thread may read/write the data. Communication is done with one-sided put/get (or simply as load/store when hardware support exists). There are global collectives (broadcast, reduce, etc.).||Habanero-C is designed to interoparate with different communication runtimes e.g., MPI, GASNet. HC's focus is on non-blocking communications, though blocking communicatiosn can also be supported. The HC runtime includes a "communication worker" to interface with a given communication runtime. The on-node runtime supports asynchronous collectives using phasers and phaser accumulators.||Communication is done with one-sided get/put. CAF 2.0 supports a full range of collectives. Split-phase implementations of all collectives are envisioned. The only split-phase collectives implemented at present is broadcast.||N/A||Communication between tasks are handled through the chaining of codelets, or through dataflow futures. Communication across the network is handled through remote codelet invokation with an input parameter: nw_call(node_id, codelet, input, chain_codelet, chan_codelet_context). The codelet will be invoked on the remote node with the input given.||Communication is expressed by passing regions to subtasks; when a subtask and the parent task are scheduled in different places, the data needed by the subtask is moved from the location of the parent to the location of the subtask. While there are not collectives per se, programming idioms achieve the effect of standard collective oprations.||Any object can communicate with any other object, as long it has its name (collection name and index for chare-arrays). Collective communication, including asynchronous broadcasts and reductions, on whole object collections, or their defined subsets (called sections) is also supported. Some methods inside an object may be declared as "threaded", which leads to ceation of a user-level thread for executing the method, when invoked. These threaded methods can access "futures", which can be set remotely [thus providing another form of communication].||Depending on the implementation, UPC++ may have separate private and shared spaces or everything may be shared. Within the shared space any thread may read/write the data. Communication is done with one-sided put/get or through asyncs, which communicate arguments and return values. There are global and team collectives (broadcast, reduce, etc.).|
|6||What other novel features exist for managing energy, resilience, reproducibility, or other systems features?||We are working on an Air Force funded project to introduce resilience in X10. We are also going down the direction of resilient/ approximate computing, developing a theoretical framework which will help analyze X10 code for sensitivity (continuity analysis) and hence permit a systematic invocation-time tradeoff between energy, performance and accuracy.||Failing functions can be automatically retried by the runtime system (in /K). Swift programs are deterministic by construction (/T).||None so far.||Support for resileince and reproducibility is in progress for an implementation of the CnC model on HC. This includes a project on asynchronous checkpointing of CnC programs.||None so far.||We have developed static/dynamic systems to do domain-specific communication optimization for GPU offloading||None at this point. However, we are designing the framework to be able to handle resiliency and power management.||Legion programs express the structure of parallelism and data, but do not commit to any specific policy about where the computation and data are placed in the machine. This is handled by a separate mapper, which is part of the runtime system. The mapper decides where a task shoudl be placed, where its data should be stored, whether task stealing is permitted, etc. Mappers can be written by the programmer, providing domain-specific knowledge about the best way to execute a particular program.||Charm++, as a programming language, specifies only the objects and their interactions. Even load balancing is left to the runtime system. Thus, the extensive support Charm++ provides for energy-related optimizations and resilience are supported by the Charm++'s adaptive runtime system (see entry in the runtime component)||None so far.|
|* K -> original Swift|
|T -> HPC Swift (X-Stack)|