1 of 60

Introduction to Parallel�computing�

Mardonbek Jumanazarov

2 of 60

Plan:

  • 1. Why do we need parallel computing?
  • 2. Why Should You Learn Parallel Computing ?
  • 3. Fundamental Laws of Parallel Computing
  • 4. How does parallel computing work?
  • 5. Classification of parallel approaches
  • 6. Parallel strategies
  • 7. Parallel Acceleration vs. Comparative Acceleration: Two Different Measures
  • 8. Exercises

3 of 60

Why do we need parallel computing?

In today's world, you will face many challenges that require extensive and efficient use of computing resources. Most of the applications that require performance are traditionally in the scientific field. However, artificial intelligence (AI) and machine learning applications are projected to become the predominant users of large-scale computing. A few examples of such applications include:

  • mega-fire simulations to assist fire brigades and the public;
  • tsunami and storm surge modelling from hurricanes (see Chapter 13 for a simple tsunami model);
  • voice recognition for computer interfaces;
  • modeling the spread of the virus and developing vaccines;
  • modeling of climatic conditions for decades and centuries;
  • image recognition for autonomous vehicle technology;
  • equipping emergency crews with working simulations of hazards such as flooding;
  • Reduce the power consumption of mobile devices.

4 of 60

Why do we need parallel computing?

With the technology described in this book, you can handle larger tasks and data sets while running simulations ten, a hundred, or even a thousand times faster. Typical applications leave much of the computing power of today's computers unused. Parallel computing is the key to unlocking the potential of your computing resources. So, what is parallel computing And how can you use them to spur your apps?

Parallel computing is the execution of a large number of operations at a single point in time. Full exploitation of parallel computing is not automatic. This requires some effort from the programmer. And first of all, you must identify and identify the potential for concurrency in the application. means that you confirm that it is safe to execute transactions in any order as system resources become available. And in parallel computing, there is another, additional requirement: these operations must be performed at the same time. For this to happen, you must also deploy the right resources to execute these operations simultaneously.

Parallel computing introduces new difficulties that do not exist in the world of the future. In order to adapt to the additional complexities of parallel execution, we need to change our thought processes, but with practice, this will become second nature. This book begins your discovery of how to access the power of parallel computing.

5 of 60

Why do we need parallel computing?

Life presents a lot of examples of parallel processing, and these examples often become the basis for computational strategies. Fig. 1.1 shows a supermarket checkout line, where the goal is to ensure that customers quickly pay for the goods they want to purchase. This can be done by hiring multiple cashiers to handle or verify customers one at a time. In this way, qualified cashiers will be able to complete the checkout process faster, resulting in customers leaving the store faster. Another strategy is to deploy a large number of self-service cash desks and allow clients to carry out the process themselves. This strategy requires less human resources from the supermarket and allows you to open more customer processing lines. Customers may not be able to pay as efficiently as a trained cashier, but perhaps more customers will be able to pay quickly due to increased concurrency, resulting in fewer queues.

6 of 60

Why do we need parallel computing?

7 of 60

Why do we need parallel computing?

We solve computational problems by developing algorithms, a list of steps to achieve the desired result. Analogous to a supermarket, the self-service process at checkout is an algorithm. In this case, this includes unloading items from the cart, scanning the items for a price, and paying for the items customers, then the algorithm for making purchases of numerous customers contains parallelism that can be used. In theory, there is no relationship between any two customers who go through the payment process. By using multiple checkout lines or self-checkout stations, supermarkets demonstrate parallelism, thereby increasing the speed at which shoppers buy goods and leave the store. Each choice in how we implement this concurrency has different costs and benefits.

DEFINITION Parallel computing is the practice of identifying and uncovering parallelism in algorithms, expressing them in our software, and understanding the costs, benefits, and limitations of the chosen implementation.

Ultimately, parallel computing is all about PERFORMANCE. This includes not only speed, but also the scale of the task and energy efficiency. Our goal in this book is to give you an understanding of the breadth of the modern field of parallel computing and to introduce you to a sufficient number of the most commonly used languages, techniques and tools to enable you to confidently take on a parallel computing project. how to incorporate concurrency is often accepted at the beginning of a project. A well-thought-out design is an important step on the road to success. Shying away from the design step can lead to problems much later. It is equally important to keep expectations realistic and to know the resources available and the nature of the project.

8 of 60

Why Should You Learn Parallel Computing ?

The future will be parallel. The increase in sequential performance has reached a plateau as process designs have reached the limits of miniaturization, clock speed, power, and even heat. Fig. Figure 1.2 shows trends in clock speed (the speed at which an instruction can be executed), power consumption, number of computing cores (or cores for short), and hardware performance over time for commodity processors.

In 2005, the number of nuclei increased dramatically from one to a few. At the same time, the clock speed and power consumption have leveled off. Theoretical performance has steadily improved because performance is proportional to the product of clock speed and number of cores. This shift towards more cores rather than clock speeds indicates that the most ideal central processing unit (CPU) performance can only be achieved through parallel computing.

9 of 60

Why Should You Learn Parallel Computing ?

Single Stream Performance

Frequency (MHz) Power (W) Number of Cores

10 of 60

Why Should You Learn Parallel Computing ?

Modern consumer computing equipment is equipped with multiple central processing units (CPUs) and/or graphics processing units (GPUs) that process multiple instruction sets simultaneously. These small systems often rival the computing power of the supercomputers of twenty years ago. Making full use of computing resources (laptops, workstations, smartphones, and so on) requires you, as a programmer, to have a working knowledge of the tools used to write parallel applications. You also need to understand the hardware functionalities that spur parallism.

Since there are many different parallel hardware functions, this creates new challenges for the programmer. One such feature is hyper-threading1, introduced by Intel. The presence of two queues commands that alternate with hardware logic modules allow a single physical kernel to appear as two cores in the operating system (OS). Vector processors are another hardware feature that began to appear in commodity processors around 2000. They execute several instructions at once. The bit width of a vector processor (also called a vector processing engine) determines the number of instructions that must be executed at the same time. From here, a 256-bit vector processing module can simultaneously execute four 64-bit commands (double-precision) or eight 32-bit commands (single-precision).

Hyper-Threading is a term derived from Intel to refer to simultaneous multi-threading (SMT). It is a process in which a central processor (CPU) splits each of its physical cores into a virtual. For example, in most Intel processors with two cores, hyperthreading is used to provide four threads.

11 of 60

Why Should You Learn Parallel Computing ?

12 of 60

Why Should You Learn Parallel Computing ?

Parallel computing can reduce time to solution, improve the energy efficiency of your application, and allow you to solve larger problems on your current hardware. Today, parallel computing is no longer the sole domain of the largest computing systems. This technology is now present on all desktop computers or laptops, and even on portable devices. It allows each software developer to create parallel software on their local systems, thereby greatly expanding the possibilities for new applications.

Cutting-edge research in both industry and academia is opening up new areas for parallel computing as interest expands from scientific computing to machine learning, big data, computer graphics and consumer applications. The emergence of new technologies, such as self-driving cars, computer vision, voice recognition and artificial intelligence, requires large computational capabilities both inside consumer devices and in development, where massive training datasets need to be used and processed. And in scientific computing, which for a long time was the exclusive field of parallel computing, new and exciting opportunities. The proliferation of remote sensors and portable devices that can feed data into larger, more realistic calculations to better inform decision-making related to natural and man-made disasters allows for more extensive data.

It should be remembered that parallel computing as such is not an end in itself. Incorrectly, goals are what is the result of parallel computing: reducing execution time, performing larger calculations, or reducing power consumption.

Time to solution is the length of time between pre-processing and post-processing of data when a task is solved.

13 of 60

Why Should You Learn Parallel Computing ?

Faster Execution Time with More Compute Cores

Reducing application execution time, or speeding it up, is often considered the primary goal of parallel computing. And in fact, it usually has the greatest consequences. Parallel computing can accelerate compute-intensive, multimedia-intensive and big data operations, whether it takes days or even weeks to process your applications, or real-time results.

In the past, a programmer would spend more effort on consistent optimization to squeeze out a few percentage improvements. Now there is potential for improvement by orders of magnitude, with numerous options to choose from. This creates a new challenge in exploring possible parallel paradigms – more opportunities than the number of programmers. But a thorough knowledge of your application and an understanding of the possibilities of concurrency will certainly lead you down a clearer path to reducing the execution time of your application.

14 of 60

Why Should You Learn Parallel Computing ?

Larger task sizes with more compute nodes

By detecting parallelism in your application, you can scale the size of your task vertically to dimensions that are not available to a sequential application. This is because the amount of compute resources determines what can be done, and parallelism allows you to work with more resources, providing opportunities that have never been considered before. Larger sizes are achieved through more main memory, disk storage, network and disk bandwidth, and central processing units (CPUs). Similar to the supermarket, as mentioned earlier, identifying parallelism is equivalent to hiring more cashiers or opening more self-checkout counters to serve a larger and growing number of customers.

Energy Efficiency: Doing More with Less

One of the new areas of impact of parallel computing is energy efficiency. With the advent of parallel resources in handheld devices, parallelism can accelerate applications. This allows the device to return to sleep mode faster and gives the ability to use slower, but more parallel processors that consume less power. Hence, offloading heavy media applications to GPUs can have a more significant impact on energy efficiency, as well as significantly improve performance. The net result of using parallelism reduces power consumption and extends battery life, which is a strong competitive advantage in this market niche.

15 of 60

Why Should You Learn Parallel Computing ?

Another area where energy efficiency is important is in remote sensors, network devices and operating devices deployed in the field, such as remote weather stations. Often, without large power supplies, these devices must be able to operate in small packages with few resources. trend called edge compute. Moving the computation to the very edge of the network provides the ability to process at the data source, compressing it into a smaller result set that is easier to send over the network.

Accurately calculating the energy costs of an application is challenging without direct power measurements. However, you can estimate the cost by multiplying the manufacturer's thermal design power by the runtime of the application and the number of processors used. Thermal design power (TDP) is the rate at which energy is consumed under typical operating loads. The power consumption for your application can be estimated using the formula:

P = (N Processors) × (R Watts/Processors) × (T Hours ),

where P is the power consumption, N is the number of processors, R is the thermal design power, and T is the application execution time.

16 of 60

Why Should You Learn Parallel Computing ?

  • Example

A 16-core Intel Xeon E5-4660 processor has a thermal design power (TDP) of 120 watts. Assume that your application uses 20 of these processors for 24 hours before shutting down. The estimated power consumption for your application is:

P = (20 Processors) × (120 watts/Processors) × (24 hours) = 57.60 kWh.

In general, GPUs have a higher thermal design power than modern CPUs, but can potentially reduce the runtime or require only a few GPUs to achieve the same result. The same formula as before can be applied, where N is now considered the number of GPUs.

17 of 60

Why Should You Learn Parallel Computing ?

Example

Let's assume you ported your application to a multi-GPU platform. Now you can run your application on four NVIDIA IA Tesla V100 GPUs in 24 hours! NVIDIA's Tesla V100 GPU has a maximum thermal power of 300 watts. The estimated power consumption for your application is:

P = (4 GPUs) × (300 watts/GPUs) × (24 hours) = 28.80 kWh.

In this example, the GPU-accelerated application runs with half the power of the CPU-only version. Note that in this case, even though the time to solve remains the same, the power cost is cut in half!

Achieving energy savings with accelerators such as GPUs requires sufficient parallelism in the application that can be identified. This allows you to make efficient use of the device's resources.

18 of 60

Why Should You Learn Parallel Computing ?

Parallel computing can reduce costs

Actual cash costs are becoming an increasing concern for software development teams, software users and researchers alike. As the size of applications and systems grows, we need to conduct cost-benefit analyses on the resources we have. For example, in the following large high-performance computing (HPC) systems, energy costs are projected to be three times higher than the cost of purchasing equipment.

The cost of use has also contributed to the development of cloud computing as an alternative, which is increasingly being adopted in academia, start-ups and industries. Typically, cloud service providers bill based on the type and amount of resources used, as well as the amount of time spent using them. Although GPUs are generally more expensive than CPUs per unit of time, GPU accelerators may be used in some applications to provide a sufficient reduction in execution time compared to CPU costs in order to reduce costs.

19 of 60

Why Should You Learn Parallel Computing ?

Caveats Related to Parallel Computing

Parallel computing is not a panacea. Many applications are not large enough or do not require enough execution time to require parallel computing. In addition, the transit of applications to multi-core and multi-core (GPU) hardware requires dedicated effort that can temporarily distract attention from direct research or product goals. First, you should realize the expediency of spending time and effort. It is always more important to ensure that an application is functioning and generating the desired result before it is accelerated and scaled up to larger tasks.

We strongly recommend that you start your parallel computational project with a plan. It's important to know the options that are available to speed up your application and then choose the best one for your project. After that, it is essential to have a reasonable estimate of the effort expended and the potential benefits (in terms of dollar cost, energy consumption, time to solution and other indicators that may be important). In this chapter, we will begin to provide you with upfront knowledge and skills for decision-making on parallel computing projects.

20 of 60

Fundamental Laws of Parallel Computing

In sequential computing, all operations speed up as the clock speed increases. In contrast, in the case of parallel computations, we need to think a little bit and modify our applications to take full advantage of the parallel hardware. Why is the amount of parallelism so important? To understand this, let's take a look at the laws of parallel deductions.

The Limit on Parallel Computing: Amdahl's Law

We need an approach to calculating the potential acceleration of computation based on the amount of parallel code. This can be done using Amdahl's Law, proposed by Gene Amdahl in 1967. This law describes the acceleration of a fixed-size problem as the number of processors increases. The following equation shows where P is the parallel fraction of the code, S is the sequential fraction, meaning that P + S = 1, and N is the number of processors:

Amdahl's Law emphasizes that no matter how fast we make the parallel part of the code, we will always be limited sequential part. Fig. Figure 1.3 shows this limit. This scaling of a task of a fixed size is called high scaling.

21 of 60

Fundamental Laws of Parallel Computing

DEFINITION Strong scaling represents the time to solution as a function of the number of processors for a problem with a fixed aggregate size.

22 of 60

Fundamental Laws of Parallel Computing

Overcoming the Parallel Limit: The Gustafson–Barsis Law

In 1988, Gustafson and Barsis pointed out that parallel code runs should increase the size of the task as more processors are added. This gives us an alternative approach to calculating the potential acceleration of our application. If the size of the task grows in proportion to the number of processors, then the acceleration is now expressed as

acceleration (N) = N S × (N – 1),

where N is the number of processors and S is the sequential fraction, as before. As a consequence, a larger problem can be solved in one and the other the same time using more processors. This provides additional opportunities to attract parallelism. And indeed, increasing the size of the task along with the number of processors makes sense, because the user of the application wants to benefit not only from the power of the additional processor, but also wants to use the additional memory. The runtime scaling for this scenario, shown in Fig. 1.4 is called weak scaling.

23 of 60

Fundamental Laws of Parallel Computing

24 of 60

Fundamental Laws of Parallel Computing

DEFINITION Low scaling represents the time to solution as a function of the number of processors for a fixed-size problem per processor.

Fig. 1.5 clearly shows the difference between strong and weak scale. The low-scaling argument that the size of the computational grid should remain constant on each processor makes efficient use of the resources of the additional processor. From the standpoint of strong scaling, all attention is focused primarily on speeding up the calculation. In practice, both high and low scaling are important, as they solve different user scenarios.

The term "scalability" is often used to refer to the ability to add more parallelism to the hardware or and the existence of a cumulative limit of possible improvement. While the traditional focus is on run-time scaling, we argue that memory scaling is often more important.

25 of 60

Fundamental Laws of Parallel Computing

26 of 60

How does parallel computing work?

Parallel computing requires a combination of hardware understanding, software and parallelism in application development. It is more than just message transmission or flow formation. Modern hardware and software provide a lot of different opportunities to parallelize your application. Some of these capabilities can be combined to provide even greater efficiency and acceleration.

It is important to have an understanding of parallelization in your application and how different hardware components allow you to identify it. In addition, developers should understand that there must be additional layers between your source code and the hardware, including the compiler and the operating system (Figure 1.7).

As a developer, you are responsible for the application software layer that contains your source code. In the source code, you choose the programming language and parallel programming interfaces that you use to use the supporting hardware. In addition, you decide how to break your work into parallel modules. The compiler translates your source code into a form that is executable by your hardware. Once you have these commands at your disposal,

27 of 60

How does parallel computing work?

The operating system controls their execution on computer hardware.

28 of 60

How does parallel computing work?

We will show you the process of introducing parallelization into an algorithm using a prototype application. This process takes place in the application software layer, but requires an understanding of the computer hardware. For now, we will refrain from discussing the choice of compiler and operating system. We will gradually add each layer of parallelization so that you can see how it works. In each parallel strategy, we will explain the impact of the available hardware on the choices made. The purpose of this is to demonstrate how hardware functionalities affect parallel strategies. We classify the parallel approaches that a developer can adopt into:

  • process-based parallelization;
  • thread-based parallelization;
  • vectorization;
  • Stream processing.

As mentioned, we classify parallel approaches that a developer can adopt into process-based parallelization, thread-based parallelization (i.e., virtual cores), vectorization, and streaming. Parallelization based on individual processes with their own memory spaces can be distributed memory on different nodes of the computer or within the node. Streaming processing is typically associated with GPU processors. The model of modern equipment and application software will help you to better understand the principle of plan to port your application to your current concurrent hardware.

29 of 60

How does parallel computing work?

Step-by-step introduction to the sample application

For the purposes of this introduction to parallelization, we will look at the data parallelism approach. This is one of the most common uses of parallel computing. We will perform calculations on a spatial computational grid consisting of a regular two-dimensional (2D) lattice of rectangular elements or cells. The steps (summarized here and described in detail later) to create a spatial computational mesh and prepare for the calculation are as follows:

1 discretize (subdivide) the task into smaller cells or elements;

2 Define a computational core (operation) to be performed on each element of the computational mesh.

3 add the following parallelization layers on CPUs and GPUs to perform the calculation:

– vectorization – work on more than one piece of data at a time;

– threads – deploy more than one computational route to attract more processor cores;

– processes – separate program instances to distribute the calculation to separate memory spaces;

– Upload the calculation to GPU processors – send data to the GPU for the calculation.

We start with a two-dimensional problem domain of a piece of space. For illustration purposes, we will use a 2D image of the Krakatoa Valley (Figure 1.8) as an example. The goal of our calculations may be to simulate the volcanic plume resulting from a tsunami or to detect a volcanic eruption early using machine learning. For all of these options, computational speed is critical if we want real-time results to inform our decisions.

30 of 60

How does parallel computing work?

31 of 60

How does parallel computing work?

Step 1.Spark the task into Smaller Cells or Elements

For any detailed calculation, we must first break down the problem domain into smaller parts (Figure 1.9). This process is called sampling. In image processing, these are often just pixels in a bitmap , which covers a large area for simulation. The data values for each cell can be integers, real numbers, or double numbers.

32 of 60

How does parallel computing work?

Step 2.Define the compute core or operation required to be performed on each Mesh Element

Calculations on this discretized data are often a form of walling operation, so called because it provides a template of adjacent cells to calculate a new value for each cell. This can be an average value (a blur operation that blurs an image or makes it more blurry), a gradient (edge detection that makes the edges of an image clearer), or another more complex operation involving solving physical systems described by partial differential equations (PDEs). Fig. Figure 1.10 shows a wall-siling operation in the form of a five-point stencil that performs a blurring operation using a weighted average-weighted value of the wall-strong values.

But what are these partial differential equations? Let's go back to our example and imagine that this time it's a color image made up of individual red, green, and blue arrays that make up the RGB color model. The term "quotient" here means that there is more than one variable and that we separate the change in red colors in space and time from changing green and blue. We then execute a blur operator separately for each of these colors.

33 of 60

How does parallel computing work?

There is another requirement: we need to apply the rate of change in time and space. In other words, red will travel at one rate, and green and blue at another. This can be done to produce a special effect in the image, or it can describe how real colors seep in and merge in the photographic image during development. green and blue, we could have a mass and velocity x and y. If we add a little more physics, we could get the motion of a wave or an ash plume.

34 of 60

How does parallel computing work?

Step 3. Inthe design for working with more than one data unit at a time

What is vectorization? Some processors have the ability to operate on more than one piece of data at a time; this capability is called vector operations. The shaded boxes in Figure 1.11 illustrate how multiple data values are processed simultaneously in the vector processing module in the processor with a single instruction in a single clock cycle.

35 of 60

How does parallel computing work?

Step 4.To deploy more than one computational route in order to attract more cores

Since most CPUs today have at least four processing cores, we use threading to operate the cores simultaneously on four lines at a time. This process is shown in Fig. 1.12.

Step 5. Processes to propagate computations across individual memory spaces

We can further divide the work between processors into two desktop computers, often referred to as nodes in parallel processing. When the work is divided into nodes, the memory spaces for each node are distinct and separated. This is indicated by the presence of a gap between the lines, as shown in Figure 1. 1.13.

Even for this rather modest hardware scenario, there is a potential 32x speedup. This is illustrated by the following data:

2 desktops (nodes) × 4 cores × (256-bit wide vector processing engine)/(64-bit double-precision) = 32x potential speedup.

If we take a high-end cluster with 16 nodes, 36 cores per node and a 512-bit vector processor, then the potential theoretical the speed-up will be 4608-fold compared to the sequential process:

16 nodes × 36 cores × (512-bit wide vector processing engine)/ (64-bit double-precision) = 4.608x potential speedup.

36 of 60

How does parallel computing work?

37 of 60

How does parallel computing work?

Step 6. Uploading the calculation to the GPUs

The GPU is another hardware resource for spurring parallelization. With GPUs, we can use a large number of streaming multiprocessors (SMs). For example, Figure 1.14 shows how work is divided separately into 8×8 tiles. Using the hardware specifications for NVIDIA Volta GPUs, these tiles can run 32 dual-precision cores, distributed across 84 streaming multiprocessors, giving us a total of 2688 dual-precision cores operating simultaneously. If we have one GPU per node in a 16-node cluster, each with 2,688 double-precision streaming multiprocessors, that would be a 43,008-way parallelization of 16 GPUs.

38 of 60

How does parallel computing work?

These numbers are impressive, but for now we must temper expectations by acknowledging that the actual acceleration lags far behind this full potential. Now our task is to organize such extreme and disparate layers of parallelization in order to achieve as much acceleration as possible.

In this step-by-step introduction to the high-level application, we have omitted a lot of important details that we will consider in the following chapters. But even this nominal level of detail highlights several strategies for detecting algorithm parallelism. In order to be able to develop similar strategies for solving other problems, it is necessary to understand the modern and software. Now we'll dive deeper into the current hardware and software models. These conceptual models are simplified representations of a variety of real-world hardware to avoid complexity and preserve commonality in rapidly evolving systems.

39 of 60

How does parallel computing work?

Hardware Model for Modern Heterogeneous Parallel Systems

In order to gain a basic understanding of how parallel computing works, we will explain the components of modern equipment. To begin with, dynamic RAM, called DRAM, stores information, or data. The core of the processor, or simply the core, performs arithmetic operations (addition, subtraction, multiplication, division), evaluates logical instructions, loads and stores data from DRAM. When an operation is performed on data, instructions and data are loaded from memory into the kernel, processed, and stored back into memory. Modern CPUs, often referred to as simple processors, are equipped with numerous cores capable of performing these operations in parallel. Systems equipped with accelerator equipment, such as GPU processors, are also becoming common. GPUs are equipped with thousands of cores and a memory space that is separate from the central processor's DRAM, the CPU.

The combination of a processor (or two), a DRAM, and an accelerator makes up a compute node that can be referenced in the context of a single home desktop computer or a "rack" in a supercomputer. Compute nodes can be connected to each other by one or more networks. Such a connection is sometimes referred to as interconnection. Conceptually, a node runs a single operating system instance that manages and controls all hardware resources. As equipment becomes more complex and heterogeneous, we will start with simplified models of the system's components to make each one more obvious.

40 of 60

How does parallel computing work?

Distributed memory architecture:

Cross-nodal parallel method

One of the first and most scalable approaches to parallel computing is the distributed memory cluster (Figure 1.15). Each CPU has its own local memory, consisting of DRAM, and is connected to other CPUs by a communication network. The good scalability of distributed memory clusters is due to their seemingly limitless ability to include a larger number of nodes.

This architecture also provides some memory locality by dividing the total address memory into smaller subspaces for this forces the programmer to explicitly access different memory locations. The disadvantage of this architecture is that the programmer must manage the subdivision of memory spaces at the very beginning of the application.

41 of 60

How does parallel computing work?

Shared Memory Architecture: Node Parallel Method

An alternative approach attaches two CPUs directly to the same shared memory (see Figure 1.16). The strength of this approach lies in the fact that processors share the same address space, which simplifies programming. But it introduces potential memory conflicts, which lead to problems with correctness and performance. Synchronizing memory access and values between CPUs or processing cores on a multi-core CPU is complex and expensive.

Adding more CPUs and processing cores does not increase the amount of memory available to the application. This overhead and synchronization overhead limit the scalability of a shared memory-based architecture.

42 of 60

How does parallel computing work?

Vector Engine Modules: Multiple Operations with a Single Command

Why not just increase the CPU clock speed to get more power, as has been done in the past? The biggest limitation in increasing the CPU clock speed is that it requires more power and generates more heat. Whether it's super-computer HPC center with limits on installed power lines or your mobile phone with a limited battery capacity, all devices today have limits on power supply. This problem is called a power wall.

Instead of increasing the clock speed, why not perform more than one operation per cycle? This is the idea behind the revival of vectorization on many processors. ). With vectorization, we can process more data in a single clock cycle than in a sequential process. The power requirements for multiple operations are virtually unchanged (compared to a single operation), and a reduction in execution time can result in lower power consumption of the application. Much like a four-lane freeway, which allows four cars to move simultaneously compared to a single-lane road, the vector operation provides a higher processing power. And indeed there are four routes through the vector processing module, shown in different shades in Fig. 1.17 are commonly referred to as vector operation bands.

Most CPUs and GPUs have some vectorization capabilities or equivalent operations. The amount of data processed per clock cycle, the length of the vector, depends on the size of the rector processing units on the processor. Currently, the most common vector length is 256 bits. If the sampled data is equal to 64-bit double-precision values, then we can Perform four floating-point operations simultaneously as a vector operation. As shown in Fig. In Figure 1.17, hardware vector processing modules load one block of data at a time, perform one data operation at a time, and then store the result.

43 of 60

How does parallel computing work?

44 of 60

How does parallel computing work?

Accelerator device: Narrow-purpose add-on processor

An accelerator is a discrete piece of hardware designed to perform specific tasks quickly. The most common accelerator is the GPU. When used for computing, this device is sometimes referred to as a general-purpose GPU (GPGPU). The GPU contains a lot of processing cores, called streaming multiprocessors (SMs). Although SM processors are simpler than a CPU core, they provide a huge amount of processing power. Usually, a small integrated GPU can be found directly on the CPU.

Most modern computers also have a separate discrete GPU connected to the CPU by a peripheral interface bus (PCI bus) (Figure 1.18). This bus increases the cost of transmitting data and commands, but a discrete card is often more powerful than an integrated device. For example, in high-end systems, NVIDIA uses NVLink, and AMD Radeon uses its Infinity Fabric to reduce data transfer costs, but these costs are still significant. We'll discuss this interesting GPU architecture in more detail in Chapters 9-12.

45 of 60

How does parallel computing work?

A common heterogeneous parallel architectural model

Now let's combine all these different hardware architectures into a single model (Figure 1.19). Two nodes, each with two CPUs, share the same DRAM. Each CPU is a dual-core processor with an integrated GPU. A disintegrated GPU on the PCI bus is also connected to one of the CPUs. Although CPUs share primary memory, they are typically located in different locations of non-equal memory access (NUMA). This means that accessing the memory of the second CPU is more expensive than accessing its own memory.

Throughout this discussion of hardware, we have mentioned a simplified model of the memory hierarchy, showing only DRAM or main memory. We have shown the cache in a combined model (Figure 1.19), but without details about its composition or how it functions. We leave our discussion of the complexities of memory management, including multiple cache levels, to Chapter 3. to help you identify the components available so that you can choose the parallel strategy that is most suitable for your application and hardware.

46 of 60

How does parallel computing work?

APPLICATION/SOFTWARE MODEL FOR MODERN HETEROGENEOUS PARALLEL SYSTEMS

The software model for parallel computing is necessarily conditioned by the supporting hardware, but nevertheless different from it. The operating system provides an interface between the two. Parallel operations do not occur by themselves; More precisely, the source code should indicate how to parallelize the work, giving rise to processes or threads; by uploading data, work and commands to a computing device or operating on data blocks at the same time. The programmer must first identify parallelization, determine the best technique to operate in parallel, and then explicitly direct his work in a safe, correct and efficient way. The following methods are the most common parallelization techniques. Next, we will take a closer look at each of them.

  • Process-based parallelization – message passing.
  • Stream-based parallelization – shared data through memory.
  • Vectorization is several operations with one command.
  • In-line processing – through specialized processors.

Process-Based Parallelization: Communication Transfer

The message-passing approach was developed for distributed memory architectures that use explicit messages to move data between processes. In this model, your application creates separate processes, called message-passing ranks, with their own memory space and instruction conveyor (see Figure 1.20). The figure also shows that processes are passed to the OS to be hosted on processors. A chart labeled as a user space where the user has permission to work. The part below is the nuclear space, which is protected from dangerous operations on the part of the user.

47 of 60

How does parallel computing work?

48 of 60

How does parallel computing work?

Keep in mind that processors – CPUs – have multiple processing cores that are not equivalent to processes. Processes are the concept of the operating system, and processors are the hardware component. Any number of processes that the application spawns are scheduled to run by the operating system for the processing cores. In fact, you can run eight processes on your quad-core laptop and they will simply switch between processing cores. For this reason, mechanisms have been developed to indicate to the operating system how to place processes and whether to "bind" the process to the processing core. Controlling binding is discussed in more detail in Chapter 14.

In order to move data between processes, you will need to program explicit messages in your application. These messages can be sent over a network or through shared memory. In 1992, many messaging libraries formed a standard called the Message Passing Interface (MPI). scalable outside one node. And – yes, you'll also find many different implementations of MPI libraries.

Distributed Computing vs. Parallel Computing

Some parallel applications use a lower-level approach to parallelization called distributed computing. We define distributed computing as a set of loosely coupled processes that communicate using operating system-level calls. Although distributed computing is a subset of parallel computing, the distinction is important to understand. Examples of distributed computing applications include peer-to-peer networks, the World Wide Web, and Internet mail. The Search for Extraterrestrial Intelligence (SETI@home) is just one example of many scientific distributed computing applications. Each process typically resides on a separate node and is created by the operating system using something like a remote procedure call (RPC) or network protocol. The processes then communicate by passing messages between processes using interprocess communication (IPC), of which there are several varieties. Simple parallel applications often use a distributed computing approach, but often with the help of a higher-level language such as Python and specialized parallel modules or libraries.

49 of 60

How does parallel computing work?

Thread-Based Parallization: Shared Data Over Memory

A thread-based approach to parallelization generates separate instruction pointers within the same process (Figure 1.21). As a result, you can easily share chunks of process memory between threads. But this comes with pitfalls related to correctness and performance. It remains for the programmer to define sections of the instruction set and data that are independent and can support threading. These considerations are discussed in more detail in Chapter 7, where we will look at OpenMP, one of the leading flow systems. OpenMP provides the ability to spawn threads and distribute work among those threads.

There are a wide variety of approaches to threading, from heavyweight to lightweight, managed by user space or operating system. Although threading systems are limited to scaling within a single node, they are an attractive option for moderate acceleration. However, the memory limits of a single node have more serious consequences for the application.

50 of 60

How does parallel computing work?

Vectorization: Multiple operations with a single command

Vectorizing an application is much more cost-effective than expanding compute resources in an HPC center, and it is absolutely necessary on portable devices such as mobile phones. With vectorization, the work is done in blocks of 2-16 data elements at a time. A more formal term for this classification of operations is "single instruction, multiple data" (SIMD). The term SIMD is often used when referring to vectorization. SIMD is just one of the categories of parallel architectures that will be discussed later in Section 1.4.

Both pragmas and compiler analysis are highly dependent on the compiler's capabilities (Figure 1.22). In addition, without explicitly specified compiler flags, the generated code is for the least powerful processor and century length, which greatly reduces the efficiency of vectorization.

51 of 60

How does parallel computing work?

Workflow Processing by Specialized Processors

Stream processing is a concept of data circulation in which a stream of data is processed by a simpler, narrower-purpose processor. Long used in embedded computing, the technology has been adapted to visually process large sets of geometric objects for computer displays in a dedicated processor, the GPU. These GPUs were filled with a wide range of arithmetic operations and several threaded multiprocessors (SMs) to processing of geometric data in parallel mode. Scientific programmers soon found ways to adapt the processing of data streams to large simulation data sets , such as cells, extending the role of the GPU to GPGPU.

Fig. Figure 1.23 shows the data and compute core offloaded over the PCI bus to the GPU for compute. GPUs are still more functional than CPUs, but where specialized functionality can be used, they provide exceptional computing power with lower power requirements. Other specialized processors also fit into this category, although we will focus on GPUs in our discussions.

52 of 60

Classification of parallel approaches

If you read more about parallel computing, you'll come across acronyms such as SIMD (Single Instruction, Multiple Data Elements) and MIMD (Multiple Instructions, Multiple Data Elements). These terms refer to categories of computer architectures proposed by Michael Flynn in 1966, which has since become known as Flynn's taxonomy. These classes help you think differently about potential parallelization in architectures. This classification is based on the breakdown of commands and data into single or multiple operations (Fig. 1.24). Keep in mind that while this taxonomy is useful, some architectures and algorithms don't fit well into the category. It is useful in recognizing patterns in categories such as SIMDs that may have difficulty with conditional instructions. This is due to the fact that each item may need a different block of code, but threads must execute the same command.

53 of 60

Classification of parallel approaches

In the case where there is more than one sequence of commands, the category is called "multiple instruction, single data" (MISD). This architecture is not so common; The best example is redundant computation on the same data. It is used in highly resilient approaches such as spacecraft controllers. Because spacecraft are exposed to high radiation, they often make two copies of each calculation and compare the results of both.

Vectorization is a prime example of SIMD, in which the same command is executed on numerous data elements. The SIMD variant is "single instruction, multi-thread" (SIMT), which is widely used to describe the working groups of GPUs.

The latter category has parallelization in both instruction and data and is called MIMD. This category describes the multi-core parallel architectures that make up the majority of large parallel systems.

54 of 60

Parallel strategies

So far, in our original example in Section 1.3.1, we have looked at parallelizing data for cells or pixels. But data parallelization can also be used for particles and other data objects. Parallelizing data is the most common approach and often the easiest. In essence, each process executes the same program, but operates with a unique subset data, as shown in the upper right corner of Fig. 1.25. A parallel approach to data processing has the advantage that it scales well as the size of the task and the number of processors increase.

Another approach is concurrency at the operational task level. This involves a master controller with strategies based on worker threads, conveyor or bucket crew, also shown in Figure 1. 1.25. The pipe approach (i.e., a pipe through which water flows evenly) is used in superscalar processors, where the address and integer calculations are performed by a separate logic module rather than a floating-point data processor, allowing these calculations to be performed in parallel. In a bucket crew (i.e., a chain of people passing buckets of water to a fire), each processor is used to process and convert data into sequences of operations. In the master worker approach, a single processor schedules and distributes tasks to all workers, and each worker checks for the next work item by returning the previous completed task. It is also possible to combine different parallel strategies to identify a higher degree of parallelism.

55 of 60

Parallel strategies

56 of 60

Parallel Acceleration vs. Comparative Acceleration: Two Different Measures

Throughout this book, we will present a number of benchmarks of performance and acceleration. Often the term

"Acceleration" is used to compare two different times of execution with little explanation or context in order to fully understand the what it means. Acceleration is an umbrella term that is used in many contexts, for example to quantify the effects of optimization. In order to clarify the difference between the two main categories of parallel productivity indicators, we will define two different terms.

- Parallel acceleration. In fact, we should call this term a parallel acceleration versus a sequential acceleration. Acceleration occurs compared to a basic sequential run on a standard platform, usually on a single CPU. Parallel acceleration can be caused by running on the GPU with either the OpenMP package or MPI on all cores of the node of the computer system.

- Comparative acceleration. In fact, we should call this term comparative acceleration between architectures. Typically, this is a performance comparison between two parallel implementations, or another comparison between fairly limited sets of equipment. For example, it may be between a parallel MPI implementation on all cores of the host computer versus the GPU(s) on the host.

57 of 60

Parallel Acceleration vs. Comparative Acceleration: Two Different Measures

These two categories of performance comparisons represent two different purposes. The first is to understand how much the process can be accelerated by adding one or another type of parallelism. However, this comparison will not be meaningful between architectures. We are talking about parallel acceleration. For example, comparing GPU runtime to a sequential CPU run is not an objective comparison between a multi-core CPU and a GPU. Comparative accelerations between architectures are more appropriate when trying to compare a multi-core CPU with the performance of one or more GPUs per node.

In recent years, these two architectures have been normalized, whereby relative performance is compared for similar power or energy requirements, rather than for an arbitrary node. However, there are so many different architectures and possible combinations that any performance metric can be obtained to support the conclusion. You can choose a fast GPU and a slow CPU, or a quad-core CPU to compare with a 16-core CPU. Therefore, we suggest that you add the following terms in parentheses to compare performance to give them more context.

58 of 60

Parallel Acceleration vs. Comparative Acceleration: Two Different Measures

- Add "(best of 2016)" to each term. For example, Parallel Acceleration (best in 2016) and Comparative Acceleration (best in 2016) indicate that the comparison is between the best hardware released in a particular year (in this example, 2016), where you can compare a high-end GPU to a high-end CPU.

- Add "(generally available in 2016)" or "(2016)" if two architectures were issued in 2016 but are not equipped with the highest class. This is relevant for developers and users who have more mass components than in top-end systems.

- Add "(2016 Mac)" if the GPU and CPU were released in a 2016 Mac laptop or desktop, or something similar for other fixed-component brands within a certain period of time (in this example, 2016). This type of performance comparison is useful for users of a public system.

- Add "(GPU 2016:CPU 2013)" to indicate that there is a possible discrepancy in the year of manufacture of the hardware (in this example, 2016 compared to 2013) for the components being compared.

- No qualifications are added to the comparative data. Who knows what these numbers mean?

Because of the dramatic growth of CPU and GPU models, performance metrics are bound to be more like apples and oranges than a well-defined metric. But for more formal comparisons, we should at least specify the nature of the comparison, so that others can better understand the meaning of the numbers and be more objective about hardware vendors.

59 of 60

What will you learn in this book?

This book is written with the application code developer in mind, and no prior knowledge of parallel computing is assumed. You just have to want to increase the performance and scalability of your application. Applications include scientific computing, machine learning, and big data analysis in systems ranging from desktops to the largest supercomputers.

In order to fully benefit from this book, readers should be experienced programmers, preferably proficient in a compiled HPC language such as C, C++ or Fortran. We also assume basic knowledge of hardware architectures. In addition, readers should be familiar with computer technology terms such as bits, bytes, operations, cache, RAM, etc. It is also useful to have a basic understanding of the functions of the operating system and how it manages and interacts with hardware components. After reading this book, you will gain several skills, including:

- Determining when message transfer (MPI) is more appropriate than threading (OpenMP packet) and vice versa.

- assessing the extent to which acceleration is possible in vectoring;

- Recognize which parts of your app have the most potential for acceleration.

- Deciding when it's useful to use a GPU to accelerate your application.

- Setting the maximum potential performance for your application.

- Estimate the energy cost for your app.

Even after this first chapter, you should feel comfortable with different approaches to parallel programming. We invite you to work through the exercises in each chapter that will help you integrate many of the concepts we introduce. If you're starting to feel a little overwhelmed by the complexity of modern parallel architectures, you're not alone. It is difficult to grasp all the opportunities at once. In the following chapters, we will break them down piece by piece to make it easier for you.

60 of 60

Exercises

1 What are other examples of parallel operations in your daily life? How would you classify your example? What do you think is the side-by-side design optimized for? Can you calculate the parallel acceleration for this example?

2 What is the theoretical parallel processing power of your system (whether desktop, laptop or mobile phone) compared to its sequential processing power? What types of parallel equipment are present in it?

3 What parallel strategies do you see in the example with paying for purchases in the store in Fig. 1.1? Are there any current parallel strategies that are not shown? What about the examples in Exercise 1?

4 You have an image processing application that needs to process 1000 images of 4 megabytes (MiB, 220, or 1,048,576 bytes) each per day. Your cluster consists of multi-core nodes with 16 cores and a total capacity of 16 gigabytes (GiB, 230 bytes, or 1024 mebibytes) of main memory per node. (Note that we use the correct binary terms MiB and GiB, not Mb and GB, which are metric terms respectively for 106 and 109 bytes.)

a Which parallel processing design best handles this workload?

b Consumer demand is now increasing 10-fold. Is your design doing the same? What changes would you have to make?

5 The Intel Xeon processor E5-4660 has a thermal design power of 130W; this is the average power consumption when using all 16 cores. The NVIDIA Tesla V100 GPU and AMD MI25 Radeon GPU have an estimated thermal power of 300 watts. Suppose you port your software to use one of these GPUs. How much faster does your GPU application have to run to be considered more energy efficient than your application with a 16-core CPU?