1 of 41

Multiprocessors and Thread-Level Parallelism (Part 1)

Chapter 5

Appendix F

Appendix I

1

Prof. Iyad Jafar

https://sites.google.com/view/iyadjafar

iyad.jafar@ju.edu.jo

2 of 41

Outline

Introduction (5.1)
Multiprocessor Architecture
Challenges in Parallel Processing
Centralized Shared Memory Architectures (5.2)
Performance of SMP (5.3)

2

Prof. Iyad Jafar

3 of 41

Introduction

3

Prof. Iyad Jafar

4 of 41

Introduction

4

RISC

Move to multi-processor

Technology Improvement

New Architectures and Organization

Power and ILP limitations?

Switching multi-core since single-core speed-ups became inefficient (due to heat, power, and ILP limits).

Multi-core architectures allow continued performance growth by exploiting parallelism.

Prof. Iyad Jafar

5 of 41

Introduction

The limitation of single processor performance were envisioned many years ago.

5

Prof. Iyad Jafar

6 of 41

Introduction

Why multiprocessors?

Increased costs of silicon and energy to exploit ILP.
Increasing performance of desktop is less important.
Advantage of replication rather than unique design.
Improved understanding on how to use multiprocessors effectively; especially in servers.

significant natural parallelism in large dataset, scientific codes, and independent requests

Growing interest in high end servers for cloud computing and SaaS.
A growth in data-intensive applications.

6

Prof. Iyad Jafar

7 of 41

Multiprocessors

Tightly coupled processors whose coordination and usage are typically controlled by a single operating system and that share memory through a shared address space.

2-32 processors.

Single-chip system (multicore) or multiple multicore chips.
Multiprocessors exploit thread-level parallelism in two software models

Parallel processing

Execute tightly-coupled threads that collaborate on a single task.

Request-level parallelism (RLP)

Execute multiple independent processes.
Single program or multiple applications (multiprogramming).

Multicomputers (Ch. 6)

7

Prof. Iyad Jafar

8 of 41

Multiprocessors

8

Prof. Iyad Jafar

9 of 41

Multiprocessor Architectures

Classification is based on number of processors which directly dictates the memory organization.

9

Symmetric Shared-Memory Multiprocessors (SMPs)

Small number of cores (<32 cores).
Centralized shared-memory multiprocessors that all processors have access to.
Share single memory with uniform latency (UMA)

all processors have a uniform latency from memory

Prof. Iyad Jafar

10 of 41

Multiprocessor Architectures

Distributed Shared Memory Multiprocessors (DSMs)

Larger number of processors.
Memory distributed among processors to increase bandwidth and reduce latency to local memory.
Non-uniform memory access/latency (NUMA).
Processors connected via direct (switched) and non-direct (multi-hop) interconnection networks.

10

Prof. Iyad Jafar

11 of 41

Multiprocessor Architecture

The term shared memory in both architectures implies that threads communicate with each other through the same memory address space.

In other words, any processor can reference any memory location as long as it has access rights.

In DSM, the distributed memory adds the communication complexities and overhead.

11

Prof. Iyad Jafar

12 of 41

Challenges - Limited Parallelism in Programs

Example. Suppose you want to achieve a speedup of 80 with 100 processors. What fraction of the original computation can be sequential?

How this can be addressed?

Using new algorithms that offer better parallel performance.

12

Check other examples on page 374 and 375

Prof. Iyad Jafar

13 of 41

Challenges - Communication Overhead

Example. Suppose we have an application running on a 32-processor multiprocessor, which has a 200 ns time to handle reference to a remote memory. Assume that:

all the references except those involving communication hit in the local memory hierarchy, which is slightly optimistic.
Processors are stalled on a remote request, and the processor clock rate is 3.3 GHz.

If the base CPI (assuming that all references hit in the cache) is 0.5, how much faster is the multiprocessor if there is no communication versus if 0.2% of the instructions involve a remote communication reference?

13

Prof. Iyad Jafar

14 of 41

Challenges

Communication Overhead
Example.

CPIcom = CPIideal + miss penalty

= 0.5 + remote request rate × penalty

= 0.5 + 0.002 × 200 ns / 0.3 ns

= 0.5 + 1.2

= 1.7

Speedup = 1.7 / 0.5 = 3.4

The multiprocessor with all local references is 3.4 faster

How to address?

HW 🡪 caching shared data and multithreading
SW 🡪 structure the data to make accesses local

14

Prof. Iyad Jafar

15 of 41

SMP Architectures

15

Prof. Iyad Jafar

16 of 41

SMP Architectures

16

Intel Nehalem (Nov 2008)

Prof. Iyad Jafar

17 of 41

Centralized Shared Memory Architectures (SMPs)

17

Prof. Iyad Jafar

18 of 41

SMP Architectures

SMPs support caching of private (used by a single processor) and shared (used by multiple processors) data to reduce latency, BW and contention.
Caching private data is not a problem (Like a uniprocessor)

Cache block is migrated to the cache to reduce latency and BW demands.

Caching shared data

Shared value may be replicated in multiple caches.
Reduce latency and bus contention.
However, this introduces cache coherence problem!

Cache coherence

Processors see the memory through their individual caches only.
This may end up seeing different values of shared data.

18

Prof. Iyad Jafar

19 of 41

Cache Coherence Problem

19

P1

P2

P3

Memory

X = 5

X = 8

X = ?

1

2

3

4

5

Assume X is a shared variable and write-back private caches

Prof. Iyad Jafar

20 of 41

Cache Coherence

Coherence defines what values can be returned read.
A memory system is coherent if:

Preserve Program Order: A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P .
Coherent view of memory: Read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses.
Write serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors. For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1.

20

Prof. Iyad Jafar

21 of 41

Basic schemes for Enforcing Coherence

Directory-based Protocols

The sharing status of a shared block is kept in one (or more) location, i.e. Directory .

In SMP, centralized directory.
In DSM, distributed directories.

Snooping-based Protocols

Every cache that has a copy of a shared block keeps track of the sharing status .
In SMP

Caches are accessible via some broadcast medium.
Each cache monitors/snoops the medium to determine whether they have a copy of the requested block.

Can be used in multichip multiprocessor on top of directory protocol within each multicore.

21

Prof. Iyad Jafar

22 of 41

Snooping Coherence Protocols

Write-update protocol (broadcast/write-through)

A write to a cashed shared item updates all cached copies via the medium.
Less popular; consumes BW!

Write-invalidate protocol

A write to a shared cached item invalidates all cached copies (exclusive access).

22

X

A invalidates X in B

A sends X to B and to memory

Prof. Iyad Jafar

23 of 41

Basic Implementation Techniques

A bus or broadcast medium:

In multi-chip multiprocessor, it is the shared memory bus.
In multicore with shared cache, it is the connection between the private caches of the cores and the shared cache.
Perform invalidates by acquiring the bus first, then broadcasting the address.
Other processors snoop and check their caches for the broadcasted address.
Invalidation by different processors is serialized by bus arbitration.

When two processors attempt to invalidate or write a block (on a write miss), the sequence is based on which processor acquires the bus.

23

Prof. Iyad Jafar

24 of 41

Basic Implementation Techniques

Locating shared items on a miss

Simple in write-through (data is in the shared cache/memory).
Write-back is more difficult (data could be in a private cache).
However, in write-back, caches can snoop for read requests as well and provide the data if they have it in dirty state.
Write buffers have the same problem as write-back

They have to be treated as separate cache lines.

24

Prof. Iyad Jafar

25 of 41

Basic Implementation Techniques

Tracking state

Use cache tags, valid and dirty bits to implement snooping.

Duplicate cache tags to avoid interference with the normal cache operation.

1-bit to track the sharing state of each block

Exclusive/Modified state 🡪 The processor has a modified copy of the block. There is no need to send invalidates on successive writes by the same processor.
Shared 🡪 the block is in cached in one or more private caches.

Finite state controller in each core

Responds to requests from the core and medium.
Change the state of a cached block

Invalid, modified or shared

25

Prof. Iyad Jafar

26 of 41

Example Protocol (Invalidate & WB)

26

Why to write-back?

Prof. Iyad Jafar

27 of 41

Example Protocol (Invalidate & WB)

27

Prof. Iyad Jafar

28 of 41

Example Protocol (Invalidate & WB)

28

Prof. Iyad Jafar

29 of 41

Example Protocol (Invalidate & WB)

Overall Operation

Black: from processor
Gray: from bus
Bold: actions

29

Prof. Iyad Jafar

30 of 41

Extensions to MSI Protocol

The previous protocol is called MSI. Many extensions exist.

Add states and/or transactions to improve performance.

MESI protocol (Intel i7 MESIF)

Exclusive state added to indicate that the cache block is resident only in a single cache but is clean.
If in “E” state, a block can be written without generating invalidates.

MOESI protocol (AMD Opteron) (sharing of dirty blocks)

MSI and MESI update memory whenever changing the state from Modified to Shared.
MOESI adds the Owner state to indicate that a block is owned by that cache, it is shared by other caches, and it is out-of-date in memory.
MOESI, a block can be changed from Modified to Owned without writing to memory. The owner supply the block on read misses and is supposed to update the memory only when the block is replaced.

30

Prof. Iyad Jafar

31 of 41

Limitations

Centralized memory can be become a bottleneck as the number of processors or their memory demands increase.
High BW connection to L3 cache allowed 4 to 8 cores. However, it is not likely to scale!

Multiple busses and interconnection networks such as cross-bar or small point-to-point.
Banked memory or cache.

31

Prof. Iyad Jafar

32 of 41

Limitations

Snooping BW could become a problem.
Each processor must examine every miss.

Snooping may interfere with cache operation!

Duplicate cache tags.
Use a centralized directory in the outermost cache to indicate which caches have copies of the blocks in the shared cache.
However, this does not eliminate the bottleneck at the bus.

32

Prof. Iyad Jafar

33 of 41

Performance of SMP

33

Prof. Iyad Jafar

34 of 41

Performance of SMPs

Performance is determined by:

Traffic caused by cache misses of processors.
Traffic of communication.
Both are affected by processor count, cache size and block size.

SMPs adds the fourth C (coherence) for the 3Cs misses.
Types of coherence misses

True Sharing Misses

Write to shared block invalidates the block’s copies in other processors.
Read an invalidated block is a miss due to the block being shared.

False Sharing Misses

Single valid bit per block; hence, writing a word in a block invalidates the entire block.

34

Prof. Iyad Jafar

35 of 41

Coherence Misses Example

Assume that words x1 and x2 are in the same cache block, which is in the shared state in the caches of both P1 and P2.
Assuming the following sequence of events, identify each miss as a true sharing miss, a false sharing miss, or a hit.

35

1. true sharing miss, since x1 was read by P2 and needs to be invalidated from P2

2. false sharing miss, since x2 was invalidated by the write of x1 in P1, but that value of x1 is not used in P2

3. This event is a false sharing miss, since the block containing x1 is marked shared due to the read in P2, but P2 did not read x1

4. This event is a false sharing miss for the same reason as step 3.

5. This event is a true sharing miss, since the value being read was written by P2.

Prof. Iyad Jafar

1. This event is a true sharing miss, since x1 was read by P2 and needs to be invalidated from P2.

2. This event is a false sharing miss, since x2 was invalidated by the write of x1 in P1, but that value of x1 is not used in P2.

3. This event is a false sharing miss, since the block containing x1 is marked shared due to the read in P2, but P2 did not read x1. The cache block containing x1 will be in the shared state after the read by P2; a write miss is required to obtain exclusive access to the block. In some protocols this will be handled as an upgrade request, which generates a bus invalidate, but does not transfer the cache block.

4. This event is a false sharing miss for the same reason as step 3.

5. This event is a true sharing miss, since the value being read was written by P2.

36 of 41

Performance of SMPs

1998 Study

Processor

Alpha Server 4100 with four Alpha 21164 processors
4 IPC at 300 MHz

Workload

TPC-B: online transaction processing (OLTP)
TPC-D: Decision support system (DSS)
AltaVista: Web index search

36

Prof. Iyad Jafar

37 of 41

Performance of SMPs

37

OLTP has the poorest performance due to memory hierarchy problems

Consider evaluating the OLTP when varying L3 cache size, block size and number of processors

Prof. Iyad Jafar

38 of 41

Performance of SMPs

38

Biggest improvement when moving from 1 to 2 MB L3?

Prof. Iyad Jafar

39 of 41

Performance of SMPs

39

Instruction and capacity misses drops but true sharing, false and compulsory misses are unaffected!

Prof. Iyad Jafar

40 of 41

Performance of SMPs

40

Increase of true sharing misses!

Prof. Iyad Jafar

41 of 41

Performance of SMPs

41

Reduce true sharing misses!

Prof. Iyad Jafar