1 of 20

Multiprocessors and Thread-Level Parallelism�(Part 2)

Chapter 5

Appendix F

Appendix I

1

Prof. Iyad Jafar

https://sites.google.com/view/iyadjafar

iyad.jafar@ju.edu.jo

2 of 20

Outline

Distributed Shared Memory (5.4)
Synchronization (5.5)
Consistency models (5.6)

2

Prof. Iyad Jafar

3 of 20

Distributed Shared Memory

3

Prof. Iyad Jafar

4 of 20

Distributed Shared Memory Multiprocessors (DSMs)

Larger number of processors.
Memory distributed among processors.

Non-uniform memory access/latency (NUMA).

Processors connected via direct (switched) and non-direct (multi-hop) interconnection networks.

4

Prof. Iyad Jafar

5 of 20

Coherency in DSM

Snooping approach has no centralized structure to track blocks.
However, communication with all caches on a miss.

Cache coherence traffic is a major portion.
Scaling up is a problem.

Alternative!

Distribute memory to separate local and remote accesses.
Use a directory to track cache blocks state

Stores which caches have copies of block and in which state (dirty or not).
Centralized directory (multicore with shared cache).
Distributed directory.

5

Prof. Iyad Jafar

6 of 20

Coherency in DSM

Distributed Directory DSM

A single directory could be become a bottleneck
Distribute directory along with memory to manage the associated memory per node
Coherence protocol should know where to find the directory information for any cached block

6

Prof. Iyad Jafar

7 of 20

Coherency in DSM

The Basics

The physical address space is statically divided between nodes

Use higher bits of the address to identify the owner

For each block in a directory,

A bit vector is used to track which nodes have a copy.
Bit vector can identify the owner of the block

The directory has three states:

Shared, Uncached, Modified

Track the state of each cache block in individual caches as well
Types of Nodes

Home Node (node associated with the address and the corresponding directory)
Local Node requesting node (could be a home node)
Remote Node (node that has a copy of the cache block, whether in S or E states)

7

Prof. Iyad Jafar

8 of 20

Example Protocol

Messages

8

P = requesting node number, A = requested address, and D = data contents

Prof. Iyad Jafar

9 of 20

DSM and Directory-Based Coherence

Example Protocol

9

State transition diagram for an individual cache block

Black: requests by the local processor

Gray: requests from the home directory

Bold: Actions by local processor

Prof. Iyad Jafar

10 of 20

DSM and Directory-Based Coherence

Example Protocol

10

State transition diagram for the directory

Prof. Iyad Jafar

11 of 20

Synchronization

11

Prof. Iyad Jafar

12 of 20

Synchronization

The considered protocols so far assumed that operations are atomic or uninterruptable.

Atomic: an operation can be done in such a way that no intervening operation can occur.
However, detecting a write miss, acquire the bus and receive response in a single action.

Need synchronization mechanism (HW+SW):

HW: instruction or instruction sequence capable of atomic read and modify of a location with some way to tell if the operation was atomic.
SW: synchronization libraries that use these instructions to help implementing user-level synchronization operations (locks and barriers).

12

Prof. Iyad Jafar

13 of 20

Basic Hardware primitives

Atomic exchange

Interchanges a value in a register for a value in memory.
The exchange (read&write) is not divisible; it can be used for synchronization.
Can be used by a processor to acquire exclusive access (lock: synchronization variable) .
The processor that acquires the lock have atomic access, while others can’t access.

Test-and-set

Tests a value and sets it if the value passes the test.

Fetch and increment

It returns the value of a memory location and atomically increments it.

Such primitives complicate coherence!

HW cannot allow any other operations between the read and the write.

.13

Prof. Iyad Jafar

14 of 20

Basic Hardware primitives

Alternatively, use a pair of instructions such that the second instruction returns a value that tells whether the pair executed atomically.
Linked Load and Store Conditional instructions

LL: loads value in a register and store the address in link register (cleared if address is invalidated).
SC: stores a value if the link register is not clear.

14

try: MOV R3,R4 ; copy exchange value to R3

LL R2,0(R1) ; load linked

SC R3,0(R1) ; store conditional

BEQZ R3, try ; branch if store fails

MOV R4,R2 ; put load value in R4

try: LL R2,0(R1) ;load linked

DADDUI R3,R2,#1 ;increment

SC R3,0(R1) ;store conditional

BEQZ R3, try ;branch store fails

Exchange

Fetch-and-Inc

Prof. Iyad Jafar

15 of 20

Locks and Coherence

Spin Locks

Locks that a processor continuously tries to acquire by spinning around a loop until it succeeds
Assuming no coherence, keep locks in memory

However, keeping locks in memory affects performance!
Cache! Multiple write misses.

15

DADDUI R2,R0,#1

lockit: EXCH R2,0(R1) ;atomic exchange

BNEZ R2,lockit ;already locked?

Prof. Iyad Jafar

16 of 20

Locks and Coherence

Cache locks if there is a coherences mechanism

Spinning on local cached value
Take advantage of locality

16

lockit: LD R2,0(R1) ;load of lock

BNEZ R2,lockit ;not available-spin

DADDUI R2,R0,#1 ;load locked value

EXCH R2,0(R1) ;swap

BNEZ R2,lockit ;branch if lock wasn’t 0

Prof. Iyad Jafar

17 of 20

Locks and Coherence

Using LL and SC 🡪 separate reading and writing to avoid bus traffic on writes

17

lockit: LL R2,0(R1) ;load linked

BNEZ R2,lockit ;not available-spin

DADDUI R2,R0,#1 ;locked value

SC R2,0(R1) ;store

BEQZ R2,lockit ;branch if store fails

Prof. Iyad Jafar

18 of 20

Consistency models

18

Prof. Iyad Jafar

19 of 20

Consistency Models

Cache coherence ensures that multiple processors see a consistent view of memory
However,

When a processor must see a value that has been updated by another processor?
Or, in what order a processor must observe writes by another processor?
Or, what properties must be enforced among reads and writes to different locations by different processors?

Consider the following processes which are executed on two processors and assume both A and B are cached

What if the write of A and B is immediate?
What if the invalidate of A and B is delayed?

19

Prof. Iyad Jafar

20 of 20

Consistency Models

Sequential consistency

The simplest approach
Delay the completion of memory access until all invalidation of that access are completed
Example

Delay reading B in P1 until the write of B in P2 is completed
Delay reading A in P2 until the write of A in P1 is completed

Programmers use synchronization libraries

Accesses to shared data is synchronized

Affects performance!

Use relaxed consistency models

Allow out-of-order read and write but to use synchronization operations to enforce ordering

20

Prof. Iyad Jafar