1 of 20

Multiprocessors and Thread-Level Parallelism�(Part 2)

Chapter 5

Appendix F

Appendix I

1

Prof. Iyad Jafar

https://sites.google.com/view/iyadjafar

iyad.jafar@ju.edu.jo

2 of 20

Outline

  • Distributed Shared Memory (5.4)
  • Synchronization (5.5)
  • Consistency models (5.6)

2

Prof. Iyad Jafar

3 of 20

Distributed Shared Memory

3

Prof. Iyad Jafar

4 of 20

Distributed Shared Memory Multiprocessors (DSMs)

  • Larger number of processors.
  • Memory distributed among processors.
    • Non-uniform memory access/latency (NUMA).
  • Processors connected via direct (switched) and non-direct (multi-hop) interconnection networks.

4

Prof. Iyad Jafar

5 of 20

Coherency in DSM

  • Snooping approach has no centralized structure to track blocks.
  • However, communication with all caches on a miss.
    • Cache coherence traffic is a major portion.
    • Scaling up is a problem.

  • Alternative!
    • Distribute memory to separate local and remote accesses.
    • Use a directory to track cache blocks state
      • Stores which caches have copies of block and in which state (dirty or not).
      • Centralized directory (multicore with shared cache).
      • Distributed directory.

5

Prof. Iyad Jafar

6 of 20

Coherency in DSM

  • Distributed Directory DSM
    • A single directory could be become a bottleneck
    • Distribute directory along with memory to manage the associated memory per node
    • Coherence protocol should know where to find the directory information for any cached block

6

Prof. Iyad Jafar

7 of 20

Coherency in DSM

  • The Basics
    • The physical address space is statically divided between nodes
      • Use higher bits of the address to identify the owner
    • For each block in a directory,
      • A bit vector is used to track which nodes have a copy.
      • Bit vector can identify the owner of the block
    • The directory has three states:
      • Shared, Uncached, Modified
    • Track the state of each cache block in individual caches as well
    • Types of Nodes
      • Home Node (node associated with the address and the corresponding directory)
      • Local Node requesting node (could be a home node)
      • Remote Node (node that has a copy of the cache block, whether in S or E states)

7

Prof. Iyad Jafar

8 of 20

Example Protocol

  • Messages

8

P = requesting node number, A = requested address, and D = data contents

Prof. Iyad Jafar

9 of 20

DSM and Directory-Based Coherence

  • Example Protocol

9

State transition diagram for an individual cache block

Black: requests by the local processor

Gray: requests from the home directory

Bold: Actions by local processor

Prof. Iyad Jafar

10 of 20

DSM and Directory-Based Coherence

  • Example Protocol

10

State transition diagram for the directory

Prof. Iyad Jafar

11 of 20

Synchronization

11

Prof. Iyad Jafar

12 of 20

Synchronization

  • The considered protocols so far assumed that operations are atomic or uninterruptable.
    • Atomic: an operation can be done in such a way that no intervening operation can occur.
    • However, detecting a write miss, acquire the bus and receive response in a single action.
  • Need synchronization mechanism (HW+SW):
    • HW: instruction or instruction sequence capable of atomic read and modify of a location with some way to tell if the operation was atomic.
    • SW: synchronization libraries that use these instructions to help implementing user-level synchronization operations (locks and barriers).

12

Prof. Iyad Jafar

13 of 20

Basic Hardware primitives

  • Atomic exchange
    • Interchanges a value in a register for a value in memory.
    • The exchange (read&write) is not divisible; it can be used for synchronization.
    • Can be used by a processor to acquire exclusive access (lock: synchronization variable) .
    • The processor that acquires the lock have atomic access, while others can’t access.
  • Test-and-set
    • Tests a value and sets it if the value passes the test.
  • Fetch and increment
    • It returns the value of a memory location and atomically increments it.
  • Such primitives complicate coherence!
    • HW cannot allow any other operations between the read and the write.

.13

Prof. Iyad Jafar

14 of 20

Basic Hardware primitives

  • Alternatively, use a pair of instructions such that the second instruction returns a value that tells whether the pair executed atomically.
  • Linked Load and Store Conditional instructions
    • LL: loads value in a register and store the address in link register (cleared if address is invalidated).
    • SC: stores a value if the link register is not clear.

14

try: MOV R3,R4 ; copy exchange value to R3

LL R2,0(R1) ; load linked

SC R3,0(R1) ; store conditional

BEQZ R3, try ; branch if store fails

MOV R4,R2 ; put load value in R4

try: LL R2,0(R1) ;load linked

DADDUI R3,R2,#1 ;increment

SC R3,0(R1) ;store conditional

BEQZ R3, try ;branch store fails

Exchange

Fetch-and-Inc

Prof. Iyad Jafar

15 of 20

Locks and Coherence

  • Spin Locks
    • Locks that a processor continuously tries to acquire by spinning around a loop until it succeeds
    • Assuming no coherence, keep locks in memory

    • However, keeping locks in memory affects performance!
    • Cache! Multiple write misses.

15

DADDUI R2,R0,#1

lockit: EXCH R2,0(R1) ;atomic exchange

BNEZ R2,lockit ;already locked?

Prof. Iyad Jafar

16 of 20

Locks and Coherence

  • Cache locks if there is a coherences mechanism
      • Spinning on local cached value
      • Take advantage of locality

16

lockit: LD R2,0(R1) ;load of lock

BNEZ R2,lockit ;not available-spin

DADDUI R2,R0,#1 ;load locked value

EXCH R2,0(R1) ;swap

BNEZ R2,lockit ;branch if lock wasn’t 0

Prof. Iyad Jafar

17 of 20

Locks and Coherence

  • Using LL and SC 🡪 separate reading and writing to avoid bus traffic on writes

17

lockit: LL R2,0(R1) ;load linked

BNEZ R2,lockit ;not available-spin

DADDUI R2,R0,#1 ;locked value

SC R2,0(R1) ;store

BEQZ R2,lockit ;branch if store fails

Prof. Iyad Jafar

18 of 20

Consistency models

18

Prof. Iyad Jafar

19 of 20

Consistency Models

  • Cache coherence ensures that multiple processors see a consistent view of memory
  • However,
    • When a processor must see a value that has been updated by another processor?
    • Or, in what order a processor must observe writes by another processor?
    • Or, what properties must be enforced among reads and writes to different locations by different processors?

  • Consider the following processes which are executed on two processors and assume both A and B are cached

    • What if the write of A and B is immediate?
    • What if the invalidate of A and B is delayed?

19

Prof. Iyad Jafar

20 of 20

Consistency Models

  • Sequential consistency
    • The simplest approach
    • Delay the completion of memory access until all invalidation of that access are completed
    • Example
      • Delay reading B in P1 until the write of B in P2 is completed
      • Delay reading A in P2 until the write of A in P1 is completed
    • Programmers use synchronization libraries
      • Accesses to shared data is synchronized
    • Affects performance!
  • Use relaxed consistency models
    • Allow out-of-order read and write but to use synchronization operations to enforce ordering

20

Prof. Iyad Jafar