Multiprocessors and Thread-Level Parallelism (Part 1)
Chapter 5
Appendix F
Appendix I
1
Outline
2
Prof. Iyad Jafar
Introduction
3
Prof. Iyad Jafar
Introduction
4
4
RISC
Move to multi-processor
Technology Improvement
New Architectures and Organization
Power and ILP limitations?
Prof. Iyad Jafar
Introduction
5
Prof. Iyad Jafar
Introduction
6
Prof. Iyad Jafar
Multiprocessors
7
Prof. Iyad Jafar
Multiprocessors
8
Prof. Iyad Jafar
Multiprocessor Architectures
9
Prof. Iyad Jafar
Multiprocessor Architectures
10
Prof. Iyad Jafar
Multiprocessor Architecture
11
Prof. Iyad Jafar
Challenges - Limited Parallelism in Programs
12
Check other examples on page 374 and 375
Prof. Iyad Jafar
Challenges - Communication Overhead
Example. Suppose we have an application running on a 32-processor multiprocessor, which has a 200 ns time to handle reference to a remote memory. Assume that:
If the base CPI (assuming that all references hit in the cache) is 0.5, how much faster is the multiprocessor if there is no communication versus if 0.2% of the instructions involve a remote communication reference?
13
Prof. Iyad Jafar
Challenges
CPIcom = CPIideal + miss penalty
= 0.5 + remote request rate × penalty
= 0.5 + 0.002 × 200 ns / 0.3 ns
= 0.5 + 1.2
= 1.7
Speedup = 1.7 / 0.5 = 3.4
The multiprocessor with all local references is 3.4 faster
14
Prof. Iyad Jafar
SMP Architectures
15
Prof. Iyad Jafar
SMP Architectures
16
Intel Nehalem (Nov 2008)
Prof. Iyad Jafar
Centralized Shared Memory Architectures (SMPs)
17
Prof. Iyad Jafar
SMP Architectures
18
Prof. Iyad Jafar
Cache Coherence Problem
19
P1
P2
P3
Memory
X = 5
X = 5
X = 5
X = 8
X = ?
X = ?
1
2
3
4
5
Assume X is a shared variable and write-back private caches
Prof. Iyad Jafar
Cache Coherence
20
Prof. Iyad Jafar
Basic schemes for Enforcing Coherence
21
Prof. Iyad Jafar
Snooping Coherence Protocols
22
X
A invalidates X in B
A sends X to B and to memory
Prof. Iyad Jafar
Basic Implementation Techniques
23
Prof. Iyad Jafar
Basic Implementation Techniques
24
Prof. Iyad Jafar
Basic Implementation Techniques
25
Prof. Iyad Jafar
Example Protocol (Invalidate & WB)
26
Why to write-back?
Prof. Iyad Jafar
Example Protocol (Invalidate & WB)
27
Prof. Iyad Jafar
Example Protocol (Invalidate & WB)
28
Prof. Iyad Jafar
Example Protocol (Invalidate & WB)
29
Prof. Iyad Jafar
Extensions to MSI Protocol
30
Prof. Iyad Jafar
Limitations
31
Prof. Iyad Jafar
Limitations
32
Prof. Iyad Jafar
Performance of SMP
33
Prof. Iyad Jafar
Performance of SMPs
34
Prof. Iyad Jafar
Coherence Misses Example
35
1. true sharing miss, since x1 was read by P2 and needs to be invalidated from P2
2. false sharing miss, since x2 was invalidated by the write of x1 in P1, but that value of x1 is not used in P2
3. This event is a false sharing miss, since the block containing x1 is marked shared due to the read in P2, but P2 did not read x1
4. This event is a false sharing miss for the same reason as step 3.
5. This event is a true sharing miss, since the value being read was written by P2.
Prof. Iyad Jafar
Performance of SMPs
36
Prof. Iyad Jafar
Performance of SMPs
37
OLTP has the poorest performance due to memory hierarchy problems
Consider evaluating the OLTP when varying L3 cache size, block size and number of processors
Prof. Iyad Jafar
Performance of SMPs
38
Biggest improvement when moving from 1 to 2 MB L3?
Prof. Iyad Jafar
Performance of SMPs
39
Instruction and capacity misses drops but true sharing, false and compulsory misses are unaffected!
Prof. Iyad Jafar
Performance of SMPs
40
Increase of true sharing misses!
Prof. Iyad Jafar
Performance of SMPs
41
Reduce true sharing misses!
Prof. Iyad Jafar