2 of 79

Више процесора

Single instruction, single data stream - SISD
Single instruction, multiple data stream - SIMD
Multiple instruction, single data stream - MISD
Multiple instruction, multiple data stream- MIMD

3 of 79

Single Instruction, Single Data Stream - SISD

Један процесор
Single instruction stream
Подаци у једном РАМ-у
„Uni-processor“

4 of 79

Single Instruction, Multiple Data Stream - SIMD

Single machine instruction
Контрола истовременог извршавања
Много елемената процесора
Основа редундантних система
Сваки елемент обраде има своју меморију
Свака инструкција на својим подацима и свом процесору
Vector and array процесори

5 of 79

Multiple Instruction, Single Data Stream - MISD

Подаци у секвенцама
Предају се групи процесора
Сваки процесор има своју секвенцу
Непознато у пракси

6 of 79

Multiple Instruction, Multiple Data Stream- MIMD

Група процесора
Симултано извршење различитих секвенци
Различити скупови података
SMPs, clusters, NUMA systems

7 of 79

Таксономија паралелних архитектура

8 of 79

MIMD - преглед

Процесор опште намене
Сваки може да обради било коју инструкцију
Детаљи зависе од процесора

9 of 79

Уско везани- SMP Symmetric Multiprocessor

Процесори деле меморију
Комуникација међу њима дељеном меморијом
Symmetric Multiprocessor (SMP)

Деле меморију
Деле сабирницу
Време приступа до локације приближно исто за сваку локацију

10 of 79

Чврсто везани- NUMA

Неуниформан приступ меморији
Различита времена приступа до лолација

11 of 79

Слабо везани- Clusters

Више независних процесора
Њихове везе чине кластер
Комуникација мрежом

12 of 79

Паралелна организација- SISD

13 of 79

Паралелна организација - SIMD

14 of 79

Parallel Organizations - MIMD Shared Memory

15 of 79

Паралелна организација- MIMD�Distributed Memory

16 of 79

Симетрични мултипроцесори

Један комојутер са:

Два или више процесора сличних карактеристика
Процесори деле меморију и I/O
Процесори су везани сабирницом
Време приступа меморији исто за све процесоре
Процесори деле приступ I/O

Јесним каналом или више канала

Функције процесора исте
Један ОС

Задужен за интеракцију међу процесорима
Интеракција: job, task, file, data

17 of 79

Multiprogramming and Multiprocessing

18 of 79

Symmetric Multiprocessor - предности

Перформансе

Ако је паралелизација могућа
Доступност
Жилавост на отказе
Инкрементални раст
Перформансе расту досавањем процесора
Скалитање

20 of 79

По критеријуму организације

Подела времена на сабирници
Вишепролазна меморија
Централна control unit

21 of 79

Time Shared Bus

Најпростије решење
Слично једнопроцесорском раду
Особине

адресирање – разликује мем. модуле
арбитража – сваки модул може нити мастер у арбитражи
Подела времена – ако један модул има приступ, остали чекају магистралу

23 of 79

Time Share Bus - предности

Simplicity једноставност
Flexibility флексибилност
Reliability поузданост

24 of 79

Time Share Bus - мане

Перформансе ограничене брзином циклуса сабирнице
Сваки процесор има свој кеш

Да се смањи комуникација са сабирницом

Резултат је проблем реализације кеша

25 of 79

ОС - последице

Симултани конкурентни процеси
Распоред времена
синхронизација
Управљање меморијиом
Поузданост и отпорност на отказе

26 of 79

A Mainframe SMP�IBM zSeries

Uniprocessor with one main memory card to a high-end system with 48 processors and 8 memory cards
Dual-core processor chip

Each includes two identical central processors (CPs)
CISC superscalar microprocessor
Mostly hardwired, some vertical microcode
256-kB L1 instruction cache and a 256-kB L1 data cache

L2 cache 32 MB

Clusters of five
Each cluster supports eight processors and access to entire main memory space

System control element (SCE)

Arbitrates system communication
Maintains cache coherence

Main store control (MSC)

Interconnect L2 caches and main memory

Memory card

Each 32 GB, Maximum 8 , total of 256 GB
Interconnect to MSC via synchronous memory interfaces (SMIs)

Memory bus adapter (MBA)

Interface to I/O channels, go directly to L2 cache

27 of 79

IBM z990 �Multiprocessor �Structure

28 of 79

Кохерентост кеша и�MESI протокол

Problem – исти подаци у више кеш модула
Неконзустенција у погледу наструктуру садржаја меморије
Write back може изазвати неконзистенцију

29 of 79

Софтверска решења

Компајлер и ОС су задужени за решење
Време се троши на компајлирање
Компликације у софтверу!
Софтвер има своје мане

Неефикасна употреба кеша

30 of 79

Хардверска решења

Протоколи
Динамичко препознавање проблема
Run time
Ефикаснија употреба кеша
Транспарентније прогамеру

31 of 79

MESI State Transition Diagram

32 of 79

Увећање перформанси

Брзина извршења иснтрукција као мера перформансе
MIPS rate = f * IPC

f processor clock frequency, in MHz
IPC is average instructions per cycle

Повећање перформансе повећањем фреквенције и усложњавањем инструкције која се током циклуса реализује
Проблем: сложеност и дисипација!

33 of 79

Multithreading 9 Chip Multiprocessors

Ток инструкција делимо у мање делове (threads, нити)
Извршавају се паралелно

34 of 79

Дефиниција нити и процеса

Нит у multithreaded процесору НЕ МОРА бити исто што и софтверска нит
Процес:

35 of 79

Дефиниција нити и процеса

36 of 79

Implicit and Explicit Multithreading

All commercial processors and most experimental ones use explicit multithreading

Concurrently execute instructions from different explicit threads
Interleave instructions from different threads on shared pipelines or parallel execution on parallel pipelines

Implicit multithreading is concurrent execution of multiple threads extracted from single sequential program

Implicit threads defined statically by compiler or dynamically by hardware

38 of 79

Approaches to Explicit Multithreading

Interleaved

Fine-grained
Processor deals with two or more thread contexts at a time
Switching thread at each clock cycle
If thread is blocked it is skipped

Blocked

Coarse-grained
Thread executed until event causes delay
E.g.Cache miss
Effective on in-order processor
Avoids pipeline stall

Simultaneous (SMT)

Instructions simultaneously issued from multiple threads to execution units of superscalar processor

Chip multiprocessing

Processor is replicated on a single chip
Each processor handles separate threads

40 of 79

Код скаларних процесора

Једнонитни

Simple pipeline
No multithreading

Преплетени вишенитни скаларни

Просто
Комутација нити по фреквенцији такта
Хардвер се пројектује да комутира контекст нити

55 of 79

Power5 Instruction Data Flow

56 of 79

Кластер

57 of 79

Кластер, пример 1

58 of 79

Кластер, пример 2

62 of 79

Архитектура кластера

63 of 79

Cluster Middleware

Unified image to user

Single system image

Single point of entry
Single file hierarchy
Single control point
Single virtual networking
Single memory space
Single job management system
Single user interface
Single I/O space
Single process space
Checkpointing
Process migration

64 of 79

Blade Servers

Common implementation of cluster
Server houses multiple server modules (blades) in single chassis

Save space
Improve system management
Chassis provides power supply
Each blade has processor, memory, disk

65 of 79

Example 100-Gbps Ethernet Configuration for Massive Blade Server Site

66 of 79

Cluster v. SMP

Both provide multiprocessor support to high demand applications.
Both available commercially

SMP for longer

SMP:

Easier to manage and control
Closer to single processor systems

Scheduling is main difference
Less physical space
Lower power consumption

Clustering:

Superior incremental & absolute scalability
Superior availability

Redundancy

67 of 79

Nonuniform Memory Access (NUMA)

Alternative to SMP & clustering
Uniform memory access

All processors have access to all parts of memory

Using load & store

Access time to all regions of memory is the same
Access time to memory for different processors same
As used by SMP

Nonuniform memory access

All processors have access to all parts of memory

Using load & store

Access time of processor differs depending on region of memory
Different processors access different regions of memory at different speeds

Cache coherent NUMA

Cache coherence is maintained among the caches of the various processors
Significantly different from SMP and clusters

68 of 79

Motivation

SMP has practical limit to number of processors

Bus traffic limits to between 16 and 64 processors

In clusters each node has own memory

Apps do not see large global memory
Coherence maintained by software not hardware

NUMA retains SMP flavour while giving large scale multiprocessing

e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors

Objective is to maintain transparent system wide memory while permitting multiprocessor nodes, each with own bus or internal interconnection system

69 of 79

CC-NUMA Organization

70 of 79

CC-NUMA Operation

Each processor has own L1 and L2 cache
Each node has own main memory
Nodes connected by some networking facility
Each processor sees single addressable memory space
Memory request order:

L1 cache (local to processor)
L2 cache (local to processor)
Main memory (local to node)
Remote memory

Delivered to requesting (local to processor) cache

Automatic and transparent

71 of 79

Memory Access Sequence

Each node maintains directory of location of portions of memory and cache status
e.g. node 2 processor 3 (P2-3) requests location 798 which is in memory of node 1

P2-3 issues read request on snoopy bus of node 2
Directory on node 2 recognises location is on node 1
Node 2 directory requests node 1’s directory
Node 1 directory requests contents of 798
Node 1 memory puts data on (node 1 local) bus
Node 1 directory gets data from (node 1 local) bus
Data transferred to node 2’s directory
Node 2 directory puts data on (node 2 local) bus
Data picked up, put in P2-3’s cache and delivered to processor

72 of 79

Cache Coherence

Node 1 directory keeps note that node 2 has copy of data
If data modified in cache, this is broadcast to other nodes
Local directories monitor and purge local cache if necessary
Local directory monitors changes to local data in remote caches and marks memory invalid until writeback
Local directory forces writeback if memory location requested by another processor

73 of 79

NUMA Pros & Cons

Effective performance at higher levels of parallelism than SMP
No major software changes
Performance can breakdown if too much access to remote memory

Can be avoided by:

L1 & L2 cache design reducing all memory access

Need good temporal locality of software

Good spatial locality of software
Virtual memory management moving pages to nodes that are using them most

Not transparent

Page allocation, process allocation and load balancing changes needed

Availability?

1 of 79

2 of 79

3 of 79

4 of 79

5 of 79

6 of 79

7 of 79

8 of 79

9 of 79

10 of 79

11 of 79

12 of 79

13 of 79

14 of 79

15 of 79

16 of 79

17 of 79

18 of 79

19 of 79

20 of 79

21 of 79

22 of 79

23 of 79

24 of 79

25 of 79

26 of 79

27 of 79

28 of 79

29 of 79

30 of 79

31 of 79

32 of 79

33 of 79

34 of 79

35 of 79

36 of 79

37 of 79

38 of 79

39 of 79

40 of 79

41 of 79

42 of 79

43 of 79

44 of 79

45 of 79

46 of 79

47 of 79

48 of 79

49 of 79

50 of 79

51 of 79

52 of 79

53 of 79

54 of 79

55 of 79

56 of 79

57 of 79

58 of 79

59 of 79

60 of 79

61 of 79

62 of 79

63 of 79

64 of 79

65 of 79

66 of 79

67 of 79

68 of 79

69 of 79

70 of 79

71 of 79

72 of 79

73 of 79

74 of 79

75 of 79

76 of 79

77 of 79

78 of 79

79 of 79