1 of 152

WES 237B: Master of Advanced Study in Wireless Embedded Systems

Pat Pannuto

Department of Computer Science and Engineering

University of California, San Diego

Summer Session 2023

2 of 152

WELCOME to WES 237B!

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

2

3 of 152

Designing and Optimizing Embedded Systems (Signal/Image/Video Processing) Applications on SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

3

4 of 152

Designing and Optimizing Embedded Systems (Signal/Image/Video Processing) Applications on SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

4

5 of 152

What is an embedded system?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

5

6 of 152

What is an embedded system?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Specialized
Low power
Real-time

Properties of embedded systems

7 of 152

What is an embedded system?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Specialized
Low power
Real-time
High-performance

Properties of embedded systems

8 of 152

What is an embedded system?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

A domain-specific + high-performance system

9 of 152

What is an embedded system?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

A domain-specific + high-performance system

Achieving high-performance while reducing power consumption is a key, and how to do that?

There is no single answer! Usually, involves multiple factors (tradeoff) latency, throughput, power, cost,...

10 of 152

What is an embedded system?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

A domain-specific + high-performance system

Achieving high-performance while reducing power consumption is a key, and how to do that?

There is no single answer! Usually, involves multiple factors (tradeoff) latency, throughput, power, cost,... What is ”performance” anyway??

11 of 152

How to design an embedded system for this app?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

void matrix_vector_multiply(int in1[SIZE][SIZE], int in2[SIZE], int out[SIZE]) {

for(int i = 0; i < SIZE; i++){

for(int j = 0; j < SIZE; j++){

acc = acc + in1[i][j] * in2[j];

}

out[i] = acc;

}

void prefix_sum(int in[SIZE], int out[SIZE]) {

for(int i = 1; i < SIZE; ++i){

out[i] = out[i-1] + in[i];

}

int main() {

matrix_vector_multiply(A, B, C);

prefix_sum(C, D);

return 0;

}

0

2

1

6

3

1

2

5

0

2

3

9

12

13

15

20

12 of 152

Embedded systems design 🡪 Trade-offs

Performance, power, cost, programming choice,…etc.

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Matrix Vector Multiplication

Prefix sum

CPU

13 of 152

Embedded systems design 🡪 Trade-offs

Performance, power, cost, programming choice,…etc.

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Matrix Vector Multiplication

Prefix sum

Parallel

Sequential

FPGA + CPU

14 of 152

Embedded systems design 🡪 Trade-offs

Performance, power, cost, programming choice,…etc.

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Matrix Vector Multiplication

Prefix sum

Parallel

Sequential

FPGA + CPU

What if memory is bottleneck?

15 of 152

Embedded systems design 🡪 Trade-offs

Performance, power, cost, programming choice,…etc.

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Matrix Vector Multiplication

Prefix sum

Parallel

Sequential

GPU ?

16 of 152

Embedded systems design 🡪 Trade-offs

Performance, power, cost, programming choice,…etc.

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Matrix Vector Multiplication

Prefix sum

Parallel

Sequential

GPU ?

Can you re-write the Prefix sum (seq code) to tailor parallel architecture?

17 of 152

Designing and Optimizing Embedded Systems (Signal/Image/Video Processing) Applications on SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

17

18 of 152

Why study signal/image/video processing?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Emerging applications ADAS/AR/VR uses vision/image processing
Requires high performance and low power

19 of 152

Why study signal/image/video processing?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Emerging applications ADAS/AR/VR uses vision/image processing
Requires high performance and low power

20 of 152

Designing and Optimizing Embedded Systems (Signal/Image/Video Processing) Applications on SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

20

21 of 152

What is a System-on-chip (SoC)?

22 of 152

System on a Chip (SoC)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

The Zynq Book http://www.zynqbook.com/

23 of 152

System on a Chip (SoC)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

The Zynq Book http://www.zynqbook.com/, http://www.pynq.io/

24 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

UART

DSP

Memory Controller

DMA

USB

ADC/DAC

CPU

Softcore

25 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

UART

DSP

Memory Controller

DMA

USB

ADC/DAC

Softcore

CPU

26 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

UART

DSP

Memory Controller

DMA

USB

ADC/DAC

Softcore

CPU

27 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

UART

DSP

Memory Controller

DMA

USB

ADC/DAC

Softcore

CPU

Vendors buy ARM IP Cores and develop SoCs

Xilinx Zynq/Pynq
Jetson Nvidia

A protocol

Must be standard
Maintenance
Modularity (Re-use)

28 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

UART

DSP

Memory Controller

DMA

USB

ADC/DAC

Softcore

CPU

Different protocols…

Core-Connect (IBM)
Wishbone (OpenCores/OSH)
AXI 🡨 ARM’s

Xilinx bought in most here

Avalon (Altera, now Intel)

MIPI… [~mobile phones]

SLIMBus, LLI, UniPro, I3C, …

29 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

CPU

Co-processor

Peripherals

30 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

CPU/GPU

Co-processor (Accelerator Design)

Peripherals

WES 237B

WES 237C

31 of 152

ARM

Introduced in 1980s into customer electronics
De facto embedded processor technology

Not usually manufactured as a standalone chip
Usually part of a System-on-chip that is customized for specific application

ARM is everywhere including your phone, TV, ipad,… as a form of SoC.
SoC is a collection of interconnected hardware modules
A typical SoC include one or more ARM processor cores
First major customer was Apple (Netwon PDA)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

32 of 152

Apple A8 (iPhone 6) … Apple A15 (iPhone 13)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

33 of 152

ARM Technology

ARM (Advanced RISC Machine) controlled by ARM
ARM writes specification of ARM processors
Specifications

Instruction set architecture
MMU
Interrupt and Exception handling
Caches
Virtualization

CPUs based on ARM1, ARM2, ARM3 instruction set architecture (ISA) were introduced around 1980s.
Due to computations from Intel x86 and Motorola 68,000 CPU, ARM switched its business model

Selling the rights to use its processor design or ISA (specification)

ARMv4, ARMv5, ARMv6, ARMv7, ARMv8

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

34 of 152

ARM Technology

ARM holding creates IP cores based on specifications
IP core = VHDL or Verilog implementation of a specification
Examples:

ARM976 = Implementation of ARMv5
ARM1176 = Implementation of ARMv6
Cortex – A15 = Implementation of ARMv7-A
Cortex- A53 = Implementation of ARMv8-A

Multiple implementation of the same specification is possible

Cortex-A5,6,7,…15
Cortex-A5 is low power, Cortex-A15 is high-performance
Difference of pipeline, out of order execution, size of cache, ,…

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

35 of 152

ARM Processor Families

Three ‘profiles’ of cores:

Cortex-A Application
Cortex-R Real Time
Cortex-M Mobile

… get it?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

36 of 152

ARM Technology

ARM processors are sold as a reusable IP cores
Consistent instruction set allows

Common tool chain such as compilers debuggers, and code libraries

Most ARM processors are used with Co-processors (FPGA, GPU,..)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

37 of 152

ARM Technology

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Specification

e.g., ARMv7

ARM IP core

Cortex-A9

SoC

e.g., Zynq

Hardware Platform

e.g., Pynq, Zedboard

38 of 152

WES237B Hardware

Xilinx PYNQ-Z1 (~2016)

Open source project to showcase python + FPGA
Xilinx Zynq Chip

650MHz dual-core ARMv7-A Cortex-A9 processor
Programmable logic equivalent to Artix-7 FPGA
512 MB DDR3 (peak bw: 2.1 GB/s)

Programmable Logic (PL) & Processing System (PS) part
Allows use of python/libraries to program PS/PL

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

NVIDIA Jetson TX2 Dev Kit (~2017)

Quick prototyping platform to highlight new core
NVIDIA TX2

Denver2 ARMv8 (64-bit) dual-core + ARMv8 ARM Cortex-A57 quad-core (64-bit)
NVIDIA Pascal GPU with 256 CUDA Cores
8 GB DDR4 (peak bw: 59.7 GB/s)

RIP…

39 of 152

WES237B Hardware

650MHz dual-core ARMv7-A Cortex-A9 processor
~Artix-7 FPGA
512 MB DDR3 (peak bw: 2.1 GB/s)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Denver2 ARMv8 (64-bit) dual-core + ARMv8 ARM Cortex-A57 quad-core (64-bit)
NVIDIA Pascal GPU with 256 CUDA Cores
8 GB DDR4 (peak bw: 59.7 GB/s)

40 of 152

WES 237B

Course web:

sites.google.com/ucsd.edu/su23-wes237b-introembedded/

Instructor: Pat Pannuto

Primary research thrusts on ‘more embedded’ applications

Amazing TA: Christopher Crutchfield

Teaching experience across many courses and significant experience with embedded platforms
Technically very strong – your hands-on expert!

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

41 of 152

What to expect from WES 237B?

Goal 1: Optimization/Performance measurement of emerging applications on a SoC

Software optimization for SoC (pipelining, unrolling, multi-threading), blocking, memory optimizations

Goal 2: Understanding Parallel Programming Paradigms and Parallel Patterns

Understanding parallel programming using emerging applications (covers some case studies using computer vision/image processing applications)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

42 of 152

Grading and Grading Policy

Grading

Student group research and participation: 30%

Class participation 10%
Group research and presentation 20%

Assignments : 70%

Assignment 1: 10%
Assignment 2 to 5 : 15%

No show up/very late: -5% from your grade each occasion (exceptions happen; preferably let us know in advance, but life happens)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

43 of 152

Group Research Presentation

Content: How well your presentation covers the given topic?🡺5%
Presentation: How you are presenting the material to audience of a wide background in an allocated time? 🡺5%
Discussion and Future Directions /Questions: Lead a discussion (pros/cons) of the technology including benefits, current technological challenges/limitations and opportunities, how this work is different than previous ones (what are the main contributions)?🡺10%

Instructors/TAs/students will ask questions & participate in the discussion

Logistics

Send me the outline of your presentation at least a week before the class
Discuss/send course staff the (close to final) presentation 2 days before the class.

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

44 of 152

Student Group Presentation Topics (Old list…)

Emerging Memory Technologies & in-memory computing

https://emai19.sites.stanford.edu/workshop-presentations
Phase Change Memory

Emerging Processing Technologies

Optical computing https://lighton.ai/, https://lightmatter.co/, https://www.optalysys.com/

Efficient Deep Learning Hardware/Software Architectures

Sze, Vivienne, et al. "Hardware for machine learning: Challenges and opportunities." 2017 IEEE Custom Integrated Circuits Conference (CICC). IEEE, 2017. https://arxiv.org/pdf/1612.07625.pdf
https://cs217.stanford.edu/
Han, Song, et al. "EIE: efficient inference engine on compressed deep neural network." ACM SIGARCH Computer Architecture News 44.3 (2016): 243-254. https://arxiv.org/pdf/1602.01528.pdf
Gamboa, Noah, et al. "Campfire: Compressible, Regularization-Free, Structured Sparse Training for Hardware Accelerators." arXiv preprint arXiv:2001.03253 (2020) https://arxiv.org/pdf/2001.03253.pdf
Yao, Shuocaho, et al. “DeepIoT: Compressing Deep Neural Network Structures for Sensing Systems with a Compressor-Critic Framework.” https://dl.acm.org/doi/abs/10.1145/3131672.3131675

Emerging Parallel Programming Systems of Parallel Architectures

Lai, Yi-Hsiang, et al. "HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing." Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2019.
Li, Tzu-Mao, et al. "Differentiable programming for image processing and deep learning in Halide." ACM Transactions on Graphics (TOG) 37.4 (2018): 1-13.

Emerging Sensor Technologies and Enabling Applications

Dynamic Vision Sensor

http://rpg.ifi.uzh.ch/CVPR19_event_vision_workshop.html

Processing inside the sensor

Sony IMX 500 https://www.cinema5d.com/sony-imx500-worlds-ai-image-sensor-announced/

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

45 of 152

Assignments

Each week has an associated assignment

Due ~2 weeks later

Assignments have in-lab and take-home parts

If you don’t finish the lab part in class, finish at home
If you finish the lab early, feel free to start on the take-home parts

Collaboration?

In-lab

As much as you like (hands on your own keyboards please)

Take-home

More limited – general questions, specific library/technical corner cases okay, but largely your own work please

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

46 of 152

WES 237B does not teach you C!

We do not teach you basics of C!

Although, we cover advanced concepts and optimizations

We will try to help you
You are expected to catch up if you do not have C programming background

Assignment 1 (part two) is about C programming

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

47 of 152

Questions ?

48 of 152

Let’s take a short break

Stretch your legs, get some water, meet your neighbors…

48

49 of 152

Embedded Compute

50 of 152

Processors

ISA

A definition of instructions that a processor can execute
x86 is an ISA, many realization of x86
ARMv1, ARMv2,…ARMv7 are ISAs.

Realization or Chip (“Soft” or “Hard” cores)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Software

Hardware

ISA

51 of 152

An Instruction Set Architecture is an abstraction of a computational machine

An ISA is “the agreed-upon interface between all the software that runs on the machine and the hardware that executes it.”

I/O system

Instr. Set Proc.

Compiler

Operating

System

Application

Digital Design

Circuit Design

Instruction Set

Architecture

52 of 152

Computers do not speak English�And they do not speak C or Java or Python or Haskell (or…) either

Compiler

High Level Language

(C, Java, Rust, etc)

Assembly Language

lw $15, 0($2)

lw $16, 4($2)

sw $16, 0($2)

sw $15, 4($2)

int temp = array[index];

array[index] = array[index + 1];

array[index + 1] = temp;

ALUOP[0:3] <= InstReg[9:11] & MASK

Machine Language

1000110001100010000000000000000

1000110011110010000000000000100

1010110011110010000000000000000

1010110001100010000000000000100

Control Signal Spec

Assembler

Machine Interpretation

Specification

Programmer

“Swap two array elements.”

53 of 152

The Instruction Set Architecture

that part of the architecture that is visible to the programmer

available instructions (“opcodes”)
number and types of registers
instruction formats
storage access, addressing modes
exceptional conditions

54 of 152

Embedded Compute Units

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Processor	Description	Applications
Microcontroller	Microcontrollers is a small computer (CPU, memory I/O)	Suitable for application that require small memory and less computation
DSP	Processors specifically designed for signal processing Repetitive numeric computations High memory bandwidth (arrays) Real-time processing DSP objectives Power Development time Throughput (Processing)	Audio Coding/Decoding Video/Image/ Computer Vision Convolution Communications FIR, FFT
GPU	Graphics processors (GPGPU)	General (embedded 🡪 data center)
FPGA	Programmable Custom Hardware	General (embedded 🡪 data center)

55 of 152

Design tradeoffs…

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

CPU

DSP

GPU

FPGA

ASIC

Efficiency, performance (given workload..)

Generality, flexibility

56 of 152

CPU (Latency) vs. GPU (Throughput)

Powerful ALU

Reduced operation latency

Large caches

Convert long latency memory accesses to short latency cache accesses

Sophisticated control

Branch prediction for reduced branch latency
Data forwarding for reduced data latency

Small caches

To boost memory throughput

Simple control

No branch prediction
No data forwarding

Energy efficient ALUs

Many, long latency but heavily pipelined for high throughput

Require massive number of threads to tolerate latencies

Threading logic
Thread state

57 of 152

A field-programmable gate array (FPGA)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

58 of 152

What is an FPGA?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

A field-programmable gate array (FPGA)

59 of 152

What is an FPGA?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

A field-programmable gate array (FPGA)

60 of 152

FPGA

Flexibility to implement any logic

F(x,y,z) = xy + z’
F(x,y,z) = xyz

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

61 of 152

FPGA

Flexibility to implement any logic

F(x,y,z) = xy + z’
F(x,y,z) = xyz

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

62 of 152

FPGA

Flexibility to implement any logic

F(x,y,z) = xy + z’
F(x,y,z) = xyz

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

63 of 152

Why FPGAs?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

A field-programmable gate array (FPGA)

Vision (image) 🡪 High performance and low (er) power

64 of 152

Motivation: Applications of FPGA

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Data centers^[1]

[1] Andrew Putnam et al. “ A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services,”, ISCA, 2014

65 of 152

Motivation: Applications of FPGA

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Data centers^[1]

Database^[2]

[1] Andrew Putnam et al. “ A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services,”, ISCA, 2014

[2] Casper et al. "Hardware acceleration of database operations", FPGA''14

66 of 152

Motivation: Applications of FPGA

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Data centers^[1]

Database^[2]

CNN ^[3]

[4] Rouhani et al. "SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA.", FCCM 2015

[1] Andrew Putnam et al. “ A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services,”, ISCA, 2014

[2] Casper et al. "Hardware acceleration of database operations", FPGA''14

[3] Zhang et al. "Optimizing fpga-based accelerator design for deep convolutional neural networks.", FPGA 2015

Big Data^[4]

67 of 152

Motivation: Applications of FPGA

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Data centers^[1]

Database^[2]

Big Data^[4]

[4] Rouhani et al. "SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA.", FCCM 2015

[1] Andrew Putnam et al. “ A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services,”, ISCA, 2014

[2] Casper et al. "Hardware acceleration of database operations", FPGA''14

[3] Zhang et al. "Optimizing fpga-based accelerator design for deep convolutional neural networks.", FPGA 2015

CNN ^[3]

68 of 152

Applications

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

69 of 152

Designing an FPGA?

You will learn in WES 237C!�

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

70 of 152

Designing an FPGA ?

VHDL,

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

71 of 152

Designing an FPGA?

VHDL,

Verilog,

SystemVerilog,

….

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

72 of 152

VHDL???

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

73 of 152

High-Level Synthesis: A way to program FPGA from C

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Magic

VHDL (Verilog)

74 of 152

High-Level Synthesis

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Directives

75 of 152

High-Level Synthesis

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Directives

HLS

Directives: Pipeline, partition, dataflow, inline,reshape, resource, stream, top, interface, expression balance, reset, allocation, dependence, function_initiate,…

76 of 152

High-Level Synthesis

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Directives

VHDL (Verilog)

HLS

Directives: Pipeline, partition, dataflow, inline,reshape, resource, stream, top, interface, expression balance, reset, allocation, dependence, function_initiate,…

77 of 152

High-Level Synthesis

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Directives

VHDL (Verilog)

HLS

void HelloWorld(

int A[8],

int B[8]) {

#pragma unroll

for(int i=0;i<8;i++){

B[i]=A[i]*3;

}

78 of 152

High-Level Synthesis

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Directives

VHDL (Verilog)

HLS

void HelloWorld(

int A[8],

int B[8]) {

#pragma unroll

for(int i=0;i<8;i++){

B[i]=A[i]*3;

}

79 of 152

High-Level Synthesis

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Directives

VHDL (Verilog)

Vivado HLS

Vivado

IP

Bitstream

void HelloWorld(

int A[8],

int B[8]) {

#pragma unroll

for(int i=0;i<8;i++){

B[i]=A[i]*3;

}

80 of 152

Summary of Embedded Systems and Trends

Embedded systems 🡪 Getting complex
High performance computing & embedded systems 🡪 Merging
GPUs and FPGAs are two key players

GPUs🡪 easy to program, high power consumption

Good for bulk processing

FPGAs🡪 hard(er) to program, low power consumption

Good for streaming applications
Flexibility

ASIC🡪 Expensive and not flexible

Power/performance

FPGA🡪 Moving towards SoC

Programming from high-level languages such as C

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

81 of 152

System on a Chip (SoC): Zynq

Processing System (PS)

Application processor unit (APU)
Memory interfaces
I/O peripherals (IOP)
Interconnect

Programmable Logic (PL)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

82 of 152

System on a Chip (SoC)

Lower cost
Faster and more secure data transfer between components
Smaller physical size
Lower power consumption
Has higher overall speed

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

83 of 152

System on a Chip (SoC): Zynq

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

84 of 152

PL Special Resources: BRAM & DSP48E1

Block Rams (BRAMs)

RAM, ROM, FIFO
Each block ram can store up to 36 Kb info

Can be configured as 36Kb or 18Kb

DSP48E1

Pre-adder/subtractor, multiplier, post-adder/subtractor

IOBs

Pre-adder/subtractor, multiplier, post-adder/subtractor

XADC

A dedicated set of ADC blocks

Clocks

4 clocks from PS and other PL clocks

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

85 of 152

PS and PL interface

Advanced eXtensible Interface (AXI)

Version AXI4
ARM AMBA® 3.0 open standard
First release 1996 for microcontrollers
De-facto standard for on-chip communication

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

86 of 152

SoC Design Flow

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Defining Requirements

Defining Specification

Software/ Hardware Partition

Software Development & Testing

Hardware Development & Testing

System Integration & Testing

Application

87 of 152

Let’s take a short break

Stretch your legs, get some water, meet your neighbors…

87

88 of 152

Compression: Huffman Encoding

89 of 152

Compression

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Encoder

Decoder

90 of 152

Huffman Encoding

Greedy Algorithm
Input: text data (symbols)

Find histogram of symbols
L = Sort according to frequencies (histogram) of symbols
Create a Huffman tree

Select two symbols with minimum frequencies
Create a node and add their symbols
Add the new node to L in a sorted order

Assign a 0 to left edge and 1 to right edge
Create a codeword from root of Huffman tree to the leaves of Huffman tree
Return codeword

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

91 of 152

Data Compression: Huffman Encoding

David Albert Huffman

Invented during his PhD at MIT

“A Method for the Construction of Minimum-Redundancy Codes”, 1952

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Data: AAABB

Fixed Length Encoding:

A = 001
B= 010
🡪 001001001010010 🡪 15 bits

Huffman Encoding:

Variable Length Encoding

A = 1
B= 010
🡪 111010010 🡪9 bits

92 of 152

Histogram

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Input:

AABACDCDDDDEEEEEF

93 of 152

Histogram

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
A	3
B	1
C	2
D	5
E	5
F	1

Input:

AABACDCDDDDEEEEEF

94 of 152

Histogram

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
A	3
B	1
C	2
D	5
E	5
F	1

Input:

AABACDCDDDDEEEEEF

Huffman Key Idea: Use the fewest number of bits to encode the most common symbols

95 of 152

Sorting by frequencies

Radix sort, merge sort, quick sort,…etc.
We will learn how to design parallel sorting algorithms

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

96 of 152

Radix Sort

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

1	2	3
0	0	2
9	9	9
6	0	9
1	1	1

1	1	1
0	0	2
1	2	3
9	9	9
6	0	9

1	2	3
		2
9	9	9
6	0	9
1	1	1

97 of 152

Radix Sort

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

1	2	3
0	0	2
9	9	9
6	0	9
1	1	1

1	1	1
0	0	2
1	2	3
9	9	9
6	0	9

6	0	9
0	0	2
1	1	1
1	2	3
9	9	9

1	2	3
		2
9	9	9
6	0	9
1	1	1

98 of 152

Radix Sort

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

1	2	3
0	0	2
9	9	9
6	0	9
1	1	1

1	1	1
0	0	2
1	2	3
9	9	9
6	0	9

6	0	9
0	0	2
1	1	1
1	2	3
9	9	9

0	0	2
1	1	1
1	2	3
6	0	9
9	9	9

1	2	3
		2
9	9	9
6	0	9
1	1	1

99 of 152

Radix Sort

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Stage 1

Stage 2

Stage 3

1	1	1
0	0	2
1	2	3
9	9	9
6	0	9

6	0	9
0	0	2
1	1	1
1	2	3
9	9	9

0	0	2
1	1	1
1	2	3
6	0	9
9	9	9

Option 1: HW 🡪 pipeline
Option 2: Parallel Hardware

1	2	3
0	0	2
9	9	9
6	0	9
1	1	1

1	2	3
		2
9	9	9
6	0	9
1	1	1

Histogram

Prefixsum

H

P

H

Counting Sort

CS

Radix Sort

RS

New Radix Sort

There are a number of ways of designing parallel algorithms.

100 of 152

Counting Sort

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

[0, 3, 2, 2, 3, 2, 0, 3, 2, 1]

(C₀, C₁, C₂, C₃) = [2, 1, 4, 3]

(O₀, O₁, O₂, O₃) = (0, 2, 3, 7)

Calculate occurrence of each number. C₀ is the count of 0

We take the prefix sum of the counts transforms it to an offset table.

[0, 0, 1, 2, 2, 2, 2, 3, 3, 3]

Input array

Re-order (copy out) step where values are

Copied into the right location using prefix table.

Counting sort/Radix sort are (parallel) architecture friendly algorithms

🡺 Suitable for parallel architectures such as GPU (including multi core parallelization)

101 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Create Tree

Sort

S	F
F	1
B	1
C	2
A	3
D	5
E	5

0

1

D

E

2

17

F

A

1

0

1

B

C

4

7

10

0

102 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

Huffman’s Algorithm:

Pop lowest two nodes
Create new node as parent

Value is sum of children
Insert new node into (sorted) work list

Repeat

Caveat: Huffman is underspecified!

(Indeed, the ‘real’ algorithm only says make-and-merge trees)

103 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

B

1

n1

2

F

1

Huffman’s Algorithm:

Pop lowest two nodes
Create new node as parent

Value is sum of children
Insert new node into (sorted) work list

Repeat

104 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

n1	2

n1

2

B

1

F

1

Huffman’s Algorithm:

Pop lowest two nodes
Create new node as parent

Value is sum of children
Insert new node into (sorted) work list

Repeat

105 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

n1	2

n1

2

B

1

F

1

Huffman’s Algorithm:

Pop lowest two nodes
Create new node as parent

Value is sum of children
Insert new node into (sorted) work list

Repeat

106 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

n1	2

B

1

n1

2

F

1

n2

4

C

2

Huffman’s Algorithm:

Pop lowest two nodes
Create new node as parent

Value is sum of children
Insert new node into (sorted) work list

Repeat

107 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

n1	2
n2	4

n1

2

n2

4

B

1

F

1

C

2

Huffman’s Algorithm:

Pop lowest two nodes
Create new node as parent

Value is sum of children
Insert new node into (sorted) work list

Repeat

108 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

n1	2
n2	4

B

1

n1

2

F

1

n2

4

C

2

A

3

Huffman’s Algorithm:

Pop lowest two nodes
Create new node as parent

Value is sum of children
Insert new node into (sorted) work list

Repeat

109 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

n1	2
n2	4
n3	7

n1

2

n2

4

n3

7

B

1

F

1

C

2

A

3

Huffman’s Algorithm:

Pop lowest two nodes
Create new node as parent

Value is sum of children
Insert new node into (sorted) work list

Repeat

110 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

n1	2
n2	4
n3	7
n4	10

B

1

n1

2

F

1

n2

4

C

2

n3

7

A

3

E

5

n4

10

D

5

Huffman’s Algorithm:

Pop lowest two nodes
Create new node as parent

Value is sum of children
Insert new node into (sorted) work list

Repeat

111 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

n1	2
n2	4
n3	7
n4	10
n5	17

B

1

n1

2

F

1

n2

4

C

2

n3

7

A

3

E

5

n4

10

D

5

n5

17

Huffman’s Algorithm:

Pop lowest two nodes
Create new node as parent

Value is sum of children
Insert new node into (sorted) work list

Repeat

112 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

Create Tree

Sort

B

1

n1

2

F

1

n2

4

C

2

n3

7

A

3

E

5

n4

10

D

5

n5

17

113 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

Create Tree

Sort

B

1

n1

2

F

1

n2

4

C

2

n3

7

A

3

E

5

n4

10

D

5

n5

17

0

1

114 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

Create Tree

Sort

B

1

n1

2

F

1

n2

4

C

2

n3

7

A

3

E

5

n4

10

D

5

n5

17

0

1

Symbol	Code
A	00
F	0100
B	0101
C	011
D	10
E	11

1

Huffman Tree Creation is Sequential Process

🡺 Suitable for CPU architecture.

115 of 152

Huffman Trees Must Save Chosen Encoding!

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Caveat: Huffman is underspecified!

(Indeed, the ‘real’ algorithm only says make-and-merge trees)

116 of 152

Text compression with Huffman Encoding

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Encoder

Decoder

Huffman Encoding

Codebook

Encoded Data

117 of 152

Canonical Huffman Encoding

Memory efficient
Faster decoding

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Encoder

Decoder

Bit

Length

https://en.wikipedia.org/wiki/Canonical_Huffman_code

118 of 152

Parallel Bit Length Calculation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S	F
F	1
B	1
C	2
A	3
D	5
E	5

Sort

Compute Bit Len

2

4

3

2

Create Tree

1	2	4	4	0

F	n1	A	D	n3

B	C	n2	E	n4

0	1	2	3	4

Parent Address

Left

Right

Address

This storage (Parent Address, Left, Right) allows parallel bit length calculation
Partitioning the Huffman Tree in horizontal direction at any point will result independent data in Parent Address, Left and Right

B

1

n1

2

F

1

n2

4

C

2

n3

7

A

3

E

5

n4

10

D

5

n5

17

119 of 152

Software Optimization

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

119

120 of 152

Optimizations

121 of 152

Optimize 🡺 Performance

122 of 152

Measuring Performance

123 of 152

The bottom line: Performance

Time to do the task

execution time, response time, latency

Tasks per day, hour, week, sec, ns. ..

throughput, bandwidth

Ferrari

Bus

Speed

160 mph

65 mph

Time to Bay Area

3.1 hours

7.7 hours

Passengers

2

60

Throughput (pmph)

320

3900

124 of 152

Measures of “Performance”

Execution Time
Throughput (operations/time)

Transactions/sec, queries/day, etc.

Frame Rate
Responsiveness
Performance / Cost
Performance / Power
Performance / Energy

125 of 152

Why Study Computer Performance?

In order to say X is n times faster than Y
Identify the benefits of running a program on a computer X, or bottlenecks of a program on a computer X
Embedded systems are too complex (CPU + FPGA + GPU) 🡺 How to measure performance of system consisting of sequential and parallel parts ?
To understand common techniques to measure performance
To know what to measure & how to measure

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

125

126 of 152

Performance is Affected by,…

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

126

127 of 152

Performance is Affected by,…

Technology

Clock speed
Processor technology

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

127

128 of 152

Performance is Affected by,…

Technology

Clock speed
Processor technology

Organization

Single core. vs. multi core
Types of processors implementation (SIMD, out of order,..)
Heterogeneous (SoC)
Memory hierarchy & memory type
Type of I/O devices
,…

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

128

129 of 152

Performance is Affected by,…

Technology

Clock speed
Processor technology

Organization

Single core. Vs. multi core
Heterogeneous (SoC)
Types of processors (SIMD, out of order,..)
Memory hierarchy & memory type
Type of I/O devices
,…

Software

Quality of compilers & libraries & parallel processing frameworks
Operating system

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

129

130 of 152

Principles of Measuring Performance

Well defined metrics

Execution time = Time spent running a program on X computer

Reproducibility

Should get the same result when you are using the same program on same machine under same conditions (OS, compiler levels,…)

Standard Benchmarks

SPEC (https://www.spec.org/benchmarks.html)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

130

131 of 152

Performance Metrics

Raw speed = peak performance

In practice, never attainable
Clock rate * #of floating point operations/cycle (E.g., 10 Gflops)

Execution time = time to execute a program from beginning to end

The sooner the better
Wall clock time/elapsed time: Time to complete task as seen by the user. Might include operating system overhead or interference with other applications
CPU Time: does not include time spent for IO, external sources

User CPU time: CPU time spent in the program
System CPU time: CPU time spent in the OS performing taks requested by program

Throughput = total amount of work completed in a given time

Frames per second (fps)., 30 fps
Transactions or packets per second
Indication of how well hardware resources are being used

Memory: Bandwidth & Latency

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

131

132 of 152

There are many ways to measure program execution time

Program-reported time?
Wall-clock time?
user CPU time?
user + kernel CPU time?

$ time make # cargo build

Compiling hail v0.1.0 (/tock/boards/hail)

Finished release [optimized + debuginfo] target(s) in 19.96s

real 0m21.146s

user 0m30.388s

sys 0m2.032s

Well, it’s user+kernel cpu, for this class, because that’s what we can do something about.

But for the rest of the class, I’ll just say ET and you don’t have to think hard about what it means (ie, assume no other jobs, no stalls for disk, etc.)

Maybe: Note the ”time travel”. I.e. user > real in this example because of parallelism!

In case I forget: https://stackoverflow.com/questions/556405/what-do-real-user-and-sys-mean-in-the-output-of-time1

Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).

User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.

Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process. See below for a brief description of kernel mode (also known as 'supervisor' mode) and the system call mechanism.

133 of 152

Execution Time

Processor A is faster than the processor B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

133

Relative performance: Performance of A is n times greater than B

134 of 152

How architects define Performance

Only has meaning in the context of a program or workload
Not very intuitive as an absolute measure, but most of the time we’re more interested in relative performance

Performance_X=

1

Execution Time_X

, for program X

135 of 152

Relative Performance

Can be confusing…

A runs in 12 seconds

B runs in 20 seconds

A/B = .6 , so A is 40% faster, or 1.4X faster, or B is 40% slower
B/A = 1.67, so A is 67% faster, or 1.67X faster, or B is 67% slower

Needs a precise definition

136 of 152

Relative Performance (Speedup), the Definition

Performance_X

Execution Time_X

Performance_Y

Speedup

Execution Time_Y

=

n

(X/Y)

137 of 152

Example

Machine A runs program C in 9 seconds.
Machine B runs the same program in 6 seconds.
What is the speedup we see if we move to Machine B from Machine A?

138 of 152

What is Time?

CPU Execution Time = CPU clock cycles * Clock cycle time

Every conventional processor has a clock with an associated clock cycle time or clock rate

Every program runs in an integral number (whole number) of clock cycles

Cycle Time

MHz = millions of cycles/second, GHz = billions of cycles/second

X MHz = 1000/X nanoseconds cycle time

Y GHz = 1/Y nanoseconds cycle time

139 of 152

CPU time

CPU time is the time CPU is computing

Not including followings

Memory access time
IO access time

CPU time = Clock cycles * Clock Period
Clock period = 1/Clock frequency

100 MHz of clock frequency = 10 ns of clock period
1GHz of clock frequency = 1 ns of clock period

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

139

140 of 152

How many clock cycles?

Number of CPU clock cycles =

[Instruction count] * [Average Clock Cycles per Instruction (CPI)]

Exercise:

Computer A runs program C in 3.6 billion cycles.

Program C requires 2 billion dynamic instructions.

What is the CPI?

141 of 152

Clock Cycles Per Instruction (CPI)

CPI is the average number of clock cycles per instruction

Throughput metric
Component metric (not a measure of performance )

Different classes of instructions can have different CPIs

CPI for floating point operations
CPI for adders, branch instructions

CPU Clock Cycles = Number of Instructions (NI)* CPI

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

141

The CPI of i^th class

The number of instructions of i^th class

142 of 152

Putting it all together

CPU Execution Time = [CPU clock cycles] * [Clock cycle time]

CPU clock cycles = [Instruction count] * [Average Clock Cycles per Instruction (CPI)]

CPU Execution Time

Instruction Count

CPI

Clock Cycle Time

=

X

instructions

cycles/instruction

seconds/cycle

seconds

143 of 152

CPU Execution Time

CPU Execution Time = CPU Clock Cycles * Clock Period

= NI*CPI*Clock Period

Execution time depends on

Number of instructions : Determined by the ISA

Programmable hardware counters can be used to find NI
Profiling tools such as gprof, valgrind

CPI: determined by the ISA & Compiler & Organization (Parallel execution)

Can be determined in software using simulator

Clock period: Determined by the process technology and implementation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

143

144 of 152

Cycle Time/Clock Rate is no longer fixed

Increasingly, modern processors can execute at multiple clock rates (cycle times).
Why?
However, the granularity at which we can change the cycle time tends to be fairly coarse, so all of these principles and formulas still apply.

145 of 152

Who Affects Performance? How?

programmer
compiler
instruction-set architect
machine architect
hardware designer
materials scientist/physicist/silicon engineer

CPU Execution Time

Instruction Count

CPI

Clock Cycle Time

=

X

146 of 152

Which programs are best, are “most fair”, to run when measuring performance?

peak throughput measures (simple programs)?
synthetic benchmarks (whetstone, dhrystone,...)?
Real applications
SPEC (best of both worlds, but with problems of their own)

System Performance Evaluation Cooperative
Provides a common set of real applications

Along with strict guidelines for how to run them

Provides a relatively unbiased means to compare machines.

147 of 152

Amdahl’s Law

The impact of a performance improvement is limited by the percent of execution time affected by the improvement

Make the common case fast!!

Execution time

after improvement

=

Execution Time Affected

Amount of Improvement

Execution Time Unaffected

+

148 of 152

Amdahl’s Law and Massive Parallelism

.9

.1

149 of 152

Amdahl’s Law and Massive Parallelism

.9

.1

.45

Speedup

1.0

1/.55 = 1.82

150 of 152

Amdahl’s Law and Massive Parallelism

.9

.1

.45

.225

Speedup

1.0

1/.55 = 1.82

1/.325 = 3.07

151 of 152

Amdahl’s Law and Massive Parallelism

.9

.1

.45

.225

Speedup

1.0

1/.55 = 1.82

1/.325 = 3.07

< 10

152 of 152

Key Points

Be careful how you specify performance
Execution time = instructions * CPI * cycle time
Use real applications
Use standards, if possible
Make the common case fast