1 of 152

WES 237B: Master of Advanced Study in Wireless Embedded Systems

Pat Pannuto

Department of Computer Science and Engineering

University of California, San Diego

Summer Session 2023

2 of 152

WELCOME to WES 237B!

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

2

3 of 152

Designing and Optimizing Embedded Systems (Signal/Image/Video Processing) Applications on SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

3

4 of 152

Designing and Optimizing Embedded Systems (Signal/Image/Video Processing) Applications on SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

4

5 of 152

What is an embedded system?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

5

6 of 152

What is an embedded system?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

  • Specialized
  • Low power
  • Real-time

Properties of embedded systems

7 of 152

What is an embedded system?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

  • Specialized
  • Low power
  • Real-time
  • High-performance

Properties of embedded systems

8 of 152

What is an embedded system?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

A domain-specific + high-performance system

9 of 152

What is an embedded system?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

A domain-specific + high-performance system

Achieving high-performance while reducing power consumption is a key, and how to do that?

There is no single answer! Usually, involves multiple factors (tradeoff) latency, throughput, power, cost,...

10 of 152

What is an embedded system?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

A domain-specific + high-performance system

Achieving high-performance while reducing power consumption is a key, and how to do that?

There is no single answer! Usually, involves multiple factors (tradeoff) latency, throughput, power, cost,... What is ”performance” anyway??

11 of 152

How to design an embedded system for this app?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

void matrix_vector_multiply(int in1[SIZE][SIZE], int in2[SIZE], int out[SIZE]) {

for(int i = 0; i < SIZE; i++){

for(int j = 0; j < SIZE; j++){

acc = acc + in1[i][j] * in2[j];

}

out[i] = acc;

}

}

void prefix_sum(int in[SIZE], int out[SIZE]) {

for(int i = 1; i < SIZE; ++i){

out[i] = out[i-1] + in[i];

}

}

int main() {

matrix_vector_multiply(A, B, C);

prefix_sum(C, D);

return 0;

}

0

2

1

6

3

1

2

5

0

2

3

9

12

13

15

20

12 of 152

Embedded systems design 🡪 Trade-offs

  • Performance, power, cost, programming choice,…etc.

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Matrix Vector Multiplication

Prefix sum

CPU

13 of 152

Embedded systems design 🡪 Trade-offs

  • Performance, power, cost, programming choice,…etc.

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Matrix Vector Multiplication

Prefix sum

Parallel

Sequential

FPGA + CPU

14 of 152

Embedded systems design 🡪 Trade-offs

  • Performance, power, cost, programming choice,…etc.

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Matrix Vector Multiplication

Prefix sum

Parallel

Sequential

FPGA + CPU

What if memory is bottleneck?

15 of 152

Embedded systems design 🡪 Trade-offs

  • Performance, power, cost, programming choice,…etc.

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Matrix Vector Multiplication

Prefix sum

Parallel

Sequential

GPU ?

16 of 152

Embedded systems design 🡪 Trade-offs

  • Performance, power, cost, programming choice,…etc.

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Matrix Vector Multiplication

Prefix sum

Parallel

Sequential

GPU ?

Can you re-write the Prefix sum (seq code) to tailor parallel architecture?

17 of 152

Designing and Optimizing Embedded Systems (Signal/Image/Video Processing) Applications on SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

17

18 of 152

Why study signal/image/video processing?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

  • Emerging applications ADAS/AR/VR uses vision/image processing
  • Requires high performance and low power

19 of 152

Why study signal/image/video processing?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

  • Emerging applications ADAS/AR/VR uses vision/image processing
  • Requires high performance and low power

20 of 152

Designing and Optimizing Embedded Systems (Signal/Image/Video Processing) Applications on SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

20

21 of 152

What is a System-on-chip (SoC)?

22 of 152

System on a Chip (SoC)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

23 of 152

System on a Chip (SoC)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

24 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

UART

DSP

Memory Controller

DMA

USB

ADC/DAC

CPU

Softcore

25 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

UART

DSP

Memory Controller

DMA

USB

ADC/DAC

Softcore

CPU

26 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

UART

DSP

Memory Controller

DMA

USB

ADC/DAC

Softcore

CPU

27 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

UART

DSP

Memory Controller

DMA

USB

ADC/DAC

Softcore

CPU

Vendors buy ARM IP Cores and develop SoCs

    • Xilinx Zynq/Pynq
    • Jetson Nvidia

A protocol

    • Must be standard
    • Maintenance
    • Modularity (Re-use)

28 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

UART

DSP

Memory Controller

DMA

USB

ADC/DAC

Softcore

CPU

Different protocols…

  • Core-Connect (IBM)
  • Wishbone (OpenCores/OSH)
  • AXI 🡨 ARM’s
    • Xilinx bought in most here
  • Avalon (Altera, now Intel)

  • MIPI… [~mobile phones]
    • SLIMBus, LLI, UniPro, I3C, …

29 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

CPU

Co-processor

Peripherals

30 of 152

SoC

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

CPU/GPU

Co-processor (Accelerator Design)

Peripherals

WES 237B

WES 237C

31 of 152

ARM

  • Introduced in 1980s into customer electronics
  • De facto embedded processor technology
    • Not usually manufactured as a standalone chip
    • Usually part of a System-on-chip that is customized for specific application
  • ARM is everywhere including your phone, TV, ipad,… as a form of SoC.
  • SoC is a collection of interconnected hardware modules
  • A typical SoC include one or more ARM processor cores
  • First major customer was Apple (Netwon PDA)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

32 of 152

Apple A8 (iPhone 6) … Apple A15 (iPhone 13)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

33 of 152

ARM Technology

  • ARM (Advanced RISC Machine) controlled by ARM
  • ARM writes specification of ARM processors
  • Specifications
    • Instruction set architecture
    • MMU
    • Interrupt and Exception handling
    • Caches
    • Virtualization
  • CPUs based on ARM1, ARM2, ARM3 instruction set architecture (ISA) were introduced around 1980s.
  • Due to computations from Intel x86 and Motorola 68,000 CPU, ARM switched its business model
    • Selling the rights to use its processor design or ISA (specification)
  • ARMv4, ARMv5, ARMv6, ARMv7, ARMv8

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

34 of 152

ARM Technology

  • ARM holding creates IP cores based on specifications
  • IP core = VHDL or Verilog implementation of a specification
  • Examples:
    • ARM976 = Implementation of ARMv5
    • ARM1176 = Implementation of ARMv6
    • Cortex – A15 = Implementation of ARMv7-A
    • Cortex- A53 = Implementation of ARMv8-A
  • Multiple implementation of the same specification is possible
    • Cortex-A5,6,7,…15
    • Cortex-A5 is low power, Cortex-A15 is high-performance
    • Difference of pipeline, out of order execution, size of cache, ,…

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

35 of 152

ARM Processor Families

  • Three ‘profiles’ of cores:
    • Cortex-A Application
    • Cortex-R Real Time
    • Cortex-M Mobile
  • … get it?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

36 of 152

ARM Technology

  • ARM processors are sold as a reusable IP cores
  • Consistent instruction set allows
    • Common tool chain such as compilers debuggers, and code libraries
  • Most ARM processors are used with Co-processors (FPGA, GPU,..)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

37 of 152

ARM Technology

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Specification

e.g., ARMv7

ARM IP core

Cortex-A9

SoC

e.g., Zynq

Hardware Platform

e.g., Pynq, Zedboard

38 of 152

WES237B Hardware

  • Xilinx PYNQ-Z1 (~2016)
    • Open source project to showcase python + FPGA
    • Xilinx Zynq Chip
      • 650MHz dual-core ARMv7-A Cortex-A9 processor
      • Programmable logic equivalent to Artix-7 FPGA
      • 512 MB DDR3 (peak bw: 2.1 GB/s)
    • Programmable Logic (PL) & Processing System (PS) part
    • Allows use of python/libraries to program PS/PL

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

  • NVIDIA Jetson TX2 Dev Kit (~2017)
    • Quick prototyping platform to highlight new core
    • NVIDIA TX2
      • Denver2 ARMv8 (64-bit) dual-core + ARMv8 ARM Cortex-A57 quad-core (64-bit)
      • NVIDIA Pascal GPU with 256 CUDA Cores
      • 8 GB DDR4 (peak bw: 59.7 GB/s)
    • RIP…

39 of 152

WES237B Hardware

    • 650MHz dual-core ARMv7-A Cortex-A9 processor
    • ~Artix-7 FPGA
    • 512 MB DDR3 (peak bw: 2.1 GB/s)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

    • Denver2 ARMv8 (64-bit) dual-core + ARMv8 ARM Cortex-A57 quad-core (64-bit)
    • NVIDIA Pascal GPU with 256 CUDA Cores
    • 8 GB DDR4 (peak bw: 59.7 GB/s)

40 of 152

WES 237B

  • Course web:
  • Instructor: Pat Pannuto
    • Primary research thrusts on ‘more embedded’ applications

  • Amazing TA: Christopher Crutchfield
    • Teaching experience across many courses and significant experience with embedded platforms
    • Technically very strong – your hands-on expert!

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

41 of 152

What to expect from WES 237B?

  • Goal 1: Optimization/Performance measurement of emerging applications on a SoC
    • Software optimization for SoC (pipelining, unrolling, multi-threading), blocking, memory optimizations
  • Goal 2: Understanding Parallel Programming Paradigms and Parallel Patterns
    • Understanding parallel programming using emerging applications (covers some case studies using computer vision/image processing applications)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

42 of 152

Grading and Grading Policy

  • Grading
    • Student group research and participation: 30%
      • Class participation 10%
      • Group research and presentation 20% 
    • Assignments : 70%
      • Assignment 1: 10%
      • Assignment 2 to 5 : 15%

    • No show up/very late: -5% from your grade each occasion (exceptions happen; preferably let us know in advance, but life happens)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

43 of 152

Group Research Presentation

  • Content: How well your presentation covers the given topic?🡺5%
  • Presentation: How you are presenting the material to audience of a wide background in an allocated time? 🡺5%
  • Discussion and Future Directions /Questions: Lead a discussion (pros/cons) of the technology including benefits, current technological challenges/limitations and opportunities, how this work is different than previous ones (what are the main contributions)?🡺10%
    • Instructors/TAs/students will ask questions & participate in the discussion
  • Logistics
    • Send me the outline of your presentation at least a week before the class
    • Discuss/send course staff the (close to final) presentation 2 days before the class.

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

44 of 152

Student Group Presentation Topics (Old list…)

  • Emerging Memory Technologies & in-memory computing
  • Emerging Processing Technologies
  • Efficient Deep Learning Hardware/Software Architectures
  • Emerging Parallel Programming Systems of Parallel Architectures
    • Lai, Yi-Hsiang, et al. "HeteroCL: A multi-paradigm programming infrastructure for software-defined reconfigurable computing." Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 2019.
    • Li, Tzu-Mao, et al. "Differentiable programming for image processing and deep learning in Halide." ACM Transactions on Graphics (TOG) 37.4 (2018): 1-13.
  • Emerging Sensor Technologies and Enabling Applications

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

45 of 152

Assignments

  • Each week has an associated assignment
    • Due ~2 weeks later
  • Assignments have in-lab and take-home parts
    • If you don’t finish the lab part in class, finish at home
    • If you finish the lab early, feel free to start on the take-home parts
  • Collaboration?
    • In-lab
      • As much as you like (hands on your own keyboards please)
    • Take-home
      • More limited – general questions, specific library/technical corner cases okay, but largely your own work please

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

46 of 152

WES 237B does not teach you C!

  • We do not teach you basics of C!
    • Although, we cover advanced concepts and optimizations
  • We will try to help you
  • You are expected to catch up if you do not have C programming background
    • Assignment 1 (part two) is about C programming

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

47 of 152

Questions ?

48 of 152

Let’s take a short break

Stretch your legs, get some water, meet your neighbors…

48

49 of 152

Embedded Compute

50 of 152

Processors

  • ISA
    • A definition of instructions that a processor can execute
    • x86 is an ISA, many realization of x86
    • ARMv1, ARMv2,…ARMv7 are ISAs.
  • Realization or Chip (“Soft” or “Hard” cores)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Software

Hardware

ISA

51 of 152

An Instruction Set Architecture is an abstraction of a computational machine

  • An ISA is “the agreed-upon interface between all the software that runs on the machine and the hardware that executes it.”

I/O system

Instr. Set Proc.

Compiler

Operating

System

Application

Digital Design

Circuit Design

Instruction Set

Architecture

52 of 152

Computers do not speak English�And they do not speak C or Java or Python or Haskell (or…) either

Compiler

High Level Language

(C, Java, Rust, etc)

Assembly Language

lw $15, 0($2)

lw $16, 4($2)

sw $16, 0($2)

sw $15, 4($2)

int temp = array[index];

array[index] = array[index + 1];

array[index + 1] = temp;

ALUOP[0:3] <= InstReg[9:11] & MASK

Machine Language

1000110001100010000000000000000

1000110011110010000000000000100

1010110011110010000000000000000

1010110001100010000000000000100

Control Signal Spec

Assembler

Machine Interpretation

Specification

Programmer

“Swap two array elements.”

53 of 152

The Instruction Set Architecture

  • that part of the architecture that is visible to the programmer
    • available instructions (“opcodes”)
    • number and types of registers
    • instruction formats
    • storage access, addressing modes
    • exceptional conditions

54 of 152

Embedded Compute Units

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Processor

Description

Applications

Microcontroller

Microcontrollers is a small computer (CPU, memory I/O)

Suitable for application that require small memory and less computation

DSP

Processors specifically designed for signal processing

Repetitive numeric computations

High memory bandwidth (arrays)

Real-time processing

DSP objectives

Power

Development time

Throughput (Processing)

Audio

Coding/Decoding

Video/Image/ Computer Vision

Convolution

Communications

FIR, FFT

GPU

Graphics processors (GPGPU)

General (embedded 🡪 data center)

FPGA

Programmable Custom Hardware

General (embedded 🡪 data center)

55 of 152

Design tradeoffs…

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

CPU

DSP

GPU

FPGA

ASIC

Efficiency, performance (given workload..)

Generality, flexibility

56 of 152

CPU (Latency) vs. GPU (Throughput)

  • Powerful ALU
    • Reduced operation latency
  • Large caches
    • Convert long latency memory accesses to short latency cache accesses
  • Sophisticated control
    • Branch prediction for reduced branch latency
    • Data forwarding for reduced data latency
  • Small caches
    • To boost memory throughput
  • Simple control
    • No branch prediction
    • No data forwarding
  • Energy efficient ALUs
    • Many, long latency but heavily pipelined for high throughput
  • Require massive number of threads to tolerate latencies
    • Threading logic
    • Thread state

57 of 152

A field-programmable gate array (FPGA)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

58 of 152

What is an FPGA?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

A field-programmable gate array (FPGA)

59 of 152

What is an FPGA?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

A field-programmable gate array (FPGA)

60 of 152

FPGA

  • Flexibility to implement any logic
    • F(x,y,z) = xy + z’
    • F(x,y,z) = xyz

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

61 of 152

FPGA

  • Flexibility to implement any logic
    • F(x,y,z) = xy + z’
    • F(x,y,z) = xyz

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

62 of 152

FPGA

  • Flexibility to implement any logic
    • F(x,y,z) = xy + z’
    • F(x,y,z) = xyz

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

63 of 152

Why FPGAs?

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

A field-programmable gate array (FPGA)

Vision (image) 🡪 High performance and low (er) power

64 of 152

Motivation: Applications of FPGA

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Data centers[1]

[1] Andrew Putnam et al. “ A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services,”, ISCA, 2014

65 of 152

Motivation: Applications of FPGA

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Data centers[1]

Database[2]

[1] Andrew Putnam et al. “ A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services,”, ISCA, 2014

[2] Casper et al. "Hardware acceleration of database operations", FPGA''14

66 of 152

Motivation: Applications of FPGA

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Data centers[1]

Database[2]

CNN [3]

[4] Rouhani et al. "SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA.", FCCM 2015

[1] Andrew Putnam et al. “ A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services,”, ISCA, 2014

[2] Casper et al. "Hardware acceleration of database operations", FPGA''14

[3] Zhang et al. "Optimizing fpga-based accelerator design for deep convolutional neural networks.", FPGA 2015

Big Data[4]

67 of 152

Motivation: Applications of FPGA

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Data centers[1]

Database[2]

Big Data[4]

[4] Rouhani et al. "SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA.", FCCM 2015

[1] Andrew Putnam et al. “ A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services,”, ISCA, 2014

[2] Casper et al. "Hardware acceleration of database operations", FPGA''14

[3] Zhang et al. "Optimizing fpga-based accelerator design for deep convolutional neural networks.", FPGA 2015

CNN [3]

68 of 152

Applications

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

69 of 152

Designing an FPGA?

You will learn in WES 237C!�

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

70 of 152

Designing an FPGA ?

VHDL,

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

71 of 152

Designing an FPGA?

VHDL,

Verilog,

SystemVerilog,

….

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

72 of 152

VHDL???

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

73 of 152

High-Level Synthesis: A way to program FPGA from C

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Magic

VHDL (Verilog)

74 of 152

High-Level Synthesis

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Directives

75 of 152

High-Level Synthesis

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Directives

HLS

  • Directives: Pipeline, partition, dataflow, inline,reshape, resource, stream, top, interface, expression balance, reset, allocation, dependence, function_initiate,…

76 of 152

High-Level Synthesis

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Directives

VHDL (Verilog)

HLS

  • Directives: Pipeline, partition, dataflow, inline,reshape, resource, stream, top, interface, expression balance, reset, allocation, dependence, function_initiate,…

77 of 152

High-Level Synthesis

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Directives

VHDL (Verilog)

HLS

void HelloWorld(

int A[8],

int B[8]) {

#pragma unroll

for(int i=0;i<8;i++){

B[i]=A[i]*3;

}

}

78 of 152

High-Level Synthesis

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Directives

VHDL (Verilog)

HLS

void HelloWorld(

int A[8],

int B[8]) {

#pragma unroll

for(int i=0;i<8;i++){

B[i]=A[i]*3;

}

}

79 of 152

High-Level Synthesis

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

C/C++

Directives

VHDL (Verilog)

Vivado HLS

Vivado

IP

Bitstream

void HelloWorld(

int A[8],

int B[8]) {

#pragma unroll

for(int i=0;i<8;i++){

B[i]=A[i]*3;

}

}

80 of 152

Summary of Embedded Systems and Trends

  • Embedded systems 🡪 Getting complex
  • High performance computing & embedded systems 🡪 Merging
  • GPUs and FPGAs are two key players
    • GPUs🡪 easy to program, high power consumption
      • Good for bulk processing
    • FPGAs🡪 hard(er) to program, low power consumption
      • Good for streaming applications
      • Flexibility
    • ASIC🡪 Expensive and not flexible
      • Power/performance
  • FPGA🡪 Moving towards SoC
    • Programming from high-level languages such as C

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

81 of 152

System on a Chip (SoC): Zynq

  • Processing System (PS)
    • Application processor unit (APU)
    • Memory interfaces
    • I/O peripherals (IOP)
    • Interconnect
  • Programmable Logic (PL)

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

82 of 152

System on a Chip (SoC)

  • Lower cost
  • Faster and more secure data transfer between components
  • Smaller physical size
  • Lower power consumption
  • Has higher overall speed

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

83 of 152

System on a Chip (SoC): Zynq

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

84 of 152

PL Special Resources: BRAM & DSP48E1

  • Block Rams (BRAMs)
    • RAM, ROM, FIFO
    • Each block ram can store up to 36 Kb info
      • Can be configured as 36Kb or 18Kb
  • DSP48E1
    • Pre-adder/subtractor, multiplier, post-adder/subtractor
  • IOBs
    • Pre-adder/subtractor, multiplier, post-adder/subtractor
  • XADC
    • A dedicated set of ADC blocks
  • Clocks
    • 4 clocks from PS and other PL clocks

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

85 of 152

PS and PL interface

  • Advanced eXtensible Interface (AXI)
    • Version AXI4
    • ARM AMBA® 3.0 open standard
    • First release 1996 for microcontrollers
    • De-facto standard for on-chip communication

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

86 of 152

SoC Design Flow

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Defining Requirements

Defining Specification

Software/ Hardware Partition

Software Development & Testing

Hardware Development & Testing

System Integration & Testing

Application

87 of 152

Let’s take a short break

Stretch your legs, get some water, meet your neighbors…

87

88 of 152

Compression: Huffman Encoding

89 of 152

Compression

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Encoder

Decoder

90 of 152

Huffman Encoding

  • Greedy Algorithm
  • Input: text data (symbols)
    • Find histogram of symbols
    • L = Sort according to frequencies (histogram) of symbols
    • Create a Huffman tree
      • Select two symbols with minimum frequencies
      • Create a node and add their symbols
      • Add the new node to L in a sorted order
    • Assign a 0 to left edge and 1 to right edge
    • Create a codeword from root of Huffman tree to the leaves of Huffman tree
    • Return codeword

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

91 of 152

Data Compression: Huffman Encoding

  • David Albert Huffman
    • Invented during his PhD at MIT
  • “A Method for the Construction of Minimum-Redundancy Codes”, 1952

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

    • Data: AAABB
  • Fixed Length Encoding:
      • A = 001
      • B= 010
      • 🡪 001001001010010 🡪 15 bits

  • Huffman Encoding:
    • Variable Length Encoding
      • A = 1
      • B= 010
      • 🡪 111010010 🡪9 bits

92 of 152

Histogram

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Input:

AABACDCDDDDEEEEEF

93 of 152

Histogram

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

A

3

B

1

C

2

D

5

E

5

F

1

Input:

AABACDCDDDDEEEEEF

94 of 152

Histogram

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

A

3

B

1

C

2

D

5

E

5

F

1

Input:

AABACDCDDDDEEEEEF

Huffman Key Idea: Use the fewest number of bits to encode the most common symbols

95 of 152

Sorting by frequencies

  • Radix sort, merge sort, quick sort,…etc.
  • We will learn how to design parallel sorting algorithms

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

96 of 152

Radix Sort

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

1

2

3

0

0

2

9

9

9

6

0

9

1

1

1

1

1

1

0

0

2

1

2

3

9

9

9

6

0

9

1

2

3

2

9

9

9

6

0

9

1

1

1

97 of 152

Radix Sort

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

1

2

3

0

0

2

9

9

9

6

0

9

1

1

1

1

1

1

0

0

2

1

2

3

9

9

9

6

0

9

6

0

9

0

0

2

1

1

1

1

2

3

9

9

9

1

2

3

2

9

9

9

6

0

9

1

1

1

98 of 152

Radix Sort

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

1

2

3

0

0

2

9

9

9

6

0

9

1

1

1

1

1

1

0

0

2

1

2

3

9

9

9

6

0

9

6

0

9

0

0

2

1

1

1

1

2

3

9

9

9

0

0

2

1

1

1

1

2

3

6

0

9

9

9

9

1

2

3

2

9

9

9

6

0

9

1

1

1

99 of 152

Radix Sort

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Stage 1

Stage 2

Stage 3

1

1

1

0

0

2

1

2

3

9

9

9

6

0

9

6

0

9

0

0

2

1

1

1

1

2

3

9

9

9

0

0

2

1

1

1

1

2

3

6

0

9

9

9

9

  • Option 1: HW 🡪 pipeline
  • Option 2: Parallel Hardware

1

2

3

0

0

2

9

9

9

6

0

9

1

1

1

1

2

3

2

9

9

9

6

0

9

1

1

1

Histogram

Prefixsum

H

P

H

Counting Sort

CS

CS

CS

Radix Sort

RS

RS

New Radix Sort

There are a number of ways of designing parallel algorithms.

100 of 152

Counting Sort

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

[0, 3, 2, 2, 3, 2, 0, 3, 2, 1]

(C0, C1, C2, C3) = [2, 1, 4, 3]

(O0, O1, O2, O3) = (0, 2, 3, 7)

Calculate occurrence of each number. C0 is the count of 0

We take the prefix sum of the counts transforms it to an offset table.

[0, 0, 1, 2, 2, 2, 2, 3, 3, 3]

Input array

Re-order (copy out) step where values are

Copied into the right location using prefix table.

Counting sort/Radix sort are (parallel) architecture friendly algorithms

🡺 Suitable for parallel architectures such as GPU (including multi core parallelization)

101 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Create Tree

Sort

S

F

F

1

B

1

C

2

A

3

D

5

E

5

0

0

0

1

1

1

D

E

2

17

F

A

1

0

1

B

C

4

7

10

0

102 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

Huffman’s Algorithm:

  1. Pop lowest two nodes
  2. Create new node as parent
    1. Value is sum of children
    2. Insert new node into (sorted) work list
  3. Repeat

Caveat: Huffman is underspecified!

(Indeed, the ‘real’ algorithm only says make-and-merge trees)

103 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

B

1

n1

2

F

1

Huffman’s Algorithm:

  1. Pop lowest two nodes
  2. Create new node as parent
    1. Value is sum of children
    2. Insert new node into (sorted) work list
  3. Repeat

104 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

n1

2

n1

2

B

1

F

1

Huffman’s Algorithm:

  1. Pop lowest two nodes
  2. Create new node as parent
    1. Value is sum of children
    2. Insert new node into (sorted) work list
  3. Repeat

105 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

n1

2

n1

2

B

1

F

1

Huffman’s Algorithm:

  1. Pop lowest two nodes
  2. Create new node as parent
    1. Value is sum of children
    2. Insert new node into (sorted) work list
  3. Repeat

106 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

n1

2

B

1

n1

2

F

1

n2

4

C

2

Huffman’s Algorithm:

  1. Pop lowest two nodes
  2. Create new node as parent
    1. Value is sum of children
    2. Insert new node into (sorted) work list
  3. Repeat

107 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

n1

2

n2

4

n1

2

n2

4

B

1

F

1

C

2

Huffman’s Algorithm:

  1. Pop lowest two nodes
  2. Create new node as parent
    1. Value is sum of children
    2. Insert new node into (sorted) work list
  3. Repeat

108 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

n1

2

n2

4

B

1

n1

2

F

1

n2

4

C

2

A

3

Huffman’s Algorithm:

  1. Pop lowest two nodes
  2. Create new node as parent
    1. Value is sum of children
    2. Insert new node into (sorted) work list
  3. Repeat

109 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

n1

2

n2

4

n3

7

n1

2

n2

4

n3

7

B

1

F

1

C

2

A

3

Huffman’s Algorithm:

  1. Pop lowest two nodes
  2. Create new node as parent
    1. Value is sum of children
    2. Insert new node into (sorted) work list
  3. Repeat

110 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

n1

2

n2

4

n3

7

n4

10

B

1

n1

2

F

1

n2

4

C

2

n3

7

A

3

E

5

n4

10

D

5

Huffman’s Algorithm:

  1. Pop lowest two nodes
  2. Create new node as parent
    1. Value is sum of children
    2. Insert new node into (sorted) work list
  3. Repeat

111 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

n1

2

n2

4

n3

7

n4

10

n5

17

B

1

n1

2

F

1

n2

4

C

2

n3

7

A

3

E

5

n4

10

D

5

n5

17

Huffman’s Algorithm:

  1. Pop lowest two nodes
  2. Create new node as parent
    1. Value is sum of children
    2. Insert new node into (sorted) work list
  3. Repeat

112 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

Create Tree

Sort

B

1

n1

2

F

1

n2

4

C

2

n3

7

A

3

E

5

n4

10

D

5

n5

17

113 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

Create Tree

Sort

B

1

n1

2

F

1

n2

4

C

2

n3

7

A

3

E

5

n4

10

D

5

n5

17

0

0

0

0

0

1

1

1

1

114 of 152

Huffman Tree Creation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

Create Tree

Sort

B

1

n1

2

F

1

n2

4

C

2

n3

7

A

3

E

5

n4

10

D

5

n5

17

0

0

0

0

0

1

1

1

1

Symbol

Code

A

00

F

0100

B

0101

C

011

D

10

E

11

1

Huffman Tree Creation is Sequential Process

🡺 Suitable for CPU architecture.

115 of 152

Huffman Trees Must Save Chosen Encoding!

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Caveat: Huffman is underspecified!

(Indeed, the ‘real’ algorithm only says make-and-merge trees)

116 of 152

Text compression with Huffman Encoding

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Encoder

Decoder

Huffman Encoding

Codebook

Encoded Data

117 of 152

Canonical Huffman Encoding

  • Memory efficient
  • Faster decoding

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

Encoder

Decoder

Bit

Length

118 of 152

Parallel Bit Length Calculation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

S

F

F

1

B

1

C

2

A

3

D

5

E

5

Sort

Compute Bit Len

2

4

4

3

2

2

Create Tree

1

2

4

4

0

F

n1

A

D

n3

B

C

n2

E

n4

0

1

2

3

4

Parent Address

Left

Right

Address

  • This storage (Parent Address, Left, Right) allows parallel bit length calculation
  • Partitioning the Huffman Tree in horizontal direction at any point will result independent data in Parent Address, Left and Right

B

1

n1

2

F

1

n2

4

C

2

n3

7

A

3

E

5

n4

10

D

5

n5

17

119 of 152

Software Optimization

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

119

120 of 152

Optimizations

121 of 152

Optimize 🡺 Performance

122 of 152

Measuring Performance

123 of 152

The bottom line: Performance

  • Time to do the task
    • execution time, response time, latency
  • Tasks per day, hour, week, sec, ns. ..
    • throughput, bandwidth

Ferrari

Bus

Speed

160 mph

65 mph

Time to Bay Area

3.1 hours

7.7 hours

Passengers

2

60

Throughput (pmph)

320

3900

124 of 152

Measures of “Performance”

  • Execution Time
  • Throughput (operations/time)
    • Transactions/sec, queries/day, etc.
  • Frame Rate
  • Responsiveness
  • Performance / Cost
  • Performance / Power
  • Performance / Energy

125 of 152

Why Study Computer Performance?

  • In order to say X is n times faster than Y
  • Identify the benefits of running a program on a computer X, or bottlenecks of a program on a computer X
  • Embedded systems are too complex (CPU + FPGA + GPU) 🡺 How to measure performance of system consisting of sequential and parallel parts ?
  • To understand common techniques to measure performance
  • To know what to measure & how to measure

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

125

126 of 152

Performance is Affected by,…

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

126

127 of 152

Performance is Affected by,…

  • Technology
    • Clock speed
    • Processor technology

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

127

128 of 152

Performance is Affected by,…

  • Technology
    • Clock speed
    • Processor technology
  • Organization
    • Single core. vs. multi core
    • Types of processors implementation (SIMD, out of order,..)
    • Heterogeneous (SoC)
    • Memory hierarchy & memory type
    • Type of I/O devices
    • ,…

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

128

129 of 152

Performance is Affected by,…

  • Technology
    • Clock speed
    • Processor technology
  • Organization
    • Single core. Vs. multi core
    • Heterogeneous (SoC)
    • Types of processors (SIMD, out of order,..)
    • Memory hierarchy & memory type
    • Type of I/O devices
    • ,…
  • Software
    • Quality of compilers & libraries & parallel processing frameworks
    • Operating system

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

129

130 of 152

Principles of Measuring Performance

  • Well defined metrics
    • Execution time = Time spent running a program on X computer
  • Reproducibility
    • Should get the same result when you are using the same program on same machine under same conditions (OS, compiler levels,…)
  • Standard Benchmarks

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

130

131 of 152

Performance Metrics

  • Raw speed = peak performance
    • In practice, never attainable
    • Clock rate * #of floating point operations/cycle (E.g., 10 Gflops)
  • Execution time = time to execute a program from beginning to end
    • The sooner the better
    • Wall clock time/elapsed time: Time to complete task as seen by the user. Might include operating system overhead or interference with other applications
    • CPU Time: does not include time spent for IO, external sources
      • User CPU time: CPU time spent in the program
      • System CPU time: CPU time spent in the OS performing taks requested by program
  • Throughput = total amount of work completed in a given time
    • Frames per second (fps)., 30 fps
    • Transactions or packets per second
    • Indication of how well hardware resources are being used
  • Memory: Bandwidth & Latency

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

131

132 of 152

There are many ways to measure program execution time

  • Program-reported time?
  • Wall-clock time?
  • user CPU time?
  • user + kernel CPU time?

$ time make # cargo build

Compiling hail v0.1.0 (/tock/boards/hail)

Finished release [optimized + debuginfo] target(s) in 19.96s

real 0m21.146s

user 0m30.388s

sys 0m2.032s

133 of 152

Execution Time

  • Processor A is faster than the processor B

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

133

  • Relative performance: Performance of A is n times greater than B

134 of 152

How architects define Performance

  • Only has meaning in the context of a program or workload
  • Not very intuitive as an absolute measure, but most of the time we’re more interested in relative performance

PerformanceX =

1

Execution TimeX

, for program X

135 of 152

Relative Performance

  • Can be confusing…

A runs in 12 seconds

B runs in 20 seconds

    • A/B = .6 , so A is 40% faster, or 1.4X faster, or B is 40% slower
    • B/A = 1.67, so A is 67% faster, or 1.67X faster, or B is 67% slower

  • Needs a precise definition

136 of 152

Relative Performance (Speedup), the Definition

PerformanceX

Execution TimeX

PerformanceY

Speedup

Execution TimeY

=

=

=

n

(X/Y)

137 of 152

Example

  • Machine A runs program C in 9 seconds.
  • Machine B runs the same program in 6 seconds.
  • What is the speedup we see if we move to Machine B from Machine A?

138 of 152

What is Time?

CPU Execution Time = CPU clock cycles * Clock cycle time

    • Every conventional processor has a clock with an associated clock cycle time or clock rate

    • Every program runs in an integral number (whole number) of clock cycles

Cycle Time

MHz = millions of cycles/second, GHz = billions of cycles/second

X MHz = 1000/X nanoseconds cycle time

Y GHz = 1/Y nanoseconds cycle time

139 of 152

CPU time

  • CPU time is the time CPU is computing
    • Not including followings
      • Memory access time
      • IO access time
  • CPU time = Clock cycles * Clock Period
  • Clock period = 1/Clock frequency

  • 100 MHz of clock frequency = 10 ns of clock period
  • 1GHz of clock frequency = 1 ns of clock period

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

139

140 of 152

How many clock cycles?

Number of CPU clock cycles =

[Instruction count] * [Average Clock Cycles per Instruction (CPI)]

Exercise:

Computer A runs program C in 3.6 billion cycles.

Program C requires 2 billion dynamic instructions.

What is the CPI?

141 of 152

Clock Cycles Per Instruction (CPI)

  • CPI is the average number of clock cycles per instruction
    • Throughput metric
    • Component metric (not a measure of performance )
  • Different classes of instructions can have different CPIs
    • CPI for floating point operations
    • CPI for adders, branch instructions
  • CPU Clock Cycles = Number of Instructions (NI)* CPI

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

141

The CPI of ith class

The number of instructions of ith class

142 of 152

Putting it all together

CPU Execution Time = [CPU clock cycles] * [Clock cycle time]

CPU clock cycles = [Instruction count] * [Average Clock Cycles per Instruction (CPI)]

CPU Execution Time

Instruction Count

CPI

Clock Cycle Time

=

X

X

instructions

cycles/instruction

seconds/cycle

seconds

143 of 152

CPU Execution Time

  • CPU Execution Time = CPU Clock Cycles * Clock Period

= NI*CPI*Clock Period

  • Execution time depends on
    • Number of instructions : Determined by the ISA
      • Programmable hardware counters can be used to find NI
      • Profiling tools such as gprof, valgrind
    • CPI: determined by the ISA & Compiler & Organization (Parallel execution)
      • Can be determined in software using simulator
    • Clock period: Determined by the process technology and implementation

CC BY-NC-ND Pat Pannuto – Many slides adapted from Janarbek Matai

143

144 of 152

Cycle Time/Clock Rate is no longer fixed

  • Increasingly, modern processors can execute at multiple clock rates (cycle times).
  • Why?
  • However, the granularity at which we can change the cycle time tends to be fairly coarse, so all of these principles and formulas still apply.

145 of 152

Who Affects Performance? How?

  • programmer
  • compiler
  • instruction-set architect
  • machine architect
  • hardware designer
  • materials scientist/physicist/silicon engineer

CPU Execution Time

Instruction Count

CPI

Clock Cycle Time

=

X

X

146 of 152

Which programs are best, are “most fair”, to run when measuring performance?

  • peak throughput measures (simple programs)?
  • synthetic benchmarks (whetstone, dhrystone,...)?
  • Real applications
  • SPEC (best of both worlds, but with problems of their own)
    • System Performance Evaluation Cooperative
    • Provides a common set of real applications
      • Along with strict guidelines for how to run them
    • Provides a relatively unbiased means to compare machines.

147 of 152

Amdahl’s Law

  • The impact of a performance improvement is limited by the percent of execution time affected by the improvement

  • Make the common case fast!!

Execution time

after improvement

=

Execution Time Affected

Amount of Improvement

Execution Time Unaffected

+

148 of 152

Amdahl’s Law and Massive Parallelism

.9

.1

149 of 152

Amdahl’s Law and Massive Parallelism

.9

.1

.1

.45

Speedup

1.0

1/.55 = 1.82

150 of 152

Amdahl’s Law and Massive Parallelism

.9

.1

.1

.1

.45

.225

Speedup

1.0

1/.55 = 1.82

1/.325 = 3.07

151 of 152

Amdahl’s Law and Massive Parallelism

.9

.1

.1

.1

.1

.45

.225

Speedup

1.0

1/.55 = 1.82

1/.325 = 3.07

< 10

152 of 152

Key Points

  • Be careful how you specify performance
  • Execution time = instructions * CPI * cycle time
  • Use real applications
  • Use standards, if possible
  • Make the common case fast