1 of 37

Reconfigurable Computing

Prof. Dr. Vanderlei Bonato

The University of Sao Paulo (USP)

Institute of Mathematical and Computing Sciences (ICMC)

2 of 37

Agenda

  • Introduction

  • Processor types

  • Transistor technology

  • Modern FPGAs

3 of 37

(VAHID; GIVARGIS, 2002)

4 of 37

Introduction

  • ... communication also.

5 of 37

Processor types

  • General Purpose processor
    • Von Neumann paradigm

  • Domain specific processor
    • Ex. Digital Signal Processor

  • Application specific processor (very specific)
    • Application Specific Integrated Circuit
    • Ex. Tensor Processing Unit (TPU)
    • Ex. Neural Processing Unit (NPU)

  • Reconfigurable processor
    • Ex. Field-Programmable Gate Array (FPGA)
    • Ex. Coarse Grain Reconfigurable Architecture (CGRA)

6 of 37

The von Neumann Computer

  • Principle

In 1945, the mathematician von Neumann (vN) demonstrated in study of computation that a computer could have a simple structure, capable of executing any kind of program, given a properly programmed control unit, without the need of hardware modification

7 of 37

Structure of von Neumann machine

8 of 37

The von Neumann Computer

  • Structure
    • A memory for storing program and data
      • The memory consists of the word with the same length

    • A control unit (control path) featuring a program counter for controlling program execution

    • An arithmetic and logic unit (ALU) also called data path for program execution

9 of 37

The von Neumann Computer

  • Advantage:
    • Flexibility: any well coded program can be executed

  • Drawbacks
    • Speed efficiency: Not efficient, due to the sequential program execution (temporal resource sharing).
      • Resource efficiency: Only one part of the hardware resources is required for the execution of an instruction. The rest remains idle.
      • Memory access: Memories are about 10 time slower than the processor
    • Drawbacks are compensated using high clock speed, pipelining, caches, instruction pre-fetching, etc.

10 of 37

Domain specific processors

  • Introduced by Texas Instruments over thirty years ago

  • DSP, by design, performs math more efficiently, particularly for applications that include computation-intensive functions, such as analytics, FFTs and matrix math in a constrained environment.

  • Digital signal processor (DSP) has evolved in its implementation from a standalone processor to a multicore processing element

11 of 37

Application examples from TI

12 of 37

Application specific processor

  • Device is manufactured to design specifications

  • An Integrated Circuit is called an Application Specific Integrated Circuit (ASIC) if it is designed for a specific application.
  • ASIC Types
    • Full-Custom
      • design is done from scratch
    • Standard Cell
      • Based on predesigned logic gates
    • Gate Array
      • transistors are predefined in the silicon wafer

13 of 37

Application specific processor

  • Implementation of a vN computer

if (a < b) then

{

d = a+b;

c = a*b;

}

else

{

d = a+1;

c = b-1;

}

  • At least 3 instructions
  • run-time >= 3*tinstruction

ASIC implementation:

The complete execution is done in

parallel in one clock cycle

run-time = tclock= delay longest path

from input to output

The vN computer needs to be clocked

at least 5 time faster

14 of 37

Reconfigurable Computing

  • Reconfigurable computing can be defined as the study of computations involving reconfigurable devices. This includes, architecture, algorithms and applications [Bobda 2007].

15 of 37

Flexibility vs Efficiency

Source: [Adpted from Bobda 2007]

GPU /

Performance

16 of 37

Temporal vs. spatial based computing

Temporal-based execution

(software)

Spatial-based execution

(reconfigurable computing)

  • Ability to extract parallelism (or concurrency) from algorithm descriptions is the key to acceleration using reconfigurable computing

17 of 37

Reconfigurable devices

  • Field-Programmable Gate Arrays (FGPAs) are one example of reconfigurable devices
  • An FPGA consists of an array of programmable logic blocks whose functionality is determined by programmable configuration bits
  • The logic blocks are connected by a set of routing resources that are also programmable
  • Custom logic circuits can be mapped to the reconfigurable fabric

18 of 37

Configuring FPGAs

[Maxfield’04]

FPGAs can be dynamically reprogrammed before runtime or during runtime (virtual hardware)

    • full
    • partial

19 of 37

FPGA vs. ASIC Design Advantages�

  • FPGA Design
  • Faster time-to-market
    • No layout, masks or other manufacturing steps are needed
  • No upfront non-recurring expenses (NRE)
    • Costs typically associated with an ASIC design
  • Simpler design cycle
    • Due to software that handles much of the routing, placement, and timing
  • More predictable project cycle
    • Due to elimination of potential re-spins, wafer capacities, etc.
  • Field reprogramability
    • A new bitstream can be uploaded remotely

  • ASIC Design
  • Full custom capability
    • For design since device is manufactured to design specs
  • Lower unit costs
    • For very high volume designs
  • Smaller form factor
    • Since device is manufactured to design specs

20 of 37

Current problems with �conventional computing

  • Technology scaling doubled the number of devices in an IC (processors, FPGAs, …, etc) every 2-3 years

  • Scaling also provided devices with reduced delay
    • frequency doubling (with aggressive pipelining)
    • increased power density

  • Increases in clock frequency slowed down (or stopped);
    • available devices are used to create multi-processor (multi-core) processors

21 of 37

3D/FinFET transistor technology

Synopsys, “FinFET Technology – Understanding �and Productizing a New Transistor” - white paper, 2013

22 of 37

Nanosheet FETs transistor technology (next generation)

Fig. 1: In nanosheet transistors, the gate contacts the channel on all sides (gate all around) and multiple sheets enable higher drive current than in finFETs. Silicon orientation differences (110 to 100) changes the carrier mobilities in the channel. Source: K. Zhao, IBM/IEDM Tutorial 2021

23 of 37

Extreme Ultra Violet (EUV) technology

  • Finer resolution
  • Enable as the nanosheet FET IBM 2nm transistor

24 of 37

Semiconductor development chain

25 of 37

IC Foundries concentration�(16/14nm custo de ~12-15B USD)

Fonte: Global Semiconductor Alliance

26 of 37

27 of 37

Report showing current state of the semiconductor industry

  • STATE OF THE U.S. SEMICONDUCTOR INDUSTRY

https://www.semiconductors.org/wp-content/uploads/2022/11/SIA_State-of-Industry-Report_Nov-2022.pdf

  • From Fabless to Fabs Everywhere? Semiconductor Global Value Chains in Transition

https://www.wto.org/english/res_e/booksp_e/07_gvc23_ch4_dev_report_e.pdf

28 of 37

Current Transistors/chip

29 of 37

30 of 37

FPGAs with 3D/FinFET transistors

  • Stratix 10 (Altera/Intel)
    • 14nm (Intel)
    • redução de 70% do consumo de energia
    • 4 milhões de elementos lógicos (LEs)
  • Virtex UltraScale (Xilinx/AMD)
    • 20nm (TSMC) – expanded also to 16nm – UltraScale+
    • redução de 50% do consumo de energia
    • 4,4 milhões de células lógicas
  • Everest (Xilinx/AMD)
    • 7nm
    • Adaptable Computing Acceleration Platform (ACAP)
  • Speedster22i HD (Achronix)
    • 22nm (Intel)
    • 1,7 millhões de LUTs
    • E-FPGA
      • Speedcore IP (eFPGA)
      • Allow to embed proven FPGA technology into ASIC

31 of 37

Xilinx Everest: Enabling FPGA Acceleration With ACAP

  • The scope of the Everest project is impressive: 1,500 engineers working for four years to build a chip with fifty billion transistors, with total costs running over $1B.
  • The chip turns out 20 times the performance of the company’s existing FPGAs for AI applications.
  • The performance over a CPU ranges from 10X for image processing, to 90X for data analytics, to 100X for genomic sequencing.

Fonte: https://www.forbes.com/sites/moorinsights/2018/03/26/xilinx-everest-enabling-fpga-acceleration-with-acap/#9a88c52342ea

32 of 37

ACAP TECHNICAL DETAILS�

  • An ACAP has – at its core – a new generation of FPGA fabric with distributed memory and hardware-programmable DSP blocks, a multicore SoC, and one or more software programmable, yet hardware adaptable, compute engines, all connected through a network on chip (NoC). An ACAP also has highly integrated programmable I/O functionality, ranging from integrated hardware programmable memory controllers, advanced SerDes technology and leading edge RF-ADC/DACs, to integrated High Bandwidth Memory (HBM) depending on the device variant.
  • Software developers will be able to target ACAP-based systems using tools like C/C++, OpenCL and Python. An ACAP can also be programmable at the RTL level using FPGA tools.

33 of 37

Discrete and Integrated platform

Source: PK Gupta

34 of 37

Zynq UltraScale+ EG (Xilinx)

35 of 37

ESL: desenvolvimento, integração e testes a nível de sistema

  • SDK for OpenCL (Altera)

  • Vivado (Xilinx)

  • HDL Coder (MathWorks)

  • MARTE, SysML (OMG)

  • BSV (BlueSpec)

www.forteds.com

35

36 of 37

Aplicações

  • Robótica embarcada
  • Processamento de imagens
  • Sensores inteligentes
  • Aceleradores para computação de alto desempenho
  • Green computing
  • …..

36

37 of 37

References

  • AMDA67 Amdahl, G. “Validity of the Single-Processor Approach to Achieving Large-Scale Computing Capability”, Proceedings of the AFIPS Conference, 1967.
  • CHRISOPHE BOBDA – “Introductions to Reconfigurable Computing – Architecture, algorithms and applications”- Springer, 2007, 359pg.
  • William Stallings. Computer Organization and Architecture, 8th Edition, 2010, 792p.