1 of 171

Wakey-Wakey

A Low-Power Reconfigurable Wake Word Accelerator�{eldrick, mjpauly}@stanford.edu

Overview

2 of 171

Wake Word Recognition

Wake words are certain words or phrases that activate a system

Hey Siri”, “Ok Google”, “Alexa”�

Can be a standalone interface or paired with automatic speech recognition systems��Enables:

    • simple command parsing
    • isolation of commands
    • power-saving

Cooper, go fetch!”

3 of 171

Motivation

Human speech interfaces enable...

  • Accessible, hands-free interaction
  • Use of natural speech and language
  • Many IoT / smart home applications

But face challenges in...

  • Power-consumption
  • Latency
  • Accuracy
  • Configurability

4 of 171

System Goals

Months-Long Longevity

  • on AA / AAA / Coin Cell
  • Nominal Usage (Toggling Lights)

01

02

03

Conversational Robustness

  • < 1s Response Time
  • Works in “office noise”

Accuracy & Reconfigurability

  • Can change wake word
  • > 90% accuracy on test set

We’re building a wake-word recognition system targeting:

5 of 171

System Architecture

6 of 171

Signal Processing Pipeline

Digital Front End

Conditions incoming digital audio signal with FIR filter.�Contains gating controller

Acoustic Featurization

Transforms conditioned audio signal into Mel-Frequency Cepstral Coefficients (MFCC), a type of acoustic feature

Word Recognition

Accelerates a 1D CNN network for recognizing the chosen wake word

DFE

ACO

WRD

7 of 171

External I/O

  1. I2C to/from Caravel or Arduino�Read/Write Model Parameters

2) I2C / SPI / PDM to ADC + Microphone

Digitized Audio�(VM3011:“Zero-Power Listening” Microphone + ADC)

8 of 171

Software Backend�Edge Impulse

  • Targets Embedded Platforms
  • Full ML Pipeline
  • Exports as C++ Gold Model

9 of 171

Neural Network Architecture

Estimate for Cortex-M4F @ 80MHz

1347 Parameters

10 of 171

11 of 171

Potential Extensions

Additional NN arch or MFCC parameter configurability

Audio passthrough after wake word detection

01

02

TBD! Drop us a suggestion :)

03

12 of 171

Timeline

Spring Break - Software Modeling (Gold Model), Existing IP Search

Spring Week 1 - WRD RTL + TB, ACO RTL + TB, Initial flow up to Simulation and Compilation

Spring Week 2 - WRD RTL + TB, ACO RTL + TB, Initial flow up to Synthesis, Order ADC/MIC

Spring Week 3 - WRD RTL + TB, ACO Verified, DFE RTL + TB, ADC/MIC Physical Testing

Spring Week 4 - WRD Verified, CFG RTL + TB, DFE Verified, ADC/MIC Verified

Spring Week 5 - CFG Verified, Initial Full System Synthesis, initial flow up to floorplan

Spring Week 6 - Full system verified, full system synthesis, initial flow up to signoff

Spring Week 7 - Floorplanning, power design, clocking and STA, physical design iteration

Spring Week 8 - Physical design iteration, RTL improvements

Spring Week 9 - Pre-sign off checks and improvements

Spring Week 10 - Sign-off, Send to Foundry, Tapeout!

13 of 171

Questions?

{eldrick, mjpauly}@stanford.edu

14 of 171

Motivation

- One slide recap of the motivation for your project, application areas

Your Design

- This is your complete verilog/schematic design

- Show top level block diagram and component level diagrams

- Explain how each component works

Functional Verification

- Describe how you created your gold model

- Show a list of tests you ran and a plot/screenshot showing that each test passed. This includes component level basic unit tests, end to end application tests, tests considering noise, PVT corners, variation etc.

- What further tests will you run in the remaining weeks, such as post layout validation, integration testing with caravel, more applications etc.

Design Space Exploration

Show any experiments you did to make design choices, such as sweeps of component values, parameters etc.

Evaluation

Evaluation metrics include estimates of

- area (must fit in caravel user project area)

- max frequency (with no timing violations), throughput, latency

- power, energy

- circuit specific figures of merit (efficiency, accuracy/error, temperature/voltage sensitivity etc)

For digital designs, post synthesis numbers are okay.

Plan

Plan for the remaining 5 weeks.

We will be happy to review your presentations during office hours!

15 of 171

Wakey-Wakey

A Low-Power Reconfigurable Wake Word Accelerator�{eldrick, mjpauly}@stanford.edu

Design Review 2021-05-03

16 of 171

Motivation

Human speech interfaces enable...

  • Accessible, hands-free interaction
  • Use of natural speech and language
  • Many IoT / smart home applications

But face challenges in...

  • Power-consumption
  • Latency
  • Accuracy

17 of 171

Wake Word Recognition

Wake words are certain words or phrases that activate a system

Hey Siri”, “Ok Google”, “Alexa”�

Can be a standalone interface or paired with automatic speech recognition systems��Enables:

    • simple command parsing
    • isolation of commands
    • power-saving

Cooper, go fetch!”

18 of 171

System Goals

Months-Long Longevity

  • on AA Battery
  • Nominal Usage (Toggling Lights)

01

02

03

Conversational Robustness

  • < 250 ms Response Time
  • Works in “office noise”

Accuracy & Reconfigurability

  • Can change wake word
  • > 90% accuracy on test set

We’re building a wake-word recognition system targeting:

19 of 171

System Goals - Targets

Metric

Target

Achieved

Notes

Area

< 10 mm2

MPW-TWO User Project Area Constraint

Latency

< 250 ms

Word Utterance to Wake Pin Assert

Freq.

4 MHz

Determined by sampling rate needed for PDM

Inference Energy Efficiency

? pj/Op

Idle Power Consumption

0.89 mW

6 Months on Alkaline AA Battery (3.9 Watt Hrs)

Test Set Accuracy

> 90%

Google Speech Commands Dataset + Microsoft Scalable Noisy Speech Dataset �(Yes vs No/Unknown/Noise - 25 mins. per class)

Model Size

Target set by Area & Test Set Accuracy

20 of 171

Dev Notes

how did you model this in software?

code / plots figures here

clever tricks for approximating or modeling behavior

talk about any code or test infrastructure here

  • cocotb - testbenches in Python��
  • DFFRAM memories��
  • Analog Frontend (ADC, Voice Activity Detector) scrapped due to time limits. Off-chip microphone with voice activity detector used instead��
  • AXI-Stream Interfaces for Datapath
    • Data / Valid / Last / Ready

cocotb block diagram

21 of 171

System Architecture

22 of 171

Design Progress

RTL Complete and Verified

RTL Work In Progress

  • Revised schedule at end of presentation
  • PD flow is simple: all standard cells

Software Gold Models Complete

23 of 171

DFE: The Digital Front-End!

24 of 171

DFE Architecture

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

Pulse Density Modulation (PDM) Signal

Microphone output: 1b data @ 4MHz

DFE output: 8b data @ 16kHz

Vesper VM3011

25 of 171

DFE Model

Software model

    • PDM microphone
    • Filtering
    • Wake word detection accuracy impact: -1.0%

Processing Quality

    • Audible difference in quality is small
    • Some higher frequency noise

how did you model this in software?

code / plots figures here

clever tricks for approximating or modeling behavior

talk about any code or test infrastructure here

Original Audio Sample

PDM modeled + filtered

26 of 171

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

Microphone output: 1b data @ 4MHz

DFE output: 8b data @ 16kHz

Vesper VM3011

DFE Architecture Detail

27 of 171

DFE Verification Plan

    • RTL work in progress
    • Verify RTL accuracy to gold model
    • Capture waveform from mic sample boards, test DFE pipeline on it using FPGAs

show passing test

show tests ran

talk about edge cases

talk about unit and integration testing

talk about future testing

28 of 171

DFE: The Digital Front-End!

29 of 171

ACO: The Acoustic Featurizer!

30 of 171

ACO Architecture

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

    • Mel-frequency cepstral coefficients (MFCCs)
    • Collect 1-second of audio features
      • 20ms long frames (50 per sec)
      • 13 coefficients per frame

Mel Scale

31 of 171

ACO Model

how did you model this in software?

code / plots figures here

clever tricks for approximating or modeling behavior

talk about any code or test infrastructure here

    • Started with SpeechPy MFCCs
    • Then quantized pipeline to get our bit-accurate gold model
    • Impact on detection accuracy: -1%

32 of 171

ACO Architecture Detail

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

33 of 171

ACO Architecture Detail

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

34 of 171

ACO Architecture Detail

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

35 of 171

ACO Verification

    • RTL work in progress
    • Verify RTL accuracy to gold model
    • Test with mic sample waveform

show passing test

show tests ran

talk about edge cases

talk about unit and integration testing

talk about future testing

36 of 171

ACO: The Acoustic Featurizer!

37 of 171

WRD: The DNN Accelerator!

38 of 171

WRD Model

how did you model this in software?

code / plots figures here

clever tricks for approximating or modeling behavior

talk about any code or test infrastructure here

  • NN Architecture Exploration using EdgeImpulse�
  • Built our own PyTorch/Numpy model for bit accuracy

  • Google Speech Commands Dataset + Microsoft Scalable Noisy Speech Dataset
    • (Yes vs No/Unknown/Noise - 25 mins. per class)�
  • Self-Quantized�
  • Google Colab Notebook is publicly available! Train your own wake word :)

39 of 171

WRD Architecture

  • 3 Layer NN
    • 2 Conv Layers
    • 1 Fully Connected Layer�
  • 2 Output Classes
    • Wake Word
    • Not Wake Word
      • Noise
      • Other Words
      • Silence

40 of 171

WRD Architecture

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

41 of 171

WRD Architecture

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

42 of 171

WRD Input

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

43 of 171

WRD Zero Pad

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

44 of 171

WRD 1D Convolution

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

45 of 171

WRD 1D Convolution

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

46 of 171

WRD 1D Convolution

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

47 of 171

WRD 1D Convolution

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

48 of 171

WRD 1D Convolution

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

49 of 171

WRD 1D Convolution

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

50 of 171

WRD 1D Convolution

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

51 of 171

WRD 1D Convolution

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

52 of 171

WRD Convolution Layer HW

53 of 171

WRD Convolution Layer HW

54 of 171

WRD Convolution Layer HW

55 of 171

WRD Convolution Layer HW

56 of 171

WRD Convolution Layer HW

57 of 171

WRD Convolution Layer HW

58 of 171

WRD Convolution Layer HW

59 of 171

WRD Convolution Layer HW

60 of 171

WRD Convolution Layer HW

61 of 171

WRD Convolution Layer HW

62 of 171

WRD Convolution Layer HW

63 of 171

WRD Convolution Layer HW

Cycles through 8 filters to complete

64 of 171

WRD Convolution Layer HW

Cycles through 8 filters to complete

“Recycles” the input MFCCs 8 times

65 of 171

WRD Convolution Layer HW

Cycles through 8 filters to complete

“Recycles” the input MFCCs 8 times

8 Filters * 50 Frames = 400 total cycles

66 of 171

WRD 1D Convolution

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

67 of 171

WRD Max Pool

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

68 of 171

WRD Max Pool

detailed block diagram here

RTL level considerations

interesting wave diagrams here

describe constituent modules

describe input values

describe output values

Output of Conv -> Max Pool is serial data

Next conv layer expects 25 frames, each frame with 8 output channels - need to reshape data from serial to parallel

69 of 171

WRD Architecture

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

70 of 171

WRD Serial-In-Parallel-Out

71 of 171

WRD Architecture

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

72 of 171

WRD Conv 2

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

73 of 171

WRD Conv 2

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

74 of 171

WRD Fully Connected Layer

75 of 171

WRD Fully Connected Layer

Output of conv2 -> max_pool2 is 208 values, serial.

76 of 171

WRD Fully Connected Layer

2 classes = 2 weight banks

Class 1: Wake Word�Class 2: Not Wake Word

208 Weights Each

77 of 171

WRD Fully Connected Layer

208 Cycles to Complete

78 of 171

WRD Fully Connected Layer

Upon completion, compare 2 class values.

Class 1 > Class 2 = WAKE!

79 of 171

WRD Fully Connected Layer

Sustain WAKE for N cycles

80 of 171

WRD Detailed Architecture

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

81 of 171

WRD Detailed Architecture

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

82 of 171

WRD Architecture

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

83 of 171

WRD Verification

show passing test

show tests ran

talk about edge cases

talk about unit and integration testing

talk about future testing

  • Significant unit and full WRD testing with random inputs, directed tests, and real audio examples�
  • Direct integration with our Python model using cocotb!

84 of 171

WRD Verification

show passing test

show tests ran

talk about edge cases

talk about unit and integration testing

talk about future testing

  • Significant unit and full WRD testing with random inputs, directed tests, and real audio examples�
  • Direct integration with our Python model using cocotb!

85 of 171

WRD: The DNN Accelerator!

86 of 171

CFG: The DNN Configurator!

87 of 171

CFG Architecture

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

    • Wishbone Compliant Interface�
    • Controlled via Caravel SoC�
    • Considered external interfaces (I2C / SPI / UART)�
    • 6 Wishbone Addressable 32b Registers
      • 4 Data
      • 1 Address
      • 1 Control

    • 4 Data Words Needed due to Conv 1 Memory Size (13 weights x 8b -> 104b -> 4 32b words)

    • Load / Store Operation via Control Register (self clearing)

88 of 171

CFG Write Transaction

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

Write to Conv 1 Filter 0 Weight 0

  1. Write Addr Register [0x3000_0000]
    1. Parameter Address: 0x0000�
  2. Write Data Register 0 [0x3000_0008]
  3. Write Data Register 1 [0x3000_000C]
  4. Write Data Register 2 [0x3000_0010]
  5. Write Data Register 3 [0x3000_0014]�
  6. Write Ctrl Register [0x3000_0004]
    • Store Command: 0x1

89 of 171

CFG Read Transaction

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

Read Conv 1 Filter 0 Weight 0�

  1. Write Addr Register [0x3000_0000]
    1. Parameter Address: 0x0000�
  2. Write Ctrl Register [0x3000_0004]
    • Load Command: 0x02

  1. Read Data Register 0 [0x3000_0008]
  2. Read Data Register 1 [0x3000_000C]
  3. Read Data Register 2 [0x3000_0010]
  4. Read Data Register 3 [0x3000_0014]

90 of 171

CFG Model

how did you model this in software?

code / plots figures here

clever tricks for approximating or modeling behavior

talk about any code or test infrastructure here

  • Synthetic wishbone transaction testing is ongoing�
  • Requires in-depth integration testing with Caravel and the PicoRV32

91 of 171

CFG: The DNN Configurator!

92 of 171

Let’s Zoom Back Out...

93 of 171

System Goals - Achieved

Metric

Target

Achieved

Notes

Area

< 10 mm2

WRD = 2 mm2, Total = 4 mm2 (est)

MPW-TWO User Project Area Constraint

Latency

< 250 ms

~ 152 us

Word Utterance to Wake Pin Assert, ~ 2447 Cycles

Freq.

4 MHz

16 MHz

Determined by sampling rate needed for PDM

Inference Energy Efficiency

? pj/Op

TODO: GL Sim via PrimeTime

Idle Power Consumption

0.89 mW

WRD = 0.794 mW

6 Months on Alkaline AA Battery (3.9 Watt Hrs), OpenSTA Leakage Estimate

Test Set Accuracy

> 90%

96%

Google Speech Commands Dataset + Microsoft Scalable Noisy Speech Dataset �(Yes vs No/Unknown/Noise - 25 mins. per class)

Model Size

1,140 Parameters (1,168 Bytes)

Target set by Area & Test Set Accuracy

94 of 171

Microcontroller Comparison

Processing time on a Cortex-M4F @ 80MHz

MFCCs:

DNN:

Power comparison coming

95 of 171

System Goals - Post Route Power for WRD

15mW 4.8mW 0.79mW 20mW

96 of 171

  • Fri, May 7 - CFG RTL Completed & Verified
  • Wed, May 12 - DFE RTL Completed & Verified
  • Fri, May 21 - ACO RTL Completed & Verified�
  • Wed, May 26 - Caravel Integration Complete
    • CFG Wishbone Transactions Verified
    • Clock Gating Implementation
    • Logic Analyzer Probe Integration
  • Fri, May 28 - GDS Streaming of Full Design�
  • Fri, Jun 4 - Tapeout�
  • ???, Dec ? - Post-Silicon Testing and Debug
    • Compare to microcontroller-only solution
    • Deploy to various use cases

Upcoming Development Schedule

talk about future RTL dev

talk about future test

talk about caravel integration

talk about physical design

talk about clock or power gating - ask what to go for here

talk about debug

talk about comparing to microcontroller

latency and power

WRD, post-route!

97 of 171

{eldrick, mjpauly}@stanford.edu

Questions?

98 of 171

Wakey-Wakey

A Low-Power Reconfigurable Wake Word Accelerator�{eldrick, mjpauly}@stanford.edu

Final Presentation 2021-05-25

99 of 171

High Level Overview

100 of 171

Recap: Where we were 3 weeks ago...

RTL Complete and Verified

RTL Work In Progress

  • Revised schedule at end of presentation
  • PD flow is simple: all standard cells

Software Gold Models Complete

101 of 171

Now

102 of 171

Our Progress to Date

103 of 171

Architecture Progress To Date

Custom Model Training + Quantization

104 of 171

Architecture Progress To Date

Custom Training + Quantization Pipeline

So Many Block Diagrams!

105 of 171

Architecture Progress To Date

Custom Training + Quantization Pipeline

So Many Block Diagrams!

Microphone Part Selection + Acquisition

106 of 171

Implementation Progress to Date

39 custom verilog modules

107 of 171

Implementation Progress to Date

6,362 lines of custom RTL code

39 custom verilog modules

108 of 171

Verification Progress to Date

5,596 lines of test bench code

734 lines of software model code

All modules have passing unit tests

All subsystems have passing integration tests

109 of 171

Physical Design Progress to Date

Fully open source flow using openlane

110 of 171

Physical Design Results

Flow on DFE + CFG + WRD runs to completion

Meets Timing, Area�(16 MHz, 2.5 mm2)

Some Antenna Violations to Resolve

111 of 171

Physical Design Results

Flow on full user design runs to placement, CTS

112 of 171

Physical Design Results

Flow on full user design runs to placement, CTS

Meets Timing, Area�(16 MHz, 4.5 mm2)

hold slack - 0.19 ns

setup slack - 17.53 ns

critical path in ACO

working on getting a per-module area breakdown

113 of 171

Physical Design Results

Flow on full user design runs to placement, CTS

Meets Timing, Area�(16 MHz, 4.5 mm2)

Failed at 6 AM today at Global Route

114 of 171

Recap: System Goals

Metric

Target

Achieved

Notes

Area

< 10 mm2

WRD = 2 mm2, Total = 4 mm2 (est)

MPW-TWO User Project Area Constraint

Latency

< 250 ms

~ 152 us

Word Utterance to Wake Pin Assert, ~ 2447 Cycles

Freq.

4 MHz

16 MHz

Determined by sampling rate needed for PDM

Inference Energy Efficiency

? pj/Op

TODO: GL Sim via PrimeTime

Idle Power Consumption

0.89 mW

WRD = 0.794 mW

6 Months on Alkaline AA Battery (3.9 Watt Hrs), OpenSTA Leakage Estimate

Test Set Accuracy

> 90%

96%

Google Speech Commands Dataset + Microsoft Scalable Noisy Speech Dataset �(Yes vs No/Unknown/Noise - 25 mins. per class)

Model Size

1,140 Parameters (1,168 Bytes)

Target set by Area & Test Set Accuracy

115 of 171

System Goals

Metric

Target

Achieved

Notes

Area

< 10 mm2

4.5mm2

MPW-TWO User Project Area Constraint

Latency

< 250 ms

~ 152 us

Word Utterance to Wake Pin Assert, ~ 2447 Cycles

Freq.

4 MHz

16 MHz

Determined by sampling rate needed for PDM

Inference Energy Efficiency

? pj/Op

TODO: GL Sim via PrimeTime

Idle Power Consumption

0.89 mW

6 Months on Alkaline AA Battery (3.9 Watt Hrs), OpenSTA Leakage Estimate

Test Set Accuracy

> 90%

96%

Google Speech Commands Dataset + Microsoft Scalable Noisy Speech Dataset �(Yes vs No/Unknown/Noise - 25 mins. per class)

Model Size

1,140 Parameters (1,168 Bytes)

Target set by Area & Test Set Accuracy

116 of 171

  • Fri, May 7 - CFG RTL Completed & Verified
  • Wed, May 12 - DFE RTL Completed & Verified
  • Fri, May 21 - ACO RTL Completed & Verified�
  • Wed, May 26 - Caravel Integration Complete
    • CFG Wishbone Transactions Verified
    • Clock Gating Implementation
    • Logic Analyzer Probe Integration (DFT)

  • Fri, May 28 - GDS Streaming of Full Design�
  • Fri, Jun 4 - Tapeout�
  • ???, Dec ? - Post-Silicon Testing and Debug
    • Compare to microcontroller-only solution
    • Deploy to various use cases

Recap: Development Schedule

talk about future RTL dev

talk about future test

talk about caravel integration

talk about physical design

talk about clock or power gating - ask what to go for here

talk about debug

talk about comparing to microcontroller

latency and power

117 of 171

  • Fri, May 7 - CFG RTL Completed & Verified
  • Wed, May 12 - DFE RTL Completed & Verified
  • Fri, May 21 - ACO RTL Completed & Verified�
  • Wed, May 26 - Caravel Integration Complete
    • CFG Wishbone Transactions Verified
    • Clock Gating Implementation
    • Logic Analyzer Probe Integration (DFT)

  • Fri, May 28 - GDS Streaming of Full Design�
  • Fri, Jun 4 - Tapeout�
  • ???, Dec ? - Post-Silicon Testing and Debug
    • Compare to microcontroller-only solution
    • Deploy to various use cases

Recap: Development Schedule

talk about future RTL dev

talk about future test

talk about caravel integration

talk about physical design

talk about clock or power gating - ask what to go for here

talk about debug

talk about comparing to microcontroller

latency and power

118 of 171

What’s Next

119 of 171

What’s Next

Test classification accuracy of RTL end-to-end

120 of 171

What’s Next

Test writing parameters via the management core and the wishbone interface

121 of 171

What’s Next

Design for test

122 of 171

What’s Next

Improve idle power consumption

123 of 171

What’s Next

Asynchronous assert, synchronous deassert resets

124 of 171

What’s Next

Capture microphone test samples

125 of 171

What’s Next

Estimate power and timing more accurately with PrimeTime

126 of 171

{eldrick, mjpauly}@stanford.edu

Questions?

127 of 171

1. Start with a brief recap of what your project was about, show the system block diagram of the chip you taped out, and how it works.

2. Describe your chip bringup process: what software you wrote, tests you ran and how you debugged the board and the chip. Include some pictures of the working board + chip showing signs of life.

3. Present your measured results, and compare them with the results from your pre-tapeout simulations. If you were not able to test your chip fully, describe the partial testing you were able to do, and what issues you are running into.

4. Finally, summarize the key contributions and takeaways of the project, including things you think you should have done differently pre-tapeout to make bringup easier.

128 of 171

Wakey-Wakey

A Low-Power Reconfigurable Wake Word Accelerator�{eldrick, mjpauly}@stanford.edu

Bring-Up Presentation 2022-06-01

129 of 171

Wake Word Recognition

Wake words are certain words or phrases that activate a system

Hey Siri”, “Ok Google”, “Alexa”�

Can be a standalone interface or paired with automatic speech recognition systems��Enables:

    • simple command parsing
    • isolation of commands
    • power-saving

Cooper, go fetch!”

130 of 171

Motivation

Human speech interfaces enable...

  • Accessible, hands-free interaction
  • Use of natural speech and language
  • Many IoT / smart home applications

But face challenges in...

  • Power-consumption
  • Latency
  • Accuracy
  • Configurability

131 of 171

System Goals

Months-Long Longevity

  • on AA / AAA / Coin Cell
  • Nominal Usage (Toggling Lights)

01

02

03

Conversational Robustness

  • < 1s Response Time
  • Works in “office noise”

Accuracy & Reconfigurability

  • Can change wake word
  • > 90% accuracy on test set

We’re building a wake-word recognition system targeting:

132 of 171

Hardware <> Software Co-Design

HW Conscious Quantization and Network Design

Fully Custom Verilog Modules

Low Power Microphone HW

133 of 171

High Level Overview

134 of 171

The Bring Up Process - Challenges

Only 1 Copy of Testboard

Received Test Board on Feb 1

Distributed / Remote Bring Up

Reliable GPIO Config

135 of 171

The Bring Up Process

Feb 1 - Received chips and test board

136 of 171

The Bring Up Process

Feb 28 - Successfully flashed caravel firmware!

And successfully tested wishbone parameter writes to internal DFFRAM!!!

137 of 171

https://github.com/eldrickm/caravel_board/blob/main/firmware/wakey/wakey.c

May 31 - Fully tested and verified all memory locations in the DFFRAM!!!!!

The Bring Up Process

138 of 171

It’s Alive!

139 of 171

CFG Architecture

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

    • Wishbone Compliant Interface�
    • Controlled via Caravel SoC�
    • Considered external interfaces (I2C / SPI / UART)�
    • 6 Wishbone Addressable 32b Registers
      • 4 Data
      • 1 Address
      • 1 Control

    • 4 Data Words Needed due to Conv 1 Memory Size (13 weights x 8b -> 104b -> 4 32b words)

    • Load / Store Operation via Control Register (self clearing)

140 of 171

CFG Write Transaction

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

Write to Conv 1 Filter 0 Weight 0

  1. Write Addr Register [0x3000_0000]
    1. Parameter Address: 0x0000�
  2. Write Data Register 0 [0x3000_0008]
  3. Write Data Register 1 [0x3000_000C]
  4. Write Data Register 2 [0x3000_0010]
  5. Write Data Register 3 [0x3000_0014]�
  6. Write Ctrl Register [0x3000_0004]
    • Store Command: 0x1

141 of 171

CFG Read Transaction

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

Read Conv 1 Filter 0 Weight 0�

  1. Write Addr Register [0x3000_0000]
    1. Parameter Address: 0x0000�
  2. Write Ctrl Register [0x3000_0004]
    • Load Command: 0x02

  1. Read Data Register 0 [0x3000_0008]
  2. Read Data Register 1 [0x3000_000C]
  3. Read Data Register 2 [0x3000_0010]
  4. Read Data Register 3 [0x3000_0014]

142 of 171

CFG Model

how did you model this in software?

code / plots figures here

clever tricks for approximating or modeling behavior

talk about any code or test infrastructure here

  • Synthetic wishbone transaction testing is ongoing�
  • Requires in-depth integration testing with Caravel and the PicoRV32

143 of 171

It’s Alive!

144 of 171

The Bring Up Process

Vesper Microphone voice activity detection

145 of 171

Current Challenge: MPRJ GPIO Config

    • We have 2 places where we should be able to verify successful GPIO configuration
      • VAD input -> triggers PDM clock output
      • Write WAKE output using our logic analyzer / debug utilities�
    • However, we haven’t been able to configure the IOs such that either of these can be confirmed to work

146 of 171

DFE Architecture

design considerations�

parameter space exploration

experiments conducted

interesting technical hurdles

put component level block diagram here

Pulse Density Modulation (PDM) Signal

Microphone output: 1b data @ 4MHz

DFE output: 8b data @ 16kHz

Vesper VM3011

147 of 171

Current Challenge: GPIO Config

    • Our attempted slow clock workaround:
      • Used a 5V Arduino Uno with a voltage divider running a sketch that toggles a GPIO pin at ½ MHz (Eldrick doesn’t have a lot of equipment at his apartment)�
    • Boots but still can’t configure IO - suggests that slower system clock speed will not fix our issue�
    • Another board?

148 of 171

Current Challenge: GPIO Config

149 of 171

System Goals - ALL TBD!

Metric

Target

Achieved

Observed In Practice:

Notes

Area

< 10 mm2

4.5mm2

N/A

MPW-TWO User Project Area Constraint

Latency

< 250 ms

~ 152 us

TBD

Word Utterance to Wake Pin Assert, ~ 2447 Cycles

Freq.

4 MHz

16 MHz

TBD

Determined by sampling rate needed for PDM

Inference Energy Efficiency

? pj/Op

TBD

Idle Power Consumption

0.89 mW

TBD

6 Months on Alkaline AA Battery (3.9 Watt Hrs), OpenSTA Leakage Estimate

Test Set Accuracy

> 90%

96%

TBD

Google Speech Commands Dataset + Microsoft Scalable Noisy Speech Dataset �(Yes vs No/Unknown/Noise - 25 mins. per class)

Model Size

1,140 Parameters (1,168 Bytes)

N/A

Target set by Area & Test Set Accuracy

150 of 171

Key Contributions and Takeaways

    • Big endeavor in hw/sw co-design
      • Neural network architecture adapted to hw constraints
      • Fully custom neural network accelerator in Verilog
    • Experimenting with new tools to make dev experience better (cocotb)
    • Tried to push through the full open source flow
      • Takeaway: large gap between open source and proprietary tools
    • Bringup takeaway: the importance of being able to see inside
      • Did a pretty good job at making GPIO config testable, but could have made it even easier

Codesign

Hardware

Software

151 of 171

1 Year Rewind: What to Do Differently

    • Make GPIO config even more easy to test
      • simple input/output loopback
    • More floor planning to make the layout prettier

152 of 171

Glamour Shots

153 of 171

Glamour Shots

Matthew’s housewarming gift for Eldrick :)

154 of 171

eldrick@alumni.stanford.edu

mjpauly@stanford.edu

Questions?

155 of 171

Wakey-Wakey

156 of 171

157 of 171

158 of 171

159 of 171

These exciting new inventions present endless possibilities

of how technology can shape the future for the better.

160 of 171

Health

& Wellness

161 of 171

CUSTOM ANTIBIOTICS

& VACCINES

Presentations are tools that can be used as reports, and more. It is mostly presented before an audience. It serves

a variety of purposes, making presentations powerful tools for convincing and teaching.

Presentations are tools that can be used as reports, and more. It is mostly presented before an audience. It serves

a variety of purposes, making presentations powerful tools for convincing and teaching.

Presentations are tools that can be used as reports, and more. It is mostly presented before an audience. It serves

a variety of purposes, making presentations powerful tools for convincing and teaching.

01

02

03

162 of 171

SMART CLOTHING

& FIXTURES

Presentations are communication tools that can be used as demonstrations, lectures, speeches, reports, and more. It is mostly presented before an audience. It serves a variety of purposes, making presentations powerful tools for convincing and teaching.

Presentations are communication tools that can be used as demonstrations, lectures, speeches, reports, and more. It is mostly presented before an audience. It serves a variety of purposes, making presentations powerful tools for convincing and teaching.

163 of 171

Computer Science and Manufacturing

164 of 171

ADVANCED

ARTIFICIAL INTELLIGENCE

Presentations are communication tools that can be used as demonstrations, lectures, speeches, reports, and more. It is mostly presented before an audience. It serves a variety of purposes, making presentations powerful tools for convincing and teaching.

Presentations are communication tools that can be used as demonstrations, lectures, speeches, reports, and more. It is mostly presented before an audience. It serves a variety of purposes, making presentations powerful tools for convincing and teaching.

165 of 171

ADVANCED

3D PRINTING

Presentations are communication tools that can be used as demonstrations, lectures, speeches, reports, and more. It is mostly presented before an audience.

01

Presentations are communication tools that can be used as demonstrations, lectures, speeches, reports, and more. It is mostly presented before an audience.

Presentations are communication tools that can be used as demonstrations, lectures, speeches, reports, and more. It is mostly presented before an audience.

02

03

166 of 171

Energy and Agriculture

167 of 171

FLOATING OR

HIGH RISE FARMS

Presentations are communication tools that can be used as demonstrations, lectures, speeches, reports, and more. It is mostly presented before an audience. It serves a variety of purposes, making presentations powerful tools for convincing and teaching.

168 of 171

SOLAR POWER INNOVATIONS

Presentations are communication tools that can be used as demonstrations, lectures, speeches, reports, and more. It is mostly presented before an audience. It serves a variety of purposes, making presentations powerful tools for convincing and teaching.

169 of 171

Transportation

& Space Research

170 of 171

HYPERSPEED TRANSPORTATION

Presentations are communication tools that can be used as demonstrations, lectures, speeches, reports, and more. It is mostly presented before an audience. It serves a variety of purposes, making presentations powerful tools for convincing and teaching.

171 of 171

SPACE TOURISM

& COLONIZATION

Presentations are communication tools that can be used as demonstrations, lectures, speeches, reports, and more. It is mostly presented before an audience. It serves a variety of purposes, making presentations powerful tools for convincing and teaching.