1 of 90

CS-773 Paper Presentation�Hardware-Software Co-Design for

Brain-Computer Interfaces

Prakhar Diwan �Misfits (#2)

180100083@iitb.ac.in

1

Pictures adapted from Hardware-Software Co-Design for Brain-Computer Interfaces else mentioned

2 of 90

Outline

2

  1. Introduction
  2. Background & BCI tasks
  3. HALO
  4. Evaluation
  5. Conclusion & Discussion

3 of 90

Introduction

3

4 of 90

What are BCIs?

  • Brain-Computer Interfaces are devices which capture and influence the activity of neurons in the brain tissue.

  • They are classified into 2 categories:
    • Invasive BCIs
    • Non-invasive BCIs

4

5 of 90

Non-invasive BCIs

5

  • Low fidelity of sensed signals

6 of 90

Invasive BCIs

6

  • Higher fidelity of sensed signals

  • But stricter power constraints

7 of 90

Advancing research on Brain Functions

7

8 of 90

Prosthesis Control using Motor Cortex

8

9 of 90

Prosthesis Control using Motor Cortex

9

10 of 90

Prosthesis Control using Motor Cortex

10

11 of 90

Treatment of Neurological Diseases

FDA approves treatment via BCIs for:

  • Epilepsy
  • Parkinson’s disease
  • Dystonia and more

And this list is growing quickly

11

12 of 90

Treatment of Neurological Diseases

12

13 of 90

BCI Research: A promising field

  • Number of patients ~ 2 Lakh people
  • Growth is recorded every year

  • Various firms have shown interest

13

14 of 90

Relevance for CS 773

14

*DRAM

Software

Architecture

15 of 90

Relevance for CS 773

15

*DRAM

Software

Architecture

16 of 90

Relevance for CS 773

16

*DRAM

Software

Architecture

Ultra low-power multi-accelerator SoC

17 of 90

Background & BCI Tasks

17

18 of 90

BCI: High-Level View

18

*DRAM

19 of 90

BCI: High-Level View

19

*DRAM

~ 1 cm

~ 1 cm

20 of 90

BCI: High-Level View

20

*DRAM

~ 1mm

21 of 90

BCI: High-Level View

21

*DRAM

Microelectrode

22 of 90

BCI: High-Level View

22

*DRAM

Record/stimulate 5-10 neurons

Microelectrode

23 of 90

BCI: High-Level View

23

*DRAM

Microelectrode Array

24 of 90

BCI: High-Level View

24

*DRAM

Record/stimulate hundreds of neurons

Microelectrode Array

25 of 90

BCI: High-Level View

25

*DRAM

8-16 bits/sample at

20-50 KHz

26 of 90

BCI: High-Level View

26

*DRAM

2.4Ghz

27 of 90

BCI: High-Level View

27

*DRAM

Non-rechargeable

Rechargeable

28 of 90

BCI: High-Level View

28

*DRAM

Non-rechargeable

Rechargeable

15 years

29 of 90

BCI: High-Level View

29

*DRAM

Non-rechargeable

Rechargeable

15 years

Wireless, low power

30 of 90

Supported BCI Tasks

30

HALO

Movement intent

Seizure prediction

Spike detection

Compression

Encryption

31 of 90

Supported BCI Tasks

31

HALO

Seizure prediction

Analysis of neuronal firing patterns

FFTs, cross-correlation, and bandpass filters

32 of 90

Supported BCI Tasks

32

HALO

Movement intent

Neuronal firing pattern indicates use of limb

Stimulate motor cortex

33 of 90

Supported BCI Tasks

33

HALO

Spike detection

Detect spike in BCI itself

Lesser transmitted data

34 of 90

Supported BCI Tasks

34

HALO

Compression

Lossless compression

LZ4, LZMA, DWT

35 of 90

Supported BCI Tasks

35

HALO

Encryption

Future proof

AES 128 bits

36 of 90

BCI Requirements

36

Power consumption < 15mW

Safety critical

37 of 90

BCI Requirements

37

Power consumption < 15mW

Sensor data rate ~ 46 Mbps

Safety critical

High processing rate

38 of 90

BCI Requirements

38

Power consumption < 15mW

Sensor data rate ~ 46 Mbps

Response time ~ 5-10 ms

Safety critical

High processing rate

Real-time

39 of 90

Low Flexibility for High Data BW: current

39

*DRAM

40 of 90

Why is flexibility a concern for BCIs?

  • Single-device for multiple solutions
  • Personalized treatment helps more
  • Reduction in cost
  • Quicker FDA approval for modular design

40

41 of 90

Tradeoff between Data BW and Flexibility

41

*DRAM

42 of 90

Ideal Scenario

42

*DRAM

43 of 90

HALO

43

44 of 90

HW Architecture for LOw Power BCIs

44

Flexibility

Performance

Power-Efficiency

HALO

45 of 90

Supporting common workloads

  • Compression (LZ4, LZMA, DWT)
  • Spike Detection (NEO, DWT)
  • Movement Intent
  • Seizure Prediction
  • Encryption

Hence total 8 distinct flows

45

*DRAM

46 of 90

Observations on two ends

  • General-purpose on low-power RISC-V µC:
    • Exceeds power budget by a lot

  • App-specific by monolithic ASIC per BCI task:
    • Poor scaling with area
    • Exceeds power budget for some tasks

  • Need for finer-grained design than m-ASICs

46

*DRAM

47 of 90

Key Observations

  • Algorithms for BCI tasks can be split into different phases

  • Monolithic ASIC for a BCI task ran different phases at same frequency

  • Splitting of phases into processing elements (PEs)

47

48 of 90

Two ideas to meet requirements

  • Running each PE at minimum clock frequency tolerable for meeting performance needs

  • Sharing and reusing the common PEs among different BCI tasks

48

49 of 90

HALO Architecture

49

*DRAM

50 of 90

HALO: Compression (LZ4)

50

*DRAM

51 of 90

HALO: Compression (LZMA)

51

*DRAM

52 of 90

HALO: Compression (DWTMA)

52

*DRAM

53 of 90

HALO: Spike Detection (DWT)

53

*DRAM

54 of 90

HALO: Spike Detection (NEO)

54

*DRAM

55 of 90

HALO: Movement Intent

55

*DRAM

56 of 90

HALO: Seizure Prediction

56

*DRAM

57 of 90

HALO: Encryption (AES)

57

*DRAM

58 of 90

HW-SW Codesign Techniques Applied

58

*DRAM

Technique

Direction

Kernel PE Decomposition

SW -> HW

PE Reuse Generalization

SW -> HW

PE Locality Refactoring

HW -> SW

Spatial Reprogramming

SW -> HW

Counter Saturation

HW <-> SW

NoC Route Selection

SW -> HW

59 of 90

Kernel PE decomposition : Two ends

59

COARSER GRANULARITY

Monolithic ASICs

Each Operator receives its PE

60 of 90

Kernel PE decomposition : Two ends

60

COARSER GRANULARITY

Monolithic ASICs

Each Operator receives its PE

HALO

61 of 90

Example: Seizure Prediction Task

61

function seizure_prediction (input):

fft_out = fft(input, NUM_FFT_POINTS)

bbf_out = bbf(input, LOW_FREQ, HIGH_FREQ)

xcorr_out = xcorr(input)

p1 = svm_prediction(fft_out)

p2 = svm_prediction(bbf_out)

p3 = svm_prediction(xcorr_out)

return ((p1+p2+p3)>0)

62 of 90

Example: Seizure Prediction Task

62

function seizure_prediction (input):

fft_out = fft(input, NUM_FFT_POINTS)

bbf_out = bbf(input, LOW_FREQ, HIGH_FREQ)

xcorr_out = xcorr(input)

p1 = svm_prediction(fft_out)

p2 = svm_prediction(bbf_out)

p3 = svm_prediction(xcorr_out)

return ((p1+p2+p3)>0)

63 of 90

Sig-proc kernels -> natural boundaries

63

function seizure_prediction (input):

fft_out = fft(input, NUM_FFT_POINTS)

bbf_out = bbf(input, LOW_FREQ, HIGH_FREQ)

xcorr_out = xcorr(input)

p1 = svm_prediction(fft_out)

p2 = svm_prediction(bbf_out)

p3 = svm_prediction(xcorr_out)

return ((p1+p2+p3)>0)

64 of 90

Sig-proc kernels -> natural boundaries

64

function seizure_prediction (input):

fft_out = fft(input, NUM_FFT_POINTS)

bbf_out = bbf(input, LOW_FREQ, HIGH_FREQ)

xcorr_out = xcorr(input)

p1 = svm_prediction(fft_out)

p2 = svm_prediction(bbf_out)

p3 = svm_prediction(xcorr_out)

return ((p1+p2+p3)>0)

65 of 90

Benefits by 3x lower power consumption

65

66 of 90

PE Locality Refactoring

  • Some workloads don’t have natural boundaries, so they need to be refactored

  • Let’s understand this with LZMA task

66

67 of 90

Example: LZMA Compression Task

67

function LZMA_COMPRESSION(input):

output = list(lzma_header)

while data = input.get() do

best_match = find_best_match(data)

match_prob = count(match_table,best_match)

/count_total(match_table)

r1 = range_encode(match_prob)

output.push(r1)

increment_counter(match_table, best_match)

end while

return output

68 of 90

Example: LZMA Compression Task

68

function LZMA_COMPRESSION(input):

output = list(lzma_header)

while data = input.get() do

best_match = find_best_match(data)

match_prob = count(match_table,best_match)

/count_total(match_table)

r1 = range_encode(match_prob)

output.push(r1)

increment_counter(match_table, best_match)

end while

return output

69 of 90

Data locality for kernel boundaries works

69

function LZMA_COMPRESSION(input):

output = list(lzma_header)

while data = input.get() do

best_match = find_best_match(data)

match_prob = count(match_table,best_match)

/count_total(match_table)

r1 = range_encode(match_prob)

output.push(r1)

increment_counter(match_table, best_match)

end while

return output

70 of 90

PE Reuse Generalization

  • Amongst BCI workloads, many common kernels such as FFT, DWT etc. are used

  • Seizure prediction ∩ Movement intent = fft

70

71 of 90

PE Reuse Generalization

71

function seizure_prediction (input):

fft_out = fft(input, NUM_POINTS=1024)

…………

return ……

function movement_intent (input):

fft_out = fft(input, NUM_POINTS=25)

…………

return ……

72 of 90

Summary

  • Refactoring of Algorithms into different PEs
  • Each PE operates at minimum frequency
  • Asynchronous, circuit-switched network used for inter-PE communication
  • Low-power embedded microcontroller
    • Configuration of PEs into pipelines
    • Support compute which is absent via PEs

72

73 of 90

Evaluation

73

74 of 90

74

“Lower the better”

HALO obeys the 15mW power budget unlike RISC-V and ASICs.

Safe and Chronic use possible.

75 of 90

Impact of HW-SW

Co-Design

Techniques on

XCOR and LZMA

workloads

75

“Lower the better”

76 of 90

For LZMA & DWTMA

Compression optimal

block size chosen

76

“Higher the better”

77 of 90

LZMA gives better

compression ratio

but also consumes

more power

77

78 of 90

Conclusion

78

79 of 90

Conclusion

  • Modular, flexible & configurable approach
  • Power-efficient & adequate performance for safe and chronic usage of BCIs
  • Easily scalable to incorporate future BCI tasks
  • Inspires ultra-low power multi-acc SoC design

79

80 of 90

80

HALO outperforming other devices

81 of 90

Discussion

81

82 of 90

Discussion

  • How will multiple-HALO implants fare?
    • Can a network of them be realised?
    • Will it maintain the power budget? ☹
    • How to mitigate it? Any ideas
  • Running parallel BCI tasks -> power budget ☹
  • Effect of shared memory b/w PEs on power?
    • Application to multi-acc SoC may require this
    • Network complexity increases ☹

82

83 of 90

BACKUP

83

84 of 90

Spatial reprogramming Helps! Eg:XCOR

84

Block-based processing leads to lethal burst mode computation

85 of 90

Spatial reprogramming Helps! Eg:XCOR

85

Burst mode computation avoided by processing data as it arrives; by spatially reprogramming algorithm

86 of 90

86

87 of 90

87

Detailed Task-wise power split into PEs

88 of 90

88

89 of 90

89

90 of 90

90

PE Data-Locality based Refactoring