1 of 20

Cholesky Decomposition

July 2024

ABC

1

OCP ODSA HPC & AI Modularity

2 of 20

References

  1. Cholesky Factorization on SIMD multi-core architectures, Florian Lemaitre, Benjamin Counturier, Lionel Lacassagne, HAL,2017. https://hal.science/hal-01550129

  • A Reconfigurable Processing Element for Cholesky Decomposition and Matrix Inversion, Aki Happonen, Adrian Burian, Erwin Hemming, 2005, https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=684fdee32b3dc8c9d8a888e032f7e6c64bfa17f4

NOTE: This is not a comprehensive lit search/survey

2

3 of 20

Cholesky Factorization

3

Source:[1]

Source:[1]

4 of 20

Cholesky Factorization

4

Floating points operation[1]

5 of 20

Cholesky Factorization FPGA architecture Implementation

  • 12 bit FPGA implementation[2]

5

6 of 20

Cholesky Factorization FPGA architecture Implementation

  • 12 bit FPGA implementation[2]

6

LUT Log2Dec Architecture

Control command

7 of 20

Cholesky Factorization architecture Implementation

  • Vector architecture [3]

7

8 of 20

Cholesky Factorization architecture Implementation

  • Cmod and Cdiv Processing Elements

8

Cmod Processing Elements [3]

Cdiv Processing Elements [3]

9 of 20

Cholesky Factorization architecture Implementation

  • Vector Version of Cholesky Algorithm[3]

9

Algorithm 4: Updating algorithm of front matrix Fj

Algorithm 3: Multifrontal Cholesky algorithm in vector version

10 of 20

Choesky Factorization Architecture

  • Execution of Cdiv with Mlane in PE

10

11 of 20

Performance Analysis

  • The analysis of sparse Cholesky factorization performance[3]

11

12 of 20

Evaluation dataset

  • nd24k

12

13 of 20

Evaluation dataset

  • nd3k

13

14 of 20

Evaluation dataset

  • shipsec1

14

15 of 20

Evaluation dataset

  • Trefethen_20000b

15

16 of 20

Performance Analysis

  • M=1;k=1 [3]

16

m is the number of PEs and k is the number of modules in one PE.

17 of 20

Performance Analysis

  • M=1;k=2 [3]

17

18 of 20

Performance Analysis

  • M=2;k=1 [3]

18

19 of 20

Performance Analysis

  • M=2;k=2 [3]

19

20 of 20

Performance Comparison

  • HSL_MA87 solver for multicore CPU platforms ; the GPU version of CHOLMOD solver for CPU-GPU platforms; subtree-based parallel methods for CPU-GPU platforms; Ours denotes prior art[3].
  • HSL_MA87, CHOLMOD, and GPU is 2.5 GHz, 3.2 GHz, and 2.93 GHz . Prior Art[3] frequency is 200 MHz. 

20