1 of 16

Parallel GPU based Algorithms for Matrix Computations

Team1

Jiayi Yang(jy23273)

Yating Wu(yw23374)

2 of 16

Addition, Subtraction,Scalar Multiplication, Transposition��

ROW = blockIdx.y * blockDim.y + threadIdx.y

COL = blockIdx.x * blockDim.x + threadIdx.x;

C[ROW * N + COL] = A[ROW * N + COL]+B[ROW * N + COL];

C[ROW * N + COL] = A[ROW * N + COL]-B[ROW * N + COL];

C[ROW * N + COL] = A[ROW * N + COL]*scalar;

C[COL * N + ROW] = A[ROW * N + COL];

Time

O(1)

Work

O(N*N)

If ∈ NC

Yes

3 of 16

Matrix Multiplication�

ROW = blockIdx.y * blockDim.y + threadIdx.y

COL = blockIdx.x * blockDim.x + threadIdx.x;

Time

O(N)

Work

O(N^3)

If ∈ NC

No (but can be converted to NC if we have O(n^3) threads)and utilize the reduce method

4 of 16

Right-Looking

update

reference

5 of 16

1. for(i=k+1; i<n;i++){

A[i][k] = A[i][k]/A[k][k]; //i is pid_i

}

k

k

2. for(j=k+1; j<n;j++){

for(i=k+1; i<n; i++){

//j is pid_j, i is pid_i

A[j][i] = A[j][i]-A[j][k]*A[k][i];

}

}

for(k=0; k<n; k++){//to update L and U, we need to draw twice, copy from CPU twice

}

k

k

Right-Looking details(two pass)

GPU branch, draw twice

update

refer

Time

O(N)

Work

O(N^3)

If ∈ NC

No

6 of 16

1. for(j=k+1;j<n;j++){

     //j is pid_j

for(i=k;i<n;i++){//i is pid_i

    if(i==k){A[j][k] = A[j][k]/A[k][k];}

          else{

A[j][i] = A[j][i] - A[j][k]*A[k][i]/A[k][k]} 

   }

}

}

k

k

for(k=0; k<n; k++){//update L and U together, we need to draw once

copy from CPU once

k

k

Right-Looking details(one pass)

If(i == k)

else

CPU branch, draw once

update

refer

Time

O(N)

Work

O(N^3)

If ∈ NC

No

7 of 16

Left-Looking

update

Update & reference

reference

i

i

8 of 16

1. for(j=0; j<k; j++){

for(i=j+1;i<n; i++)

//j is pid_j

//A[i][k] and A[j][k] have overleap

A[i][k] = A[i][k]-A[i][j]*A[j][k];

}

k

k

2. for(i=k+1; i<n;i++){//i is pid_i

A[i][k] = A[i][k]/A[k][k];

}

for(k=0; k<n; k++){

}

Left-Looking details

k

k

update

Update & refer

refer

Time

O(N^2)

Work

O(N^3)

If ∈ NC

No

9 of 16

Matrix Inversion (LU Factorization-based)

  1. Inverse the upper triangular matrix U

2. Inverse the lower triangular matrix L

Time

O(N)

Work

O(N^3) (including the work of LU Factorization)

If ∈ NC

No

10 of 16

Matrix Determinant Calculation

Time

O(N) (including the time of LU Factorization)

Work

O(N^3) (including the work of LU Factorization)

If ∈ NC

No

11 of 16

Solver for systems of equations

Ax = b

-> (LU)x = b

-> L(Ux) = b

set Ux=y

->Ly=b lowerTriangleSolver(L,y,b)

->Ux=y upperTriangleSolver(L,y,b)

Time

O(N^2) (including the time of LU Factorization)

Work

O(N^3) (including the work of LU Factorization)

If ∈ NC

No

12 of 16

Size\Computation

Addition

Subtraction

Multiplication

Scalar Multiplicaton

Transposition

32*32(CPU)

0.001312

0.001280

0.001440

0.001312

0.001216

32*32(GPU)

1.860224

0.735264

3.414784

0.577920

0.519808

64*64(CPU)

0.001216

0.001248

0.001344

0.001184

0.001120

64*64(GPU)

4.376352

4.523104

11.208000

0.205568

0.257216

128*128(CPU)

0.001152

0.001152

0.001184

0.001184

0.001280

128*128(GPU)

0.360768

0.432480

0.382688

0.296512

0.659616

256*256(CPU)

0.001152

0.001184

0.001152

0.001216

0.001696

256*256(GPU)

0.966464

1.260192

0.898944

0.677568

0.529184

512*512(CPU)

0.001184

0.001408

0.001152

0.002272

0.001184

512*512(GPU)

2.370432

1.867552

1.753984

1.518848

2.351072

13 of 16

Size\LU method

Right-looking(two pass)

Right-looking(one pass)

Left-looking

Inversion

Determinant

32*32(CPU)

0.001184

0.001184

0.001152

0.001152

0.001248

32*32(GPU)

1.453440

3.832480

20.917055

4.291712

3.142336

64*64(CPU)

0.001504

0.001952

0.001120

0.001280

0.001376

64*64(GPU)

4.968864

8.242400

162.619385

15.256960

7.821504

128*128(CPU)

0.001312

0.001184

0.001120

0.002112

0.001248

128*128(GPU)

17.878080

26.706047

1217.227661

56.154430

28.437599

256*256(CPU)

0.001184

0.001248

0.001120

0.001760

0.001216

256*256(GPU)

70.392258

105.709503

9735.912109

207.210144

109.569214

512*512(CPU)

0.001120

0.001152

001120

0.001536

0.001152

512*512(GPU)

275.576447

402.126495

13246.125234

819.849243

402.243286

14 of 16

Results Analysis

1. In small matrix, copy from host to device and device to host would take a lot of time.

2. We don’t have enough blocks*threads to achieve NC

3. The setting of GridSize and BlockSize could be improved. We could also find a reasonable way to convert global GPU memory into local GPU memory to improve it.

4.LU decomposition could be refined.

15 of 16

Live Code Demo

16 of 16

References:

[1] Ino, Fumihiko & Matsui, Manabu & Goda, Keigo & Hagihara, Kenichi. (2005). Performance Study of LU Decomposition on the Programmable GPU. 83-94. 10.1007/11602569_13.

[2] N. Galoppo, N. K. Govindaraju, M. Henson and D. Manocha, "LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware," SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, Seattle, WA, USA, 2005, pp. 3-3.

[3] K. He, S. X. -. Tan, H. Wang and G. Shi, "GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 3, pp. 1140-1150, March 2016.

[4] https://www.cspp.cc.u-tokyo.ac.jp/hanawa/class/spc2016s/