1 of 16

Parallel GPU based Algorithms for Matrix Computations

Team1

Jiayi Yang(jy23273)

Yating Wu(yw23374)

2 of 16

Addition, Subtraction,Scalar Multiplication, Transposition��

ROW = blockIdx.y * blockDim.y + threadIdx.y

COL = blockIdx.x * blockDim.x + threadIdx.x;

C[ROW * N + COL] = A[ROW * N + COL]+B[ROW * N + COL];

C[ROW * N + COL] = A[ROW * N + COL]-B[ROW * N + COL];

C[ROW * N + COL] = A[ROW * N + COL]*scalar;

C[COL * N + ROW] = A[ROW * N + COL];

Time	O(1)
Work	O(N*N)
If ∈ NC	Yes

3 of 16

Matrix Multiplication�

ROW = blockIdx.y * blockDim.y + threadIdx.y

COL = blockIdx.x * blockDim.x + threadIdx.x;

Time	O(N)
Work	O(N^3)
If ∈ NC	No (but can be converted to NC if we have O(n^3) threads)and utilize the reduce method

4 of 16

Right-Looking

update

reference

5 of 16

1. for(i=k+1; i<n;i++){

A[i][k] = A[i][k]/A[k][k]; //i is pid_i

}

k

2. for(j=k+1; j<n;j++){

for(i=k+1; i<n; i++){

//j is pid_j, i is pid_i

A[j][i] = A[j][i]-A[j][k]*A[k][i];

}

for(k=0; k<n; k++){//to update L and U, we need to draw twice, copy from CPU twice

}

k

Right-Looking details(two pass)

GPU branch, draw twice

update

refer

Time	O(N)
Work	O(N^3)
If ∈ NC	No

6 of 16

1. for(j=k+1;j<n;j++){

//j is pid_j

for(i=k;i<n;i++){//i is pid_i

if(i==k){A[j][k] = A[j][k]/A[k][k];}

else{

A[j][i] = A[j][i] - A[j][k]*A[k][i]/A[k][k]}

}

�

k

for(k=0; k<n; k++){//update L and U together, we need to draw once

copy from CPU once

k

Right-Looking details(one pass)

If(i == k)

else

CPU branch, draw once

update

refer

Time	O(N)
Work	O(N^3)
If ∈ NC	No

7 of 16

Left-Looking

update

Update & reference

reference

i

8 of 16

1. for(j=0; j<k; j++){

for(i=j+1;i<n; i++)

//j is pid_j

//A[i][k] and A[j][k] have overleap

A[i][k] = A[i][k]-A[i][j]*A[j][k];

}

k

2. for(i=k+1; i<n;i++){//i is pid_i

A[i][k] = A[i][k]/A[k][k];

}

for(k=0; k<n; k++){

}

Left-Looking details

k

update

Update & refer

refer

Time	O(N^2)
Work	O(N^3)
If ∈ NC	No

9 of 16

Matrix Inversion (LU Factorization-based)

Inverse the upper triangular matrix U

2. Inverse the lower triangular matrix L

Time	O(N)
Work	O(N^3) (including the work of LU Factorization)
If ∈ NC	No

10 of 16

Matrix Determinant Calculation

Time	O(N) (including the time of LU Factorization)
Work	O(N^3) (including the work of LU Factorization)
If ∈ NC	No

11 of 16

Solver for systems of equations

Ax = b

-> (LU)x = b

-> L(Ux) = b

set Ux=y

->Ly=b lowerTriangleSolver(L,y,b)

->Ux=y upperTriangleSolver(L,y,b)

Time	O(N^2) (including the time of LU Factorization)
Work	O(N^3) (including the work of LU Factorization)
If ∈ NC	No

12 of 16

Size\Computation	Addition	Subtraction	Multiplication	Scalar Multiplicaton	Transposition
32*32(CPU)	0.001312	0.001280	0.001440	0.001312	0.001216
32*32(GPU)	1.860224	0.735264	3.414784	0.577920	0.519808
64*64(CPU)	0.001216	0.001248	0.001344	0.001184	0.001120
64*64(GPU)	4.376352	4.523104	11.208000	0.205568	0.257216
128*128(CPU)	0.001152	0.001152	0.001184	0.001184	0.001280
128*128(GPU)	0.360768	0.432480	0.382688	0.296512	0.659616
256*256(CPU)	0.001152	0.001184	0.001152	0.001216	0.001696
256*256(GPU)	0.966464	1.260192	0.898944	0.677568	0.529184
512*512(CPU)	0.001184	0.001408	0.001152	0.002272	0.001184
512*512(GPU)	2.370432	1.867552	1.753984	1.518848	2.351072

13 of 16

Size\LU method	Right-looking(two pass)	Right-looking(one pass)	Left-looking	Inversion	Determinant
32*32(CPU)	0.001184	0.001184	0.001152	0.001152	0.001248
32*32(GPU)	1.453440	3.832480	20.917055	4.291712	3.142336
64*64(CPU)	0.001504	0.001952	0.001120	0.001280	0.001376
64*64(GPU)	4.968864	8.242400	162.619385	15.256960	7.821504
128*128(CPU)	0.001312	0.001184	0.001120	0.002112	0.001248
128*128(GPU)	17.878080	26.706047	1217.227661	56.154430	28.437599
256*256(CPU)	0.001184	0.001248	0.001120	0.001760	0.001216
256*256(GPU)	70.392258	105.709503	9735.912109	207.210144	109.569214
512*512(CPU)	0.001120	0.001152	001120	0.001536	0.001152
512*512(GPU)	275.576447	402.126495	13246.125234	819.849243	402.243286

14 of 16

Results Analysis

1. In small matrix, copy from host to device and device to host would take a lot of time.

2. We don’t have enough blocks*threads to achieve NC

3. The setting of GridSize and BlockSize could be improved. We could also find a reasonable way to convert global GPU memory into local GPU memory to improve it.

4.LU decomposition could be refined.

15 of 16

Live Code Demo

16 of 16

References:

[1] Ino, Fumihiko & Matsui, Manabu & Goda, Keigo & Hagihara, Kenichi. (2005). Performance Study of LU Decomposition on the Programmable GPU. 83-94. 10.1007/11602569_13.

[2] N. Galoppo, N. K. Govindaraju, M. Henson and D. Manocha, "LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware," SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, Seattle, WA, USA, 2005, pp. 3-3.

[3] K. He, S. X. -. Tan, H. Wang and G. Shi, "GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 3, pp. 1140-1150, March 2016.

[4] https://www.cspp.cc.u-tokyo.ac.jp/hanawa/class/spc2016s/