Parallel GPU based Algorithms for Matrix Computations
Team1
Jiayi Yang(jy23273)
Yating Wu(yw23374)
Addition, Subtraction,Scalar Multiplication, Transposition��
ROW = blockIdx.y * blockDim.y + threadIdx.y
COL = blockIdx.x * blockDim.x + threadIdx.x;
C[ROW * N + COL] = A[ROW * N + COL]+B[ROW * N + COL];
C[ROW * N + COL] = A[ROW * N + COL]-B[ROW * N + COL];
C[ROW * N + COL] = A[ROW * N + COL]*scalar;
C[COL * N + ROW] = A[ROW * N + COL];
Time | O(1) |
Work | O(N*N) |
If ∈ NC | Yes |
Matrix Multiplication�
ROW = blockIdx.y * blockDim.y + threadIdx.y
COL = blockIdx.x * blockDim.x + threadIdx.x;
Time | O(N) |
Work | O(N^3) |
If ∈ NC | No (but can be converted to NC if we have O(n^3) threads)and utilize the reduce method |
Right-Looking
update
reference
1. for(i=k+1; i<n;i++){
A[i][k] = A[i][k]/A[k][k]; //i is pid_i
}
k
k
2. for(j=k+1; j<n;j++){
for(i=k+1; i<n; i++){
//j is pid_j, i is pid_i
A[j][i] = A[j][i]-A[j][k]*A[k][i];
}
}
for(k=0; k<n; k++){//to update L and U, we need to draw twice, copy from CPU twice
}
k
k
Right-Looking details(two pass)
GPU branch, draw twice
update
refer
Time | O(N) |
Work | O(N^3) |
If ∈ NC | No |
1. for(j=k+1;j<n;j++){
//j is pid_j
for(i=k;i<n;i++){//i is pid_i
if(i==k){A[j][k] = A[j][k]/A[k][k];}
else{
A[j][i] = A[j][i] - A[j][k]*A[k][i]/A[k][k]}
}
}
}
�
k
k
for(k=0; k<n; k++){//update L and U together, we need to draw once
copy from CPU once
k
k
Right-Looking details(one pass)
If(i == k)
else
CPU branch, draw once
update
refer
Time | O(N) |
Work | O(N^3) |
If ∈ NC | No |
Left-Looking
update
Update & reference
reference
i
i
1. for(j=0; j<k; j++){
for(i=j+1;i<n; i++)
//j is pid_j
//A[i][k] and A[j][k] have overleap
A[i][k] = A[i][k]-A[i][j]*A[j][k];
}
k
k
2. for(i=k+1; i<n;i++){//i is pid_i
A[i][k] = A[i][k]/A[k][k];
}
for(k=0; k<n; k++){
}
Left-Looking details
k
k
update
Update & refer
refer
Time | O(N^2) |
Work | O(N^3) |
If ∈ NC | No |
Matrix Inversion (LU Factorization-based)
2. Inverse the lower triangular matrix L
Time | O(N) |
Work | O(N^3) (including the work of LU Factorization) |
If ∈ NC | No |
Matrix Determinant Calculation
Time | O(N) (including the time of LU Factorization) |
Work | O(N^3) (including the work of LU Factorization) |
If ∈ NC | No |
Solver for systems of equations
Ax = b
-> (LU)x = b
-> L(Ux) = b
set Ux=y
->Ly=b lowerTriangleSolver(L,y,b)
->Ux=y upperTriangleSolver(L,y,b)
Time | O(N^2) (including the time of LU Factorization) |
Work | O(N^3) (including the work of LU Factorization) |
If ∈ NC | No |
Size\Computation | Addition | Subtraction | Multiplication | Scalar Multiplicaton | Transposition |
32*32(CPU) | 0.001312 | 0.001280 | 0.001440 | 0.001312 | 0.001216 |
32*32(GPU) | 1.860224 | 0.735264 | 3.414784 | 0.577920 | 0.519808 |
64*64(CPU) | 0.001216 | 0.001248 | 0.001344 | 0.001184 | 0.001120 |
64*64(GPU) | 4.376352 | 4.523104 | 11.208000 | 0.205568 | 0.257216 |
128*128(CPU) | 0.001152 | 0.001152 | 0.001184 | 0.001184 | 0.001280 |
128*128(GPU) | 0.360768 | 0.432480 | 0.382688 | 0.296512 | 0.659616 |
256*256(CPU) | 0.001152 | 0.001184 | 0.001152 | 0.001216 | 0.001696 |
256*256(GPU) | 0.966464 | 1.260192 | 0.898944 | 0.677568 | 0.529184 |
512*512(CPU) | 0.001184 | 0.001408 | 0.001152 | 0.002272 | 0.001184 |
512*512(GPU) | 2.370432 | 1.867552 | 1.753984 | 1.518848 | 2.351072 |
Size\LU method | Right-looking(two pass) | Right-looking(one pass) | Left-looking | Inversion | Determinant |
32*32(CPU) | 0.001184 | 0.001184 | 0.001152 | 0.001152 | 0.001248 |
32*32(GPU) | 1.453440 | 3.832480 | 20.917055 | 4.291712 | 3.142336 |
64*64(CPU) | 0.001504 | 0.001952 | 0.001120 | 0.001280 | 0.001376 |
64*64(GPU) | 4.968864 | 8.242400 | 162.619385 | 15.256960 | 7.821504 |
128*128(CPU) | 0.001312 | 0.001184 | 0.001120 | 0.002112 | 0.001248 |
128*128(GPU) | 17.878080 | 26.706047 | 1217.227661 | 56.154430 | 28.437599 |
256*256(CPU) | 0.001184 | 0.001248 | 0.001120 | 0.001760 | 0.001216 |
256*256(GPU) | 70.392258 | 105.709503 | 9735.912109 | 207.210144 | 109.569214 |
512*512(CPU) | 0.001120 | 0.001152 | 001120 | 0.001536 | 0.001152 |
512*512(GPU) | 275.576447 | 402.126495 | 13246.125234 | 819.849243 | 402.243286 |
Results Analysis
1. In small matrix, copy from host to device and device to host would take a lot of time.
2. We don’t have enough blocks*threads to achieve NC
3. The setting of GridSize and BlockSize could be improved. We could also find a reasonable way to convert global GPU memory into local GPU memory to improve it.
4.LU decomposition could be refined.
Live Code Demo
References:
[1] Ino, Fumihiko & Matsui, Manabu & Goda, Keigo & Hagihara, Kenichi. (2005). Performance Study of LU Decomposition on the Programmable GPU. 83-94. 10.1007/11602569_13.
[2] N. Galoppo, N. K. Govindaraju, M. Henson and D. Manocha, "LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware," SC '05: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, Seattle, WA, USA, 2005, pp. 3-3.
[3] K. He, S. X. -. Tan, H. Wang and G. Shi, "GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 3, pp. 1140-1150, March 2016.
[4] https://www.cspp.cc.u-tokyo.ac.jp/hanawa/class/spc2016s/