(B) K=0
…
…
(C) K=1
…
…
…
…
(E) Cin=0Wf =0
…
…
…
…
…
…
MAC
MAC
MAC
MAC
MAC
MAC
(F) Cin=0,Wf =1
(G) Cin=0,Wf =2
(H) Cin=1,Wf =0
…
…
MAC
(F) Cin=1,Wf =1
…
…
MAC
(G) Cin=1,Wf =2
GEMM
mGEMM
Cin
co
wf
co
wo
Filter
Input
Output
…
…
Cin
…
…
…
…
…
…
(B)
(C)
(D)
(A)
co
co
Filter
Input
Output
…
…
Cin
(A) GEMM
(B) mGEMM
co
(a) Im2col + GEMM
(b) GEMM Plus
wf
Input
Filter
Output
n
…
wo + wf - 1
Input
Filter
Output
m
k
Duplication & Shift
k
…
…
…
…
…
ci
ci
ci
ci
Duplication & Shift
ci
co
…
…
ci
(b) k=0
Im2col + GEMM
mGEMM
co
wo + wf - 1
…
(c) k=1
(d) k=2
(e) k=3
(a)
Input
Output
(g) ci=0, wf=0
(h) ci=0, wf=1
(i) ci=0, wf=2
(j) ci=1, wf=0
(f)
wf
ci
…
ci
Filter
MAC
MAC
MAC
m
n
k
k
wf
wf
MAC
MAC
MAC
MAC
MAC
Duplicated
& Shifted
눈금자, 눈금선을 .33” 에 맞추면 그리기 수월합니다
M
N
K
Output (C)
Filter (A)
Max Flop/s
Ours
Operational Intensity (Flop/Byte)
Throughput (Flop/s)
XNNPACK
OpenBlas
ARMNN
Cout
Wfil
Cin
Cin
Hfil
Hout
Cout
…
Output
Hin
Win
Wout
…
Input
Filter
Filter
Main
Memory
L2
L1
Register
Wfil
Hfil
Cin
Cout
Wout
Hout
Cout
Cin
Win
Hin
…
Wout Iteration
Input
Output
Filter
K
Input (B)
N Iteration
i,k
i,j
k,j
눈금자, 눈금선을 .33” 에 맞추면 그리기 수월합니다
ci
Input
Output
ci
Memory
Shared Last Level Cache
L1 Cache
(CPU0)
Hout xWout
Streamed from memory
Filter
Streamed into memory
Input
눈금자, 눈금선을 .33” 에 맞추면 그리기 수월합니다
Cout
B x Hout xWout
Output (C)
Filter (A)
Input (B)
Cin x Hfil xWfil
Cout x Hfil xWfil
B x Hout xWout
Output (C)
Filter (A)
Input (B)
Cin
Cin
kn2col
im2col
Cout
B x Hout xWout
Cin
Cin
Hfil xWfil
Cin x Hfil xWfil
Filter (A)
Input (B)
Output (C)
Cout (M)
B x Hout xWout (N)
Cin x Hfil xWfil (K)
Output
Input
Cin x Hfil xWfil (K)
Filter
Wfil
Hfil
Cin
Cout (M)
B x Hout xWout (N)
Filter
Cin x Hfil xWfil (K)
Wfil
Hfil
Cin
Output
Input
(A)
(B)
(C)
Wout
Cin x Hfil xWfil
Cin x Hfil xWfil
Cin x Hfil xWfil (K)