MultiQueue-Based FPGA Routing: Relaxed A* Priority Ordering for Improved Parallelism
Alexandre Singer (speaker),* Hang Yan,* Guozheng Zhang,*
Mark C. Jeffrey,* Mirjana Stojilović,§ Vaughn Betz*
*
§
FPT 2024
FPGA CAD is A Large Graph Problem
2
SYNTH
PACK & PLACE
ROUTE
Netlist
Arch
Compile Time Forecast: From Bad to Worse
3
Serial CPU perf.
≈ FPGA growth
Sources: Intel (Altera) / AMD (Xilinx) / SPEC CPU2017 Benchmark / CPU Specs Database
Compile Time Forecast: From Bad to Worse
4
10× Gap
Sources: Intel (Altera) / AMD (Xilinx) / SPEC CPU2017 Benchmark / CPU Specs Database
But We Have More and More Cores!
5
Sources: Intel (Altera) / AMD (Xilinx) / SPEC CPU2017 Benchmark / CPU Specs Database
Could Using Multiple Cores Bridge this Gap?
6
Parallel opportunity
Sources: Intel (Altera) / AMD (Xilinx) / SPEC CPU2017 Benchmark / CPU Specs Database
UB
FPGA Routing
7
PathFinder: High-Quality, Many Path Searches
8
while no overlapping nets:
for each net n:
route(n)
update costs
Serial Speedup: Incremental (AIR)
9
while no overlapping nets:
for each marked net n:
route(n) // Only cong. part
update costs
mark nets if overlap
Parallel Speedup: Challenging
10
Scalable Parallel Path Search
11
Path Search Using A*
12
4
5
6
7
8
9
Total Estimated Cost to Target
A
B
C
D
F
D
E
Priority
Queue
B
Global Info:
C
D
E
F
A
A
A
D
B
3
3
6
7
9
Node:
Prev:
BCost:
A
(4)
C
(4)
B
(2)
D
(3)
E
(1)
F
(0)
3
3
6
1
3
2
1
6
Source
Target
A
B
B
4
B
C
C
D
D
E
E
D
D
F
F
A
(4)
A
(4)
B
(2)
B
(2)
B
(2)
C
(4)
C
(4)
C
(4)
D
(3)
D
(3)
D
(3)
E
(1)
E
(1)
E
(1)
F
(0)
F
(0)
F
(0)
How do we Speed Up A*?
13
Explore in parallel?
A
(4)
C
(4)
B
(2)
D
(3)
E
(1)
F
(0)
3
3
6
1
3
2
1
6
Source
Target
1
2
3
4
5
7
Pop Order
A
B
C
D
F
D
E
Priority
Queue
6
Priority
Queue
Priority Queue Bottleneck
Serial Priority Queue
14
Priority
Queue
2
1
2
1
1
2
2
MultiQueue: More Concurrency But Less Sorting
Serial Priority Queue
MultiQueue
15
Priority
Queue
MultiQueue
Priority
Queue
Priority
Queue
Priority
Queue
Priority
Queue
1
2
2
1
Parallel A*
16
A
(4)
C
(4)
B
(2)
D
(3)
E
(1)
F
(0)
3
3
6
1
3
2
1
6
Source
Target
4
5
6
7
8
9
Total Estimated Cost to Target
A
B
C
D
F
D
E
Multi
Queue
B
Global Info:
C
D
E
F
A
A
A
D
B
3
3
6
7
9
Node:
Prev:
BCost:
A
(4)
B
(2)
C
(4)
D
(3)
A
A
(4)
B
C
B
(2)
C
(4)
Race!
B
4
B
(2)
C
(4)
B
C
F
(0)
D
F
D
D
(3)
F
(0)
E
(1)
D
(3)
F
(0)
D
F
D
E
(1)
E
E
(1)
E
Duplicate!
Problem #1: Race Conditions
Same neighbor explored concurrently
Solution:
Per-node locking + stable tie-breaking
Stop?
F
Problem #2: Early Exploration
Node explored too early / duplicate
Solution:
Intelligent pruning of nodes
Problem #3: When to Stop?
Cannot stop when target is first reached
Solution:
Drain the queue + intelligent pruning
Serial vs. Parallel A*
17
A
(4)
C
(4)
B
(2)
D
(3)
E
(1)
F
(0)
3
3
6
1
3
2
1
6
Source
Target
1
2
3
4
5
7
Time
A
B
C
D
F
D
E
Multi
Queue
6
1
2
3
4
5
7
A
B
C
D
F
D
E
Priority
Queue
6
Implementation & Results
18
Integrated Into Most Recent VTR Router
19
VTR 8+ (March 2024)
AIR
VTR 8+ (March 2024)
AIR
Serial Connection Router
Our Parallel Connection Router
| Largest VTR Circuits | Largest Koios Circuits | Titan Circuits |
Type of workload | General | Deep Learning | Large |
Architecture | VTR’s Flagship Arch | 22nm Intel-Like Arch (Arora et al.) | Stratix IV Arch Capture |
Path Search Variations
20
A*
Underestimating heuristic
Optimal path
Medium work
Directed
Overestimating heuristic
Suboptimal path
Less Work
Heuristic:
Guess of remaining cost to T
Dijkstra’s
No heuristic
Optimal paths to many T
More work
Titan Benchmark Results
21
Directed
A*
Geomean Route Time (mins)
0
60
120
13.2×
1.0×
1.7×
Parallel 12T
Parallel 1T
Serial Router
1.0×
1.0×
2.0×
Directed
A*
Normalized to Default VTR
0
1
2
0.99×
0.99×
Parallel CPD
Parallel WL
0.98×
0.99×
Koios Benchmark Results
22
Directed
A*
Geomean Route Time (mins)
0
60
120
11.0×
1.0×
1.5×
Parallel 12T
Parallel 1T
Serial Router
1.0×
1.0×
1.6×
Directed
A*
Normalized to Default VTR
0
1
2
0.99×
0.96×
Parallel CPD
Parallel WL
1.02×
0.98×
Scalable Path Search!
23
Geomean Path Search Time on Titan Benchmarks
Scalable Parallel Path Search
24
Dijkstra’s | 18.7× speedup | Deterministic |
A* | 13× speedup | Deterministic |
Directed | 2× speedup | Nondeterministic |
Thank You!
25
GitHub Repo: vtr-verilog-to-routing Branch: mq-parallel-router
Geomean Path Search Time on Titan Benchmarks