1 of 25

MultiQueue-Based FPGA Routing: Relaxed A* Priority Ordering for Improved Parallelism

Alexandre Singer (speaker),* Hang Yan,* Guozheng Zhang,*

Mark C. Jeffrey,* Mirjana Stojilović,§ Vaughn Betz*

*

§

FPT 2024

2 of 25

FPGA CAD is A Large Graph Problem

2

SYNTH

PACK & PLACE

ROUTE

Netlist

Arch

3 of 25

Compile Time Forecast: From Bad to Worse

3

Serial CPU perf.

≈ FPGA growth

4 of 25

Compile Time Forecast: From Bad to Worse

4

10× Gap

5 of 25

But We Have More and More Cores!

5

6 of 25

Could Using Multiple Cores Bridge this Gap?

6

Parallel opportunity

UB

7 of 25

FPGA Routing

7

8 of 25

PathFinder: High-Quality, Many Path Searches

8

while no overlapping nets:

for each net n:

route(n)

update costs

9 of 25

Serial Speedup: Incremental (AIR)

9

while no overlapping nets:

for each marked net n:

route(n) // Only cong. part

update costs

mark nets if overlap

10 of 25

Parallel Speedup: Challenging

  • Coarse-grained: route non-overlapping nets in parallel
    • Large nets lead to less scalability
    • Path search as a subroutine
  • Fine-grained: search for paths in parallel
  • Parallelize neighbor exploration
    • Limited scalability (1.2× on 2T) [Gort & Anderson]
  • Software speculation of nodes
    • High overhead (3.7× on 8T) [Moctar & Brisk]

10

11 of 25

Scalable Parallel Path Search

11

12 of 25

Path Search Using A*

12

4

5

6

7

8

9

Total Estimated Cost to Target

A

B

C

D

F

D

E

Priority

Queue

B

Global Info:

C

D

E

F

A

A

A

D

B

3

3

6

7

9

Node:

Prev:

BCost:

A

(4)

C

(4)

B

(2)

D

(3)

E

(1)

F

(0)

3

3

6

1

3

2

1

6

Source

Target

A

B

B

4

B

C

C

D

D

E

E

D

D

F

F

A

(4)

A

(4)

B

(2)

B

(2)

B

(2)

C

(4)

C

(4)

C

(4)

D

(3)

D

(3)

D

(3)

E

(1)

E

(1)

E

(1)

F

(0)

F

(0)

F

(0)

13 of 25

How do we Speed Up A*?

13

Explore in parallel?

A

(4)

C

(4)

B

(2)

D

(3)

E

(1)

F

(0)

3

3

6

1

3

2

1

6

Source

Target

1

2

3

4

5

7

Pop Order

A

B

C

D

F

D

E

Priority

Queue

6

Priority

Queue

14 of 25

Priority Queue Bottleneck

Serial Priority Queue

  • Fully sorted
  • Lots of lock contention
  • Low concurrency

14

Priority

Queue

2

1

2

1

1

2

2

15 of 25

MultiQueue: More Concurrency But Less Sorting

Serial Priority Queue

  • Relaxed sorting
  • Low lock contention
  • High concurrency
  • Fully sorted
  • Lots of lock contention
  • Low concurrency

MultiQueue

15

Priority

Queue

MultiQueue

Priority

Queue

Priority

Queue

Priority

Queue

Priority

Queue

1

2

2

1

16 of 25

Parallel A*

16

A

(4)

C

(4)

B

(2)

D

(3)

E

(1)

F

(0)

3

3

6

1

3

2

1

6

Source

Target

4

5

6

7

8

9

Total Estimated Cost to Target

A

B

C

D

F

D

E

Multi

Queue

B

Global Info:

C

D

E

F

A

A

A

D

B

3

3

6

7

9

Node:

Prev:

BCost:

A

(4)

B

(2)

C

(4)

D

(3)

A

A

(4)

B

C

B

(2)

C

(4)

Race!

B

4

B

(2)

C

(4)

B

C

F

(0)

D

F

D

D

(3)

F

(0)

E

(1)

D

(3)

F

(0)

D

F

D

E

(1)

E

E

(1)

E

Duplicate!

Problem #1: Race Conditions

Same neighbor explored concurrently

Solution:

Per-node locking + stable tie-breaking

Stop?

F

Problem #2: Early Exploration

Node explored too early / duplicate

Solution:

Intelligent pruning of nodes

Problem #3: When to Stop?

Cannot stop when target is first reached

Solution:

Drain the queue + intelligent pruning

17 of 25

Serial vs. Parallel A*

17

A

(4)

C

(4)

B

(2)

D

(3)

E

(1)

F

(0)

3

3

6

1

3

2

1

6

Source

Target

1

2

3

4

5

7

Time

A

B

C

D

F

D

E

Multi

Queue

6

1

2

3

4

5

7

A

B

C

D

F

D

E

Priority

Queue

6

18 of 25

Implementation & Results

18

19 of 25

Integrated Into Most Recent VTR Router

19

VTR 8+ (March 2024)

AIR

VTR 8+ (March 2024)

AIR

Serial Connection Router

Our Parallel Connection Router

Largest VTR Circuits

Largest Koios Circuits

Titan Circuits

Type of workload

General

Deep Learning

Large

Architecture

VTR’s Flagship Arch

22nm Intel-Like Arch (Arora et al.)

Stratix IV Arch Capture

20 of 25

Path Search Variations

20

A*

Underestimating heuristic

Optimal path

Medium work

Directed

Overestimating heuristic

Suboptimal path

Less Work

Heuristic:

Guess of remaining cost to T

Dijkstra’s

No heuristic

Optimal paths to many T

More work

21 of 25

Titan Benchmark Results

21

Directed

A*

Geomean Route Time (mins)

0

60

120

13.2×

1.0×

1.7×

Parallel 12T

Parallel 1T

Serial Router

1.0×

1.0×

2.0×

Directed

A*

Normalized to Default VTR

0

1

2

0.99×

0.99×

Parallel CPD

Parallel WL

0.98×

0.99×

22 of 25

Koios Benchmark Results

22

Directed

A*

Geomean Route Time (mins)

0

60

120

11.0×

1.0×

1.5×

Parallel 12T

Parallel 1T

Serial Router

1.0×

1.0×

1.6×

Directed

A*

Normalized to Default VTR

0

1

2

0.99×

0.96×

Parallel CPD

Parallel WL

1.02×

0.98×

23 of 25

Scalable Path Search!

23

Geomean Path Search Time on Titan Benchmarks

24 of 25

Scalable Parallel Path Search

  • MultiQueue unlocks concurrency
  • Algorithm upgrades: reorder tolerance + deterministic tie breaking
    • Path quality maintained
    • Deterministic parallel A* and Dijkstra’s
  • Integrated into AIR (VTR 8+)

24

Dijkstra’s

18.7× speedup

Deterministic

A*

13× speedup

Deterministic

Directed

2× speedup

Nondeterministic

25 of 25

Thank You!

25

GitHub Repo: vtr-verilog-to-routing Branch: mq-parallel-router

Geomean Path Search Time on Titan Benchmarks