2 of 20

FPU and FastInvSqrt Circuit Specifications

Multi-Cycle FPU performs floating point add, subtract, multiply, divide

Opcodes in order from 0 to 3
Outputs Done signal when operation complete
Indicates if overflow or underflow has occurred

FastInvSqrt approximates 1/sqrt(x) without using division

Utilizes the FPU
Pipelined version improves performance by roughly 20% at the cost of more area

3 of 20

Work Division

Long:

Arithmetic Logic
Diagrams
RTL Verilog
Synthesis

Uriel:

Controls Logic
Diagrams
RTL Verilog

4 of 20

Theoretical Background

16-bit Composition:

0	0	1	0	1	1	0	0	0	0	1	1	0	0	0	1

= Signed Bit

= Exponent

= Mantissa

Conversion: (-1)^s * 2^(11-15) * [1+(49/1024)] = 0.06549072265625

Signed Bit is 0; positive
Exponent reads 11, bias is 15

11-15= -4

Mantissa is 110001; therefore is (49/1024)

5 of 20

Addition/Subtraction

Increment smaller exponent until both matches, left shift the mantissa accordingly
ADD/SUBTRACT mantissas
Normalize
Use Appropriate Sign

7 of 20

Multiplication/Division

Find exponents while applying bias (15), sum or subtract exponents
Multiply or divide mantissas
Normalize
Use appropriate sign bit

9 of 20

Comparison (<,>,=)

Checks sign bit
Next Exponent
Then Mantissa

12 of 20

FPU Implementation and Demo

The three arithmetic modules are independent moore circuits each with asynchronous reset

Better input/output stability and clock synchronization
Output may be delayed 1 cycle since it is a moore circuit

Later integrated into FPU module with control signals selecting which output gets transmitted
Generates a Done signal when computation completes or there is an overflow or underflow

13 of 20

Newton’s Method for Root Approximation:

let

for any constant x

finding the root of f(y) = 0 yields

18 of 20

FastInverseSqrtComparison Testbench Demo

Moore circuit approach for better input/output synchronization between modules
Non-pipelined version with a single FPU only computes one arithmetic operation at a time
Pipelined version computes parallel terms in the expression simultaneously

Saves approximately 5 clock cycles, which is about 20% faster than non-pipelined implementation

Fast Inverse Square Roots has heavy applications in computer graphics:

Vector normalization to calculate shading and light angles

19 of 20

Synopsys Synthesis Report

FPU:

Timing:

Clock Period: 1.5 ns (0.32 slack)

Power:

Cell Internal Power = 1.3119 mW
Net Switching Power = 655.5172 uW
Total Dynamic Power = 1.9675 mW
Cell Leakage Power = 217.2053 uW

Area:

Combinational area: 2337.874010
Buf/Inv area: 204.022002
Noncombinational area: 937.117998
Total cell area: 3274.992009

FastInvSqrtPipelined:

Timing:

Clock Period: 1.5 ns (0.32 slack)

Power:

Cell Internal Power = 1.4501 mW
Net Switching Power = 27.9948 uW
Total Dynamic Power = 1.4781 mW
Cell Leakage Power = 326.7899 uW

Area:

Combinational area: 3386.978015
Buf/Inv area: 281.162003
Noncombinational area: 1645.209982
Total cell area: 5032.187998

20 of 20

Conclusion

What was learned/discovered:

State Graphs can get messy as complexity increases: flowcharts better model the circuit functions
Behavioral Verilog design is the most efficient way to generate complicated logic
Synopsys generates the power report based on random inputs, which may not be accurate with actual usage

FPU has higher total dynamic power than the FastInvSqrt circuit because it is more sensitive to input level switching

Fast Inverse Square Root software algorithm avoids costly division hardware

This implementation combines the multiplier and divider into the same module so the benefit may not be as significant

Pipelining increases instruction throughput, however individual instruction execution time remains the same

1 of 20

2 of 20

3 of 20

4 of 20

5 of 20

6 of 20

7 of 20

8 of 20

9 of 20

10 of 20

11 of 20

12 of 20

13 of 20

14 of 20

15 of 20

16 of 20

17 of 20

18 of 20

19 of 20

20 of 20