1 of 20

16-bit Half Precision Floating Point Unit

By: Long Nguyen, Uriel Lopez

2 of 20

FPU and FastInvSqrt Circuit Specifications

  • Multi-Cycle FPU performs floating point add, subtract, multiply, divide
    • Opcodes in order from 0 to 3
    • Outputs Done signal when operation complete
    • Indicates if overflow or underflow has occurred
  • FastInvSqrt approximates 1/sqrt(x) without using division
    • Utilizes the FPU
    • Pipelined version improves performance by roughly 20% at the cost of more area

3 of 20

Work Division

Long:

  • Arithmetic Logic
  • Diagrams
  • RTL Verilog
  • Synthesis

Uriel:

  • Controls Logic
  • Diagrams
  • RTL Verilog

4 of 20

Theoretical Background

16-bit Composition:

0

0

1

0

1

1

0

0

0

0

1

1

0

0

0

1

= Signed Bit

= Exponent

= Mantissa

Conversion: (-1)^s * 2^(11-15) * [1+(49/1024)] = 0.06549072265625

  • Signed Bit is 0; positive
  • Exponent reads 11, bias is 15
    • 11-15= -4
  • Mantissa is 110001; therefore is (49/1024)

15

0

5 of 20

Addition/Subtraction

  • Increment smaller exponent until both matches, left shift the mantissa accordingly
  • ADD/SUBTRACT mantissas
  • Normalize
  • Use Appropriate Sign

6 of 20

7 of 20

Multiplication/Division

  • Find exponents while applying bias (15), sum or subtract exponents
  • Multiply or divide mantissas
  • Normalize
  • Use appropriate sign bit

8 of 20

9 of 20

Comparison (<,>,=)

  • Checks sign bit
  • Next Exponent
  • Then Mantissa

10 of 20

11 of 20

12 of 20

FPU Implementation and Demo

  • The three arithmetic modules are independent moore circuits each with asynchronous reset
    • Better input/output stability and clock synchronization
    • Output may be delayed 1 cycle since it is a moore circuit
  • Later integrated into FPU module with control signals selecting which output gets transmitted
  • Generates a Done signal when computation completes or there is an overflow or underflow

13 of 20

Newton’s Method for Root Approximation:

let

for any constant x

finding the root of f(y) = 0 yields

14 of 20

15 of 20

16 of 20

17 of 20

18 of 20

FastInverseSqrtComparison Testbench Demo

  • Moore circuit approach for better input/output synchronization between modules
  • Non-pipelined version with a single FPU only computes one arithmetic operation at a time
  • Pipelined version computes parallel terms in the expression simultaneously
    • Saves approximately 5 clock cycles, which is about 20% faster than non-pipelined implementation
  • Fast Inverse Square Roots has heavy applications in computer graphics:
    • Vector normalization to calculate shading and light angles

19 of 20

Synopsys Synthesis Report

FPU:

  • Timing:
    • Clock Period: 1.5 ns (0.32 slack)
  • Power:
    • Cell Internal Power = 1.3119 mW
    • Net Switching Power = 655.5172 uW
    • Total Dynamic Power = 1.9675 mW
    • Cell Leakage Power = 217.2053 uW
  • Area:
    • Combinational area: 2337.874010
    • Buf/Inv area: 204.022002
    • Noncombinational area: 937.117998
    • Total cell area: 3274.992009

FastInvSqrtPipelined:

  • Timing:
    • Clock Period: 1.5 ns (0.32 slack)
  • Power:
    • Cell Internal Power = 1.4501 mW
    • Net Switching Power = 27.9948 uW
    • Total Dynamic Power = 1.4781 mW
    • Cell Leakage Power = 326.7899 uW
  • Area:
    • Combinational area: 3386.978015
    • Buf/Inv area: 281.162003
    • Noncombinational area: 1645.209982
    • Total cell area: 5032.187998

20 of 20

Conclusion

What was learned/discovered:

  • State Graphs can get messy as complexity increases: flowcharts better model the circuit functions
  • Behavioral Verilog design is the most efficient way to generate complicated logic
  • Synopsys generates the power report based on random inputs, which may not be accurate with actual usage
    • FPU has higher total dynamic power than the FastInvSqrt circuit because it is more sensitive to input level switching
  • Fast Inverse Square Root software algorithm avoids costly division hardware
    • This implementation combines the multiplier and divider into the same module so the benefit may not be as significant
  • Pipelining increases instruction throughput, however individual instruction execution time remains the same