1 of 64

IEEE 754 Floating Point

PDF: link

EECS department undergrad survey open through 3/3 (join EECS 101 Ed)

CS61C

Great Ideas

in

Computer Architecture

(a.k.a. Machine Structures)

cs61c.org

Assistant

Teaching Professor

Lisa Yan

Yan, SP25

07-Floating Point (1)

2 of 64

Generics wrap-up: Pointer Arithmetic

  • Generics wrap-up
  • Strawman: Fixed Point
  • Floating Point
  • Example, Float Step Size
  • Special Numbers, Overflow, and Underflow
  • [Reference]
    • Other Floating Point Reps
    • Floats under the hood
    • More Examples

Agenda

Yan, SP25

07-Floating Point (2)

3 of 64

How to use swap?

  • Use the swap function to swap the first and last elements in an array:

void swap(void *ptr1, void *ptr2, size_t nbytes)

void *arr; // pointer to array arr

size_t nelems, nbytes; // # elements in array, element size

/* initialize */

swap(arr, ??? , nbytes);

arr

0x100

nelems

5

nbytes

4

A. arr + nelems - 1

B. arr + (nelems - 1)*nbytes

C. (char *) arr + (nelems - 1)*nbytes

D. (char *) (arr + (nelems - 1)*nbytes)

E. Something else

1

2

3

4

5

0x100

0x110

0x114

0x10C

0x108

0x104

1 5

5 1

Yan, SP25

07-Floating Point (3)

4 of 64

Generic Swap in Action

swap(arr, __________________________________, nbytes);

1

2

3

4

5

6

1

2

3

4

5

0x100

0x110

0x114

0x10C

0x108

0x104

1 5

5 1

void swap(void *ptr1,� void *ptr2,� size_t nbytes) {

char temp[nbytes];

memcpy(temp, ptr1, nbytes);

memcpy(ptr1, ptr2, nbytes);

memcpy(ptr2, temp, nbytes);

}

temp

??

??

??

??

Pointer arithmetic in generics must be bytewise arithmetic:

1. Cast void * pointer to char *.

2. Pointer arithmetic is then effectively byte-wise!

nbytes

0x110

ptr1

ptr2

0x100

4

1

(char *) arr + (nelems - 1)*nbytes

Yan, SP25

07-Floating Point (4)

5 of 64

(change of gears)

Yan, SP25

07-Floating Point (5)

6 of 64

Great Idea #1: Abstraction (Levels of Representation/Interpretation)

Week 1

Welcome

Number 1: Bit ops, Integer num rep

Week 2

C 1: Intro

C 2: Pointers, Arrays, Strings

C 3: Memory (Stack, Heap)

Week 3

C 4: Generics, Function Pointers

Number 2: Floating Point

RISC-V 1: Intro to Assembly

Learning C means learning binary abstraction! Let’s revisit number representation, but now for numbers like 4.25, 5.17, ….

Yan, SP25

07-Floating Point (6)

7 of 64

Quote of the Day

“95% of the folks out there are completely clueless about floating-point.”

– James Gosling, 1998-02-28

  • Sun Fellow
  • Java Inventor

Yan, SP25

07-Floating Point (7)

8 of 64

Strawman:�Fixed Point

[Reference] FA22 videos: link

  • Generics wrap-up
  • Strawman: Fixed Point
  • Floating Point
  • Example, Float Step Size
  • Special Numbers, Overflow, and Underflow
  • [Reference]
    • Other Floating Point Reps
    • Floats under the hood
    • More Examples

Agenda

Yan, SP25

07-Floating Point (8)

9 of 64

Review of Integer Number Representations

Computers made to process numbers, represented as bits.

What can we represent in N bits?

  • 2N things, and no more! We’ve covered a few conventions…
  • Unsigned integers:
    • Range: 0 to 2N - 1
    • N=32: 2N - 1 = 4,294,967,295)
  • Signed Integers (Two’s Complement)
    • Range: -2(N-1) to 2(N-1) - 1
    • N=32: 2(N-1) - 1 = 2,147,483,647)

Yan, SP25

07-Floating Point (9)

10 of 64

What about other numbers?

Very large numbers (sec/millennium)

  • 31,556,926,010 = 3.155692610 × 1010

Very small numbers? (Bohr radius)

  • 0.000000000052917710 = 5.2917710 × 10-11

#s with both integer & fractional parts?

  • 2.625

Let’s start with representing�this number…

Scientific Notation

Yan, SP25

07-Floating Point (10)

11 of 64

“Fixed Point” Binary Representation

Binary Point,” like decimal point, signifies the boundary between integer and fractional parts.

Example 6-bit representation, positive numbers only:

x

21

x

20

y

2-1

y

2-2

y

2-3

y

2-4

xxyyyy

.

Yan, SP25

07-Floating Point (11)

12 of 64

“Fixed Point” Binary Representation

Binary Point,” like decimal point, signifies the boundary between integer and fractional parts.

Example 6-bit representation, positive numbers only:

With the above 6-bit ”fixed binary point” representation:

  • Range: 0 to 3.9375

101010

1

21

0

20

1

2-1

0

2-2

1

2-3

0

2-4

1 × 21 + 1 × 2-1 + 1 × 2-3

= 2 + 0.5 + 0.125

= 2.625ten

11.1111two = 4 – 1 × 2-4

.

Yan, SP25

07-Floating Point (12)

13 of 64

Arithmetic with Fixed Point

01.1000 1.510

+ 00.1000 0.510

10.0000 2.010

Addition is straightforward:

Multiplication is a bit more complex:

01.1000 1.510

× 00.1000 0.510

00 0000

000 000

0000 00

01100 0

000000

000000

0001100 000000.1100

Need to remember where point is…

.

Yan, SP25

07-Floating Point (13)

14 of 64

What about other numbers?

Very large numbers (sec/millennium)

  • 31,556,926,010 = 3.155692610 × 1010

Very small numbers? (Bohr radius)

  • 0.000000000052917710 = 5.2917710 × 10-11

#s with both integer & fractional parts?

  • 2.625

⚠️

11···0.0

0.0···11

34 bits

58 bits

⚠️

To store all these numbers, we’d need a fixed-point rep with at least 92 bits. There must be a better way!

Yan, SP25

07-Floating Point (14)

15 of 64

Floating Point

  • Generics wrap-up
  • Strawman: Fixed Point
  • Floating Point
  • Example, Float Step Size
  • Special Numbers, Overflow, and Underflow
  • [Reference]
    • Other Floating Point Reps
    • Floats under the hood
    • More Examples

Agenda

Yan, SP25

07-Floating Point (15)

16 of 64

Floating Point

A “floating binary point” most effectively uses of our limited bits (and thus more accuracy in our number representation).

Example:

  • Suppose we wanted to represent 0.1640625ten.
  • Binary representation: …000000.001010100000…

The binary point is stored separately from the significant bits, so very large and small numbers can be represented.

2. Keep track of the binary point 2 places to the left of the MSB (“exponent field”).

1. Store these “significant bits.”

Yan, SP25

07-Floating Point (16)

17 of 64

Enter Scientific Notation (in Decimal)

Normalized form”: no leadings 0s (exactly one nonzero digit to left of point).

  • To represent 3/1,000,000,000:
    • Normalized: 3.0 x 10-9
    • Not normalized: 0.3 x 10-8, 30.0 x 10-10

1.640625ten × 10 -1

radix (base 10)

decimal point

mantissa

exponent

Yan, SP25

07-Floating Point (17)

18 of 64

Enter Scientific Notation (in Binary)

Normalized form”: no leadings 0s (exactly one nonzero digit to left of point).

  • To represent 3/1,000,000,000:
    • Normalized: 3.0 x 10-9
    • Not normalized: 0.3 x 10-8, 30.0 x 10-10

Floating point”: Computer arithmetic that supports a binary normalized form.

  • Represents numbers where the binary point is not fixed (as opposed to integers).
  • C variable types: float, double

1.640625ten × 10 -1

decimal point

radix (base 10)

mantissa

exponent

1.0101two × 2 -3

radix (base 2)

“binary point”

Note: Normalized binary representation always leads with a 1! (why?)

Yan, SP25

07-Floating Point (18)

19 of 64

IEEE 754 Single-Precision Floating Point (1/4)

  • Single precision standard for 32-bit word.In C, float.
    • Each field is NOT interpreted as two’s complement!
  • Instead, designed to maximize accuracy of representing many numbers with a limited precision of 32 bits.
    • Precision is a count of the number of bits used to represent a value.
    • Accuracy is the difference between the actual value of a number and its computer representation.

s

exponent

significand

1 bit

8 bits

23 bits

31

30

23

22

0

±1.xxx···xtwo × 2yyy···ytwo

(normalized format)

Yan, SP25

07-Floating Point (19)

20 of 64

IEEE 754 Single-Precision Floating Point (2/4)

s

exponent

significand

1 bit

8 bits

23 bits

31

30

23

22

0

±1.xxx···xtwo × 2yyy···ytwo

Sign bit:

  • 1 is negative
  • 0 is positive

Significand field:

  • Mantissa = 1 + 23-bit significand field
  • In normalized, mantissa always leads with 1
  • Therefore, to pack more bits, IEEE 754:
    • Treat mantissa leading 1 as implicit
    • Significand explicitly represents other bits!
    • 0 < significand field < 1

(normalized format)

(-1)S

(1 + significand)

Yan, SP25

07-Floating Point (20)

21 of 64

IEEE 754 Single-Precision Floating Point (3/4)

s

exponent

significand

1 bit

8 bits

23 bits

31

30

23

22

0

±1.xxx···xtwo × 2yyy···ytwo

Exponent field uses “bias notation”:

  • Bias of 127: Subtract 127 from exponent field to get exponent value.
  • Designers wanted FP numbers to be used even without specialized FP hardware, e.g., sort records with FP numbers using integer compares.
  • Idea: for the same sign, bigger exponent field should represent bigger numbers.
    • 2’s complement poses a problem (because negative numbers look bigger)
    • We’ll see numbers ordered EXACTLY as in sign-magnitude:
      • 00···00 -> 11···11�0 to +MAX -> -0 to -MAX

2(exponent-127)

Yan, SP25

07-Floating Point (21)

22 of 64

IEEE 754 Single-Precision Floating Point (4/4)

Single precision standard for 32-bit word. In C, float.

s

exponent

significand

1 bit

8 bits

23 bits

31

30

23

22

0

±1.xxx···xtwo × 2yyy···ytwo

Sign

  • 1 is negative
  • 0 is positive

Exponent

  • Bias of 127
  • Subtract 127 from exponent field to get exponent value

Significand

  • To pack more bits, mantissa leads w/implicit 1

(-1)S × (1 + significand) × 2(exponent-127)

[Summary]

(normalized format)

Yan, SP25

07-Floating Point (22)

23 of 64

“Father” of the Floating Point Standard

IEEE Standard 754 for Binary Floating-Point Arithmetic.

William Kahan�Professor Emeritus�UC Berkeley

On most systems, C float and double are specified by IEEE 754 floating point format (32b float: single-precision, 64b double precision)

Yan, SP25

07-Floating Point (23)

24 of 64

Example,�Float Step Size

  • Generics wrap-up
  • Strawman: Fixed Point
  • Floating Point
  • Example, Float Step Size
  • Special Numbers, Overflow, and Underflow
  • [Reference]
    • Other Floating Point Reps
    • Floats under the hood
    • More Examples

Agenda

Yan, SP25

07-Floating Point (24)

25 of 64

Normalized Example

What is the decimal equivalent of the following IEEE 754 single-precision binary floating point number?

31

30

23

22

0

1

1000 0001

111 0000 0000 0000 0000 0000

A. -7 * 2^129

B. -3.5

C. -3.75

D. 7

E. -7.5

F. Something else

(-1)S × (1 + significand) × 2(exponent-127)

Yan, SP25

07-Floating Point (25)

26 of 64

Yan, SP25

07-Floating Point (26)

27 of 64

[Solution] Normalized Example

What is the decimal equivalent of the following IEEE 754 single-precision binary floating point number?

31

30

23

22

0

1

1000 0001

111 0000 0000 0000 0000 0000

(-1)S × (1 + significand) × 2(exponent-127)

(-1)1 × (1 + .111)two × 2(129-127)

-1 × (1.111)two × 2(2)

-111.1two

-7.5ten

A. -7 * 2^129

B. -3.5

C. -3.75

D. 7

E. -7.5

F. Something else

-1.111

Yan, SP25

07-Floating Point (27)

28 of 64

Step Size with Limited Precision

What is the next representable number after y? Before y?

0

1000 0001

111 0000 0000 0000 0000 0001

exponent

s

significand

y

+1

next float after y

y + ((.0…01)two × 2(129-127))

y + (2-23 × 2(2))

y + 2-21

“step size”

Because we have a fixed # of bits (precision), we cannot represent all numbers in a range.

Step size is the spacing between consecutive floats with a given exponent.

  • Bigger exponents → bigger step size.
  • Smaller exponents → smaller step size!

31

30

23

22

0

0

1000 0001

111 0000 0000 0000 0000 0000

Yan, SP25

07-Floating Point (28)

29 of 64

Computing in the News(?) Climate Change

INT_MAX: 2147483647

UINT_MAX: 4294967295

FLOAT_MAX: 3.40282e+38

(1.1…1) × 2(254-127) ≈ 3.4 × 1038

This is the maximum normalized float rep in IEEE 754, where biased exponents are restricted to the range 1 to 254 (2-126 to 2127).

s

1111 1110

1…1 (23 bits)

Exponent fields 0, 255 are reserved for special numbers. More now!

Yan, SP25

07-Floating Point (29)

30 of 64

Special Numbers, Overflow, and Underflow

[FOR NEXT TIME]

  • Generics wrap-up
  • Strawman: Fixed Point
  • Floating Point
  • Example, Float Step Size
  • Special Numbers, Overflow, and Underflow
  • [Reference]
    • Other Floating Point Reps
    • Floats under the hood
    • More Examples

Agenda

Yan, SP25

07-Floating Point (30)

31 of 64

Special Numbers

Normalized numbers are only a fraction (heh) of floating point representations. For single-precision (32-bit):

Professor Kahan had clever ideas;�“Waste not, want not.”

1 – 254

anything

Normalized floating point

Biased Exponent

Significand field

Object

0

all zeros

±0

0

nonzero

???

1 – 254

anything

Normalized floating point

255

all zeros

???

255

nonzero

???

The fields 0 and 255 accommodate overflow, underflow, zero, infinity, and arithmetic errors.

Read reference slides for zero, infinity, and NaN!

Yan, SP25

07-Floating Point (31)

32 of 64

Overflow and Underflow (1/2)

Because 0 and 255 are reserved exponent fields,�the range of normalized single-precision floating point is:

Largest magnitude:

Smallest magnitude:

(1.1…1) × 2(254-127) ≈ 3.4 × 1038

(1.0) × 2(1-127) ≈ 1.2 × 10-38

Normalized range: [–3.4×1038, –1.2×10-38] and [1.2×10-38, 3.4×1038]

s

1111 1110

1…1 (23 bits)

s

0000 0001

0…0 (23 bits)

0

-1

1

-3.4×1038

3.4×1038

-1.2×10-38

1.2×10-38

Yan, SP25

07-Floating Point (32)

33 of 64

Overflow and Underflow (2/2)

Because 0 and 255 are reserved exponent fields,�the range of normalized single-precision floating point is:

Largest magnitude:

Smallest magnitude:

(1.1…1) × 2(254-127) ≈ 3.4 × 1038

(1.0) × 2(1-127) ≈ 1.2 × 10-38

What if a number falls outside the representable range?

  • Too large: overflow. Magnitude of value is too large to represent!
  • Too small: underflow. Magnitude of value is too small to represent!

s

1111 1110

1…1 (23 bits)

s

0000 0001

0…0 (23 bits)

0

-1.2×10-38

1.2×10-38

-1

1

-3.4×1038

3.4×1038

underflow

overflow

overflow

Yan, SP25

07-Floating Point (33)

34 of 64

Denorms: Gradual Underflow (1/3)

Problem:

  • There’s a gap among representable FP numbers around zero!

Smallest normalized number:

2nd smallest normalized number:

(1.0) × 2(1-127) = 2-126

(1.000……12) × 2(1-127)

= (1 + 2-23) x 2-126 = 2-126 + 2-149

0

0000 0001

000…000

31

30

23

22

1 0

0

0000 0001

00000…00001

Because of the mantissa’s implicit 1, there is a huge step size between 0 and smallest vs. smallest and 2nd smallest!

2-149

2-126

-∞

+∞

0

Yan, SP25

07-Floating Point (34)

35 of 64

Denorms: Gradual Underflow (2/3)

Solution:

  • We still haven’t used Exponent = 0, Significand nonzero.
  • DEnormalized number:
    • no (implied) leading 1!!
    • One implicit exponent for all denorms:the smallest norm step size, 2^-126

(-1)s × (0.x……x2) × 2-126

31

30

23

22

1 0

s

0000 0000

00000…00001

Yan, SP25

07-Floating Point (35)

36 of 64

Denorms: Gradual Underflow (3/3)

Solution:

  • We still haven’t used Exponent = 0, Significand nonzero.
  • DEnormalized number:
    • no (implied) leading 1
    • One implicit exponent for all denorms: �the smallest norm step size, 2^-126

(-1)s × (0.x……x2) × 2-126

31

30

23

22

1 0

s

0000 0000

00000…00001

0

0000 0000

000…001

0

0000 0001

000…000

-∞

+∞

0

0000 0000

000…010

0

0000 0000

111…111

0

2-126

Smallest normalized number:

Smallest denormalized number:

2-149

2nd smallest denorm

Largest denorm

Yan, SP25

07-Floating Point (36)

37 of 64

Special Numbers, Summary

The fields 0 and 255 accommodate overflow, underflow, zero, infinity, and arithmetic errors.

Biased Exponent

Significand field

Object

0

all zeros

±0

0

nonzero

Denormalized numbers

1 – 254

anything

Normalized floating point

255

all zeros

±∞

255

nonzero

NaNs

Read reference slides for zero, infinity, and NaN!

Yan, SP25

07-Floating Point (37)

38 of 64

BTW: IEEE 754 Double-Precision Floating Point (double)

binary64: Next Multiple of Word Size (64 bits)

Double Precision (vs. Single Precision)

  • C variable declared as double
  • Exponent bias now 1023.
  • Represent numbers from ~2.0 × 10-308 to ~2.0 × 10308.
  • The primary advantage is greater accuracy due to larger significand.

31

30

23

22

0

s

0000 0001

Significand

1 bit

11 bits

20 bits

Significand (cont’d)

Yan, SP25

07-Floating Point (38)

39 of 64

For you: IEEE-754 Floating Point Converter

Yan, SP25

07-Floating Point (39)

40 of 64

Representing Zero

IEEE 754 represents ±0:

  • Reserve exponent value 0000 0000; �signals to hardware to not implicitly add 1.
  • Keep significand all zeroes.
  • What about sign? Both cases valid! Why? (Hint: Math)

Note: Zero has no normalized representation (i.e., no leading 1).

0

0000 0000

000…000

1

0000 0000

000…000

+0:

-0:

[READ]

Yan, SP25

07-Floating Point (40)

41 of 64

Representation for ±∞

In FP, divide by ±0 should produce ±∞, not overflow.

  • Why?
  • OK to do further computations with ∞
  • E.g., X/0 > Y may be a valid comparison.
  • Ask math majors

IEEE 754 represents ±∞:

  • Reserve exponent value 1111 1111.
  • And keep significand all zeroes.

0

1111 1111

111…111

1

1111 1111

111…111

+∞:

-∞:

[READ]

Yan, SP25

07-Floating Point (41)

42 of 64

Representation for Not a Number

What do I get if I calculate sqrt(-4.0) or 0/0?

  • If ∞ is not an error, these shouldn’t be either!
  • Called Not a Number (NaN).
  • Exponent = 255, Significand nonzero.

Why is this useful?

  • They contaminate: op(NaN, X) = NaN
  • Hope NaNs help with debugging?
  • Can use the significand to encode/identify where errors occurred! (proprietary; not defined in standard)

1

1111 1111

xxx…xxxx

[READ]

Yan, SP25

07-Floating Point (42)

43 of 64

[reference] Other Floating Point Representations

[Reference] FA22 videos: link

  • Generics wrap-up
  • Strawman: Fixed Point
  • Floating Point
  • Example, Float Step Size
  • Special Numbers, Overflow, and Underflow
  • [Reference]
    • Other Floating Point Reps
    • Floats under the hood
    • More Examples

Agenda

Yan, SP25

07-Floating Point (43)

44 of 64

Precision and Accuracy

Precision is a count of the number of bits used to represent a value.

Accuracy is the difference between the actual value of a number and its computer representation.

High precision permits high accuracy but doesn’t guarantee it.

It is possible to have high precision but low accuracy.

  • Example: float pi = 3.14;
  • pi will be represented using all 23 bits of the significand (highly precise),�but it is only an approximation (not accurate).

Yan, SP25

07-Floating Point (44)

45 of 64

IEEE 754 Double-Precision Floating Point (double)

binary64: Next Multiple of Word Size (64 bits)

Double Precision (vs. Single Precision)

  • C variable declared as double
  • Exponent bias now 1023.
  • Represent numbers from ~2.0 × 10-308 to ~2.0 × 10308.
  • The primary advantage is greater accuracy due to larger significand.

31

30

23

22

0

s

0000 0001

Significand

1 bit

11 bits

20 bits

Significand (cont’d)

Yan, SP25

07-Floating Point (45)

46 of 64

Other Floating Point Representations

Quad-Precision? Yep! (128 bits) “binary128

  • Unbelievable range, precision (accuracy)
  • 15 exponent bits, 112 significand bits

Oct-Precision? Yep! “binary256

  • 19 exponent bits, 236 significant bits

Half-Precision? Yep! “binary16” or “fp16

  • 1/5/10 bits

Half-Precision? Yep! “bfloat16

  • Competing with fp16
  • Same range as fp32!
  • Used for faster ML

Yan, SP25

07-Floating Point (46)

47 of 64

Floating Point Soup

31

30

23

22

S

Exponent

Significand

1 bit

8 bits

23 bits

15

14

10

9

0

S

Exponent

Significand

1 bit

5 bits

15

14

7

6

0

S

Exponent

Significand

1 bit

8 bits

7 bits

18

17

10

9

0

s

0000 0001

Significand

1 bit

8 bits

10 bits

FP32

FP16

BFLOAT16

TF32

Yan, SP25

07-Floating Point (47)

48 of 64

Who Uses What in Domain Accelerators?

Accelerator

int4

int8

int16

fp16

bf16

fp32

tf32

Google TPU v1

x

Google TPU v2

x

Google TPU v3

x

Nvidia Volta TensorCore

x

x

x

Nvidia Ampere TensorCore

x

x

x

x

x

x

x

Nvidia DLA

x

x

x

Intel AMX

x

x

Amazon AWS Inferentia

x

x

x

Qualcomm Hexagon

x

Huawei Da Vinci

x

x

MediaTek APU 3.0

x

x

x

Samsung NPU

x

Tesla NPU

x

Yan, SP25

07-Floating Point (48)

49 of 64

Unum

Everything so far has had a fixed set of bits for Exponent and Significand.

  • What if the field widths were variable?
  • Also, add a “u-bit” to tell whether number is exact or in-between unums.
  • “Promises to be to floating point what floating point is to fixed point.”

Claims to save power!

Dr. John Gustafson

Yan, SP25

07-Floating Point (49)

50 of 64

[Reference] Floating Point Discussion

[Reference] FA22 videos: link

  • Generics wrap-up
  • Strawman: Fixed Point
  • Floating Point
  • Example, Float Step Size
  • Special Numbers, Overflow, and Underflow
  • [Reference]
    • Other Floating Point Reps
    • Floats under the hood
    • More Examples

Agenda

Yan, SP25

07-Floating Point (50)

51 of 64

Floating Point Fallacy

  • FP add associative?
    • x + (y + z) = –1.5 x 1038 + (1.5 x 1038 + 1.0)� = –1.5 x 1038 + (1.5 x 1038) = 0.0
    • (x + y) + z = (–1.5 x 1038 + 1.5 x 1038) + 1.0� = (0.0) + 1.0 = 1.0
  • Therefore, Floating Point add is not associative!
    • Why? FP result approximates real result!
    • With bigger exponents, step size between floats gets bigger, too!
    • This example: 1.5 x 1038 is so much larger than 1.0 that 1.5 x 1038 + 1.0 in floating point representation rounds to 1.5 x 1038.

x

– 1.5 x 1038

y

1.5 x 1038

z

1.0

Yan, SP25

07-Floating Point (51)

52 of 64

Rounding

  • When we perform math on real numbers, we have to worry about rounding to fit the result in the significand field.
  • The FP hardware carries two extra bits of precision, and then rounds to get the proper value
  • Rounding also occurs when converting:
    • double to a single precision value, or
    • floating point number to an integer

Under the hood

Yan, SP25

07-Floating Point (52)

53 of 64

IEEE FP’s Four Rounding Modes

  • Round towards +∞
    • ALWAYS round “up”: 2.001 → 3, -2.001 → -2
  • Round towards –∞
    • ALWAYS round “down”: 1.999 → 1, -1.999 → -2
  • Truncate
    • Just drop the last bits (round towards 0)
  • (default) Unbiased. If midway, round to even
    • Normal rounding, almost: 2.4 → 2, 2.6 → 3, 2.5 → 2, 3.5 → 4
    • Round like you learned in grade school (nearest representable number)…
    • …except if the value is right on the borderline, in which case we round to the nearest EVEN number
    • Ensures fairness on calculation, balances out inaccuracies.
      • On a tie, half the time we round up; the other half time we round down.

Under the hood

Examples in decimal

(but applies to IEEE754 binary)

Yan, SP25

07-Floating Point (53)

54 of 64

FP Addition

  • More difficult than with integers
  • Can’t just add significands…
  • How do we do it?
    • De-normalize to match exponents
    • Add significands together
    • Keep the matched exponent
    • Normalize (possibly changing exponent)
  • Note: If signs differ, just perform a subtract instead.

Under the hood

Yan, SP25

07-Floating Point (54)

55 of 64

Casting floats to ints and vice versa

  • (int) floating_point_expression
    • Coerces and converts floating point to nearest integer�(C uses truncation).
    • i = (int) (3.14159 * f);

  • (float) integer_expression
    • Converts integer to nearest floating point.
    • f = f + (float) i;

Yan, SP25

07-Floating Point (55)

56 of 64

Double Casting Doesn’t Always Work

  • Will not always print “true”!
  • Most large values of integers don’t have exact floating point representations.
    • Recall: only 223 floats per exponent
  • What about double?
  • Will not always print “true”!
  • Small floating point numbers (<1) don’t have integer representations.
  • For other numbers, rounding errors…

int i = …;

if (i == (int)((float) i)) {

printf("true\n");

}

float f = …;

if (f == (float)((int) f)) {

printf("true\n");

}

Yan, SP25

07-Floating Point (56)

57 of 64

Now you can make float jokes too…

Saturday Morning Breakfast Comics

www.smbc-comics.com/comic/2013-06-05

Side note: the robot is using IEEE 754 double-precision.

(double) 0.3:

0.29999999999999999

(double) 0.1 + (double) 0.2:

0.30000000000000004

Yan, SP25

07-Floating Point (57)

58 of 64

[Reference] More Examples

[Reference] FA22 videos: link

  • Generics wrap-up
  • Strawman: Fixed Point
  • Floating Point
  • Example, Float Step Size
  • Special Numbers, Overflow, and Underflow
  • [Reference]
    • Other Floating Point Reps
    • Floats under the hood
    • More Examples

Agenda

Yan, SP25

07-Floating Point (58)

59 of 64

Ex 1: Convert Binary Floating Point to Decimal

31

30

23

22

0

0

0110 1000

101 0101 0100 0011 0100 0010

exponent

significand

1 + 1x2-1+ 0x2-2 + 1x2-3 + 0x2-4 + 1x2-5 + … �= 1+2-1+2-3 +2-5 +2-7 +2-9 +2-14 +2-15 +2-17 +2-22�= 1.0 + 0.666115

0 → positive

0110 1000two = 104ten

Bias adjustment:

104 - 127 = -23

1.666115ten*2-23 ≈ 1.986*10-7

(about 2/10,000,000)

Yan, SP25

07-Floating Point (59)

60 of 64

Ex 2: Convert Decimal to Binary Floating Point

  1. Denormalize.

-23.40625

  • Convert integer part.�23 = 16 + ( 7 = 4 + ( 3 = 2 + ( 1 ) ) )
  • Convert fraction part.

.40625 = .25 + ( .15625 = .125 + ( .03125 ) )

  • Put parts together + normalize.

10111.01101

  • Convert exponent.

127 + 4

-23.40625

= 101112

= .011012

= 1.011101101 x 24

= 100000112

-2.340625 x 101

31

30

23

22

0

1

1000 0011

011 1011 0100 0000 0000 0000

Yan, SP25

07-Floating Point (60)

61 of 64

Ex 3: Represent 1/3

    • = 0.33333…10
    • = 0.25 + 0.0625 + 0.015625 + 0.00390625 + …
    • = 1/4 + 1/16 + 1/64 + 1/256 + …
    • = 2-2 + 2-4 + 2-6 + 2-8 + …
    • = 0.01010101012 * 20
    • = 1.01010101012 * 2-2
  • Sign: 0
  • Exponent = -2 + 127 = 125 = 01111101
  • Significand = 0101010101

31

30

23

22

0

0

01111101

010 1010 1010 1010 1010 1010

Yan, SP25

07-Floating Point (61)

62 of 64

Understanding the Significand (1/2)

  • Method 1 (Fractions):
    • In decimal:
      • 0.34010
      • ⇒ 34010/100010
      • ⇒ 3410/10010
    • In binary:
      • 0.1102
      • ⇒ 1102/10002 = 610/810
      • ⇒ 112/1002 = 310/410
  • Advantage: less purely numerical, more thought oriented; this method usually helps people understand the meaning of the significand better

Yan, SP25

07-Floating Point (62)

63 of 64

Understanding the Significand (2/2)

  • Method 2 (Place Values):
    • Convert from scientific notation
    • In decimal:
      • 1.6732 = (1x100) + (6x10-1) + (7x10-2) + (3x10-3) + (2x10-4)
    • In binary:
      • 1.1001 = (1x20) + (1x2-1) + (0x2-2) + (0x2-3) + (1x2-4)
    • Interpretation of value in each position extends beyond the decimal/binary point
    • Advantage: good for quickly calculating significand value; use this method for translating FP numbers

Yan, SP25

07-Floating Point (63)

64 of 64

Fractional Powers of 2

0 1.0 1

  1. 0.5 1/2
  2. 0.25 1/4
  3. 0.125 1/8
  4. 0.0625 1/16
  5. 0.03125 1/32
  6. 0.015625
  7. 0.0078125
  8. 0.00390625
  9. 0.001953125
  10. 0.0009765625
  11. 0.00048828125
  12. 0.000244140625
  13. 0.0001220703125
  14. 0.00006103515625
  15. 0.000030517578125

Mark Lu’s “Binary Float Displayer”

Yan, SP25

07-Floating Point (64)