1 of 64

IEEE 754 Floating Point

PDF: link

EECS department undergrad survey open through 3/3 (join EECS 101 Ed)

CS61C

Great Ideas

in

Computer Architecture

(a.k.a. Machine Structures)

cs61c.org

Assistant

Teaching Professor

Lisa Yan

Yan, SP25

07-Floating Point (1)

2 of 64

Generics wrap-up: Pointer Arithmetic

Generics wrap-up
Strawman: Fixed Point
Floating Point
Example, Float Step Size
Special Numbers, Overflow, and Underflow
[Reference]

Other Floating Point Reps
Floats under the hood
More Examples

Agenda

Yan, SP25

07-Floating Point (2)

3 of 64

How to use swap?

Use the swap function to swap the first and last elements in an array:

void swap(void *ptr1, void *ptr2, size_t nbytes)

void *arr; // pointer to array arr

size_t nelems, nbytes; // # elements in array, element size

… /* initialize */

swap(arr, ??? , nbytes);

arr

0x100

nelems

5

nbytes

4

A. arr + nelems - 1

B. arr + (nelems - 1)*nbytes

C. (char *) arr + (nelems - 1)*nbytes

D. (char *) (arr + (nelems - 1)*nbytes)

E. Something else

1	2	3	4	5

0x100

0x110

0x114

0x10C

0x108

0x104

1 5

5 1

Yan, SP25

07-Floating Point (3)

4 of 64

Generic Swap in Action

swap(arr, __________________________________, nbytes);

1

2

3

4

5

6

1	2	3	4	5

0x100

0x110

0x114

0x10C

0x108

0x104

1 5

5 1

void swap(void *ptr1,� void *ptr2,� size_t nbytes) {

char temp[nbytes];

memcpy(temp, ptr1, nbytes);

memcpy(ptr1, ptr2, nbytes);

memcpy(ptr2, temp, nbytes);

}

temp

??	??	??	??

Pointer arithmetic in generics must be bytewise arithmetic:

1. Cast void * pointer to char *.

2. Pointer arithmetic is then effectively byte-wise!

nbytes

0x110

ptr1

ptr2

0x100

4

1

(char *) arr + (nelems - 1)*nbytes

Yan, SP25

07-Floating Point (4)

5 of 64

(change of gears)

Yan, SP25

07-Floating Point (5)

6 of 64

Great Idea #1: Abstraction (Levels of Representation/Interpretation)

Week 1	–	Welcome	Number 1: Bit ops, Integer num rep
Week 2	C 1: Intro	C 2: Pointers, Arrays, Strings	C 3: Memory (Stack, Heap)
Week 3	C 4: Generics, Function Pointers	Number 2: Floating Point	RISC-V 1: Intro to Assembly

Learning C means learning binary abstraction! Let’s revisit number representation, but now for numbers like 4.25, 5.17, ….

Yan, SP25

07-Floating Point (6)

7 of 64

Quote of the Day

“95% of the folks out there are completely clueless about floating-point.”

– James Gosling, 1998-02-28

Sun Fellow
Java Inventor

Yan, SP25

07-Floating Point (7)

8 of 64

Strawman:�Fixed Point

[Reference] FA22 videos: link

Generics wrap-up
Strawman: Fixed Point
Floating Point
Example, Float Step Size
Special Numbers, Overflow, and Underflow
[Reference]

Other Floating Point Reps
Floats under the hood
More Examples

Agenda

Yan, SP25

07-Floating Point (8)

9 of 64

Review of Integer Number Representations

Computers made to process numbers, represented as bits.

What can we represent in N bits?

2N things, and no more! We’ve covered a few conventions…
Unsigned integers:

Range: 0 to 2^N - 1
N=32: 2^N - 1 = 4,294,967,295)

Signed Integers (Two’s Complement)

Range: -2^(N-1) to 2^(N-1) - 1
N=32: 2^(N-1) - 1 = 2,147,483,647)

Yan, SP25

07-Floating Point (9)

10 of 64

What about other numbers?

Very large numbers (sec/millennium)

31,556,926,010 = 3.155692610 × 10¹⁰

Very small numbers? (Bohr radius)

0.000000000052917710 = 5.2917710 × 10^-11

#s with both integer & fractional parts?

2.625

Let’s start with representing�this number…

Scientific Notation

Yan, SP25

07-Floating Point (10)

11 of 64

“Fixed Point” Binary Representation

“Binary Point,” like decimal point, signifies the boundary between integer and fractional parts.

Example 6-bit representation, positive numbers only:

x

2¹

x

2⁰

y

2^-1

y

2^-2

y

2^-3

y

2^-4

xxyyyy

.

Yan, SP25

07-Floating Point (11)

12 of 64

“Fixed Point” Binary Representation

“Binary Point,” like decimal point, signifies the boundary between integer and fractional parts.

Example 6-bit representation, positive numbers only:

With the above 6-bit ”fixed binary point” representation:

Range: 0 to 3.9375

101010

1

2¹

0

2⁰

1

2^-1

0

2^-2

1

2^-3

0

2^-4

1 × 2¹ + 1 × 2^-1+ 1 × 2^-3

= 2 + 0.5 + 0.125

= 2.625_ten

11.1111_two = 4 – 1 × 2^-4

.

Yan, SP25

07-Floating Point (12)

13 of 64

Arithmetic with Fixed Point

01.1000 1.5₁₀

+ 00.1000 0.5₁₀

10.0000 2.0₁₀

Addition is straightforward:

Multiplication is a bit more complex:

01.1000 1.5₁₀

× 00.1000 0.5₁₀

00 0000

000 000

0000 00

01100 0

000000

0001100 0000 → 00.1100

Need to remember where point is…

.

Yan, SP25

07-Floating Point (13)

14 of 64

What about other numbers?

Very large numbers (sec/millennium)

31,556,926,010 = 3.155692610 × 10¹⁰

Very small numbers? (Bohr radius)

0.000000000052917710 = 5.2917710 × 10^-11

#s with both integer & fractional parts?

2.625

✅

⚠️

11···0.0

0.0···11

34 bits

58 bits

⚠️

To store all these numbers, we’d need a fixed-point rep with at least 92 bits. There must be a better way!

Yan, SP25

07-Floating Point (14)

15 of 64

Floating Point

Generics wrap-up
Strawman: Fixed Point
Floating Point
Example, Float Step Size
Special Numbers, Overflow, and Underflow
[Reference]

Other Floating Point Reps
Floats under the hood
More Examples

Agenda

Yan, SP25

07-Floating Point (15)

16 of 64

Floating Point

A “floating binary point” most effectively uses of our limited bits (and thus more accuracy in our number representation).

Example:

Suppose we wanted to represent 0.1640625_ten.
Binary representation: …000000.001010100000…

The binary point is stored separately from the significant bits, so very large and small numbers can be represented.

2. Keep track of the binary point 2 places to the left of the MSB (“exponent field”).

1. Store these “significant bits.”

Yan, SP25

07-Floating Point (16)

17 of 64

Enter Scientific Notation (in Decimal)

“Normalized form”: no leadings 0s (exactly one nonzero digit to left of point).

To represent 3/1,000,000,000:

Normalized: 3.0 x 10^-9
Not normalized: 0.3 x 10^-8, 30.0 x 10^-10

1.640625_ten × 10^-1

radix (base 10)

decimal point

mantissa

exponent

Yan, SP25

07-Floating Point (17)

18 of 64

Enter Scientific Notation (in Binary)

“Normalized form”: no leadings 0s (exactly one nonzero digit to left of point).

To represent 3/1,000,000,000:

Normalized: 3.0 x 10^-9
Not normalized: 0.3 x 10^-8, 30.0 x 10^-10

“Floating point”: Computer arithmetic that supports a binary normalized form.

Represents numbers where the binary point is not fixed (as opposed to integers).
C variable types: float, double

1.640625_ten × 10^-1

decimal point

radix (base 10)

mantissa

exponent

1.0101_two × 2^-3

radix (base 2)

“binary point”

Note: Normalized binary representation always leads with a 1! (why?)

Yan, SP25

07-Floating Point (18)

19 of 64

IEEE 754 Single-Precision Floating Point (1/4)

Single precision standard for 32-bit word.In C, float.

Each field is NOT interpreted as two’s complement!

Instead, designed to maximize accuracy of representing many numbers with a limited precision of 32 bits.

Precision is a count of the number of bits used to represent a value.
Accuracy is the difference between the actual value of a number and its computer representation.

s	exponent						significand

1 bit

8 bits

23 bits

31

30

23

22

0

±1.xxx···x_two× 2^yyy···y^two

(normalized format)

Yan, SP25

07-Floating Point (19)

20 of 64

IEEE 754 Single-Precision Floating Point (2/4)

s	exponent						significand

1 bit

8 bits

23 bits

31

30

23

22

0

±1.xxx···x_two× 2^yyy···y^two

Sign bit:

1 is negative
0 is positive

Significand field:

Mantissa = 1 + 23-bit significand field
In normalized, mantissa always leads with 1
Therefore, to pack more bits, IEEE 754:

Treat mantissa leading 1 as implicit
Significand explicitly represents other bits!
0 < significand field < 1

(normalized format)

(-1)^S

(1 + significand)

Yan, SP25

07-Floating Point (20)

21 of 64

IEEE 754 Single-Precision Floating Point (3/4)

s	exponent						significand

1 bit

8 bits

23 bits

31

30

23

22

0

±1.xxx···x_two× 2^yyy···y^two

Exponent field uses “bias notation”:

Bias of 127: Subtract 127 from exponent field to get exponent value.
Designers wanted FP numbers to be used even without specialized FP hardware, e.g., sort records with FP numbers using integer compares.
Idea: for the same sign, bigger exponent field should represent bigger numbers.

2’s complement poses a problem (because negative numbers look bigger)
We’ll see numbers ordered EXACTLY as in sign-magnitude:

00···00 -> 11···11�0 to +MAX -> -0 to -MAX

2⁽^exponent^-127)

Yan, SP25

07-Floating Point (21)

22 of 64

IEEE 754 Single-Precision Floating Point (4/4)

Single precision standard for 32-bit word. In C, float.

s	exponent						significand

1 bit

8 bits

23 bits

31

30

23

22

0

±1.xxx···x_two× 2^yyy···y^two

Sign

1 is negative
0 is positive

Exponent

Bias of 127
Subtract 127 from exponent field to get exponent value

Significand

To pack more bits, mantissa leads w/implicit 1

(-1)^S × (1 + significand) × 2⁽^exponent^-127)

[Summary]

(normalized format)

Yan, SP25

07-Floating Point (22)

23 of 64

“Father” of the Floating Point Standard

IEEE Standard 754 for Binary Floating-Point Arithmetic.

William Kahan�Professor Emeritus�UC Berkeley

www.cs.berkeley.edu/~wkahan/ieee754status/754story.html

On most systems, C float and double are specified by IEEE 754 floating point format (32b float: single-precision, 64b double precision)

Yan, SP25

07-Floating Point (23)

24 of 64

Example,�Float Step Size

Generics wrap-up
Strawman: Fixed Point
Floating Point
Example, Float Step Size
Special Numbers, Overflow, and Underflow
[Reference]

Other Floating Point Reps
Floats under the hood
More Examples

Agenda

Yan, SP25

07-Floating Point (24)

25 of 64

Normalized Example

What is the decimal equivalent of the following IEEE 754 single-precision binary floating point number?

31	30					23	22								0
1	1000 0001						111 0000 0000 0000 0000 0000

A. -7 * 2^129

B. -3.5

C. -3.75

D. 7

E. -7.5

F. Something else

(-1)^S × (1 + significand) × 2⁽^exponent^-127)

Yan, SP25

07-Floating Point (25)

26 of 64

Yan, SP25

07-Floating Point (26)

27 of 64

[Solution] Normalized Example

What is the decimal equivalent of the following IEEE 754 single-precision binary floating point number?

31	30					23	22								0
1	1000 0001						111 0000 0000 0000 0000 0000

(-1)^S × (1 + significand) × 2⁽^exponent^-127)

(-1)¹ × (1 + .111)_two × 2⁽¹²⁹^-127)

-1 × (1.111)_two × 2⁽²⁾

-111.1_two

-7.5_ten

A. -7 * 2^129

B. -3.5

C. -3.75

D. 7

E. -7.5

F. Something else

-1.111

Yan, SP25

07-Floating Point (27)

28 of 64

Step Size with Limited Precision

What is the next representable number after y? Before y?

0	1000 0001						111 0000 0000 0000 0000 0001

exponent

s

significand

y

+1

next float after y

y + ((.0…01)_two × 2⁽¹²⁹^-127))

y + (2^-23 × 2⁽²⁾)

y + 2^-21

“step size”

Because we have a fixed # of bits (precision), we cannot represent all numbers in a range.

Step size is the spacing between consecutive floats with a given exponent.

Bigger exponents → bigger step size.
Smaller exponents → smaller step size!

31	30					23	22								0
0	1000 0001						111 0000 0000 0000 0000 0000

Yan, SP25

07-Floating Point (28)

29 of 64

Computing in the News(?) Climate Change

INT_MAX: 2147483647

UINT_MAX: 4294967295

FLOAT_MAX: 3.40282e+38

(1.1…1) × 2⁽²⁵⁴^-127) ≈ 3.4 × 10³⁸

This is the maximum normalized float rep in IEEE 754, where biased exponents are restricted to the range 1 to 254 (2^-126 to 2¹²⁷).

s	1111 1110	1…1 (23 bits)

Exponent fields 0, 255 are reserved for special numbers. More now!

Yan, SP25

07-Floating Point (29)

30 of 64

Special Numbers, Overflow, and Underflow

[FOR NEXT TIME]

Generics wrap-up
Strawman: Fixed Point
Floating Point
Example, Float Step Size
Special Numbers, Overflow, and Underflow
[Reference]

Other Floating Point Reps
Floats under the hood
More Examples

Agenda

Yan, SP25

07-Floating Point (30)

31 of 64

Special Numbers

Normalized numbers are only a fraction (heh) of floating point representations. For single-precision (32-bit):

Professor Kahan had clever ideas;�“Waste not, want not.”




1 – 254	anything	Normalized floating point

Biased Exponent	Significand field	Object
0	all zeros	±0
0	nonzero	???
1 – 254	anything	Normalized floating point
255	all zeros	???
255	nonzero	???

The fields 0 and 255 accommodate overflow, underflow, zero, infinity, and arithmetic errors.

Read reference slides for zero, infinity, and NaN!

Yan, SP25

07-Floating Point (31)

32 of 64

Overflow and Underflow (1/2)

Because 0 and 255 are reserved exponent fields,�the range of normalized single-precision floating point is:

Largest magnitude:

Smallest magnitude:

(1.1…1) × 2⁽²⁵⁴^-127) ≈ 3.4 × 10³⁸

(1.0) × 2⁽¹^-127) ≈ 1.2 × 10^-38

Normalized range: [–3.4×10³⁸, –1.2×10^-38] and [1.2×10^-38, 3.4×10³⁸]

s	1111 1110	1…1 (23 bits)

s	0000 0001	0…0 (23 bits)

0

-1

1

-3.4×10³⁸

3.4×10³⁸

-1.2×10^-38

1.2×10^-38

Yan, SP25

07-Floating Point (32)

33 of 64

Overflow and Underflow (2/2)

Because 0 and 255 are reserved exponent fields,�the range of normalized single-precision floating point is:

Largest magnitude:

Smallest magnitude:

(1.1…1) × 2⁽²⁵⁴^-127) ≈ 3.4 × 10³⁸

(1.0) × 2⁽¹^-127) ≈ 1.2 × 10^-38

What if a number falls outside the representable range?

Too large: overflow. Magnitude of value is too large to represent!
Too small: underflow. Magnitude of value is too small to represent!

s	1111 1110	1…1 (23 bits)

s	0000 0001	0…0 (23 bits)

0

-1.2×10^-38

1.2×10^-38

-1

1

-3.4×10³⁸

3.4×10³⁸

underflow

overflow

Yan, SP25

07-Floating Point (33)

34 of 64

Denorms: Gradual Underflow (1/3)

Problem:

There’s a gap among representable FP numbers around zero!

Smallest normalized number:

2^nd smallest normalized number:

(1.0) × 2⁽¹^-127) = 2^-126

(1.000……1₂) × 2⁽¹^-127)

= (1 + 2^-23) x 2^-126 = 2^-126 + 2^-149

0	0000 0001	000…000

31	30	23	22	10
0	0000 0001		00000…00001

Because of the mantissa’s implicit 1, there is a huge step size between 0 and smallest vs. smallest and 2nd smallest!

2^-149

2^-126

-∞

+∞

0

Yan, SP25

07-Floating Point (34)

35 of 64

Denorms: Gradual Underflow (2/3)

Solution:

We still haven’t used Exponent = 0, Significand nonzero.
DEnormalized number:

no (implied) leading 1!!
One implicit exponent for all denorms:�the smallest norm step size, 2^-126

(-1)^s × (0.x……x₂) × 2^-126

31	30	23	22	10
s	0000 0000		00000…00001

Yan, SP25

07-Floating Point (35)

36 of 64

Denorms: Gradual Underflow (3/3)

Solution:

We still haven’t used Exponent = 0, Significand nonzero.
DEnormalized number:

no (implied) leading 1
One implicit exponent for all denorms: �the smallest norm step size, 2^-126

(-1)^s × (0.x……x₂) × 2^-126

31	30	23	22	10
s	0000 0000		00000…00001

0	0000 0000	000…001

0	0000 0001	000…000

-∞

+∞

0	0000 0000	000…010

0	0000 0000	111…111

0

2^-126

Smallest normalized number:

Smallest denormalized number:

2^-149

2^nd smallest denorm

Largest denorm

Yan, SP25

07-Floating Point (36)

37 of 64

Special Numbers, Summary

The fields 0 and 255 accommodate overflow, underflow, zero, infinity, and arithmetic errors.

Biased Exponent	Significand field	Object
0	all zeros	±0
0	nonzero	Denormalized numbers
1 – 254	anything	Normalized floating point
255	all zeros	±∞
255	nonzero	NaNs

Read reference slides for zero, infinity, and NaN!

Yan, SP25

07-Floating Point (37)

38 of 64

BTW: IEEE 754 Double-Precision Floating Point (double)

binary64: Next Multiple of Word Size (64 bits)

Double Precision (vs. Single Precision)

C variable declared as double
Exponent bias now 1023.
Represent numbers from ~2.0 × 10^-308 to ~2.0 × 10³⁰⁸.
The primary advantage is greater accuracy due to larger significand.

31	30	23	22	0
s	0000 0001		Significand
1 bit	11 bits		20 bits

Significand (cont’d)

Yan, SP25

07-Floating Point (38)

39 of 64

For you: IEEE-754 Floating Point Converter

https://www.h-schmidt.net/FloatConverter/IEEE754.html

Yan, SP25

07-Floating Point (39)

40 of 64

Representing Zero

IEEE 754 represents ±0:

Reserve exponent value 0000 0000; �signals to hardware to not implicitly add 1.
Keep significand all zeroes.
What about sign? Both cases valid! Why? (Hint: Math)

Note: Zero has no normalized representation (i.e., no leading 1).

0	0000 0000	000…000

1	0000 0000	000…000

+0:

-0:

[READ]

Yan, SP25

07-Floating Point (40)

41 of 64

Representation for ±∞

In FP, divide by ±0 should produce ±∞, not overflow.

Why?
OK to do further computations with ∞
E.g., X/0 > Y may be a valid comparison.
Ask math majors

IEEE 754 represents ±∞:

Reserve exponent value 1111 1111.
And keep significand all zeroes.

0	1111 1111	111…111

1	1111 1111	111…111

+∞:

-∞:

[READ]

Yan, SP25

07-Floating Point (41)

42 of 64

Representation for Not a Number

What do I get if I calculate sqrt(-4.0) or 0/0?

If ∞ is not an error, these shouldn’t be either!
Called Not a Number (NaN).
Exponent = 255, Significand nonzero.

Why is this useful?

They contaminate: op(NaN, X) = NaN
Hope NaNs help with debugging?
Can use the significand to encode/identify where errors occurred! (proprietary; not defined in standard)

1	1111 1111	xxx…xxxx

[READ]

Yan, SP25

07-Floating Point (42)

43 of 64

[reference] Other Floating Point Representations

[Reference] FA22 videos: link

Generics wrap-up
Strawman: Fixed Point
Floating Point
Example, Float Step Size
Special Numbers, Overflow, and Underflow
[Reference]

Other Floating Point Reps
Floats under the hood
More Examples

Agenda

Yan, SP25

07-Floating Point (43)

44 of 64

Precision and Accuracy

Precision is a count of the number of bits used to represent a value.

Accuracy is the difference between the actual value of a number and its computer representation.

High precision permits high accuracy but doesn’t guarantee it.

It is possible to have high precision but low accuracy.

Example: float pi = 3.14;
pi will be represented using all 23 bits of the significand (highly precise),�but it is only an approximation (not accurate).

Yan, SP25

07-Floating Point (44)

45 of 64

IEEE 754 Double-Precision Floating Point (double)

binary64: Next Multiple of Word Size (64 bits)

Double Precision (vs. Single Precision)

C variable declared as double
Exponent bias now 1023.
Represent numbers from ~2.0 × 10^-308 to ~2.0 × 10³⁰⁸.
The primary advantage is greater accuracy due to larger significand.

31	30	23	22	0
s	0000 0001		Significand
1 bit	11 bits		20 bits

Significand (cont’d)

Yan, SP25

07-Floating Point (45)

46 of 64

Other Floating Point Representations

Quad-Precision? Yep! (128 bits) “binary128”

Unbelievable range, precision (accuracy)
15 exponent bits, 112 significand bits

Oct-Precision? Yep! “binary256”

19 exponent bits, 236 significant bits

Half-Precision? Yep! “binary16” or “fp16”

1/5/10 bits

Half-Precision? Yep! “bfloat16”

Competing with fp16
Same range as fp32!
Used for faster ML

https://cloud.google.com/tpu/docs/bfloat16

https://en.wikipedia.org/wiki/Floating-point_arithmetic

Yan, SP25

07-Floating Point (46)

47 of 64

Floating Point Soup

31	30	23	22
S	Exponent		Significand
1 bit	8 bits		23 bits

15	14	10	9	0
S	Exponent		Significand
1 bit	5 bits

15	14	7	6	0
S	Exponent		Significand
1 bit	8 bits		7 bits

18	17	10	9	0
s	0000 0001		Significand
1 bit	8 bits		10 bits

FP32

FP16

BFLOAT16

TF32

Yan, SP25

07-Floating Point (47)

48 of 64

Who Uses What in Domain Accelerators?

Accelerator	int4	int8	int16	fp16	bf16	fp32	tf32
Google TPU v1		x
Google TPU v2					x
Google TPU v3					x
Nvidia Volta TensorCore	x	x		x
Nvidia Ampere TensorCore	x	x	x	x	x	x	x
Nvidia DLA		x	x	x
Intel AMX		x			x
Amazon AWS Inferentia		x		x	x
Qualcomm Hexagon		x
Huawei Da Vinci		x		x
MediaTek APU 3.0		x	x	x
Samsung NPU		x
Tesla NPU		x

Yan, SP25

07-Floating Point (48)

49 of 64

Unum

Everything so far has had a fixed set of bits for Exponent and Significand.

What if the field widths were variable?
Also, add a “u-bit” to tell whether number is exact or in-between unums.
“Promises to be to floating point what floating point is to fixed point.”

Claims to save power!

https://en.wikipedia.org/wiki/Unum_(number_format)

Dr. John Gustafson

Yan, SP25

07-Floating Point (49)

50 of 64

[Reference] Floating Point Discussion

[Reference] FA22 videos: link

Generics wrap-up
Strawman: Fixed Point
Floating Point
Example, Float Step Size
Special Numbers, Overflow, and Underflow
[Reference]

Other Floating Point Reps
Floats under the hood
More Examples

Agenda

Yan, SP25

07-Floating Point (50)

51 of 64

Floating Point Fallacy

FP add associative?

x + (y + z) = –1.5 x 10³⁸ + (1.5 x 10³⁸ + 1.0)� = –1.5 x 10³⁸ + (1.5 x 10³⁸) = 0.0
(x + y) + z = (–1.5 x 10³⁸ + 1.5 x 10³⁸) + 1.0� = (0.0) + 1.0 = 1.0

Therefore, Floating Point add is not associative!

Why? FP result approximates real result!
With bigger exponents, step size between floats gets bigger, too!
This example: 1.5 x 10³⁸ is so much larger than 1.0 that 1.5 x 10³⁸ + 1.0 in floating point representation rounds to 1.5 x 10³⁸.

x

– 1.5 x 10³⁸

y

1.5 x 10³⁸

z

1.0

Yan, SP25

07-Floating Point (51)

52 of 64

Rounding

When we perform math on real numbers, we have to worry about rounding to fit the result in the significand field.
The FP hardware carries two extra bits of precision, and then rounds to get the proper value
Rounding also occurs when converting:

double to a single precision value, or
floating point number to an integer

Under the hood

Yan, SP25

07-Floating Point (52)

53 of 64

IEEE FP’s Four Rounding Modes

Round towards +∞

ALWAYS round “up”: 2.001 → 3, -2.001 → -2

Round towards –∞

ALWAYS round “down”: 1.999 → 1, -1.999 → -2

Truncate

Just drop the last bits (round towards 0)

(default) Unbiased. If midway, round to even

Normal rounding, almost: 2.4 → 2, 2.6 → 3, 2.5 → 2, 3.5 → 4
Round like you learned in grade school (nearest representable number)…
…except if the value is right on the borderline, in which case we round to the nearest EVEN number
Ensures fairness on calculation, balances out inaccuracies.

On a tie, half the time we round up; the other half time we round down.

Under the hood

Examples in decimal

(but applies to IEEE754 binary)

Yan, SP25

07-Floating Point (53)

54 of 64

FP Addition

More difficult than with integers
Can’t just add significands…
How do we do it?

De-normalize to match exponents
Add significands together
Keep the matched exponent
Normalize (possibly changing exponent)

Note: If signs differ, just perform a subtract instead.

Under the hood

Yan, SP25

07-Floating Point (54)

55 of 64

Casting floats to ints and vice versa

(int) floating_point_expression

Coerces and converts floating point to nearest integer�(C uses truncation).
i = (int) (3.14159 * f);

(float) integer_expression

Converts integer to nearest floating point.
f = f + (float) i;

Yan, SP25

07-Floating Point (55)

56 of 64

Double Casting Doesn’t Always Work

Will not always print “true”!
Most large values of integers don’t have exact floating point representations.

Recall: only 2²³ floats per exponent

What about double?

Will not always print “true”!
Small floating point numbers (<1) don’t have integer representations.
For other numbers, rounding errors…

int i = …;

if (i == (int)((float) i)) {

printf("true\n");

}

float f = …;

if (f == (float)((int) f)) {

printf("true\n");

}

Yan, SP25

07-Floating Point (56)

57 of 64

Now you can make float jokes too…

Saturday Morning Breakfast Comics

www.smbc-comics.com/comic/2013-06-05

Side note: the robot is using IEEE 754 double-precision.

(double) 0.3:

0.29999999999999999

(double) 0.1 + (double) 0.2:

0.30000000000000004

Yan, SP25

07-Floating Point (57)

58 of 64

[Reference] More Examples

[Reference] FA22 videos: link

Generics wrap-up
Strawman: Fixed Point
Floating Point
Example, Float Step Size
Special Numbers, Overflow, and Underflow
[Reference]

Other Floating Point Reps
Floats under the hood
More Examples

Agenda

Yan, SP25

07-Floating Point (58)

59 of 64

Ex 1: Convert Binary Floating Point to Decimal

31	30	23	22	0
0	0110 1000		101 0101 0100 0011 0100 0010
	exponent		significand

1 + 1x2^-1+ 0x2^-2 + 1x2^-3 + 0x2^-4 + 1x2^-5+ … �= 1+2^-1+2^-3+2^-5+2^-7+2^-9+2^-14+2^-15+2^-17+2^-22�= 1.0 + 0.666115

0 → positive

0110 1000_two = 104_ten

Bias adjustment:

104 - 127 = -23

1.666115_ten*2^-23≈ 1.986*10^-7

(about 2/10,000,000)

Yan, SP25

07-Floating Point (59)

60 of 64

Ex 2: Convert Decimal to Binary Floating Point

Denormalize.

-23.40625

Convert integer part.�23 = 16 + ( 7 = 4 + ( 3 = 2 + ( 1 ) ) )
Convert fraction part.

.40625 = .25 + ( .15625 = .125 + ( .03125 ) )

Put parts together + normalize.

10111.01101

Convert exponent.

127 + 4

-23.40625

= 10111₂

= .01101₂

= 1.011101101 x 2⁴

= 10000011₂

-2.340625 x 10¹

31	30					23	22								0
1	1000 0011						011 1011 0100 0000 0000 0000

Yan, SP25

07-Floating Point (60)

61 of 64

Ex 3: Represent 1/3

⅓

= 0.33333…₁₀
= 0.25 + 0.0625 + 0.015625 + 0.00390625 + …
= 1/4 + 1/16 + 1/64 + 1/256 + …
= 2^-2 + 2^-4 + 2^-6 + 2^-8 + …
= 0.0101010101…₂ * 2⁰
= 1.0101010101…₂ * 2^-2

Sign: 0
Exponent = -2 + 127 = 125 = 01111101
Significand = 0101010101…

31	30					23	22								0
0	01111101						010 1010 1010 1010 1010 1010

Yan, SP25

07-Floating Point (61)

62 of 64

Understanding the Significand (1/2)

Method 1 (Fractions):

In decimal:

0.340₁₀
⇒ 340₁₀/1000₁₀
⇒ 34₁₀/100₁₀

In binary:

0.110₂
⇒ 110₂/1000₂ = 6₁₀/8₁₀
⇒ 11₂/100₂ = 3₁₀/4₁₀

Advantage: less purely numerical, more thought oriented; this method usually helps people understand the meaning of the significand better

Yan, SP25

07-Floating Point (62)

63 of 64

Understanding the Significand (2/2)

Method 2 (Place Values):

Convert from scientific notation
In decimal:

1.6732 = (1x10⁰) + (6x10^-1) + (7x10^-2) + (3x10^-3) + (2x10^-4)

In binary:

1.1001 = (1x2⁰) + (1x2^-1) + (0x2^-2) + (0x2^-3) + (1x2^-4)

Interpretation of value in each position extends beyond the decimal/binary point
Advantage: good for quickly calculating significand value; use this method for translating FP numbers

Yan, SP25

07-Floating Point (63)

64 of 64

Fractional Powers of 2

0 1.0 1

0.5 1/2
0.25 1/4
0.125 1/8
0.0625 1/16
0.03125 1/32
0.015625
0.0078125
0.00390625
0.001953125
0.0009765625
0.00048828125
0.000244140625
0.0001220703125
0.00006103515625
0.000030517578125

Mark Lu’s “Binary Float Displayer”

Yan, SP25

07-Floating Point (64)