IEEE 754 Floating Point
PDF: link
EECS department undergrad survey open through 3/3 (join EECS 101 Ed)
CS61C
Great Ideas
in
Computer Architecture
(a.k.a. Machine Structures)
cs61c.org
Assistant
Teaching Professor
Lisa Yan
Yan, SP25
07-Floating Point (1)
Generics wrap-up: Pointer Arithmetic
Agenda
Yan, SP25
07-Floating Point (2)
How to use swap?
void swap(void *ptr1, void *ptr2, size_t nbytes)
void *arr; // pointer to array arr
size_t nelems, nbytes; // # elements in array, element size
… /* initialize */
swap(arr, ??? , nbytes);
arr
0x100
nelems
5
nbytes
4
A. arr + nelems - 1
B. arr + (nelems - 1)*nbytes
C. (char *) arr + (nelems - 1)*nbytes
D. (char *) (arr + (nelems - 1)*nbytes)
E. Something else
1 | 2 | 3 | 4 | 5 |
0x100
0x110
0x114
0x10C
0x108
0x104
1 5 |
5 1 |
Yan, SP25
07-Floating Point (3)
Generic Swap in Action
swap(arr, __________________________________, nbytes);
1
2
3
4
5
6
1 | 2 | 3 | 4 | 5 |
0x100
0x110
0x114
0x10C
0x108
0x104
1 5 |
5 1 |
void swap(void *ptr1,� void *ptr2,� size_t nbytes) {
char temp[nbytes];
memcpy(temp, ptr1, nbytes);
memcpy(ptr1, ptr2, nbytes);
memcpy(ptr2, temp, nbytes);
}
temp
?? | ?? | ?? | ?? |
Pointer arithmetic in generics must be bytewise arithmetic:
1. Cast void * pointer to char *.
2. Pointer arithmetic is then effectively byte-wise!
nbytes
0x110
ptr1
ptr2
0x100
4
1
(char *) arr + (nelems - 1)*nbytes
Yan, SP25
07-Floating Point (4)
(change of gears)
Yan, SP25
07-Floating Point (5)
Great Idea #1: Abstraction (Levels of Representation/Interpretation)
Week 1 | – | Welcome | Number 1: Bit ops, Integer num rep |
Week 2 | C 1: Intro | C 2: Pointers, Arrays, Strings | C 3: Memory (Stack, Heap) |
Week 3 | C 4: Generics, Function Pointers | Number 2: Floating Point | RISC-V 1: Intro to Assembly |
Learning C means learning binary abstraction! Let’s revisit number representation, but now for numbers like 4.25, 5.17, ….
Yan, SP25
07-Floating Point (6)
Quote of the Day
“95% of the folks out there are completely clueless about floating-point.”
– James Gosling, 1998-02-28
Yan, SP25
07-Floating Point (7)
Strawman:�Fixed Point
[Reference] FA22 videos: link
Agenda
Yan, SP25
07-Floating Point (8)
Review of Integer Number Representations
Computers made to process numbers, represented as bits.
What can we represent in N bits?
Yan, SP25
07-Floating Point (9)
What about other numbers?
Very large numbers (sec/millennium)
Very small numbers? (Bohr radius)
#s with both integer & fractional parts?
Let’s start with representing�this number…
Scientific Notation
Yan, SP25
07-Floating Point (10)
“Fixed Point” Binary Representation
“Binary Point,” like decimal point, signifies the boundary between integer and fractional parts.
Example 6-bit representation, positive numbers only:
x
21
x
20
y
2-1
y
2-2
y
2-3
y
2-4
xxyyyy
.
Yan, SP25
07-Floating Point (11)
“Fixed Point” Binary Representation
“Binary Point,” like decimal point, signifies the boundary between integer and fractional parts.
Example 6-bit representation, positive numbers only:
With the above 6-bit ”fixed binary point” representation:
101010
1
21
0
20
1
2-1
0
2-2
1
2-3
0
2-4
1 × 21 + 1 × 2-1 + 1 × 2-3
= 2 + 0.5 + 0.125
= 2.625ten
11.1111two = 4 – 1 × 2-4
.
Yan, SP25
07-Floating Point (12)
Arithmetic with Fixed Point
01.1000 1.510
+ 00.1000 0.510
10.0000 2.010
Addition is straightforward:
Multiplication is a bit more complex:
01.1000 1.510
× 00.1000 0.510
00 0000
000 000
0000 00
01100 0
000000
000000
0001100 0000 → 00.1100
Need to remember where point is…
.
Yan, SP25
07-Floating Point (13)
What about other numbers?
Very large numbers (sec/millennium)
Very small numbers? (Bohr radius)
#s with both integer & fractional parts?
✅
⚠️
11···0.0
0.0···11
34 bits
58 bits
⚠️
To store all these numbers, we’d need a fixed-point rep with at least 92 bits. There must be a better way!
Yan, SP25
07-Floating Point (14)
Floating Point
Agenda
Yan, SP25
07-Floating Point (15)
Floating Point
A “floating binary point” most effectively uses of our limited bits (and thus more accuracy in our number representation).
Example:
The binary point is stored separately from the significant bits, so very large and small numbers can be represented.
2. Keep track of the binary point 2 places to the left of the MSB (“exponent field”).
1. Store these “significant bits.”
Yan, SP25
07-Floating Point (16)
Enter Scientific Notation (in Decimal)
“Normalized form”: no leadings 0s (exactly one nonzero digit to left of point).
1.640625ten × 10 -1
radix (base 10)
decimal point
mantissa
exponent
Yan, SP25
07-Floating Point (17)
Enter Scientific Notation (in Binary)
“Normalized form”: no leadings 0s (exactly one nonzero digit to left of point).
“Floating point”: Computer arithmetic that supports a binary normalized form.
1.640625ten × 10 -1
decimal point
radix (base 10)
mantissa
exponent
1.0101two × 2 -3
radix (base 2)
“binary point”
Note: Normalized binary representation always leads with a 1! (why?)
Yan, SP25
07-Floating Point (18)
IEEE 754 Single-Precision Floating Point (1/4)
s | exponent | significand | |||||||||||||
1 bit
8 bits
23 bits
31
30
23
22
0
±1.xxx···xtwo × 2yyy···ytwo
(normalized format)
Yan, SP25
07-Floating Point (19)
IEEE 754 Single-Precision Floating Point (2/4)
s | exponent | significand | |||||||||||||
1 bit
8 bits
23 bits
31
30
23
22
0
±1.xxx···xtwo × 2yyy···ytwo
Sign bit:
Significand field:
(normalized format)
(-1)S
(1 + significand)
Yan, SP25
07-Floating Point (20)
IEEE 754 Single-Precision Floating Point (3/4)
s | exponent | significand | |||||||||||||
1 bit
8 bits
23 bits
31
30
23
22
0
±1.xxx···xtwo × 2yyy···ytwo
Exponent field uses “bias notation”:
2(exponent-127)
Yan, SP25
07-Floating Point (21)
IEEE 754 Single-Precision Floating Point (4/4)
Single precision standard for 32-bit word. In C, float.
s | exponent | significand | |||||||||||||
1 bit
8 bits
23 bits
31
30
23
22
0
±1.xxx···xtwo × 2yyy···ytwo
Sign
Exponent
Significand
(-1)S × (1 + significand) × 2(exponent-127)
[Summary]
(normalized format)
Yan, SP25
07-Floating Point (22)
“Father” of the Floating Point Standard
IEEE Standard 754 for Binary Floating-Point Arithmetic.
William Kahan�Professor Emeritus�UC Berkeley
On most systems, C float and double are specified by IEEE 754 floating point format (32b float: single-precision, 64b double precision)
Yan, SP25
07-Floating Point (23)
Example,�Float Step Size
Agenda
Yan, SP25
07-Floating Point (24)
Normalized Example
What is the decimal equivalent of the following IEEE 754 single-precision binary floating point number?
31 | 30 | | | | | 23 | 22 | | | | | | | | 0 |
1 | 1000 0001 | 111 0000 0000 0000 0000 0000 | |||||||||||||
A. -7 * 2^129
B. -3.5
C. -3.75
D. 7
E. -7.5
F. Something else
(-1)S × (1 + significand) × 2(exponent-127)
Yan, SP25
07-Floating Point (25)
Yan, SP25
07-Floating Point (26)
[Solution] Normalized Example
What is the decimal equivalent of the following IEEE 754 single-precision binary floating point number?
31 | 30 | | | | | 23 | 22 | | | | | | | | 0 |
1 | 1000 0001 | 111 0000 0000 0000 0000 0000 | |||||||||||||
(-1)S × (1 + significand) × 2(exponent-127)
(-1)1 × (1 + .111)two × 2(129-127)
-1 × (1.111)two × 2(2)
-111.1two
-7.5ten
A. -7 * 2^129
B. -3.5
C. -3.75
D. 7
E. -7.5
F. Something else
-1.111
Yan, SP25
07-Floating Point (27)
Step Size with Limited Precision
What is the next representable number after y? Before y?
0 | 1000 0001 | 111 0000 0000 0000 0000 0001 | |||||||||||||
exponent
s
significand
y
+1
next float after y
y + ((.0…01)two × 2(129-127))
y + (2-23 × 2(2))
y + 2-21
“step size”
Because we have a fixed # of bits (precision), we cannot represent all numbers in a range.
Step size is the spacing between consecutive floats with a given exponent.
31 | 30 | | | | | 23 | 22 | | | | | | | | 0 |
0 | 1000 0001 | 111 0000 0000 0000 0000 0000 | |||||||||||||
Yan, SP25
07-Floating Point (28)
Computing in the News(?) Climate Change
INT_MAX: 2147483647
UINT_MAX: 4294967295
FLOAT_MAX: 3.40282e+38
(1.1…1) × 2(254-127) ≈ 3.4 × 1038
This is the maximum normalized float rep in IEEE 754, where biased exponents are restricted to the range 1 to 254 (2-126 to 2127).
s | 1111 1110 | 1…1 (23 bits) |
Exponent fields 0, 255 are reserved for special numbers. More now!
Yan, SP25
07-Floating Point (29)
Special Numbers, Overflow, and Underflow
[FOR NEXT TIME]
Agenda
Yan, SP25
07-Floating Point (30)
Special Numbers
Normalized numbers are only a fraction (heh) of floating point representations. For single-precision (32-bit):
Professor Kahan had clever ideas;�“Waste not, want not.”
| | |
| | |
| | |
1 – 254 | anything | Normalized floating point |
| | |
| | |
Biased Exponent | Significand field | Object |
0 | all zeros | ±0 |
0 | nonzero | ??? |
1 – 254 | anything | Normalized floating point |
255 | all zeros | ??? |
255 | nonzero | ??? |
The fields 0 and 255 accommodate overflow, underflow, zero, infinity, and arithmetic errors.
Read reference slides for zero, infinity, and NaN!
Yan, SP25
07-Floating Point (31)
Overflow and Underflow (1/2)
Because 0 and 255 are reserved exponent fields,�the range of normalized single-precision floating point is:
Largest magnitude:
Smallest magnitude:
(1.1…1) × 2(254-127) ≈ 3.4 × 1038
(1.0) × 2(1-127) ≈ 1.2 × 10-38
Normalized range: [–3.4×1038, –1.2×10-38] and [1.2×10-38, 3.4×1038]
s | 1111 1110 | 1…1 (23 bits) |
s | 0000 0001 | 0…0 (23 bits) |
0
-1
1
-3.4×1038
3.4×1038
-1.2×10-38
1.2×10-38
Yan, SP25
07-Floating Point (32)
Overflow and Underflow (2/2)
Because 0 and 255 are reserved exponent fields,�the range of normalized single-precision floating point is:
Largest magnitude:
Smallest magnitude:
(1.1…1) × 2(254-127) ≈ 3.4 × 1038
(1.0) × 2(1-127) ≈ 1.2 × 10-38
What if a number falls outside the representable range?
s | 1111 1110 | 1…1 (23 bits) |
s | 0000 0001 | 0…0 (23 bits) |
0
-1.2×10-38
1.2×10-38
-1
1
-3.4×1038
3.4×1038
underflow
overflow
overflow
Yan, SP25
07-Floating Point (33)
Denorms: Gradual Underflow (1/3)
Problem:
Smallest normalized number:
2nd smallest normalized number:
(1.0) × 2(1-127) = 2-126
(1.000……12) × 2(1-127)
= (1 + 2-23) x 2-126 = 2-126 + 2-149
0 | 0000 0001 | 000…000 |
31 | 30 | 23 | 22 | 1 0 |
0 | 0000 0001 | 00000…00001 | ||
Because of the mantissa’s implicit 1, there is a huge step size between 0 and smallest vs. smallest and 2nd smallest!
2-149
2-126
-∞
+∞
0
Yan, SP25
07-Floating Point (34)
Denorms: Gradual Underflow (2/3)
Solution:
(-1)s × (0.x……x2) × 2-126
31 | 30 | 23 | 22 | 1 0 |
s | 0000 0000 | 00000…00001 | ||
Yan, SP25
07-Floating Point (35)
Denorms: Gradual Underflow (3/3)
Solution:
(-1)s × (0.x……x2) × 2-126
31 | 30 | 23 | 22 | 1 0 |
s | 0000 0000 | 00000…00001 | ||
0 | 0000 0000 | 000…001 |
0 | 0000 0001 | 000…000 |
-∞
+∞
0 | 0000 0000 | 000…010 |
0 | 0000 0000 | 111…111 |
0
2-126
Smallest normalized number:
Smallest denormalized number:
2-149
2nd smallest denorm
Largest denorm
Yan, SP25
07-Floating Point (36)
Special Numbers, Summary
The fields 0 and 255 accommodate overflow, underflow, zero, infinity, and arithmetic errors.
Biased Exponent | Significand field | Object |
0 | all zeros | ±0 |
0 | nonzero | Denormalized numbers |
1 – 254 | anything | Normalized floating point |
255 | all zeros | ±∞ |
255 | nonzero | NaNs |
Read reference slides for zero, infinity, and NaN!
Yan, SP25
07-Floating Point (37)
BTW: IEEE 754 Double-Precision Floating Point (double)
binary64: Next Multiple of Word Size (64 bits)
Double Precision (vs. Single Precision)
31 | 30 | 23 | 22 | 0 |
s | 0000 0001 | Significand | ||
1 bit | 11 bits | 20 bits | ||
Significand (cont’d) |
Yan, SP25
07-Floating Point (38)
For you: IEEE-754 Floating Point Converter
Yan, SP25
07-Floating Point (39)
Representing Zero
IEEE 754 represents ±0:
Note: Zero has no normalized representation (i.e., no leading 1).
0 | 0000 0000 | 000…000 |
1 | 0000 0000 | 000…000 |
+0:
-0:
[READ]
Yan, SP25
07-Floating Point (40)
Representation for ±∞
In FP, divide by ±0 should produce ±∞, not overflow.
IEEE 754 represents ±∞:
0 | 1111 1111 | 111…111 |
1 | 1111 1111 | 111…111 |
+∞:
-∞:
[READ]
Yan, SP25
07-Floating Point (41)
Representation for Not a Number
What do I get if I calculate sqrt(-4.0) or 0/0?
Why is this useful?
1 | 1111 1111 | xxx…xxxx |
[READ]
Yan, SP25
07-Floating Point (42)
[reference] Other Floating Point Representations
[Reference] FA22 videos: link
Agenda
Yan, SP25
07-Floating Point (43)
Precision and Accuracy
Precision is a count of the number of bits used to represent a value.
Accuracy is the difference between the actual value of a number and its computer representation.
High precision permits high accuracy but doesn’t guarantee it.
It is possible to have high precision but low accuracy.
Yan, SP25
07-Floating Point (44)
IEEE 754 Double-Precision Floating Point (double)
binary64: Next Multiple of Word Size (64 bits)
Double Precision (vs. Single Precision)
31 | 30 | 23 | 22 | 0 |
s | 0000 0001 | Significand | ||
1 bit | 11 bits | 20 bits | ||
Significand (cont’d) |
Yan, SP25
07-Floating Point (45)
Other Floating Point Representations
Quad-Precision? Yep! (128 bits) “binary128”
Oct-Precision? Yep! “binary256”
Half-Precision? Yep! “binary16” or “fp16”
Half-Precision? Yep! “bfloat16”
Yan, SP25
07-Floating Point (46)
Floating Point Soup
31 | 30 | 23 | 22 | |
S | Exponent | Significand | ||
1 bit | 8 bits | 23 bits | ||
15 | 14 | 10 | 9 | 0 |
S | Exponent | Significand | ||
1 bit | 5 bits | | ||
15 | 14 | 7 | 6 | 0 |
S | Exponent | Significand | ||
1 bit | 8 bits | 7 bits | ||
18 | 17 | 10 | 9 | 0 |
s | 0000 0001 | Significand | ||
1 bit | 8 bits | 10 bits | ||
FP32
FP16
BFLOAT16
TF32
Yan, SP25
07-Floating Point (47)
Who Uses What in Domain Accelerators?
Accelerator | int4 | int8 | int16 | fp16 | bf16 | fp32 | tf32 |
Google TPU v1 | | x | | | | | |
Google TPU v2 | | | | | x | | |
Google TPU v3 | | | | | x | | |
Nvidia Volta TensorCore | x | x | | x | | | |
Nvidia Ampere TensorCore | x | x | x | x | x | x | x |
Nvidia DLA | | x | x | x | | | |
Intel AMX | | x | | | x | | |
Amazon AWS Inferentia | | x | | x | x | | |
Qualcomm Hexagon | | x | | | | | |
Huawei Da Vinci | | x | | x | | | |
MediaTek APU 3.0 | | x | x | x | | | |
Samsung NPU | | x | | | | | |
Tesla NPU | | x | | | | | |
Yan, SP25
07-Floating Point (48)
Unum
Everything so far has had a fixed set of bits for Exponent and Significand.
Claims to save power!
Dr. John Gustafson
Yan, SP25
07-Floating Point (49)
[Reference] Floating Point Discussion
[Reference] FA22 videos: link
Agenda
Yan, SP25
07-Floating Point (50)
Floating Point Fallacy
x
– 1.5 x 1038
y
1.5 x 1038
z
1.0
Yan, SP25
07-Floating Point (51)
Rounding
Under the hood
Yan, SP25
07-Floating Point (52)
IEEE FP’s Four Rounding Modes
Under the hood
Examples in decimal
(but applies to IEEE754 binary)
Yan, SP25
07-Floating Point (53)
FP Addition
Under the hood
Yan, SP25
07-Floating Point (54)
Casting floats to ints and vice versa
Yan, SP25
07-Floating Point (55)
Double Casting Doesn’t Always Work
int i = …;
if (i == (int)((float) i)) {
printf("true\n");
}
float f = …;
if (f == (float)((int) f)) {
printf("true\n");
}
Yan, SP25
07-Floating Point (56)
Now you can make float jokes too…
Saturday Morning Breakfast Comics
Side note: the robot is using IEEE 754 double-precision.
(double) 0.3:
0.29999999999999999
(double) 0.1 + (double) 0.2:
0.30000000000000004
Yan, SP25
07-Floating Point (57)
[Reference] More Examples
[Reference] FA22 videos: link
Agenda
Yan, SP25
07-Floating Point (58)
Ex 1: Convert Binary Floating Point to Decimal
31 | 30 | | | | | 23 | 22 | | | | | | | | 0 |
0 | 0110 1000 | 101 0101 0100 0011 0100 0010 | |||||||||||||
| exponent | significand | |||||||||||||
1 + 1x2-1+ 0x2-2 + 1x2-3 + 0x2-4 + 1x2-5 + … �= 1+2-1+2-3 +2-5 +2-7 +2-9 +2-14 +2-15 +2-17 +2-22�= 1.0 + 0.666115
0 → positive
0110 1000two = 104ten
Bias adjustment:
104 - 127 = -23
1.666115ten*2-23 ≈ 1.986*10-7
(about 2/10,000,000)
Yan, SP25
07-Floating Point (59)
Ex 2: Convert Decimal to Binary Floating Point
-23.40625
.40625 = .25 + ( .15625 = .125 + ( .03125 ) )
10111.01101
127 + 4
-23.40625
= 101112
= .011012
= 1.011101101 x 24
= 100000112
-2.340625 x 101
31 | 30 | | | | | 23 | 22 | | | | | | | | 0 |
1 | 1000 0011 | 011 1011 0100 0000 0000 0000 | |||||||||||||
Yan, SP25
07-Floating Point (60)
Ex 3: Represent 1/3
31 | 30 | | | | | 23 | 22 | | | | | | | | 0 |
0 | 01111101 | 010 1010 1010 1010 1010 1010 | |||||||||||||
Yan, SP25
07-Floating Point (61)
Understanding the Significand (1/2)
Yan, SP25
07-Floating Point (62)
Understanding the Significand (2/2)
Yan, SP25
07-Floating Point (63)
Fractional Powers of 2
0 1.0 1
Mark Lu’s “Binary Float Displayer”
Yan, SP25
07-Floating Point (64)