Fixed and Floating Point Representation

So far we have represented whole numbers. But real-world quantities — voltages, temperatures, money, the results of division — are fractional. A finite register cannot hold every real number, so we must choose a scheme that trades range against precision. This lesson develops fixed-point binary fractions and then the far more flexible floating-point representation: mantissa and exponent (both in two's complement), normalisation, two-way conversion between denary and normalised floating point, the range-versus-precision trade-off, absolute and relative error, rounding, and underflow/overflow. Floating point is the single hardest sub-topic in data representation, so every conversion below is shown in full.

Spec Mapping

This lesson covers the real-number strand of the AQA A-Level Computer Science (7517) Fundamentals of data representation area:

Fixed-point binary — representing fractional values with binary place values $\tfrac12, \tfrac14, \tfrac18, \ldots$ and an implied binary point at a fixed column.
Floating-point form — a normalised mantissa and an exponent, both stored in two's complement, and how moving the point trades range for precision.
Normalisation — putting a mantissa into its standard form and why this maximises precision.
Conversion — denary ↔ normalised floating point in both directions.
Errors — absolute error, relative error, rounding, and the conditions producing underflow and overflow.

This material links tightly to the rounding-error and precision content elsewhere in the specification and to the two's complement arithmetic from the previous lesson.

Fixed-Point Binary Fractions

In fixed-point representation the binary point sits at a fixed, agreed column. Bits to the left have positive powers of two; bits to the right have negative powers:

$\begin{array}{cccc|cccc} 2^3 & 2^2 & 2^1 & 2^0 & 2^{-1} & 2^{-2} & 2^{-3} & 2^{-4} \\ 8 & 4 & 2 & 1 & \tfrac12 & \tfrac14 & \tfrac18 & \tfrac1{16} \end{array}$

The fractional place values in denary are $0.5, 0.25, 0.125, 0.0625, \ldots$

Fixed-point binary → denary

With the point after 4 integer bits, evaluate $\texttt{0101.1010}$ :

Bit	0	1	0	1	.	1	0	1	0
Value	8	4	2	1		$\tfrac12$	$\tfrac14$	$\tfrac18$	$\tfrac1{16}$

$4 + 1 + 0.5 + 0.125 = \mathbf{5.625_{10}}$

Denary → fixed-point binary

Convert $6.75_{10}$ . Handle the integer and fractional parts separately. Integer part: $6 = \texttt{0110}$ . Fractional part by repeated multiplication by 2 — read the carried integer parts top-to-bottom:

Step	Result	Carry (bit)
$0.75 \times 2 = 1.5$	0.5	1
$0.5 \times 2 = 1.0$	0.0	1
stop — fraction is 0

So $0.75 = \texttt{.11}$ and $6.75 = \texttt{0110.1100}_2$ (padding the fraction field). Check: $4 + 2 + 0.5 + 0.25 = 6.75$ . Correct.

The limitation of fixed point

Fixed point is simple and fast, but the position of the point is fixed — so the range and the smallest representable step are both locked. With 4 integer + 4 fraction bits the largest value is just under $16$ and the finest resolution is $\tfrac1{16} = 0.0625$ . You cannot represent a very large number and a very small one in the same format. Floating point solves this by letting the point move.

To make the limitation concrete, consider the same 8 bits split as fixed-point. The total range and resolution depend entirely on where we fix the point:

Point position	Largest value	Smallest step (resolution)
7 integer, 1 fraction	$\approx 127.5$	$\tfrac12 = 0.5$
4 integer, 4 fraction	$\approx 15.94$	$\tfrac1{16} = 0.0625$
1 integer, 7 fraction	$\approx 1.99$	$\tfrac1{128} \approx 0.0078$

Every row uses all 8 bits, yet each forces a stark compromise: a wide range (top row) buys coarse resolution, while fine resolution (bottom row) buys a tiny range. With the point fixed you must commit to one row in advance, even though real data may need both large and small magnitudes. Floating point sidesteps this by storing the point's position (the exponent) with each number, so the same format can represent $12{,}000$ and $0.0003$ — the freedom that makes it the standard for scientific and general-purpose real arithmetic.

Floating-Point Form

A floating-point number, like scientific notation in denary ( $6.02 \times 10^{23}$ ), has two parts:

$\text{value} = \text{mantissa} \times 2^{\,\text{exponent}}$

The mantissa (also called the significand) holds the significant digits, stored as a two's complement fixed-point fraction with the binary point immediately after the sign bit.
The exponent is a two's complement integer saying how many places — and in which direction — to shift the point.

The format used in this lesson. Unless a question states otherwise we use a 6-bit mantissa (1 sign bit + 5 fraction bits, point straight after the sign bit) and a 4-bit two's complement exponent. AQA questions always state the bit allocation; read it carefully because the answer depends on it.

So a stored pair like mantissa $\texttt{0.10110}$ , exponent $\texttt{0011}$ means $0.10110_2 \times 2^{3}$ — shift the point 3 places right to get $\texttt{0101.10}_2 = 5.5_{10}$ .

Reading the mantissa's sign

Because the mantissa is two's complement, its leading bit is a sign bit with weight $-1$ (i.e. $-2^0$ with the point right after it). The remaining bits have weights $2^{-1}, 2^{-2}, \ldots$ :

$\begin{array}{c|ccccc} -2^0 & 2^{-1} & 2^{-2} & 2^{-3} & 2^{-4} & 2^{-5} \\ -1 & \tfrac12 & \tfrac14 & \tfrac18 & \tfrac1{16} & \tfrac1{32} \end{array}$

A positive mantissa therefore begins $\texttt{0.1...}$ and a negative mantissa begins $\texttt{1.0...}$ once normalised — a fact we use constantly.

Normalisation

A number can be written in floating-point form in many ways ( $0.011 \times 2^2 = 0.11 \times 2^1 = 1.1 \times 2^0$ , all equal to $1.5$ ). Normalisation picks the one form that uses the mantissa bits most efficiently, giving maximum precision.

The rule for a two's complement mantissa:

Positive numbers: the mantissa must start with $\texttt{0}$ followed by $\texttt{1}$ — i.e. the pattern $\texttt{0.1...}$ (the first two bits are 01).
Negative numbers: the mantissa must start with $\texttt{1}$ followed by $\texttt{0}$ — i.e. the pattern $\texttt{1.0...}$ (the first two bits are 10).

In both cases the two most-significant bits differ. The idea: shift the point so the first significant bit sits just after the sign, wasting no leading bits that carry no information. Each shift left of the point increases the exponent by 1; each shift right decreases it by 1 (you are multiplying/dividing the mantissa by 2 and compensating in the exponent so the value is unchanged).

Exam Tip: "Normalise this number" almost always means shift until the first two mantissa bits differ, adjusting the exponent to keep the value the same. Show the shift and the matching exponent change explicitly — both attract marks.

Is this mantissa normalised? — quick tests

A common short question gives a mantissa and asks whether it is normalised. Apply the two-bits-differ rule by inspection:

Mantissa	First two bits	Normalised?	Why
$\texttt{010110}$	01	Yes	positive, bits differ
$\texttt{001011}$	00	No	leading 0s waste precision — shift left, decrease exponent
$\texttt{101101}$	10	Yes	negative, bits differ
$\texttt{110100}$	11	No	leading 1s waste precision — shift left, decrease exponent

The rule of thumb: an un-normalised mantissa always begins with two identical bits (00 or 11). To normalise it you shift the mantissa left until the first two bits differ, decreasing the exponent by 1 for each shift (because each left shift doubles the mantissa, so the exponent must drop to keep the value constant). For example $\texttt{001011}$ with exponent $5$ becomes $\texttt{010110}$ with exponent $4$ after one left shift — same value, now normalised, one more significant bit retained.

Worked Conversion: Denary → Normalised Floating Point

Example 1 — a positive value, $9.5_{10}$

Step 1 — convert to fixed-point binary. $9 = \texttt{1001}$ and $0.5 = \texttt{.1}$ , so $9.5 = \texttt{1001.1}_2$ .

Step 2 — write in the form mantissa $\times 2^e$ with the point after the sign bit. Move the point to just after the leading (sign) position. Currently the point is after 4 integer bits; to reach $\texttt{0.1001 1}$ form we move it 4 places left, so the exponent is $+4$ :

$9.5 = \texttt{1001.1}_2 = 0.10011_2 \times 2^{4}$

Step 3 — write the mantissa with its sign bit (positive → leading 0) and the exponent in two's complement. Mantissa (6 bits) $= \texttt{0.10011}$ ; exponent $4 = \texttt{0100}$ in 4-bit two's complement. Check normalisation: first two mantissa bits are 01 ✓ (positive, normalised).

$\boxed{\text{mantissa } \texttt{010011}, \quad \text{exponent } \texttt{0100}}$

Example 2 — a small fraction, $0.40625_{10}$

Step 1 — fixed-point binary. Repeated multiplication: $0.40625 \times 2 = 0.8125$ (carry 0); $0.8125 \times 2 = 1.625$ (carry 1); $0.625 \times 2 = 1.25$ (carry 1); $0.25 \times 2 = 0.5$ (carry 0); $0.5 \times 2 = 1.0$ (carry 1). Reading the carries top-to-bottom: $0.40625 = \texttt{0.01101}_2$ . (Check: $\tfrac14 + \tfrac18 + \tfrac1{32} = 0.25 + 0.125 + 0.03125 = 0.40625$ ✓.)

Step 2 — normalise. The first significant bit is in the $2^{-2}$ column, so we must shift the point one place right to make the pattern $\texttt{0.1101}$ , decreasing the exponent by 1 to $-1$ :

$0.40625 = 0.01101_2 \times 2^{0} = 0.1101_2 \times 2^{-1}$

Step 3 — encode. Mantissa $= \texttt{0.11010}$ (positive, pattern 01 ✓); exponent $-1 = \texttt{1111}$ in 4-bit two's complement.

$\boxed{\text{mantissa } \texttt{011010}, \quad \text{exponent } \texttt{1111}}$

Example 3 — a negative value, $-9.5_{10}$

Start from the positive mantissa $\texttt{0.10011}$ with exponent $4$ , and negate the mantissa in two's complement (the exponent is unchanged). Invert and add one to $\texttt{010011}$ :

Step	Mantissa bits
$+$ mantissa	0 1 0 0 1 1
invert	1 0 1 1 0 0
add 1	1 0 1 1 0 1

So $-9.5$ has mantissa $\texttt{101101}$ , exponent $\texttt{0100}$ . Check normalisation: first two bits 10 ✓ (negative, normalised). Decoding back: $-1 + \tfrac14 + \tfrac1{16} + \tfrac1{32} = -0.65625$ , and $-0.65625 \times 2^4 = -10.5$ … that is not $-9.5$ , which warns us to re-derive carefully rather than negate the rounded mantissa. Negating $\texttt{0.10011}$ exactly: its value is $0.59375$ ; two's complement negation gives $\texttt{1.01101}$ with value $-1 + 0.25 + 0.125 + 0.03125 = -0.59375$ ✓, and $-0.59375 \times 2^4 = -9.5$ ✓. The mantissa is therefore $\texttt{101101}$ — the earlier decode slip came from mis-reading a bit, underlining why you must always decode your answer to verify.

Worked Conversion: Normalised Floating Point → Denary

Decode mantissa $\texttt{010110}$ , exponent $\texttt{0011}$ .

Step 1 — value of the exponent. $\texttt{0011}_2 = +3$ .

Step 2 — value of the mantissa (two's complement, point after sign bit). Leading bit 0 → positive: $0 \cdot(-1) + \tfrac12 + 0 + \tfrac18 + \tfrac1{16} + 0 = 0.5 + 0.125 + 0.0625 = 0.6875$ .

Step 3 — apply the exponent (shift the point 3 places right):

$0.6875 \times 2^{3} = 0.6875 \times 8 = \mathbf{5.5_{10}}$

Now a negative example: mantissa $\texttt{100110}$ , exponent $\texttt{0010}$ .

Mantissa value (leading 1 → the $-1$ column contributes): $-1 + \tfrac14 + \tfrac1{16} = -1 + 0.25 + 0.0625 = -0.6875$ . Exponent $= +2$ . So:

$-0.6875 \times 2^{2} = -0.6875 \times 4 = \mathbf{-2.75_{10}}$

Exam Tip: Decode the exponent first, then the mantissa as a two's complement fraction, then shift. Writing the mantissa column weights ( $-1, \tfrac12, \tfrac14, \ldots$ ) above the bits prevents the most common slip — forgetting that the leading column is negative.

Range versus Precision

A floating-point format has a fixed total of bits to split between mantissa and exponent, and this split is a direct trade-off:

More exponent bits → greater range (you can shift the point further, reaching much larger and much smaller magnitudes), but fewer mantissa bits → lower precision (fewer significant figures, coarser steps).
More mantissa bits → greater precision, but fewer exponent bits → smaller range.

graph LR
    A["Fixed total bits"] --> B["More exponent bits"]
    A --> C["More mantissa bits"]
    B --> D["Larger range (very big and very small)"]
    B --> E["Lower precision (fewer significant figures)"]
    C --> F["Higher precision (more significant figures)"]
    C --> G["Smaller range"]

You cannot improve both at once with a fixed word length — increasing one necessarily shrinks the other. This is why real formats (IEEE 754 single precision: 23-bit mantissa, 8-bit exponent; double precision: 52-bit mantissa, 11-bit exponent) choose the split deliberately for their use-case.

Errors: Absolute, Relative, and Rounding

Because most real numbers cannot be represented exactly in a finite mantissa, the stored value is an approximation. We quantify the error two ways.

Absolute error = $|\,\text{stored value} - \text{true value}\,|$ . It is expressed in the same units as the quantity.
Relative error = $\dfrac{|\,\text{stored value} - \text{true value}\,|}{|\,\text{true value}\,|}$ , often given as a percentage. It expresses the error as a proportion of the value.

Worked example

Suppose a format can only store $\tfrac13$ as the rounded value $0.333251953125$ (a particular 12-bit mantissa). The true value is $0.3333\overline{3}$ .

$\text{absolute error} = |0.333251953125 - 0.333333\ldots| \approx 0.0000814$ $\text{relative error} = \frac{0.0000814}{0.333333\ldots} \approx 0.000244 = 0.0244\%$

The same absolute error matters far more for a small value than a large one, which is exactly why relative error is the more meaningful measure for floating point: a given mantissa width gives roughly constant relative precision across the whole range, because normalisation always packs the same number of significant bits behind the point.

Rounding

When a value needs more fraction bits than the mantissa provides, the surplus bits are discarded. Truncation simply drops them (always biasing toward zero in magnitude); rounding to nearest chooses the closer representable value (smaller average error). Rounding can occasionally increase the magnitude enough to require re-normalisation.

Fixed and Floating Point Representation

Fixed and Floating Point Representation

Spec Mapping

Fixed-Point Binary Fractions

Fixed-point binary → denary

Denary → fixed-point binary

The limitation of fixed point

Floating-Point Form

Reading the mantissa's sign

Normalisation

Is this mantissa normalised? — quick tests

Worked Conversion: Denary → Normalised Floating Point

Example 1 — a positive value, $9.5_{10}$

Example 2 — a small fraction, $0.40625_{10}$

Example 3 — a negative value, $-9.5_{10}$

Worked Conversion: Normalised Floating Point → Denary

Range versus Precision

Errors: Absolute, Relative, and Rounding

Worked example

Rounding

Underflow and Overflow

More in Computer Science

Fixed and Floating Point Representation

Fixed and Floating Point Representation

Spec Mapping

Fixed-Point Binary Fractions

Fixed-point binary → denary

Denary → fixed-point binary

The limitation of fixed point

Floating-Point Form

Reading the mantissa's sign

Normalisation

Is this mantissa normalised? — quick tests

Worked Conversion: Denary → Normalised Floating Point

Example 1 — a positive value, 9.5109.5_{10}9.510​

Example 2 — a small fraction, 0.40625100.40625_{10}0.4062510​

Example 3 — a negative value, −9.510-9.5_{10}−9.510​

Worked Conversion: Normalised Floating Point → Denary

Range versus Precision

Errors: Absolute, Relative, and Rounding

Worked example

Rounding

Underflow and Overflow

More in Computer Science

Example 1 — a positive value, $9.5_{10}$

Example 2 — a small fraction, $0.40625_{10}$

Example 3 — a negative value, $-9.5_{10}$