Normal Distribution (Further)

The normal distribution is the keystone of statistical inference, and almost every test and interval in the rest of this course rests on one fact: any linear combination of independent normal variables is itself normal. This lesson develops that result carefully — sums, differences and scalar multiples — then specialises it to the distribution of the sample mean $\bar X$ , and closes with the Central Limit Theorem, which extends normal-based inference even to populations that are not themselves normal.

Where this sits in AQA 7367

This is Paper 3 Statistics (7367/3S) content (Paper 3: 2 h, 100 marks, AO1 40% / AO2 25% / AO3 35%). It sits at the heart of the option because the results here are the engine of every later inference lesson: the t-test, confidence intervals and hypothesis tests all begin "since $\bar X \sim N(\mu, \sigma^2/n)$ …". The work is predominantly AO1 (combine distributions, standardise, read normal tables) with strong AO2 (state the correct combined distribution and justify the variance-addition rule) and AO3 in multi-stage worded problems. It builds on the A-Level Maths Statistics normal distribution and standardisation $Z = (X-\mu)/\sigma$ , and on the algebra of $E$ and $\operatorname{Var}$ from Statistics 1.

Core theory: recap and the combination rules

If $X \sim N(\mu, \sigma^2)$ then its density is the familiar bell curve

$f(x) = \frac{1}{\sigma\sqrt{2\pi}}\,e^{-\frac{(x - \mu)^2}{2\sigma^2}},$

and we standardise with $Z = \dfrac{X - \mu}{\sigma} \sim N(0,1)$ , reading probabilities as $\Phi(z) = P(Z \le z)$ from tables. Two algebraic facts about any random variables drive everything below:

$E(aX + bY) = aE(X) + bE(Y) \quad (\text{always}),$ $\operatorname{Var}(aX + bY) = a^2\operatorname{Var}(X) + b^2\operatorname{Var}(Y) \quad (\text{if } X, Y \text{ independent}).$

The squared coefficients in the variance rule are the source of the topic's defining trap. The extra ingredient for normal variables is the closure result: a linear combination of independent normals is again normal, so we know not just its mean and variance but its whole distribution. This is special to the normal family — a linear combination of, say, two uniform variables is not uniform (it is triangular or trapezoidal). The normal is "closed" under addition, which is precisely why it is the natural distribution for aggregates and averages, and why the entire edifice of normal-based inference is so widely usable: once we know a quantity is a sum or average of independent normals, finding any probability about it reduces to a single standardisation.

Combination (independent $X, Y$ )	Distribution
$aX + b$	$N(a\mu_X + b,\ a^2\sigma_X^2)$
$X + Y$	$N(\mu_X + \mu_Y,\ \sigma_X^2 + \sigma_Y^2)$
$X - Y$	$N(\mu_X - \mu_Y,\ \sigma_X^2 + \sigma_Y^2)$
$aX + bY$	$N(a\mu_X + b\mu_Y,\ a^2\sigma_X^2 + b^2\sigma_Y^2)$

The critical line: for $X - Y$ the variances are added, not subtracted. Subtracting variables widens the spread (two sources of uncertainty combine), so the variance grows. Using $b = -1$ in the rule gives $(-1)^2\sigma_Y^2 = +\sigma_Y^2$ — the minus sign squares away.

Why does the variance carry squared coefficients while the mean does not? Recall $\operatorname{Var}(Z) = E[(Z - E(Z))^2]$ . For $Z = aX + bY$ with independent $X, Y$ , expanding the square gives

$\operatorname{Var}(aX + bY) = a^2\operatorname{Var}(X) + b^2\operatorname{Var}(Y) + 2ab\,\operatorname{Cov}(X,Y),$

and independence forces $\operatorname{Cov}(X,Y) = 0$ , leaving $a^2\sigma_X^2 + b^2\sigma_Y^2$ . The squaring is structural — it comes from squaring the deviation — which is exactly why a sign on $b$ cannot make a variance shrink. (If $X$ and $Y$ were not independent, the covariance term would survive; every result in this lesson assumes independence so that it vanishes.)

Sum of $n$ copies versus scaling one variable

A second trap is conflating two operations that share a mean but differ in variance:

Operation	Distribution
$X_1 + \cdots + X_n$ (sum of $n$ independent copies of $X$ )	$N(n\mu,\ n\sigma^2)$
$nX$ (one variable multiplied by $n$ )	$N(n\mu,\ n^2\sigma^2)$

The sum adds $n$ independent variances ( $n\sigma^2$ ); scaling multiplies a single variable by $n$ , and the constant comes out squared ( $n^2\sigma^2$ ). Concretely, for $X \sim N(5,3)$ : $X_1 + X_2 \sim N(10, 6)$ but $2X \sim N(10, 12)$ .

Worked examples (with mark scheme)

Example 1 — difference of two independent normals

The weight of a large egg is $X \sim N(68, 16)$ g and of a small egg $Y \sim N(50, 9)$ g, independently. Find the probability that a large egg weighs more than $15$ g more than a small egg.

$X - Y \sim N(68 - 50,\ 16 + 9) = N(18, 25). \quad (\text{M1 mean; M1 variance added; A1 distribution})$ $P(X - Y > 15) = P\!\left(Z > \frac{15 - 18}{\sqrt{25}}\right) = P(Z > -0.6). \quad (\text{M1 standardise})$ $= P(Z < 0.6) = \Phi(0.6) = 0.7257. \quad (\text{A1, 4 s.f.})$

(M1 for $\mu_X - \mu_Y = 18$ ; M1 for adding variances to get $25$ ; A1 for stating $N(18,25)$ ; M1 standardising; A1 for $0.7257$ . By symmetry $P(Z>-0.6) = \Phi(0.6)$ .)

Example 2 — sum of identical normals

Packets of flour weigh $X \sim N(1005, 100)$ g independently. A box holds $12$ packets. Find $P(\text{total} > 12100\,\text{g})$ .

$T = X_1 + \cdots + X_{12} \sim N(12 \times 1005,\ 12 \times 100) = N(12060, 1200). \quad (\text{M1 mean }12\mu;\ \text{M1 variance }12\sigma^2;\ \text{A1})$ $P(T > 12100) = P\!\left(Z > \frac{12100 - 12060}{\sqrt{1200}}\right) = P\!\left(Z > \frac{40}{34.641}\right) = P(Z > 1.155). \quad (\text{M1 standardise})$ $= 1 - \Phi(1.155) = 1 - 0.8759 = 0.1241. \quad (\text{A1})$

(M1 for $12 \times 1005$ ; M1 for $12 \times 100$ — not $12^2 \times 100$ ; A1 for $N(12060,1200)$ ; M1 standardising with $\sqrt{1200}$ ; A1 for $0.124$ . The classic slip is squaring the $12$ on the variance.)

Example 3 — a mixed linear combination

A recipe uses $2$ eggs and $1$ cup of flour. With $X \sim N(68,16)$ (egg, g) and $W \sim N(120, 25)$ (a cup of flour, g), all independent, find the distribution of the combined weight $C = X_1 + X_2 + W$ and $P(C < 250)$ .

$C \sim N(68 + 68 + 120,\ 16 + 16 + 25) = N(256, 57). \quad (\text{M1 mean; M1 variance; A1})$ $P(C < 250) = P\!\left(Z < \frac{250 - 256}{\sqrt{57}}\right) = P(Z < -0.7947) = 1 - \Phi(0.795) = 0.2134. \quad (\text{M1; A1})$

(M1/M1/A1 for the combined $N(256,57)$ — two egg variances plus the flour variance, all added; M1 standardising; A1 for $0.213$ .)

Example 4 — distinguishing a sum from a scalar multiple

A bag of sugar weighs $X \sim N(1000, 25)$ g. (a) A pallet holds $50$ bags; find the distribution of the total weight. (b) A wholesaler quotes weights in units of $50$ bags by scaling one bag's weight by $50$ ; find the distribution of $50X$ . (c) Explain why the variances differ.

(a) Sum of $50$ independent bags:

$T = X_1 + \cdots + X_{50} \sim N(50 \times 1000,\ 50 \times 25) = N(50000,\ 1250). \quad (\text{M1 mean; A1 variance } 50\sigma^2)$

(b) Scaling a single bag:

$50X \sim N(50 \times 1000,\ 50^2 \times 25) = N(50000,\ 62500). \quad (\text{M1 mean; A1 variance } 50^2\sigma^2)$

(c) The means agree ( $50000$ g), but the variances differ by a factor of $50$ : the sum pools $50$ independent fluctuations that partly cancel, whereas $50X$ magnifies one bag's fluctuation fifty-fold with no cancellation. (M1/A1 each part; B1 for the cancellation explanation. The real pallet is the sum, with the much smaller variance $1250$ — modelling it as $50X$ would wildly overstate the variability.)

The distribution of the sample mean

If $X_1, \ldots, X_n$ are independent with each $X_i \sim N(\mu, \sigma^2)$ , then the sample mean $\bar X = \frac1n\sum X_i$ is a scaled sum of normals, hence normal. Its parameters follow from the rules:

$E(\bar X) = \frac1n\sum E(X_i) = \frac{n\mu}{n} = \mu, \qquad \operatorname{Var}(\bar X) = \frac{1}{n^2}\sum \operatorname{Var}(X_i) = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}.$

$\boxed{\ \bar X \sim N\!\left(\mu,\ \frac{\sigma^2}{n}\right)\ }$

The variance $\sigma^2/n$ shrinks as $n$ grows: larger samples give more reliable estimates of $\mu$ . The quantity $\sigma/\sqrt n$ is the standard error of the mean — the standard deviation of $\bar X$ , and the denominator in every $z$ - and $t$ -statistic to come.

The $\sqrt n$ in the standard error, rather than $n$ , is one of the most important features of statistics and deserves emphasis. Because the variance falls as $1/n$ , the standard deviation of $\bar X$ falls only as $1/\sqrt n$ : to halve the spread of the sample mean you must quadruple the sample. This "diminishing returns" law explains why very precise estimates are expensive — doubling precision costs four times the data — and it recurs in the width of confidence intervals and the power of tests in the lessons that follow. It is worth distinguishing the standard error $\sigma/\sqrt n$ (the variability of the mean) from the population standard deviation $\sigma$ (the variability of a single observation): the two are constantly confused, yet they answer entirely different questions.

Worked example — probability for a sample mean

Adult heights follow $N(170, 64)$ cm. A random sample of $16$ is taken. Find $P(\bar X > 173)$ .

$\bar X \sim N\!\left(170,\ \frac{64}{16}\right) = N(170, 4), \quad \text{so SE} = \sqrt 4 = 2.$ $P(\bar X > 173) = P\!\left(Z > \frac{173 - 170}{2}\right) = P(Z > 1.5) = 1 - 0.9332 = 0.0668.$

The same threshold for a single adult would give $P(X > 173) = P(Z > 3/8) = P(Z > 0.375) = 0.354$ — much larger, because a single observation is far more variable than an average of sixteen.

Worked example — a reverse sample-size problem

The diameters of ball bearings are $N(10, 0.04)$ mm. A quality engineer wants the sample mean of $n$ bearings to lie within $0.05$ mm of $10$ with probability $0.95$ . Find the smallest $n$ .

The mean satisfies $\bar X \sim N(10, 0.04/n)$ , with standard error $0.2/\sqrt n$ . "Within $0.05$ with probability $0.95$ " means

$P(|\bar X - 10| < 0.05) = 0.95 \;\Rightarrow\; \frac{0.05}{0.2/\sqrt n} \ge 1.96. \quad (\text{M1 set-up; M1 use } 1.96)$ $\frac{0.05\sqrt n}{0.2} \ge 1.96 \;\Rightarrow\; \sqrt n \ge \frac{1.96 \times 0.2}{0.05} = 7.84 \;\Rightarrow\; n \ge 61.47. \quad (\text{M1 rearrange})$

So the smallest sample size is $n = 62$ (round up). (M1 relating the half-width to the SE; M1 the $1.96$ for $95\%$ ; M1 solving for $n$ ; A1 for $62$ . This is the sampling-distribution counterpart of the confidence-interval sample-size formula in the next lessons.)

The Central Limit Theorem (CLT)

The combination rules above assumed the $X_i$ were already normal. The Central Limit Theorem removes that assumption for the sample mean: for any population with finite mean $\mu$ and variance $\sigma^2$ ,

$\bar X \;\approx\; N\!\left(\mu,\ \frac{\sigma^2}{n}\right) \quad \text{for large } n,$

the approximation improving as $n$ increases (a working rule of thumb is $n \ge 30$ , sooner if the population is roughly symmetric). This is why normal-based inference is so widely applicable: even when the underlying data are skewed or discrete, the average behaves normally once the sample is reasonably large.

Normal Distribution (Further)

Normal Distribution (Further)

Where this sits in AQA 7367

Core theory: recap and the combination rules

Sum of nnn copies versus scaling one variable

Worked examples (with mark scheme)

Example 1 — difference of two independent normals

Example 2 — sum of identical normals

Example 3 — a mixed linear combination

Example 4 — distinguishing a sum from a scalar multiple

The distribution of the sample mean

Worked example — probability for a sample mean

Worked example — a reverse sample-size problem

The Central Limit Theorem (CLT)

More in Mathematics

Sum of $n$ copies versus scaling one variable