You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
The normal distribution is the keystone of statistical inference, and almost every test and interval in the rest of this course rests on one fact: any linear combination of independent normal variables is itself normal. This lesson develops that result carefully — sums, differences and scalar multiples — then specialises it to the distribution of the sample mean Xˉ, and closes with the Central Limit Theorem, which extends normal-based inference even to populations that are not themselves normal.
This is Paper 3 Statistics (7367/3S) content (Paper 3: 2 h, 100 marks, AO1 40% / AO2 25% / AO3 35%). It sits at the heart of the option because the results here are the engine of every later inference lesson: the t-test, confidence intervals and hypothesis tests all begin "since Xˉ∼N(μ,σ2/n) …". The work is predominantly AO1 (combine distributions, standardise, read normal tables) with strong AO2 (state the correct combined distribution and justify the variance-addition rule) and AO3 in multi-stage worded problems. It builds on the A-Level Maths Statistics normal distribution and standardisation Z=(X−μ)/σ, and on the algebra of E and Var from Statistics 1.
If X∼N(μ,σ2) then its density is the familiar bell curve
f(x)=σ2π1e−2σ2(x−μ)2,
and we standardise with Z=σX−μ∼N(0,1), reading probabilities as Φ(z)=P(Z≤z) from tables. Two algebraic facts about any random variables drive everything below:
E(aX+bY)=aE(X)+bE(Y)(always), Var(aX+bY)=a2Var(X)+b2Var(Y)(if X,Y independent).
The squared coefficients in the variance rule are the source of the topic's defining trap. The extra ingredient for normal variables is the closure result: a linear combination of independent normals is again normal, so we know not just its mean and variance but its whole distribution. This is special to the normal family — a linear combination of, say, two uniform variables is not uniform (it is triangular or trapezoidal). The normal is "closed" under addition, which is precisely why it is the natural distribution for aggregates and averages, and why the entire edifice of normal-based inference is so widely usable: once we know a quantity is a sum or average of independent normals, finding any probability about it reduces to a single standardisation.
| Combination (independent X,Y) | Distribution |
|---|---|
| aX+b | N(aμX+b, a2σX2) |
| X+Y | N(μX+μY, σX2+σY2) |
| X−Y | N(μX−μY, σX2+σY2) |
| aX+bY | N(aμX+bμY, a2σX2+b2σY2) |
The critical line: for X−Y the variances are added, not subtracted. Subtracting variables widens the spread (two sources of uncertainty combine), so the variance grows. Using b=−1 in the rule gives (−1)2σY2=+σY2 — the minus sign squares away.
Why does the variance carry squared coefficients while the mean does not? Recall Var(Z)=E[(Z−E(Z))2]. For Z=aX+bY with independent X,Y, expanding the square gives
Var(aX+bY)=a2Var(X)+b2Var(Y)+2abCov(X,Y),
and independence forces Cov(X,Y)=0, leaving a2σX2+b2σY2. The squaring is structural — it comes from squaring the deviation — which is exactly why a sign on b cannot make a variance shrink. (If X and Y were not independent, the covariance term would survive; every result in this lesson assumes independence so that it vanishes.)
A second trap is conflating two operations that share a mean but differ in variance:
| Operation | Distribution |
|---|---|
| X1+⋯+Xn (sum of n independent copies of X) | N(nμ, nσ2) |
| nX (one variable multiplied by n) | N(nμ, n2σ2) |
The sum adds n independent variances (nσ2); scaling multiplies a single variable by n, and the constant comes out squared (n2σ2). Concretely, for X∼N(5,3): X1+X2∼N(10,6) but 2X∼N(10,12).
The weight of a large egg is X∼N(68,16) g and of a small egg Y∼N(50,9) g, independently. Find the probability that a large egg weighs more than 15 g more than a small egg.
X−Y∼N(68−50, 16+9)=N(18,25).(M1 mean; M1 variance added; A1 distribution) P(X−Y>15)=P(Z>2515−18)=P(Z>−0.6).(M1 standardise) =P(Z<0.6)=Φ(0.6)=0.7257.(A1, 4 s.f.)
(M1 for μX−μY=18; M1 for adding variances to get 25; A1 for stating N(18,25); M1 standardising; A1 for 0.7257. By symmetry P(Z>−0.6)=Φ(0.6).)
Packets of flour weigh X∼N(1005,100) g independently. A box holds 12 packets. Find P(total>12100g).
T=X1+⋯+X12∼N(12×1005, 12×100)=N(12060,1200).(M1 mean 12μ; M1 variance 12σ2; A1) P(T>12100)=P(Z>120012100−12060)=P(Z>34.64140)=P(Z>1.155).(M1 standardise) =1−Φ(1.155)=1−0.8759=0.1241.(A1)
(M1 for 12×1005; M1 for 12×100 — not 122×100; A1 for N(12060,1200); M1 standardising with 1200; A1 for 0.124. The classic slip is squaring the 12 on the variance.)
A recipe uses 2 eggs and 1 cup of flour. With X∼N(68,16) (egg, g) and W∼N(120,25) (a cup of flour, g), all independent, find the distribution of the combined weight C=X1+X2+W and P(C<250).
C∼N(68+68+120, 16+16+25)=N(256,57).(M1 mean; M1 variance; A1) P(C<250)=P(Z<57250−256)=P(Z<−0.7947)=1−Φ(0.795)=0.2134.(M1; A1)
(M1/M1/A1 for the combined N(256,57) — two egg variances plus the flour variance, all added; M1 standardising; A1 for 0.213.)
A bag of sugar weighs X∼N(1000,25) g. (a) A pallet holds 50 bags; find the distribution of the total weight. (b) A wholesaler quotes weights in units of 50 bags by scaling one bag's weight by 50; find the distribution of 50X. (c) Explain why the variances differ.
(a) Sum of 50 independent bags:
T=X1+⋯+X50∼N(50×1000, 50×25)=N(50000, 1250).(M1 mean; A1 variance 50σ2)
(b) Scaling a single bag:
50X∼N(50×1000, 502×25)=N(50000, 62500).(M1 mean; A1 variance 502σ2)
(c) The means agree (50000 g), but the variances differ by a factor of 50: the sum pools 50 independent fluctuations that partly cancel, whereas 50X magnifies one bag's fluctuation fifty-fold with no cancellation. (M1/A1 each part; B1 for the cancellation explanation. The real pallet is the sum, with the much smaller variance 1250 — modelling it as 50X would wildly overstate the variability.)
If X1,…,Xn are independent with each Xi∼N(μ,σ2), then the sample mean Xˉ=n1∑Xi is a scaled sum of normals, hence normal. Its parameters follow from the rules:
E(Xˉ)=n1∑E(Xi)=nnμ=μ,Var(Xˉ)=n21∑Var(Xi)=n2nσ2=nσ2.
Xˉ∼N(μ, nσ2)
The variance σ2/n shrinks as n grows: larger samples give more reliable estimates of μ. The quantity σ/n is the standard error of the mean — the standard deviation of Xˉ, and the denominator in every z- and t-statistic to come.
The n in the standard error, rather than n, is one of the most important features of statistics and deserves emphasis. Because the variance falls as 1/n, the standard deviation of Xˉ falls only as 1/n: to halve the spread of the sample mean you must quadruple the sample. This "diminishing returns" law explains why very precise estimates are expensive — doubling precision costs four times the data — and it recurs in the width of confidence intervals and the power of tests in the lessons that follow. It is worth distinguishing the standard error σ/n (the variability of the mean) from the population standard deviation σ (the variability of a single observation): the two are constantly confused, yet they answer entirely different questions.
Adult heights follow N(170,64) cm. A random sample of 16 is taken. Find P(Xˉ>173).
Xˉ∼N(170, 1664)=N(170,4),so SE=4=2. P(Xˉ>173)=P(Z>2173−170)=P(Z>1.5)=1−0.9332=0.0668.
The same threshold for a single adult would give P(X>173)=P(Z>3/8)=P(Z>0.375)=0.354 — much larger, because a single observation is far more variable than an average of sixteen.
The diameters of ball bearings are N(10,0.04) mm. A quality engineer wants the sample mean of n bearings to lie within 0.05 mm of 10 with probability 0.95. Find the smallest n.
The mean satisfies Xˉ∼N(10,0.04/n), with standard error 0.2/n. "Within 0.05 with probability 0.95" means
P(∣Xˉ−10∣<0.05)=0.95⇒0.2/n0.05≥1.96.(M1 set-up; M1 use 1.96) 0.20.05n≥1.96⇒n≥0.051.96×0.2=7.84⇒n≥61.47.(M1 rearrange)
So the smallest sample size is n=62 (round up). (M1 relating the half-width to the SE; M1 the 1.96 for 95%; M1 solving for n; A1 for 62. This is the sampling-distribution counterpart of the confidence-interval sample-size formula in the next lessons.)
The combination rules above assumed the Xi were already normal. The Central Limit Theorem removes that assumption for the sample mean: for any population with finite mean μ and variance σ2,
Xˉ≈N(μ, nσ2)for large n,
the approximation improving as n increases (a working rule of thumb is n≥30, sooner if the population is roughly symmetric). This is why normal-based inference is so widely applicable: even when the underlying data are skewed or discrete, the average behaves normally once the sample is reasonably large.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.