You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Real problems rarely involve a single random quantity in isolation: a total assembly time is a sum of stage times, a profit is a difference of revenue and cost, a sample mean is a scaled sum of observations. This lesson establishes the rules for the mean and variance of linear combinations aX+bY, the crucial role of independence (variances add, with squared coefficients), the special closure of the normal family under linear combination, and the distribution of the sample mean and sample variance — the bridge to all of statistical inference.
This is Paper 3 optional content — Statistics (7367/3S), taken with Mechanics (7367/3M) or Discrete (7367/3D). Paper 3 is 2 hours, 100 marks, AO1 40% / AO2 25% / AO3 35%. Applying E(aX+bY) and Var(aX+bY) is AO1; the proofs (e.g. why Var(X−Y) adds variances, why S2 needs n−1) are AO2; a multi-stage modelling problem is AO3. It builds directly on A-Level Maths Statistics (mean and variance of a single variable; the normal distribution) and dovetails with the previous lesson (PGFs prove the distributional additivity; this lesson does the moment bookkeeping).
For any random variables X and Y (not necessarily independent):
| Result | Formula |
|---|---|
| E(aX+bY) | aE(X)+bE(Y) |
| Var(aX+bY) | a2Var(X)+b2Var(Y)+2abCov(X,Y) |
If X and Y are independent, then Cov(X,Y)=0:
Var(aX+bY)=a2Var(X)+b2Var(Y).
Why the mean is linear but the variance is not. The expectation is a linear operator: E(aX+bY)=aE(X)+bE(Y) holds for any X,Y, independent or not, because integration/summation is linear. The variance is quadratic: expanding the definition,
Var(aX+bY)=E[(aX+bY−aμX−bμY)2]=E[(a(X−μX)+b(Y−μY))2],
and multiplying out the square gives a2Var(X)+b2Var(Y)+2abCov(X,Y). The cross term 2abCov(X,Y) vanishes only when Cov(X,Y)=0 (e.g. independence), and the coefficients appear squared because the square of aX is a2X2. This single expansion explains every special case below, including the crucial Var(X−Y)=Var(X)+Var(Y): here a=1,b=−1, so b2=(−1)2=+1 multiplies Var(Y) — the minus sign is squared away.
| Combination | E | Var (independent) |
|---|---|---|
| X+Y | E(X)+E(Y) | Var(X)+Var(Y) |
| X−Y | E(X)−E(Y) | Var(X)+Var(Y) |
| 3X | 3E(X) | 9Var(X) |
| X+5 | E(X)+5 | Var(X) |
Exam Tip: The variance of X−Y is Var(X)+Var(Y) (plus, not minus) when X and Y are independent. This is tested frequently and is a common source of errors.
If X1,X2,…,Xn are i.i.d. (independent and identically distributed) with mean μ and variance σ2, then by linearity of expectation and the additivity of variance for independents:
E(∑i=1nXi)=∑i=1nE(Xi)=nμ,Var(∑i=1nXi)=∑i=1nVar(Xi)=nσ2.
The mean scales by n and so does the variance — but the standard deviation scales only by n (since nσ2=σn). Contrast this sharply with the single scaled variable nX, where Var(nX)=n2σ2 and the standard deviation scales by the full factor n. The difference is that ∑Xi adds n independent fluctuations (which partly cancel), whereas nX magnifies a single fluctuation n-fold.
Xˉ=n1∑i=1nXi
| Property | Value |
|---|---|
| E(Xˉ) | μ |
| Var(Xˉ) | σ2/n |
| SD(Xˉ) | σ/n (standard error) |
These follow from the sum results by dividing by n, i.e. taking a=1/n on each Xi:
E(Xˉ)=n1E(∑Xi)=nnμ=μ,Var(Xˉ)=n21Var(∑Xi)=n2nσ2=nσ2.
Note the 1/n2: scaling by 1/n squares to 1/n2, and multiplying the sum's variance nσ2 gives σ2/n. The sample mean is therefore an unbiased estimator of the population mean (E(Xˉ)=μ), and as n increases its variance σ2/n shrinks, so Xˉ clusters ever more tightly around μ — it becomes a more precise estimator. The standard error σ/n decreases only like 1/n, which is why quadrupling the sample size only halves the standard error — the inverse-square-root law that governs the cost of precision in all of statistics.
A toy is assembled in two stages. Stage 1 takes time X∼N(10,4) minutes and Stage 2 takes Y∼N(15,9) minutes, independently. (a) Find P(total>30). (b) Find the probability that Stage 2 takes longer than Stage 1.
(a) Total. Sum of independent normals is normal; means add, variances add:
T=X+Y∼N(10+15, 4+9)=N(25,13).(M1 mean; M1 variance add; A1 distribution) P(T>30)=P(Z>1330−25)=P(Z>1.387)=1−0.9173=0.0827.(M1 standardise; A1)
(b) Difference. "Stage 2 longer than Stage 1" means Y−X>0. For the difference, variances still add:
D=Y−X∼N(15−10, 9+4)=N(5,13).(M1 variance still adds) P(D>0)=P(Z>130−5)=P(Z>−1.387)=0.9173.(A1)
(M1s for adding means and variances and for standardising; A1s for the distribution and the probabilities. The key teaching point: Var(Y−X)=Var(Y)+Var(X)=13, never 9−4=5.)
Component lengths from machine A are N(100,25); from machine B, N(95,36). Samples of sizes nA=10 and nB=15 are taken independently. Find P(XˉA−XˉB>8).
XˉA∼N(100,1025)=N(100,2.5),XˉB∼N(95,1536)=N(95,2.4).(M1 each sample mean) XˉA−XˉB∼N(100−95, 2.5+2.4)=N(5,4.9).(M1 combine; A1) P(XˉA−XˉB>8)=P(Z>4.98−5)=P(Z>1.355)=1−0.9123=0.0877.(M1 standardise; A1)
(M1 for each Var(Xˉ)=σ2/n; M1/A1 for the difference distribution; M1/A1 for the probability. Note the two standard errors add because the samples are independent.)
Let X∼N(10,4).
| Quantity | Distribution | Note |
|---|---|---|
| X1+X2 (sum of 2 independent copies) | N(20,8) | Var multiplied by 2 |
| 2X (single variable scaled) | N(20,16) | Var multiplied by 4 |
| X1+X2+X3 | N(30,12) | Var multiplied by 3 |
| 3X | N(30,36) | Var multiplied by 9 |
The distinction between X1+X2+⋯+Xn and nX is crucial — and the table shows it starkly. Both X1+X2 and 2X have mean 20, but X1+X2 has variance 8 (two independent variances added) while 2X has variance 16 (the coefficient 2 squared). The sum of independent copies grows in spread only like n, the single scaled variable like n. In an exam, the phrase "n independent observations" signals ∑Xi (variance nσ2), whereas "n times a single observation" signals nX (variance n2σ2) — read the wording carefully, as the two give different answers.
The sample variance is:
S2=n−11∑i=1n(Xi−Xˉ)2
| Property | Value |
|---|---|
| E(S2) | σ2 (unbiased) |
| Division by n−1 | Bessel's correction ensures unbiasedness |
If we divided by n instead, the estimator would be biased: one can show
E(∑i=1n(Xi−Xˉ)2)=(n−1)σ2,soE(n1∑(Xi−Xˉ)2)=nn−1σ2<σ2.
Dividing by n−1 (not n) exactly cancels this shortfall, giving E(S2)=σ2 — Bessel's correction. The intuition: the deviations are taken about the sample mean Xˉ, which is itself pulled towards the data, so the squared deviations systematically under-estimate spread about the true μ; the n−1 compensates. The deeper reason is degrees of freedom: the n deviations Xi−Xˉ satisfy the single constraint ∑(Xi−Xˉ)=0, so only n−1 of them are free to vary — the same n−1 that fixes the parameter of the t-distribution.
The covariance measures how two variables move together:
Cov(X,Y)=E[(X−μX)(Y−μY)]=E(XY)−E(X)E(Y),
the second (computational) form following by expanding the brackets. A positive covariance means X and Y tend to be large together; negative means one tends to be large when the other is small. But covariance carries the units of X times Y, so its size is hard to interpret; dividing by the standard deviations gives the dimensionless correlation coefficient
ρ=Var(X)Var(Y)Cov(X,Y)=σXσYCov(X,Y),−1≤ρ≤1.
This is the population analogue of Pearson's sample r from the correlation-and-regression lesson.
| Property | Detail |
|---|---|
| −1≤ρ≤1 | always (Cauchy–Schwarz) |
| ρ=0 | uncorrelated (but not necessarily independent) |
| Independence ⟹ ρ=0 | the converse is not generally true |
| ρ=±1 | Y is an exact linear function of X |
| Property | Formula |
|---|---|
| Cov(X,X) | Var(X) |
| Cov(X,Y)=Cov(Y,X) | Symmetric |
| Cov(aX,bY) | abCov(X,Y) |
| Cov(X+a,Y+b) | Cov(X,Y) (shifts don't affect spread) |
| Var(X+Y) | Var(X)+Var(Y)+2Cov(X,Y) |
| Var(X−Y) | Var(X)+Var(Y)−2Cov(X,Y) |
Two features deserve note. Adding constants leaves covariance (and variance) unchanged — only the spread, not the location, matters. Scaling pulls constants out multiplicatively: Cov(aX,bY)=abCov(X,Y). For a dependent pair the cross term 2Cov(X,Y) is essential: positive covariance inflates Var(X+Y) and deflates Var(X−Y); negative covariance does the reverse (the basis of the risk reduction in the portfolio example below).
The sampling-distribution result Xˉ∼N(μ,σ2/n) was stated above for a normal population. The Central Limit Theorem (CLT) extends it to any population with finite mean μ and variance σ2: for large n,
Xˉ∼approxN(μ,nσ2),equivalently∑i=1nXi∼approxN(nμ,nσ2).
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.