Combinations of Random Variables

Real problems rarely involve a single random quantity in isolation: a total assembly time is a sum of stage times, a profit is a difference of revenue and cost, a sample mean is a scaled sum of observations. This lesson establishes the rules for the mean and variance of linear combinations $aX + bY$ , the crucial role of independence (variances add, with squared coefficients), the special closure of the normal family under linear combination, and the distribution of the sample mean and sample variance — the bridge to all of statistical inference.

Where this sits in AQA 7367

This is Paper 3 optional content — Statistics (7367/3S), taken with Mechanics (7367/3M) or Discrete (7367/3D). Paper 3 is 2 hours, 100 marks, AO1 40% / AO2 25% / AO3 35%. Applying $E(aX+bY)$ and $\operatorname{Var}(aX+bY)$ is AO1; the proofs (e.g. why $\operatorname{Var}(X-Y)$ adds variances, why $S^2$ needs $n-1$ ) are AO2; a multi-stage modelling problem is AO3. It builds directly on A-Level Maths Statistics (mean and variance of a single variable; the normal distribution) and dovetails with the previous lesson (PGFs prove the distributional additivity; this lesson does the moment bookkeeping).

General Results (Any Distribution)

For any random variables $X$ and $Y$ (not necessarily independent):

Result	Formula
$E(aX + bY)$	$aE(X) + bE(Y)$
$\text{Var}(aX + bY)$	$a^2\text{Var}(X) + b^2\text{Var}(Y) + 2ab\text{Cov}(X,Y)$

If $X$ and $Y$ are independent, then $\text{Cov}(X,Y) = 0$ :

$\text{Var}(aX + bY) = a^2\text{Var}(X) + b^2\text{Var}(Y).$

Why the mean is linear but the variance is not. The expectation is a linear operator: $E(aX + bY) = aE(X) + bE(Y)$ holds for any $X, Y$ , independent or not, because integration/summation is linear. The variance is quadratic: expanding the definition,

$\operatorname{Var}(aX + bY) = E\big[(aX + bY - a\mu_X - b\mu_Y)^2\big] = E\big[(a(X-\mu_X) + b(Y-\mu_Y))^2\big],$

and multiplying out the square gives $a^2\operatorname{Var}(X) + b^2\operatorname{Var}(Y) + 2ab\operatorname{Cov}(X,Y)$ . The cross term $2ab\operatorname{Cov}(X,Y)$ vanishes only when $\operatorname{Cov}(X,Y) = 0$ (e.g. independence), and the coefficients appear squared because the square of $aX$ is $a^2X^2$ . This single expansion explains every special case below, including the crucial $\operatorname{Var}(X - Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$ : here $a = 1, b = -1$ , so $b^2 = (-1)^2 = +1$ multiplies $\operatorname{Var}(Y)$ — the minus sign is squared away.

Special Cases

Combination	$E$	$\text{Var}$ (independent)
$X + Y$	$E(X) + E(Y)$	$\text{Var}(X) + \text{Var}(Y)$
$X - Y$	$E(X) - E(Y)$	$\text{Var}(X) + \text{Var}(Y)$
$3X$	$3E(X)$	$9\text{Var}(X)$
$X + 5$	$E(X) + 5$	$\text{Var}(X)$

Exam Tip: The variance of $X - Y$ is $\text{Var}(X) + \text{Var}(Y)$ (plus, not minus) when $X$ and $Y$ are independent. This is tested frequently and is a common source of errors.

Sum of Independent Identically Distributed (i.i.d.) Variables

If $X_1, X_2, \ldots, X_n$ are i.i.d. (independent and identically distributed) with mean $\mu$ and variance $\sigma^2$ , then by linearity of expectation and the additivity of variance for independents:

$E\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n E(X_i) = n\mu, \qquad \operatorname{Var}\left(\sum_{i=1}^n X_i\right) = \sum_{i=1}^n \operatorname{Var}(X_i) = n\sigma^2.$

The mean scales by $n$ and so does the variance — but the standard deviation scales only by $\sqrt n$ (since $\sqrt{n\sigma^2} = \sigma\sqrt n$ ). Contrast this sharply with the single scaled variable $nX$ , where $\operatorname{Var}(nX) = n^2\sigma^2$ and the standard deviation scales by the full factor $n$ . The difference is that $\sum X_i$ adds $n$ independent fluctuations (which partly cancel), whereas $nX$ magnifies a single fluctuation $n$ -fold.

The Sample Mean

$\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$

Property	Value
$E(\bar{X})$	$\mu$
$\text{Var}(\bar{X})$	$\sigma^2/n$
$\text{SD}(\bar{X})$	$\sigma/\sqrt{n}$ (standard error)

These follow from the sum results by dividing by $n$ , i.e. taking $a = 1/n$ on each $X_i$ :

$E(\bar X) = \frac1n E\!\left(\sum X_i\right) = \frac{n\mu}{n} = \mu, \qquad \operatorname{Var}(\bar X) = \frac{1}{n^2}\operatorname{Var}\!\left(\sum X_i\right) = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}.$

Note the $1/n^2$ : scaling by $1/n$ squares to $1/n^2$ , and multiplying the sum's variance $n\sigma^2$ gives $\sigma^2/n$ . The sample mean is therefore an unbiased estimator of the population mean ( $E(\bar X) = \mu$ ), and as $n$ increases its variance $\sigma^2/n$ shrinks, so $\bar X$ clusters ever more tightly around $\mu$ — it becomes a more precise estimator. The standard error $\sigma/\sqrt n$ decreases only like $1/\sqrt n$ , which is why quadrupling the sample size only halves the standard error — the inverse-square-root law that governs the cost of precision in all of statistics.

Worked Example 1: Two Stages (with mark scheme)

A toy is assembled in two stages. Stage 1 takes time $X \sim N(10, 4)$ minutes and Stage 2 takes $Y \sim N(15, 9)$ minutes, independently. (a) Find $P(\text{total} > 30)$ . (b) Find the probability that Stage 2 takes longer than Stage 1.

(a) Total. Sum of independent normals is normal; means add, variances add:

$T = X + Y \sim N(10 + 15,\ 4 + 9) = N(25, 13). \quad (\text{M1 mean; M1 variance add; A1 distribution})$ $P(T > 30) = P\!\left(Z > \frac{30 - 25}{\sqrt{13}}\right) = P(Z > 1.387) = 1 - 0.9173 = 0.0827. \quad (\text{M1 standardise; A1})$

(b) Difference. "Stage 2 longer than Stage 1" means $Y - X > 0$ . For the difference, variances still add:

$D = Y - X \sim N(15 - 10,\ 9 + 4) = N(5, 13). \quad (\text{M1 variance still adds})$ $P(D > 0) = P\!\left(Z > \frac{0 - 5}{\sqrt{13}}\right) = P(Z > -1.387) = 0.9173. \quad (\text{A1})$

(M1s for adding means and variances and for standardising; A1s for the distribution and the probabilities. The key teaching point: $\operatorname{Var}(Y - X) = \operatorname{Var}(Y) + \operatorname{Var}(X) = 13$ , never $9 - 4 = 5$ .)

Worked Example 2: a difference of sample means (with mark scheme)

Component lengths from machine A are $N(100, 25)$ ; from machine B, $N(95, 36)$ . Samples of sizes $n_A = 10$ and $n_B = 15$ are taken independently. Find $P(\bar X_A - \bar X_B > 8)$ .

$\bar X_A \sim N\!\left(100, \tfrac{25}{10}\right) = N(100, 2.5), \quad \bar X_B \sim N\!\left(95, \tfrac{36}{15}\right) = N(95, 2.4). \quad (\text{M1 each sample mean})$ $\bar X_A - \bar X_B \sim N(100 - 95,\ 2.5 + 2.4) = N(5, 4.9). \quad (\text{M1 combine; A1})$ $P(\bar X_A - \bar X_B > 8) = P\!\left(Z > \frac{8 - 5}{\sqrt{4.9}}\right) = P(Z > 1.355) = 1 - 0.9123 = 0.0877. \quad (\text{M1 standardise; A1})$

(M1 for each $\operatorname{Var}(\bar X) = \sigma^2/n$ ; M1/A1 for the difference distribution; M1/A1 for the probability. Note the two standard errors add because the samples are independent.)

Sums and Scalar Multiples Revisited

Let $X \sim N(10, 4)$ .

Quantity	Distribution	Note
$X_1 + X_2$ (sum of 2 independent copies)	$N(20, 8)$	Var multiplied by 2
$2X$ (single variable scaled)	$N(20, 16)$	Var multiplied by 4
$X_1 + X_2 + X_3$	$N(30, 12)$	Var multiplied by 3
$3X$	$N(30, 36)$	Var multiplied by 9

The distinction between $X_1 + X_2 + \cdots + X_n$ and $nX$ is crucial — and the table shows it starkly. Both $X_1 + X_2$ and $2X$ have mean $20$ , but $X_1 + X_2$ has variance $8$ (two independent variances added) while $2X$ has variance $16$ (the coefficient $2$ squared). The sum of independent copies grows in spread only like $\sqrt n$ , the single scaled variable like $n$ . In an exam, the phrase " $n$ independent observations" signals $\sum X_i$ (variance $n\sigma^2$ ), whereas " $n$ times a single observation" signals $nX$ (variance $n^2\sigma^2$ ) — read the wording carefully, as the two give different answers.

Sample Variance

The sample variance is:

$S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2$

Property	Value
$E(S^2)$	$\sigma^2$ (unbiased)
Division by $n-1$	Bessel's correction ensures unbiasedness

If we divided by $n$ instead, the estimator would be biased: one can show

$E\!\left(\sum_{i=1}^n (X_i - \bar X)^2\right) = (n-1)\sigma^2, \quad \text{so} \quad E\!\left(\frac{1}{n}\sum(X_i - \bar X)^2\right) = \frac{n-1}{n}\sigma^2 < \sigma^2.$

Dividing by $n - 1$ (not $n$ ) exactly cancels this shortfall, giving $E(S^2) = \sigma^2$ — Bessel's correction. The intuition: the deviations are taken about the sample mean $\bar X$ , which is itself pulled towards the data, so the squared deviations systematically under-estimate spread about the true $\mu$ ; the $n - 1$ compensates. The deeper reason is degrees of freedom: the $n$ deviations $X_i - \bar X$ satisfy the single constraint $\sum (X_i - \bar X) = 0$ , so only $n - 1$ of them are free to vary — the same $n - 1$ that fixes the parameter of the $t$ -distribution.

Covariance and Correlation

The covariance measures how two variables move together:

$\operatorname{Cov}(X, Y) = E\big[(X - \mu_X)(Y - \mu_Y)\big] = E(XY) - E(X)E(Y),$

the second (computational) form following by expanding the brackets. A positive covariance means $X$ and $Y$ tend to be large together; negative means one tends to be large when the other is small. But covariance carries the units of $X$ times $Y$ , so its size is hard to interpret; dividing by the standard deviations gives the dimensionless correlation coefficient

$\rho = \frac{\operatorname{Cov}(X, Y)}{\sqrt{\operatorname{Var}(X)\operatorname{Var}(Y)}} = \frac{\operatorname{Cov}(X,Y)}{\sigma_X\sigma_Y}, \qquad -1 \le \rho \le 1.$

This is the population analogue of Pearson's sample $r$ from the correlation-and-regression lesson.

Property	Detail
$-1 \leq \rho \leq 1$	always (Cauchy–Schwarz)
$\rho = 0$	uncorrelated (but not necessarily independent)
Independence $\implies$ $\rho = 0$	the converse is not generally true
$\rho = \pm 1$	$Y$ is an exact linear function of $X$

Properties of Covariance

Property	Formula
$\text{Cov}(X, X)$	$\text{Var}(X)$
$\text{Cov}(X, Y) = \text{Cov}(Y, X)$	Symmetric
$\text{Cov}(aX, bY)$	$ab\text{Cov}(X, Y)$
$\text{Cov}(X + a, Y + b)$	$\text{Cov}(X, Y)$ (shifts don't affect spread)
$\text{Var}(X + Y)$	$\text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$
$\text{Var}(X - Y)$	$\text{Var}(X) + \text{Var}(Y) - 2\text{Cov}(X, Y)$

Two features deserve note. Adding constants leaves covariance (and variance) unchanged — only the spread, not the location, matters. Scaling pulls constants out multiplicatively: $\operatorname{Cov}(aX, bY) = ab\operatorname{Cov}(X, Y)$ . For a dependent pair the cross term $2\operatorname{Cov}(X,Y)$ is essential: positive covariance inflates $\operatorname{Var}(X+Y)$ and deflates $\operatorname{Var}(X-Y)$ ; negative covariance does the reverse (the basis of the risk reduction in the portfolio example below).

The Central Limit Theorem

The sampling-distribution result $\bar X \sim N(\mu, \sigma^2/n)$ was stated above for a normal population. The Central Limit Theorem (CLT) extends it to any population with finite mean $\mu$ and variance $\sigma^2$ : for large $n$ ,

$\bar X \;\overset{\text{approx}}{\sim}\; N\!\left(\mu, \frac{\sigma^2}{n}\right), \qquad \text{equivalently} \qquad \sum_{i=1}^n X_i \;\overset{\text{approx}}{\sim}\; N(n\mu, n\sigma^2).$

Combinations of Random Variables

Combinations of Random Variables

Where this sits in AQA 7367

General Results (Any Distribution)

Special Cases

Sum of Independent Identically Distributed (i.i.d.) Variables

The Sample Mean

Worked Example 1: Two Stages (with mark scheme)

Worked Example 2: a difference of sample means (with mark scheme)

Sums and Scalar Multiples Revisited

Sample Variance

Covariance and Correlation

Properties of Covariance

The Central Limit Theorem

More in Mathematics