You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson extends your understanding of discrete random variables beyond what is covered in A-Level Mathematics. You will learn to compute expectations of functions of random variables, derive variance rigorously via the E(X2) method, manipulate the algebra of expectation and variance, and combine independent variables. These skills are the load-bearing foundation for the entire Further Statistics module: the Poisson, the continuous distributions, moment generating functions and the chi-squared tests all rest on a fluent command of E and Var.
This topic belongs to the Paper 3 Statistics option (7367/3S). Paper 3 carries the more problem-solving-weighted assessment profile (AO1 40% / AO2 25% / AO3 35%), so although the mechanics of computing E(X) and Var(X) are AO1, examiners deliberately wrap them in unfamiliar contexts and multi-step parameter-finding (AO3) and ask you to justify or interpret results (AO2). The A-Level Mathematics prerequisite is the discrete-distribution work in the applied paper (computing E(X) and Var(X) from a table); Further Maths adds functions of a random variable, the linear-combination algebra, and sums of independent variables.
Students choose two of Mechanics (3M), Statistics (3S) and Discrete (3D). If you are reading this you are taking Statistics — this lesson is the gateway to the rest of 3S.
A discrete random variable X takes a countable set of values x1,x2,…, each with probability P(X=xi)=pi. For the distribution to be valid:
| Property | Requirement |
|---|---|
| Non-negativity | pi≥0 for all i |
| Normalisation | i∑pi=1 |
We use this running example throughout the lesson:
| x | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| P(X=x) | 0.1 | 0.3 | 0.4 | 0.2 |
Check: 0.1+0.3+0.4+0.2=1. Valid.
The expected value (mean) is the probability-weighted average of the values:
E(X)=μ=∑xxP(X=x).
It is the long-run mean of X over many repetitions. For the running example,
E(X)=1(0.1)+2(0.3)+3(0.4)+4(0.2)=0.1+0.6+1.2+0.8=2.7.
E(X) need not be an attainable value of X. It is a balance point, not a mode or median.
For any function g,
E(g(X))=∑xg(x)P(X=x).
This is the law of the unconscious statistician: you never need the distribution of g(X) — apply g to each value and weight by the original probabilities. The most important case is g(x)=x2:
E(X2)=12(0.1)+22(0.3)+32(0.4)+42(0.2)=0.1+1.2+3.6+3.2=8.1.
The variance measures spread. By definition it is the expected squared deviation from the mean,
Var(X)=E((X−μ)2)=∑x(x−μ)2P(X=x),
but the computational form is far quicker. Expanding the square and using linearity:
Var(X)=E(X2−2μX+μ2)=E(X2)−2μE(X)+μ2=E(X2)−2μ2+μ2=E(X2)−μ2=E(X2)−(E(X))2.For the running example,
Var(X)=8.1−2.72=8.1−7.29=0.81,σ=0.81=0.9.
Because variance is an expected square, Var(X)≥0 always, and the identity forces E(X2)≥(E(X))2 — with equality only when X is constant.
Expectation is linear; variance is not (it is quadratic in the scaling constant and blind to shifts):
| Operation | Expectation | Variance |
|---|---|---|
| Scale by a | E(aX)=aE(X) | Var(aX)=a2Var(X) |
| Shift by b | E(X+b)=E(X)+b | Var(X+b)=Var(X) |
| Linear map | E(aX+b)=aE(X)+b | Var(aX+b)=a2Var(X) |
A shift of b slides the whole distribution along the axis without changing its shape, so the spread — and hence the variance — is untouched. A scaling by a stretches deviations by a, so squared deviations stretch by a2.
We can confirm linearity numerically. Computing E(3X+2) directly:
E(3X+2)=5(0.1)+8(0.3)+11(0.4)+14(0.2)=0.5+2.4+4.4+2.8=10.1,
which matches 3E(X)+2=3(2.7)+2=10.1. And Var(3X+2)=32(0.81)=7.29.
A great deal of Further Statistics — and almost every multi-source modelling problem — rests on how E and Var behave when variables are combined. Expectation is always additive, whatever the dependence between the variables:
E(X+Y)=E(X)+E(Y),E(aX+bY)=aE(X)+bE(Y).
This holds because expectation is a sum (an integral in the continuous case), and sums distribute over addition regardless of any relationship between X and Y. Variance is more delicate. For independent X and Y,
Var(X+Y)=Var(X)+Var(Y),Var(X−Y)=Var(X)+Var(Y).
Note the second identity carefully: even for a difference, the variances add. Intuitively, subtracting an uncertain quantity injects just as much uncertainty as adding it — the spread of X−Y is no smaller than the spread of X+Y. More generally, for independent variables and constants a,b,
Var(aX+bY)=a2Var(X)+b2Var(Y),
where each coefficient is squared, exactly as in the single-variable rule Var(aX)=a2Var(X). A frequent source of confusion deserves emphasis: the doubled variable 2X and the sum X1+X2 of two independent copies of X are not the same. The first scales a single outcome,
Var(2X)=22Var(X)=4Var(X),
while the second adds two independent outcomes,
Var(X1+X2)=Var(X)+Var(X)=2Var(X).
The means agree (E(2X)=E(X1+X2)=2E(X)), but the variances differ by a factor of two — averaging independent measurements reduces relative spread, which is precisely why repeating an experiment and taking a mean improves precision.
Worked illustration. Suppose X (the running example) and an independent Y have E(X)=2.7, Var(X)=0.81 and E(Y)=4, Var(Y)=2. Then E(X+Y)=6.7 and, by independence, Var(X+Y)=0.81+2=2.81. For T=2X−3Y: E(T)=2(2.7)−3(4)=−6.6 and Var(T)=22(0.81)+32(2)=3.24+18=21.24 — note the minus sign in T has no effect on the variance, which uses (−3)2=9.
These combination rules reappear throughout 3S: the additivity of the Poisson (Lesson 2), the variance of a sample mean, and the chained approximations of Lesson 3 are all consequences of them.
The independence assumption is doing real work in the variance rules, and it is worth seeing what happens without it. For any two variables,
Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y),Cov(X,Y)=E(XY)−E(X)E(Y).
The covariance measures how the two variables move together: it is positive when high X tends to accompany high Y, negative when high X accompanies low Y, and zero when there is no linear association. Independence forces E(XY)=E(X)E(Y), hence Cov(X,Y)=0, which is exactly why the cross-term vanishes and the variances simply add for independent variables.
A concrete computation makes this tangible. Suppose X and Y each take the values 0 and 1, with the joint distribution
| Y=0 | Y=1 | |
|---|---|---|
| X=0 | 0.1 | 0.3 |
| X=1 | 0.4 | 0.2 |
The marginal for X is P(X=0)=0.4, P(X=1)=0.6, so E(X)=0.6; the marginal for Y is P(Y=0)=0.5, P(Y=1)=0.5, so E(Y)=0.5. The only term contributing to E(XY) is the cell where both equal 1: E(XY)=1⋅1⋅0.2=0.2. Therefore
Cov(X,Y)=E(XY)−E(X)E(Y)=0.2−(0.6)(0.5)=0.2−0.3=−0.1.
The negative covariance says that, in this joint distribution, X=1 tends to coincide with Y=0 — and indeed the largest probability, 0.4, sits in exactly that cell. Because the covariance is non-zero, X and Y are not independent, and you could not add their variances directly: you would need the 2Cov(X,Y)=−0.2 correction. (A cleaner check of independence: under independence the top-left cell would be P(X=0)P(Y=0)=0.4×0.5=0.2, but the table gives 0.1 — they differ, confirming dependence.) Although full two-variable distributions are at the edge of the A-Level Further specification, recognising when the additive variance rule applies — and being able to articulate that it requires independence — is squarely examinable and is precisely the reasoning the variance algebra is built on.
Two further summaries round out the description of a discrete distribution. The mode is the value of x carrying the largest probability; a distribution is bimodal (or multimodal) when two or more values tie for the maximum. The median is a value m with P(X≤m)≥0.5 and P(X≥m)≥0.5 — the "middle" value once the probability is accumulated. For the parameter example (with k=0.1, values 0,1,2,3 carrying 0.1,0.2,0.3,0.4), the mode is x=3 (probability 0.4). Accumulating probability gives P(X≤0)=0.1, P(X≤1)=0.3, P(X≤2)=0.6, so the median is m=2, the first value at which the running total reaches or passes 0.5.
That running total is the cumulative distribution function (CDF),
F(x)=P(X≤x)=∑xi≤xP(X=xi),
a step function that climbs from 0 to 1, jumping at each value of X by the size of its probability. The CDF and the probability mass function carry the same information: you recover an individual probability as the size of the jump,
P(X=x)=F(x)−F(x−),
where F(x−) is the value of F just to the left of x. For the example, F jumps by 0.1,0.2,0.3,0.4 at x=0,1,2,3 respectively, reaching exactly 1 at x=3. The CDF is the natural tool for "at most" and "more than" questions — P(X>2)=1−F(2)=1−0.6=0.4 — and it is the discrete shadow of the continuous CDF that dominates the second half of this course.
A random variable X has the distribution below, where k is constant.
| x | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| P(X=x) | k | 2k | 3k | 4k |
Find k, E(X), Var(X) and E(2X2−3X+1).
k+2k+3k+4k=1⟹10k=1⟹k=0.1. (M1 set sum of probabilities =1; A1 k=0.1.)
E(X)=0(0.1)+1(0.2)+2(0.3)+3(0.4)=2.0. (M1 ∑xP(X=x); A1 E(X)=2.0.)
E(X2)=0(0.1)+1(0.2)+4(0.3)+9(0.4)=0+0.2+1.2+3.6=5.0. (M1 ∑x2P(X=x); A1 E(X2)=5.0.)
Var(X)=5.0−2.02=1.0. (M1 apply E(X2)−(E(X))2; A1 Var(X)=1.0.)
E(2X2−3X+1)=2E(X2)−3E(X)+1=2(5.0)−3(2.0)+1=5. (M1 expand using linearity; A1 =5.)
The number of faults X on a metre of cable has E(X)=1.5 and Var(X)=0.75. The inspection cost in pounds is C=4X+10. Find E(C), Var(C) and the standard deviation of C.
E(C)=4E(X)+10=4(1.5)+10=16. (M1 E(aX+b)=aE(X)+b; A1 £16.)
Var(C)=42Var(X)=16(0.75)=12. (M1 Var(aX+b)=a2Var(X), with the +10 ignored; A1 12.)
SD(C)=12=23≈3.46. (A1 ≈£3.46.)
A variable X takes the values 0,1,2 with P(X=1)=0.5. Given E(X)=1, find the full distribution and Var(X).
Let P(X=0)=a and P(X=2)=c. Then
a+0.5+c=1⟹a+c=0.5. (M1 normalisation equation.)
E(X)=0⋅a+1(0.5)+2c=1⟹0.5+2c=1⟹c=0.25. (M1 expectation equation; A1 c=0.25, hence a=0.25.)
So P(X=0)=0.25, P(X=1)=0.5, P(X=2)=0.25. Then
E(X2)=0+1(0.5)+4(0.25)=1.5,Var(X)=1.5−12=0.5. (M1 E(X2); A1 Var(X)=0.5.)
(Specimen-style — not from any past paper.) The discrete random variable X has probability distribution
x −1 0 1 2 P(X=x) 0.2 a 0.3 b Given that E(X)=0.6: (a) show that a+b=0.5 and find a second equation in a and b; hence find a and b; (b) find Var(X); (c) find Var(5−2X).
Model solution.
(a) Probabilities sum to 1: 0.2+a+0.3+b=1⇒a+b=0.5. Expectation:
E(X)=(−1)(0.2)+0⋅a+1(0.3)+2b=0.1+2b=0.6⟹b=0.25,
so a=0.25. Both lie in [0,1], so the distribution is valid.
(b) E(X2)=(−1)2(0.2)+0+12(0.3)+22(0.25)=0.2+0.3+1.0=1.5, so
Var(X)=1.5−0.62=1.5−0.36=1.14.
(c) Variance ignores the shift and squares the scale: Var(5−2X)=(−2)2Var(X)=4(1.14)=4.56.
Question. The random variable X has E(X)=4 and Var(X)=9. Find E(X2) and Var(2X−1), and explain why Var(2X−1)=2Var(X).
Mid-band response. E(X2)=9+16=25. Var(2X−1)=4×9=36. It is not 2×9 because you square the 2.
Examiner-style commentary: The two numerical answers are correct and would earn the method and accuracy marks. The explanation is too thin for the AO2 mark — "you square the 2" states the rule rather than the reason, and the −1 is not addressed.
Stronger response. Rearranging Var(X)=E(X2)−(E(X))2 gives E(X2)=Var(X)+(E(X))2=9+16=25. For the transformation, Var(2X−1)=22Var(X)=4(9)=36; the −1 is a shift and does not affect spread. It is not 2Var(X) because variance scales with the square of the multiplier.
Examiner-style commentary: Fully correct with the identity quoted and rearranged, and the shift correctly dismissed. The final sentence earns the reasoning mark. To reach the top band the explanation could connect the a2 factor to squared deviations.
Top-band response. From Var(X)=E(X2)−(E(X))2, E(X2)=9+42=25. Writing Y=2X−1, Var(Y)=E((Y−E(Y))2); since Y−E(Y)=2(X−E(X)), the deviation is scaled by 2 and the squared deviation by 22=4, giving Var(Y)=4Var(X)=36. The additive constant −1 cancels in Y−E(Y), so it cannot change the variance. Hence Var(2X−1)=36=18=2Var(X): variance is quadratic in the scale factor, not linear.
Examiner-style commentary: This is exemplary. The candidate derives the a2 factor from first principles via Y−E(Y)=2(X−E(X)), explains why the shift cancels, and states the general principle. Every available AO1 and AO2 mark is secured.
The identity Var(X)=E(X2)−(E(X))2 is the second instance of a deeper pattern. Define the moments μk=E(Xk) and the central moments νk=E((X−μ)k). Then ν2 is the variance, ν3 governs skewness and ν4 governs kurtosis, and each νk expands into the raw moments by the binomial theorem:
νk=E((X−μ)k)=∑j=0k(jk)(−μ)k−jμj.
Setting k=2 recovers exactly ν2=μ2−μ2. A short STEP-flavoured challenge: show that the central third moment is
E((X−μ)3)=E(X3)−3μE(X2)+2μ3,
and verify it on the running example (where μ=2.7). A second, genuinely useful result is the variance of a sum: for any X,Y,
Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y),Cov(X,Y)=E(XY)−E(X)E(Y),
which collapses to additive variances precisely when Cov(X,Y)=0 — guaranteed by independence, since then E(XY)=E(X)E(Y). The converse is false: uncorrelated does not imply independent, a favourite Oxbridge interview trap.
P1. X has P(X=x)=15x for x=1,2,3,4,5. Find E(X) and Var(X).
E(X)=151(1+4+9+16+25)=1555=311≈3.667. E(X2)=151(1+8+27+64+125)=15225=15. Var(X)=15−(311)2=15−9121=914≈1.556.
P2. Given E(X)=3 and E(X2)=13, find (a) Var(X); (b) E((X−3)2); (c) Var(4−3X).
(a) 13−9=4. (b) E((X−μ)2)=Var(X)=4. (c) (−3)2(4)=36.
P3. A fair four-sided die shows 1,2,3,4. Let X be the score and Y=X2. Find E(Y) and Var(Y).
E(Y)=E(X2)=41(1+4+9+16)=430=7.5. E(Y2)=E(X4)=41(1+16+81+256)=4354=88.5. Var(Y)=88.5−7.52=88.5−56.25=32.25.
P4. X takes values 1,2,3 with probabilities p,q,p (symmetric). Given Var(X)=0.5, find p and q.
By symmetry E(X)=2. Normalisation: 2p+q=1. E(X2)=p+4q+9p=10p+4q. Then Var(X)=10p+4q−4=0.5, so 10p+4q=4.5. Subtract 4(2p+q)=4: 2p=0.5⇒p=0.25, q=0.5.
P5. The number of heads X in two tosses of a biased coin (P(head)=0.6) has distribution P(0)=0.16, P(1)=0.48, P(2)=0.36. Verify E(X)=1.2 and find Var(X); compare with np and np(1−p).
E(X)=0(0.16)+1(0.48)+2(0.36)=1.2=np=2(0.6). E(X2)=0+0.48+4(0.36)=1.92. Var(X)=1.92−1.44=0.48=np(1−p)=2(0.6)(0.4). The table reproduces the binomial moments exactly.
Aligned to AQA A-Level Further Mathematics 7367, Paper 3 Statistics (7367/3S): discrete random variables, E(X), E(g(X)), Var(X)=E(X2)−(E(X))2, and the linear-transformation algebra. The same content appears in Edexcel Further Statistics 1 (discrete random variables, E(X)/Var(X) of linear functions) and OCR(A)/OCR(MEI) Statistics modules; the formulae and conventions are identical across boards.
| Quantity | Discrete formula | Key behaviour under aX+b |
|---|---|---|
| E(X) | ∑xP(X=x) | aE(X)+b (linear) |
| E(g(X)) | ∑g(x)P(X=x) | apply g then weight |
| Var(X) | E(X2)−(E(X))2 | a2Var(X) (shift ignored) |
| σ | Var(X) | ∣a∣σ |
graph LR
A["Distribution table x, P(X=x)"] --> B["E(X) = sum x P"]
A --> C["E(X^2) = sum x^2 P"]
B --> D["Var(X) = E(X^2) - (E(X))^2"]
C --> D
D --> E["SD = sqrt(Var)"]
B --> F["Transform: E(aX+b)=aE(X)+b"]
D --> G["Transform: Var(aX+b)=a^2 Var(X)"]
Recap: build a clean table, compute E(X) and E(X2), then Var(X)=E(X2)−(E(X))2. Expectation is linear; variance squares the scale and ignores shifts. Master this and every distribution in 7367/3S becomes a special case.