You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Every hypothesis test is a decision rule, and every decision rule can be wrong in two distinct ways: it can raise a false alarm (reject a true null) or it can miss a real effect (fail to reject a false null). These are the Type I and Type II errors, with probabilities α and β. Their interplay — and the power 1−β that measures a test's ability to detect what is really there — is the heart of experimental design. This lesson defines the two errors precisely, derives β and the power from the sampling distribution, exposes the α–β trade-off, and shows how to size a study for a target power.
This is Paper 3 optional content — Statistics (7367/3S), the applied option a Further-Maths student may choose alongside one of Mechanics (7367/3M) or Discrete (7367/3D). Paper 3 is 2 hours, 100 marks, weighted AO1 40% / AO2 25% / AO3 35%. This topic is unusually AO2/AO3-heavy: defining the errors and interpreting power in context is reasoning and communication (AO2), while computing β for a stated alternative is genuine problem-solving (AO3). It builds directly on the hypothesis-testing machinery of the previous lesson (critical regions, significance levels) and on A-Level Maths Statistics (the normal model for Xˉ and the standardising transformation Z=(Xˉ−μ)/(σ/n)).
A test decides between a null hypothesis H0 and an alternative H1 on the basis of a sample. Reality is one of two states (H0 true or false) and our decision is one of two actions (reject or do not reject), giving the decision table:
| H0 is true | H0 is false | |
|---|---|---|
| Do not reject H0 | Correct decision (prob. 1−α) | Type II error (prob. β) |
| Reject H0 | Type I error (prob. α) | Correct decision — power (prob. 1−β) |
A Type I error is rejecting H0 when it is in fact true — a "false positive." Because we reject precisely when the statistic falls in the critical region, and we design the critical region to have probability α under H0,
P(Type I error)=P(reject H0∣H0 true)=α=significance level.
The Type I error rate is therefore something we choose: setting α=0.05 is a deliberate decision to tolerate a 5% false-alarm rate. For a discrete test statistic the actual size may be a little below the nominal α (you cannot always hit 5% exactly), but the principle is unchanged.
A Type II error is failing to reject H0 when it is in fact false — a "false negative":
P(Type II error)=P(do not reject H0∣H0 false)=β.
Crucially, β is not a single number we choose. "H0 false" is not one situation but a whole family — the true parameter could be anywhere in the range covered by H1. So β depends on how false H0 is (the true value of the parameter), on the sample size n, on the population variability σ, and on the chosen α. To compute a numerical β you must be told (or must assume) a specific true value.
Exam Tip: Type I error = rejecting a true H0; Type II error = not rejecting a false H0. A robust mnemonic: Type I = "crying wolf" (a false alarm when there is no wolf); Type II = "missing the wolf" (the wolf is there but you fail to sound the alarm).
For a fixed sample size the two error rates pull against each other. Making the test more stringent — shrinking α by pushing the critical value further out — makes it harder to reject H0, so a false null is missed more often and β rises. Conversely, a generous α catches more true effects (small β) but raises the false-alarm rate. You cannot drive both to zero by tuning the critical value alone.
The escape from this trade-off is more information: increasing n shrinks the standard error σ/n, so the sampling distributions under H0 and under the true value overlap less, and both α (held fixed) and β (now smaller) can be controlled simultaneously. This is why "collect a larger sample" is the universal remedy for a test that lacks discriminating power.
The power is the probability of correctly rejecting H0 when it is false:
Power=1−β=P(reject H0∣H0 false).
High power means the test reliably detects a genuine effect. Like β, power is a function of the true parameter value, not a single number.
| Factor | Effect on power | Why |
|---|---|---|
| Larger sample size n | Increases power | smaller standard error σ/n ⇒ less overlap |
| Larger significance level α | Increases power | critical value closer in ⇒ easier to reject (but more Type I error) |
| Larger true effect size ∣μ1−μ0∣ | Increases power | distributions pulled further apart |
| Smaller population variance σ2 | Increases power | tighter sampling distribution ⇒ less overlap |
The recipe never changes: (1) find the critical region using the distribution under H0 (this fixes the boundary at significance level α); (2) find the probability of not landing in that region using the distribution at the true parameter value; that probability is β; (3) the power is 1−β. The two distributions in steps (1) and (2) are different — one centred on θ0, the other on the true θ — and β measures their overlap. Picturing two bell curves, one for H0 and one for the true value, with the critical boundary between them, makes every β calculation transparent: β is the area of the true-value curve that falls on the "do not reject" side of the boundary.
Test H0:μ=50 against H1:μ>50 with σ=4, n=16, α=0.05. Find β and the power when the true mean is μ=52.
Step 1 — critical region under H0. The standard error is σ/n=4/16=1, so under H0, Xˉ∼N(50,12). For a one-tailed test at 5% we reject when
Xˉ>50+1.645×1=51.645.(M1 standard error; M1 critical value; A1 51.645)
Step 2 — β at the true mean. If μ=52 then Xˉ∼N(52,12), and a Type II error is failing to reject, i.e. Xˉ≤51.645:
β=P(Xˉ≤51.645∣μ=52)=P(Z≤151.645−52)=P(Z≤−0.355).(M1 standardise at true μ) =1−Φ(0.355)=1−0.6387=0.3613.(A1)
Step 3 — power.
Power=1−β=1−0.3613=0.6387.(A1)
There is about a 63.9% chance of detecting the shift from 50 to 52. (M1 marks for the standard error, the critical value and standardising at the true mean; A1 marks for 51.645, β=0.361 and power 0.639. The single most common error is to standardise the critical value using μ0=50 again — you must use the true μ=52.)
A coin is tested for fairness by tossing it n=20 times and counting heads X. Under H0:p=0.5, X∼B(20,0.5); the critical region is X≤5 or X≥15. (a) Find the size (actual significance level) of the test. (b) Find β and the power when in truth p=0.7.
(a) Using B(20,0.5): P(X≤5)=0.02069 and by symmetry P(X≥15)=0.02069, so
α=P(X≤5)+P(X≥15)=2(0.02069)=0.0414.(M1 both tails; A1 0.0414)
(b) With p=0.7, X∼B(20,0.7). A Type II error is landing outside the critical region, i.e. 6≤X≤14:
β=P(6≤X≤14∣p=0.7)=P(X≤14)−P(X≤5).(M1 set-up at true p)
From B(20,0.7) tables, P(X≤14)=0.5836 and P(X≤5)=0.00004 (negligible), so
β=0.5836−0.0000=0.5836,Power=1−0.5836=0.4164.(A1 β; A1 power)
The power is only about 0.42: with just 20 tosses this test misses a substantial bias (p=0.7) more often than it detects it — a vivid argument for a larger sample. (M1 for both tails in (a); M1 for evaluating the complementary region at the true p in (b); A1s for the numerical values. Note α=0.0414, below the nominal 5%, because X is discrete.)
A power function π(θ) plots the power against the true value θ of the parameter. Its shape is diagnostic:
The operating-characteristic (OC) curve plots β(θ)=1−π(θ), so it is simply the power curve flipped vertically: it starts at 1−α near θ0 and falls towards 0 as the effect grows. The two carry the same information.
These curves make the factors affecting power visible. A larger n (smaller standard error) makes the power curve climb more steeply, so even a modest departure from θ0 is detected with high probability — the test becomes more discriminating. A larger α lifts the whole power curve (including its value α at θ0), buying detection power at the price of more false alarms. In industrial acceptance sampling the OC curve is the working tool: it shows, for each true defect rate, the probability a batch is accepted, and the supplier (producer's risk α) and customer (consumer's risk β) negotiate the sampling plan by shaping this curve.
graph LR
A["True mean μ"] --> B["Compute P(reject | μ) = power"]
B --> C["Plot power vs μ → power curve<br/>passes through (μ₀, α), rises to 1"]
B --> D["Plot β = 1 − power vs μ → OC curve<br/>starts at 1−α, falls to 0"]
To detect a difference δ=μ1−μ0 with power 1−β at significance level α (one-tailed, known σ), the two distributions must be separated so that the critical value sits zα standard errors above μ0 and simultaneously zβ standard errors below μ1. Setting these equal,
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.