Type I and Type II Errors, Power of a Test

Every hypothesis test is a decision rule, and every decision rule can be wrong in two distinct ways: it can raise a false alarm (reject a true null) or it can miss a real effect (fail to reject a false null). These are the Type I and Type II errors, with probabilities $\alpha$ and $\beta$ . Their interplay — and the power $1-\beta$ that measures a test's ability to detect what is really there — is the heart of experimental design. This lesson defines the two errors precisely, derives $\beta$ and the power from the sampling distribution, exposes the $\alpha$ – $\beta$ trade-off, and shows how to size a study for a target power.

Where this sits in AQA 7367

This is Paper 3 optional content — Statistics (7367/3S), the applied option a Further-Maths student may choose alongside one of Mechanics (7367/3M) or Discrete (7367/3D). Paper 3 is 2 hours, 100 marks, weighted AO1 40% / AO2 25% / AO3 35%. This topic is unusually AO2/AO3-heavy: defining the errors and interpreting power in context is reasoning and communication (AO2), while computing $\beta$ for a stated alternative is genuine problem-solving (AO3). It builds directly on the hypothesis-testing machinery of the previous lesson (critical regions, significance levels) and on A-Level Maths Statistics (the normal model for $\bar X$ and the standardising transformation $Z = (\bar X - \mu)/(\sigma/\sqrt n)$ ).

Core theory: the two errors

A test decides between a null hypothesis $H_0$ and an alternative $H_1$ on the basis of a sample. Reality is one of two states ( $H_0$ true or false) and our decision is one of two actions (reject or do not reject), giving the decision table:

	$H_0$ is true	$H_0$ is false
Do not reject $H_0$	Correct decision (prob. $1-\alpha$ )	Type II error (prob. $\beta$ )
Reject $H_0$	Type I error (prob. $\alpha$ )	Correct decision — power (prob. $1-\beta$ )

Type I error

A Type I error is rejecting $H_0$ when it is in fact true — a "false positive." Because we reject precisely when the statistic falls in the critical region, and we design the critical region to have probability $\alpha$ under $H_0$ ,

$P(\text{Type I error}) = P(\text{reject } H_0 \mid H_0 \text{ true}) = \alpha = \text{significance level}.$

The Type I error rate is therefore something we choose: setting $\alpha = 0.05$ is a deliberate decision to tolerate a $5\%$ false-alarm rate. For a discrete test statistic the actual size may be a little below the nominal $\alpha$ (you cannot always hit $5\%$ exactly), but the principle is unchanged.

Type II error

A Type II error is failing to reject $H_0$ when it is in fact false — a "false negative":

$P(\text{Type II error}) = P(\text{do not reject } H_0 \mid H_0 \text{ false}) = \beta.$

Crucially, $\beta$ is not a single number we choose. " $H_0$ false" is not one situation but a whole family — the true parameter could be anywhere in the range covered by $H_1$ . So $\beta$ depends on how false $H_0$ is (the true value of the parameter), on the sample size $n$ , on the population variability $\sigma$ , and on the chosen $\alpha$ . To compute a numerical $\beta$ you must be told (or must assume) a specific true value.

Exam Tip: Type I error = rejecting a true $H_0$ ; Type II error = not rejecting a false $H_0$ . A robust mnemonic: Type I = "crying wolf" (a false alarm when there is no wolf); Type II = "missing the wolf" (the wolf is there but you fail to sound the alarm).

The $\alpha$ – $\beta$ trade-off

For a fixed sample size the two error rates pull against each other. Making the test more stringent — shrinking $\alpha$ by pushing the critical value further out — makes it harder to reject $H_0$ , so a false null is missed more often and $\beta$ rises. Conversely, a generous $\alpha$ catches more true effects (small $\beta$ ) but raises the false-alarm rate. You cannot drive both to zero by tuning the critical value alone.

The escape from this trade-off is more information: increasing $n$ shrinks the standard error $\sigma/\sqrt n$ , so the sampling distributions under $H_0$ and under the true value overlap less, and both $\alpha$ (held fixed) and $\beta$ (now smaller) can be controlled simultaneously. This is why "collect a larger sample" is the universal remedy for a test that lacks discriminating power.

Power of a test

The power is the probability of correctly rejecting $H_0$ when it is false:

$\text{Power} = 1 - \beta = P(\text{reject } H_0 \mid H_0 \text{ false}).$

High power means the test reliably detects a genuine effect. Like $\beta$ , power is a function of the true parameter value, not a single number.

Factors affecting power

Factor	Effect on power	Why
Larger sample size $n$	Increases power	smaller standard error $\sigma/\sqrt n$ ⇒ less overlap
Larger significance level $\alpha$	Increases power	critical value closer in ⇒ easier to reject (but more Type I error)
Larger true effect size $\lvert\mu_1-\mu_0\rvert$	Increases power	distributions pulled further apart
Smaller population variance $\sigma^2$	Increases power	tighter sampling distribution ⇒ less overlap

Computing $\beta$ and the power

The recipe never changes: (1) find the critical region using the distribution under $H_0$ (this fixes the boundary at significance level $\alpha$ ); (2) find the probability of not landing in that region using the distribution at the true parameter value; that probability is $\beta$ ; (3) the power is $1-\beta$ . The two distributions in steps (1) and (2) are different — one centred on $\theta_0$ , the other on the true $\theta$ — and $\beta$ measures their overlap. Picturing two bell curves, one for $H_0$ and one for the true value, with the critical boundary between them, makes every $\beta$ calculation transparent: $\beta$ is the area of the true-value curve that falls on the "do not reject" side of the boundary.

Worked Example 1 — Type II error and power for a $z$ -test (with mark scheme)

Test $H_0: \mu = 50$ against $H_1: \mu > 50$ with $\sigma = 4$ , $n = 16$ , $\alpha = 0.05$ . Find $\beta$ and the power when the true mean is $\mu = 52$ .

Step 1 — critical region under $H_0$ . The standard error is $\sigma/\sqrt n = 4/\sqrt{16} = 1$ , so under $H_0$ , $\bar X \sim N(50, 1^2)$ . For a one-tailed test at $5\%$ we reject when

$\bar X > 50 + 1.645\times 1 = 51.645. \quad (\text{M1 standard error}; \ \text{M1 critical value}; \ \text{A1 } 51.645)$

Step 2 — $\beta$ at the true mean. If $\mu = 52$ then $\bar X \sim N(52, 1^2)$ , and a Type II error is failing to reject, i.e. $\bar X \le 51.645$ :

$\beta = P(\bar X \le 51.645 \mid \mu = 52) = P\!\left(Z \le \frac{51.645 - 52}{1}\right) = P(Z \le -0.355). \quad (\text{M1 standardise at true } \mu)$ $= 1 - \Phi(0.355) = 1 - 0.6387 = 0.3613. \quad (\text{A1})$

Step 3 — power.

$\text{Power} = 1 - \beta = 1 - 0.3613 = 0.6387. \quad (\text{A1})$

There is about a $63.9\%$ chance of detecting the shift from $50$ to $52$ . (M1 marks for the standard error, the critical value and standardising at the true mean; A1 marks for $51.645$ , $\beta = 0.361$ and power $0.639$ . The single most common error is to standardise the critical value using $\mu_0 = 50$ again — you must use the true $\mu = 52$ .)

Worked Example 2 — a two-tailed Type I / Type II calculation (with mark scheme)

A coin is tested for fairness by tossing it $n = 20$ times and counting heads $X$ . Under $H_0: p = 0.5$ , $X \sim B(20, 0.5)$ ; the critical region is $X \le 5$ or $X \ge 15$ . (a) Find the size (actual significance level) of the test. (b) Find $\beta$ and the power when in truth $p = 0.7$ .

(a) Using $B(20,0.5)$ : $P(X\le 5) = 0.02069$ and by symmetry $P(X\ge 15) = 0.02069$ , so

$\alpha = P(X\le 5) + P(X\ge 15) = 2(0.02069) = 0.0414. \quad (\text{M1 both tails}; \ \text{A1 } 0.0414)$

(b) With $p = 0.7$ , $X \sim B(20, 0.7)$ . A Type II error is landing outside the critical region, i.e. $6 \le X \le 14$ :

$\beta = P(6 \le X \le 14 \mid p = 0.7) = P(X\le 14) - P(X\le 5). \quad (\text{M1 set-up at true } p)$

From $B(20,0.7)$ tables, $P(X\le 14) = 0.5836$ and $P(X\le 5) = 0.00004$ (negligible), so

$\beta = 0.5836 - 0.0000 = 0.5836, \qquad \text{Power} = 1 - 0.5836 = 0.4164. \quad (\text{A1 } \beta; \ \text{A1 power})$

The power is only about $0.42$ : with just $20$ tosses this test misses a substantial bias ( $p = 0.7$ ) more often than it detects it — a vivid argument for a larger sample. (M1 for both tails in (a); M1 for evaluating the complementary region at the true $p$ in (b); A1s for the numerical values. Note $\alpha = 0.0414$ , below the nominal $5\%$ , because $X$ is discrete.)

Power curves and the operating-characteristic curve

A power function $\pi(\theta)$ plots the power against the true value $\theta$ of the parameter. Its shape is diagnostic:

at $\theta = \theta_0$ (the null value) the "power" is just the probability of rejecting a true $H_0$ , which is $\alpha$ — so the curve passes through $(\theta_0, \alpha)$ ;
as $\theta$ moves away from $\theta_0$ the power rises towards $1$ (a bigger effect is easier to detect);
increasing $n$ makes the curve steeper, climbing to $1$ sooner — a more sensitive test.

The operating-characteristic (OC) curve plots $\beta(\theta) = 1 - \pi(\theta)$ , so it is simply the power curve flipped vertically: it starts at $1-\alpha$ near $\theta_0$ and falls towards $0$ as the effect grows. The two carry the same information.

These curves make the factors affecting power visible. A larger $n$ (smaller standard error) makes the power curve climb more steeply, so even a modest departure from $\theta_0$ is detected with high probability — the test becomes more discriminating. A larger $\alpha$ lifts the whole power curve (including its value $\alpha$ at $\theta_0$ ), buying detection power at the price of more false alarms. In industrial acceptance sampling the OC curve is the working tool: it shows, for each true defect rate, the probability a batch is accepted, and the supplier (producer's risk $\alpha$ ) and customer (consumer's risk $\beta$ ) negotiate the sampling plan by shaping this curve.

graph LR
  A["True mean μ"] --> B["Compute P(reject | μ) = power"]
  B --> C["Plot power vs μ → power curve<br/>passes through (μ₀, α), rises to 1"]
  B --> D["Plot β = 1 − power vs μ → OC curve<br/>starts at 1−α, falls to 0"]

Sample size for a specified power

To detect a difference $\delta = \mu_1 - \mu_0$ with power $1-\beta$ at significance level $\alpha$ (one-tailed, known $\sigma$ ), the two distributions must be separated so that the critical value sits $z_\alpha$ standard errors above $\mu_0$ and simultaneously $z_\beta$ standard errors below $\mu_1$ . Setting these equal,

Type I and Type II Errors, Power of a Test

Type I and Type II Errors, Power of a Test

Where this sits in AQA 7367

Core theory: the two errors

Type I error

Type II error

The α\alphaα–β\betaβ trade-off

Power of a test

Factors affecting power

Computing β\betaβ and the power

Worked Example 1 — Type II error and power for a zzz-test (with mark scheme)

Worked Example 2 — a two-tailed Type I / Type II calculation (with mark scheme)

Power curves and the operating-characteristic curve

Sample size for a specified power

More in Mathematics

The $\alpha$ – $\beta$ trade-off

Computing $\beta$ and the power

Worked Example 1 — Type II error and power for a $z$ -test (with mark scheme)