Hypothesis Testing (Further)

Hypothesis testing is the formal machinery for deciding whether sample evidence is strong enough to overturn a default belief about a population. You met the single-mean and proportion tests in A-Level Maths; this lesson consolidates them into one rigorous framework and extends it across the full toolkit of Further Statistics 2 — the z-test and t-test for a mean, the test for a proportion, and tests for the difference of two means — alongside p-values, one- versus two-tailed reasoning, and the exact duality with confidence intervals.

Where this sits in AQA 7367

This is Paper 3 Statistics (7367/3S) content (Paper 3: 2 h, 100 marks, AO1 40% / AO2 25% / AO3 35%). It is the capstone of the inferential strand, drawing together the sampling distribution, the t-distribution and confidence intervals from the preceding lessons. The mechanics of computing a test statistic are AO1, but the high-tariff marks are AO2/AO3: choosing the right test and tail, stating assumptions, and — above all — writing a conclusion in context that neither over-claims nor under-claims. It builds directly on the A-Level Maths hypothesis-testing framework and on this course's earlier statistics lessons.

Core theory: the framework

Every hypothesis test follows the same six-step logic. The null hypothesis $H_0$ is the status-quo claim (an equality); the alternative $H_1$ is what we test for (one- or two-sided). We assume $H_0$ , measure how surprising the data are under it, and reject $H_0$ only if that surprise exceeds a pre-set threshold $\alpha$ .

Step	Action
1	State $H_0$ (an equality, e.g. $\mu = \mu_0$ ) and $H_1$ (the effect sought)
2	Fix the significance level $\alpha$ (commonly $0.05$ or $0.01$ )
3	Compute the test statistic from the sample
4	Find the critical value (or the p-value) for the chosen tail(s)
5	Decide: reject $H_0$ if the statistic is in the critical region (or $p < \alpha$ )
6	State the conclusion in context, hedged at the chosen level

The significance level $\alpha$ is the probability of rejecting a true $H_0$ — the risk of a false alarm we are willing to accept. It must be fixed before seeing the data; choosing the tail or level to suit the sample invalidates the test.

There are two equivalent ways to make the decision, and it is worth being fluent in both. The critical-value method computes the test statistic and compares it with a fixed cut-off (e.g. $\pm 1.96$ ); the statistic is "in the critical region" or not. The p-value method computes the probability of a result at least as extreme as observed and compares it directly with $\alpha$ . They always give the same verdict — the critical value is simply the statistic whose p-value equals $\alpha$ — but the p-value carries more information, reporting how strong the evidence is rather than a bare reject/do-not-reject. Examiners accept either method; what loses marks is mixing them (e.g. comparing a statistic with a probability).

A third presentation, the critical region for the sample statistic, is sometimes asked for explicitly: instead of standardising, you state the range of $\bar x$ (or of the count) that would trigger rejection. For the bar example below, rejecting when $|Z| > 1.96$ is the same as rejecting when $\bar x$ falls outside $500 \pm 1.96 \times \tfrac{10}{6} = 500 \pm 3.27$ , i.e. the critical region is $\bar x < 496.73$ or $\bar x > 503.27$ .

The z-test for a mean ( $\sigma$ known)

When $\sigma$ is known and the population is normal (or $n$ is large), use

$Z = \frac{\bar x - \mu_0}{\sigma/\sqrt n} \sim N(0,1),$

comparing with $z_\alpha$ (one-tailed) or $z_{\alpha/2}$ (two-tailed).

Worked example — two-tailed z-test

A manufacturer claims the mean weight of its bars is $500$ g, with $\sigma = 10$ g. A random sample of $36$ bars has $\bar x = 497$ g. Test the claim at the $5\%$ level.

$H_0:\ \mu = 500, \qquad H_1:\ \mu \ne 500 \ \text{(two-tailed)}. \quad (\text{B1 hypotheses})$ $Z = \frac{497 - 500}{10/\sqrt{36}} = \frac{-3}{1.6667} = -1.80. \quad (\text{M1 statistic; A1})$

Critical values: $z_{0.025} = \pm 1.960$ . Since $|{-1.80}| = 1.80 < 1.960$ , $Z$ is not in the critical region.

Do not reject $H_0$ : there is insufficient evidence at the $5\%$ level that the mean weight differs from $500$ g. (B1; M1/A1 statistic; M1 compare with $\pm 1.96$ ; A1 contextual conclusion.)

The t-test for a mean ( $\sigma$ unknown)

When $\sigma$ is unknown, estimate it by $s$ and switch to the t-distribution:

$T = \frac{\bar x - \mu_0}{s/\sqrt n} \sim t_{n-1}.$

Worked example — one-tailed t-test

A sample of $15$ batteries has mean lifetime $48.5$ hours and $s = 3.2$ hours. The maker claims $\mu = 50$ . Test at the $5\%$ level whether the true mean is less than $50$ .

$H_0:\ \mu = 50, \qquad H_1:\ \mu < 50 \ \text{(one-tailed)}. \quad (\text{B1})$ $T = \frac{48.5 - 50}{3.2/\sqrt{15}} = \frac{-1.5}{0.8262} = -1.816. \quad (\text{M1; A1})$

Critical value: $t_{14,\,0.05} = -1.761$ (lower tail). Since $-1.816 < -1.761$ , $T$ is in the critical region.

Reject $H_0$ : there is evidence at the $5\%$ level that the mean lifetime is less than $50$ hours. (B1; M1/A1 statistic; M1 compare; A1 contextual conclusion. Contrast with the previous lesson's borderline non-rejection — here the statistic just clears the critical value.)

The test for a proportion

For a large sample, test $H_0:\ p = p_0$ with

$Z = \frac{\hat p - p_0}{\sqrt{p_0(1 - p_0)/n}} \sim N(0,1).$

The standard error uses the hypothesised $p_0$ , not the observed $\hat p$ , because the whole calculation is performed under $H_0$ (this is the key difference from the proportion confidence interval, which uses $\hat p$ ).

Worked example — test for a proportion

A coin is tossed $200$ times and lands heads $115$ times. Test at the $5\%$ level whether the coin is biased.

$H_0:\ p = 0.5, \qquad H_1:\ p \ne 0.5 \ \text{(two-tailed)}; \qquad \hat p = \frac{115}{200} = 0.575. \quad (\text{B1; M1 } \hat p)$ $Z = \frac{0.575 - 0.5}{\sqrt{0.5 \times 0.5/200}} = \frac{0.075}{0.035355} = 2.121. \quad (\text{M1 SE with } p_0;\ \text{A1})$

Since $2.121 > 1.960$ , reject $H_0$ : there is significant evidence at the $5\%$ level that the coin is biased. (B1 hypotheses; M1 $\hat p$ ; M1 SE using $p_0 = 0.5$ ; A1 statistic; A1 conclusion. Using $\hat p$ in the SE would give a slightly different — and incorrect — $Z$ .)

Tests for the difference of two means

Known variances (z-test). For $H_0:\ \mu_1 = \mu_2$ ,

$Z = \frac{\bar x_1 - \bar x_2}{\sqrt{\dfrac{\sigma_1^2}{n_1} + \dfrac{\sigma_2^2}{n_2}}} \sim N(0,1).$

Unknown but equal variances (pooled t-test).

$T = \frac{\bar x_1 - \bar x_2}{s_p\sqrt{\dfrac{1}{n_1} + \dfrac{1}{n_2}}} \sim t_{n_1 + n_2 - 2}, \qquad s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2}.$

In both cases the two standard errors combine by adding variances — the same rule as the difference of normals. For paired data, do not use these; reduce to differences and run a one-sample test (previous lesson).

Worked example — difference of two means (known variances)

Two machines produce rods. Machine A: $n_1 = 40$ , $\bar x_1 = 25.4$ mm, $\sigma_1 = 1.2$ . Machine B: $n_2 = 50$ , $\bar x_2 = 24.9$ mm, $\sigma_2 = 1.0$ . Test at the $5\%$ level whether the machines differ in mean length.

$H_0:\ \mu_1 = \mu_2, \qquad H_1:\ \mu_1 \ne \mu_2 \ \text{(two-tailed)}. \quad (\text{B1})$ $\text{SE} = \sqrt{\frac{1.2^2}{40} + \frac{1.0^2}{50}} = \sqrt{0.036 + 0.02} = \sqrt{0.056} = 0.2366. \quad (\text{M1 SE; variances added})$ $Z = \frac{25.4 - 24.9}{0.2366} = \frac{0.5}{0.2366} = 2.113. \quad (\text{M1; A1})$

Since $2.113 > 1.960$ , reject $H_0$ : there is evidence at the $5\%$ level that the machines' mean lengths differ. (B1 hypotheses; M1 SE adding the two variance terms; M1/A1 statistic; A1 conclusion.)

Worked example — computing and using a p-value

For the rod example above, find the p-value and confirm the decision.

The statistic is $Z = 2.113$ . For a two-tailed test the p-value is the probability of a $|Z|$ this large or larger in either tail:

$p = 2\,P(Z > 2.113) = 2(1 - 0.9827) = 2(0.0173) = 0.0346. \quad (\text{M1 one tail; M1 double})$

Since $0.0346 < 0.05$ , reject $H_0$ — the same verdict as the critical-value method, as it must be. (M1 reading $P(Z > 2.113)$ ; M1 doubling for two tails; A1 $p = 0.0346$ and the decision. Reporting $p = 0.035$ tells the reader the evidence is moderate — significant at $5\%$ but not at $1\%$ — which a bare "reject" would hide.)

p-values

The p-value is the probability, assuming $H_0$ is true, of obtaining a test statistic at least as extreme as the one observed (in the direction(s) of $H_1$ ). It is the strength of evidence against $H_0$ on a continuous scale.

p-value	Strength of evidence against $H_0$
$p < 0.01$	strong
$0.01 \le p < 0.05$	moderate
$p \ge 0.05$	insufficient (at the $5\%$ level)

Decision rule: reject $H_0$ when $p < \alpha$ . For a two-tailed test, double the one-tail probability. In the coin example, $Z = 2.121$ gives a two-tailed $p = 2\,P(Z > 2.121) = 2(0.0170) = 0.0339$ ; since $0.0339 < 0.05$ , reject $H_0$ — agreeing with the critical-value verdict.

A subtle but important warning: the p-value is not the probability that $H_0$ is true. It is the probability of data this extreme assuming $H_0$ — a conditional probability in the opposite direction. Confusing $P(\text{data} \mid H_0)$ with $P(H_0 \mid \text{data})$ is the most common conceptual error in all of inference. Nor does a small p-value measure the size of an effect: with a very large sample, a trivially small departure from $H_0$ can yield a tiny p-value, so "statistically significant" is not the same as "practically important." Conversely, a large p-value does not prove $H_0$ ; it merely shows the data are consistent with it, which is weaker. Good practice reports the p-value alongside the estimate and its confidence interval, so the reader sees both the strength of evidence and the magnitude of the effect.

One-tailed versus two-tailed tests

Test	$H_1$	Critical region
Two-tailed	$\mu \ne \mu_0$	both tails, area $\alpha/2$ each
Upper one-tailed	$\mu > \mu_0$	right tail only, area $\alpha$
Lower one-tailed	$\mu < \mu_0$	left tail only, area $\alpha$

Choose the direction from the research question set in advance, never from the observed data. A one-tailed test puts the whole $\alpha$ in one tail, so its critical value is closer to zero (e.g. $1.645$ vs $1.960$ at $5\%$ ) — it is more powerful in the predicted direction but blind to an effect the other way.

Hypothesis Testing (Further)

Hypothesis Testing (Further)

Where this sits in AQA 7367

Core theory: the framework

The z-test for a mean (σ\sigmaσ known)

Worked example — two-tailed z-test

The t-test for a mean (σ\sigmaσ unknown)

Worked example — one-tailed t-test

The test for a proportion

Worked example — test for a proportion

Tests for the difference of two means

Worked example — difference of two means (known variances)

Worked example — computing and using a p-value

p-values

One-tailed versus two-tailed tests

More in Mathematics

The z-test for a mean ( $\sigma$ known)

The t-test for a mean ( $\sigma$ unknown)