You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Hypothesis testing is the formal machinery for deciding whether sample evidence is strong enough to overturn a default belief about a population. You met the single-mean and proportion tests in A-Level Maths; this lesson consolidates them into one rigorous framework and extends it across the full toolkit of Further Statistics 2 — the z-test and t-test for a mean, the test for a proportion, and tests for the difference of two means — alongside p-values, one- versus two-tailed reasoning, and the exact duality with confidence intervals.
This is Paper 3 Statistics (7367/3S) content (Paper 3: 2 h, 100 marks, AO1 40% / AO2 25% / AO3 35%). It is the capstone of the inferential strand, drawing together the sampling distribution, the t-distribution and confidence intervals from the preceding lessons. The mechanics of computing a test statistic are AO1, but the high-tariff marks are AO2/AO3: choosing the right test and tail, stating assumptions, and — above all — writing a conclusion in context that neither over-claims nor under-claims. It builds directly on the A-Level Maths hypothesis-testing framework and on this course's earlier statistics lessons.
Every hypothesis test follows the same six-step logic. The null hypothesis H0 is the status-quo claim (an equality); the alternative H1 is what we test for (one- or two-sided). We assume H0, measure how surprising the data are under it, and reject H0 only if that surprise exceeds a pre-set threshold α.
| Step | Action |
|---|---|
| 1 | State H0 (an equality, e.g. μ=μ0) and H1 (the effect sought) |
| 2 | Fix the significance level α (commonly 0.05 or 0.01) |
| 3 | Compute the test statistic from the sample |
| 4 | Find the critical value (or the p-value) for the chosen tail(s) |
| 5 | Decide: reject H0 if the statistic is in the critical region (or p<α) |
| 6 | State the conclusion in context, hedged at the chosen level |
The significance level α is the probability of rejecting a true H0 — the risk of a false alarm we are willing to accept. It must be fixed before seeing the data; choosing the tail or level to suit the sample invalidates the test.
There are two equivalent ways to make the decision, and it is worth being fluent in both. The critical-value method computes the test statistic and compares it with a fixed cut-off (e.g. ±1.96); the statistic is "in the critical region" or not. The p-value method computes the probability of a result at least as extreme as observed and compares it directly with α. They always give the same verdict — the critical value is simply the statistic whose p-value equals α — but the p-value carries more information, reporting how strong the evidence is rather than a bare reject/do-not-reject. Examiners accept either method; what loses marks is mixing them (e.g. comparing a statistic with a probability).
A third presentation, the critical region for the sample statistic, is sometimes asked for explicitly: instead of standardising, you state the range of xˉ (or of the count) that would trigger rejection. For the bar example below, rejecting when ∣Z∣>1.96 is the same as rejecting when xˉ falls outside 500±1.96×610=500±3.27, i.e. the critical region is xˉ<496.73 or xˉ>503.27.
When σ is known and the population is normal (or n is large), use
Z=σ/nxˉ−μ0∼N(0,1),
comparing with zα (one-tailed) or zα/2 (two-tailed).
A manufacturer claims the mean weight of its bars is 500 g, with σ=10 g. A random sample of 36 bars has xˉ=497 g. Test the claim at the 5% level.
H0: μ=500,H1: μ=500 (two-tailed).(B1 hypotheses) Z=10/36497−500=1.6667−3=−1.80.(M1 statistic; A1)
Critical values: z0.025=±1.960. Since ∣−1.80∣=1.80<1.960, Z is not in the critical region.
Do not reject H0: there is insufficient evidence at the 5% level that the mean weight differs from 500 g. (B1; M1/A1 statistic; M1 compare with ±1.96; A1 contextual conclusion.)
When σ is unknown, estimate it by s and switch to the t-distribution:
T=s/nxˉ−μ0∼tn−1.
A sample of 15 batteries has mean lifetime 48.5 hours and s=3.2 hours. The maker claims μ=50. Test at the 5% level whether the true mean is less than 50.
H0: μ=50,H1: μ<50 (one-tailed).(B1) T=3.2/1548.5−50=0.8262−1.5=−1.816.(M1; A1)
Critical value: t14,0.05=−1.761 (lower tail). Since −1.816<−1.761, T is in the critical region.
Reject H0: there is evidence at the 5% level that the mean lifetime is less than 50 hours. (B1; M1/A1 statistic; M1 compare; A1 contextual conclusion. Contrast with the previous lesson's borderline non-rejection — here the statistic just clears the critical value.)
For a large sample, test H0: p=p0 with
Z=p0(1−p0)/np^−p0∼N(0,1).
The standard error uses the hypothesised p0, not the observed p^, because the whole calculation is performed under H0 (this is the key difference from the proportion confidence interval, which uses p^).
A coin is tossed 200 times and lands heads 115 times. Test at the 5% level whether the coin is biased.
H0: p=0.5,H1: p=0.5 (two-tailed);p^=200115=0.575.(B1; M1 p^) Z=0.5×0.5/2000.575−0.5=0.0353550.075=2.121.(M1 SE with p0; A1)
Since 2.121>1.960, reject H0: there is significant evidence at the 5% level that the coin is biased. (B1 hypotheses; M1 p^; M1 SE using p0=0.5; A1 statistic; A1 conclusion. Using p^ in the SE would give a slightly different — and incorrect — Z.)
Known variances (z-test). For H0: μ1=μ2,
Z=n1σ12+n2σ22xˉ1−xˉ2∼N(0,1).
Unknown but equal variances (pooled t-test).
T=spn11+n21xˉ1−xˉ2∼tn1+n2−2,sp2=n1+n2−2(n1−1)s12+(n2−1)s22.
In both cases the two standard errors combine by adding variances — the same rule as the difference of normals. For paired data, do not use these; reduce to differences and run a one-sample test (previous lesson).
Two machines produce rods. Machine A: n1=40, xˉ1=25.4 mm, σ1=1.2. Machine B: n2=50, xˉ2=24.9 mm, σ2=1.0. Test at the 5% level whether the machines differ in mean length.
H0: μ1=μ2,H1: μ1=μ2 (two-tailed).(B1) SE=401.22+501.02=0.036+0.02=0.056=0.2366.(M1 SE; variances added) Z=0.236625.4−24.9=0.23660.5=2.113.(M1; A1)
Since 2.113>1.960, reject H0: there is evidence at the 5% level that the machines' mean lengths differ. (B1 hypotheses; M1 SE adding the two variance terms; M1/A1 statistic; A1 conclusion.)
For the rod example above, find the p-value and confirm the decision.
The statistic is Z=2.113. For a two-tailed test the p-value is the probability of a ∣Z∣ this large or larger in either tail:
p=2P(Z>2.113)=2(1−0.9827)=2(0.0173)=0.0346.(M1 one tail; M1 double)
Since 0.0346<0.05, reject H0 — the same verdict as the critical-value method, as it must be. (M1 reading P(Z>2.113); M1 doubling for two tails; A1 p=0.0346 and the decision. Reporting p=0.035 tells the reader the evidence is moderate — significant at 5% but not at 1% — which a bare "reject" would hide.)
The p-value is the probability, assuming H0 is true, of obtaining a test statistic at least as extreme as the one observed (in the direction(s) of H1). It is the strength of evidence against H0 on a continuous scale.
| p-value | Strength of evidence against H0 |
|---|---|
| p<0.01 | strong |
| 0.01≤p<0.05 | moderate |
| p≥0.05 | insufficient (at the 5% level) |
Decision rule: reject H0 when p<α. For a two-tailed test, double the one-tail probability. In the coin example, Z=2.121 gives a two-tailed p=2P(Z>2.121)=2(0.0170)=0.0339; since 0.0339<0.05, reject H0 — agreeing with the critical-value verdict.
A subtle but important warning: the p-value is not the probability that H0 is true. It is the probability of data this extreme assuming H0 — a conditional probability in the opposite direction. Confusing P(data∣H0) with P(H0∣data) is the most common conceptual error in all of inference. Nor does a small p-value measure the size of an effect: with a very large sample, a trivially small departure from H0 can yield a tiny p-value, so "statistically significant" is not the same as "practically important." Conversely, a large p-value does not prove H0; it merely shows the data are consistent with it, which is weaker. Good practice reports the p-value alongside the estimate and its confidence interval, so the reader sees both the strength of evidence and the magnitude of the effect.
| Test | H1 | Critical region |
|---|---|---|
| Two-tailed | μ=μ0 | both tails, area α/2 each |
| Upper one-tailed | μ>μ0 | right tail only, area α |
| Lower one-tailed | μ<μ0 | left tail only, area α |
Choose the direction from the research question set in advance, never from the observed data. A one-tailed test puts the whole α in one tail, so its critical value is closer to zero (e.g. 1.645 vs 1.960 at 5%) — it is more powerful in the predicted direction but blind to an effect the other way.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.