Further Hypothesis Testing

This lesson extends the hypothesis testing framework to include tests for the mean of a normal distribution and the correlation coefficient. These are key topics at A-Level and bring together the concepts of the normal distribution, sampling, and statistical inference.

Hypothesis Test for the Mean of a Normal Distribution

If $X \sim N(\mu, \sigma^2)$ and $\sigma^2$ is known, we can test hypotheses about $\mu$ using the sample mean $\bar{X}$ .

Distribution of the Sample Mean

If $X \sim N(\mu, \sigma^2)$ , then the sample mean of a random sample of size $n$ follows:

$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$

The standard error of the mean is:

$\text{SE} = \frac{\sigma}{\sqrt{n}}$

Test Procedure

State hypotheses:
- $H_0: \mu = \mu_0$
- $H_1: \mu > \mu_0$ or $\mu < \mu_0$ or $\mu \neq \mu_0$
Assume $H_0$ is true: $\bar{X} \sim N\left(\mu_0, \frac{\sigma^2}{n}\right)$
Calculate the test statistic: $z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$
Compare with the critical value or find the p-value.
Make a decision and state the conclusion in context.

Example: A machine fills bottles with a mean of 500 ml ( $\sigma = 5$ ml). A sample of 25 bottles has a mean of 498.2 ml. Test at 5% whether the mean has decreased.

$H_0: \mu = 500$ , $H_1: \mu < 500$ (one-tailed)

$z = \frac{498.2 - 500}{5 / \sqrt{25}} = \frac{-1.8}{1} = -1.8$

Critical value at 5% (one-tailed): $z = -1.6449$

Since $-1.8 < -1.6449$ , reject $H_0$ . There is sufficient evidence at the 5% level to suggest the mean volume has decreased.

Exam Tip: Always state the distribution of $\bar{X}$ under $H_0$ before calculating the test statistic. This shows the examiner you understand the sampling distribution, which is worth method marks.

Large Samples and the Central Limit Theorem

The Central Limit Theorem states that for a large sample (typically $n \geq 30$ ), the distribution of $\bar{X}$ is approximately normal regardless of the underlying distribution:

$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \quad \text{(approximately)}$

This allows us to apply normal-based hypothesis tests even when the population is not normally distributed, provided the sample is large enough.

Hypothesis Test for the Correlation Coefficient

To test whether a population correlation coefficient $\rho$ is significantly different from zero:

State hypotheses:
- $H_0: \rho = 0$ (no linear correlation)
- $H_1: \rho > 0$ or $\rho < 0$ or $\rho \neq 0$
Calculate the sample PMCC $r$ .
Compare $|r|$ with the critical value from the PMCC table for the appropriate $n$ and $\alpha$ .
State conclusion in context.

Example: A sample of 12 pairs of data gives $r = 0.65$ . Test at 5% whether there is positive correlation.

$H_0: \rho = 0$ , $H_1: \rho > 0$ (one-tailed)

Critical value for $n = 12$ at 5% (one-tailed): 0.4973

Since $0.65 > 0.4973$ , reject $H_0$ . There is sufficient evidence at the 5% level of positive correlation.

Confidence and Significance

Significance Level	Confidence Level	Meaning
10%	90%	Weak evidence required to reject $H_0$
5%	95%	Moderate evidence required (most common)
1%	99%	Strong evidence required

Lower significance levels require stronger evidence to reject $H_0$ , reducing the risk of a Type I error but increasing the risk of a Type II error.

Common Mistakes in Hypothesis Testing

Not stating hypotheses in terms of the population parameter (use $p$ , $\mu$ , or $\rho$ , not $\hat{p}$ or $\bar{x}$ ).
Confusing one-tailed and two-tailed tests — read the question carefully.
Failing to state the conclusion in context — always refer back to the original question.
Using the wrong distribution — check whether binomial or normal is appropriate.
Incorrect inequality direction when calculating probabilities.

Summary

Test for the mean $\mu$ using $z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}$ when $\sigma$ is known.
$\bar{X} \sim N(\mu, \sigma^2/n)$ — the standard error decreases as $n$ increases.
The Central Limit Theorem justifies normal approximation for large samples.
Test for correlation using the PMCC table.
Always state hypotheses, significance level, and conclusion in context.

Exam Tip: The most common reason students lose marks in hypothesis testing questions is failing to give a contextual conclusion. After making your statistical decision, translate it into plain English: "There is sufficient evidence at the 5% significance level to suggest that the average weight of apples has decreased."

A-Level Deep Dive: Further Hypothesis Testing

Spec mapping

AQA 7357 specification, Paper 3 — Statistics, Section S: Statistical Hypothesis Testing (Year 2 content) covers conduct a statistical hypothesis test for the mean of a normal distribution with known, given or assumed variance and interpret the results in context (refer to the official specification document for exact wording).; and "Understand and apply the language of statistical hypothesis testing developed through a binomial model: null hypothesis, alternative hypothesis, significance level, test statistic, 1-tail test, 2-tail test, critical value, critical region, acceptance region, p-value; extend to correlation coefficients as measures of how close data points lie to a straight line and be able to interpret a given correlation coefficient using a given p-value or critical value (calculation of correlation coefficients is excluded)." Section S sits alongside Section O (probability), Section P (statistical distributions, including the normal distribution), and Section R (sampling), and is examined principally on Paper 3 alongside Mechanics. Critical values for the standard normal distribution are provided in the AQA formulae and statistical tables booklet; critical values of the product-moment correlation coefficient are also tabulated.

Worked example with full mark scheme

Question (8 marks):

A machine fills bottles with mineral water. The volume $X$ ml dispensed per bottle is modelled as $X \sim N(\mu, 4^2)$ , where the standard deviation $\sigma = 4$ ml is known from the manufacturer's calibration data. The machine is set so that $\mu = 500$ ml. After a maintenance visit, a quality engineer suspects the mean volume has changed. A random sample of $n = 25$ bottles gives a sample mean of $\bar{x} = 498.4$ ml.

Test, at the 5% significance level, whether the mean volume has changed.

Solution with mark scheme:

Step 1 — state hypotheses.

Let $\mu$ denote the population mean volume after maintenance. Then:

$H_0: \mu = 500, \qquad H_1: \mu \neq 500$

B1 — both hypotheses correct, with $\mu$ defined as a population parameter (not sample mean) and $H_1$ two-tailed (the engineer suspects a change in either direction).

Step 2 — state distribution of the sample mean under $H_0$ .

Under $H_0$ , since $X \sim N(500, 16)$ with $\sigma$ known:

$\bar{X} \sim N\!\left(500,\ \dfrac{16}{25}\right) = N(500, 0.64)$

so the standard error is $\sigma/\sqrt{n} = 4/5 = 0.8$ .

M1 — quoting the sampling distribution of $\bar{X}$ with the correct variance $\sigma^2/n$ . The single most common error is using $\sigma^2$ instead of $\sigma^2/n$ — that costs the M1 and propagates a wrong z-value through the rest of the question.

Step 3 — compute the test statistic.

$z = \dfrac{\bar{x} - \mu_0}{\sigma/\sqrt{n}} = \dfrac{498.4 - 500}{0.8} = \dfrac{-1.6}{0.8} = -2.00$

M1 — correct standardisation formula applied.

A1 — $z = -2.00$ to at least 2 d.p. Sign matters: an answer of $+2.00$ (forgetting which way round subtraction goes) loses this A1 and may invalidate the conclusion.

Step 4 — compare with critical value.

For a two-tailed test at the 5% level, the critical values are $z = \pm 1.96$ . The critical region is $|z| > 1.96$ .

B1 — correct critical value $\pm 1.96$ for a two-tailed 5% test. Candidates who quote $\pm 1.645$ (the one-tailed 5% value) lose this mark.

Step 5 — make a decision.

Since $|z| = 2.00 > 1.96$ , the test statistic falls inside the critical region. Reject $H_0$ .

M1 — explicit comparison of test statistic with critical value, stating decision in symbolic form.

Step 6 — conclude in context.

There is sufficient evidence at the 5% significance level to suggest that the population mean volume has changed from 500 ml after the maintenance visit. The sample evidence (mean 498.4 ml) is consistent with a slight under-fill.

A1 — conclusion stated in context of the original problem (mineral-water filling), referring to the population mean, with appropriate tentative language ("evidence to suggest", not "proves"). This is the most frequently lost mark on hypothesis-test questions.

Total: 8 marks (B1 M1 M1 A1 B1 M1 A1, with the second A1 carrying the contextual conclusion).

Specimen question modelled on the AQA 7357 Paper 3 format

Question (6 marks): A psychologist measures the product-moment correlation coefficient between hours of sleep and reaction-time score for a random sample of $n = 20$ adults. The sample value is $r = -0.524$ . The critical value of the PMCC at the 1% one-tailed significance level for $n = 20$ is given in the table as $0.5155$ .

Test, at the 1% significance level, whether there is evidence of negative correlation between hours of sleep and reaction-time score in the underlying population. (6)

Mark scheme decomposition by AO:

B1 (AO2.5) — defining $\rho$ as the population PMCC and stating $H_0: \rho = 0$ , $H_1: \rho < 0$ (one-tailed, negative direction matching the question stem).
B1 (AO1.1a) — identifying the correct critical value as $-0.5155$ for a one-tailed test in the negative tail, recognising that the tabulated value $0.5155$ is positive and must have its sign flipped for a lower-tail test.
M1 (AO1.1b) — comparing the sample value $r = -0.524$ with the critical value $-0.5155$ , e.g. "since $-0.524 < -0.5155$ " or equivalently "since $|r| = 0.524 > 0.5155$ ".
A1 (AO2.2a) — correct decision: reject $H_0$ .
A1 (AO3.5b) — contextual conclusion referring to population correlation, sleep, and reaction time, with appropriate hedging.
B1 (AO2.4) — critical commentary: noting that the test indicates evidence of correlation but not causation, or that the result is sample-dependent.

Total: 6 marks split AO1 = 2, AO2 = 3, AO3 = 1. PMCC tests are unusual in that calculation of $r$ is excluded from AQA — the entire question rests on interpretation, comparison, and contextual reasoning, which pushes AO2/AO3 weight unusually high.

Synoptic links

Connects to:

Section P — The normal distribution: the Z-test for a mean is an application of the normal distribution to the sampling distribution of $\bar{X}$ . Confidence in standardising $Z = (X - \mu)/\sigma$ extends directly to $Z = (\bar{X} - \mu)/(\sigma/\sqrt{n})$ — same formula, different denominator.
Section R — Statistical sampling and the sampling distribution of $\bar{X}$ : the result $\bar{X} \sim N(\mu, \sigma^2/n)$ when $X$ is normal (or by the Central Limit Theorem when $n$ is large) is the engine of the Z-test. Without the CLT, the Z-test would only apply to genuinely normal populations.
Section S — Hypothesis test for a binomial proportion (Year 1): the Year 1 binomial test introduces the language of $H_0$ , $H_1$ , significance level, critical region, p-value. The Z-test re-uses every word, swapping the discrete binomial test statistic for the continuous $Z$ .
Section O — Probability: the significance level $\alpha$ is $P(\text{reject } H_0 \mid H_0 \text{ true})$ — a conditional probability. Misinterpreting $\alpha$ as $P(H_0 \text{ true})$ is the most common conceptual error and lies in elementary probability, not statistics.
PMCC / regression (Section T): the PMCC test connects hypothesis testing to bivariate data. Although calculation of $r$ is excluded at A-level, its interpretation synthesises Section P (normality assumptions on each variable), Section S (hypothesis-testing language), and the broader framing of inference about populations.
Mechanics (Paper 2): quality-control problems in a manufacturing context (filling machines, packaging weights, component lengths) are ubiquitous in Paper 3 statistics questions but routinely reference physical contexts students meet in mechanics — units, tolerances, and engineering-meaningful conclusions matter.

Mark-scheme literacy

Hypothesis-test questions on AQA 7357 Paper 3 split AO marks more evenly than typical pure-mathematics questions:

AO	Typical share	Earned by
AO1 (knowledge / procedure)	35–45%	Stating hypotheses, computing test statistic, looking up critical value, making the formal comparison
AO2 (reasoning / interpretation)	35–45%	Choosing one- vs two-tailed correctly, interpreting significance level, writing the conclusion in context, defending the decision
AO3 (problem-solving / modelling)	10–25%	Critiquing modelling assumptions (was $\sigma$ really known? is the population normal? was sampling random?), commenting on sample size adequacy

Examiner-rewarded phrasing: "Let $\mu$ denote the population mean ..."; "Under $H_0$ , $\bar{X} \sim N(\mu_0, \sigma^2/n)$ "; "Since $|z| = 2.00 > 1.96$ , the test statistic lies in the critical region"; "There is sufficient evidence at the 5% significance level to suggest ...". Phrases that lose marks: "the mean is 498.4" (confuses sample with population); "we accept $H_0$ " (statisticians do not accept $H_0$ — they fail to reject it); "the test proves ..." (statistical tests provide evidence, not proof); writing the conclusion only in symbolic terms with no contextual statement.

A specific AQA pattern to watch: questions phrased "test whether there is evidence that ..." with a direction in the stem (e.g. "evidence that the mean has increased") demand a one-tailed test. Questions phrased "test whether the mean has changed" demand a two-tailed test. Reading this distinction wrong inverts the critical value and usually inverts the decision — losing every single mark from Step 4 onwards.

Grade-band model answers

3-mark question

Question: State, with reasons, whether each of the following hypothesis-test setups uses a one-tailed or a two-tailed test:

(i) A teacher tests whether the mean exam score has changed since last year.

(ii) An engineer tests whether a new alloy has greater tensile strength than the standard alloy.

(iii) A pharmacist tests whether a drug's mean dissolution time differs from the manufacturer's stated value.

Grade C response (~150 words):

(i) Two-tailed because "changed" goes either way.

(ii) One-tailed because we want greater than.

(iii) Two-tailed because "differs" means up or down.

Examiner commentary: Full marks (3/3) for correct identifications. The reasoning is brief but the key trigger words ("changed", "greater", "differs") are correctly mapped to the right tails. This is what a 3-mark question of this kind looks like at Grade C — efficient, accurate, no decoration.

Grade A response (~190 words):*

(i) Two-tailed. "Changed" admits both increase and decrease, so $H_1: \mu \neq \mu_0$ . The critical region is split: $|z| > z_{\alpha/2}$ .

(ii) One-tailed (upper). "Greater" specifies a single direction, so $H_1: \mu > \mu_0$ . The full $\alpha$ is in the upper tail; the critical value is $z_\alpha$ , not $z_{\alpha/2}$ .

(iii) Two-tailed. "Differs" carries the same logic as "changed" — the manufacturer's value could be low or high. $H_1: \mu \neq \mu_0$ .

Further Hypothesis Testing

Further Hypothesis Testing

Hypothesis Test for the Mean of a Normal Distribution

Distribution of the Sample Mean

Test Procedure

Large Samples and the Central Limit Theorem

Hypothesis Test for the Correlation Coefficient

Confidence and Significance

Common Mistakes in Hypothesis Testing

Summary

A-Level Deep Dive: Further Hypothesis Testing

Spec mapping

Worked example with full mark scheme

Specimen question modelled on the AQA 7357 Paper 3 format

Synoptic links

Mark-scheme literacy

Grade-band model answers

3-mark question

More in Mathematics