Correlation and Regression (Further)

Correlation asks how strongly two variables move together; regression asks what line best summarises that relationship and lets us predict. This lesson takes both beyond GCSE/AS: it derives and interprets Pearson's product-moment correlation coefficient (PMCC) $r$ , introduces Spearman's rank correlation coefficient $r_s$ for monotonic (not necessarily linear) association, tests each against population hypotheses ( $\rho = 0$ ) using critical-value tables, and builds the least-squares regression line from the bivariate sums, with full attention to when prediction is valid.

Where this sits in AQA 7367

This is Paper 3 optional content — Statistics (7367/3S), chosen alongside Mechanics (7367/3M) or Discrete (7367/3D). Paper 3 is 2 hours, 100 marks, AO1 40% / AO2 25% / AO3 35%. The mechanics of computing $r$ , $r_s$ and the regression coefficients are AO1; choosing the right coefficient for the data, interpreting $r^2$ , and warning against extrapolation are AO2; a multi-step worded test is AO3. It builds on A-Level Maths bivariate data (scatter diagrams, the PMCC, the $y$ -on- $x$ regression line) and on the hypothesis-testing framework from earlier in this option.

Core theory: Pearson's PMCC

For paired data $(x_i, y_i)$ , define the bivariate sums

$S_{xx} = \sum (x_i - \bar x)^2 = \sum x_i^2 - \frac{(\sum x_i)^2}{n}, \quad S_{yy} = \sum (y_i - \bar y)^2 = \sum y_i^2 - \frac{(\sum y_i)^2}{n},$ $S_{xy} = \sum (x_i - \bar x)(y_i - \bar y) = \sum x_i y_i - \frac{(\sum x_i)(\sum y_i)}{n}.$

The right-hand "computational" forms are the ones to use in practice — they avoid subtracting the mean from every value. The PMCC is

$r = \frac{S_{xy}}{\sqrt{S_{xx}\,S_{yy}}}, \qquad -1 \le r \le 1.$

It measures the strength and direction of a linear relationship: it is the covariance of $x$ and $y$ divided by the product of their standard deviations, so it is dimensionless and unchanged by any linear rescaling of either variable (changing units from cm to m leaves $r$ fixed). The bound $|r|\le 1$ is the Cauchy–Schwarz inequality applied to the centred data.

Geometrically, $r$ is the cosine of the angle between the two centred data vectors $(x_i - \bar x)$ and $(y_i - \bar y)$ : when they point the same way $r = 1$ , when opposite $r = -1$ , and when orthogonal $r = 0$ . This is why $r$ detects only the linear component of a relationship — it is blind to any structure perpendicular to a straight-line trend. A small $|r|$ therefore rules out a linear association but says nothing about curved or other non-linear patterns, a point we return to under misconceptions.

Value of $r$	Interpretation
$r = 1$	perfect positive linear correlation (all points on a line of positive gradient)
$r = -1$	perfect negative linear correlation
$r = 0$	no linear correlation (but a curved relationship may still exist)

Worked Example 1 — PMCC and the regression line from sums (with mark scheme)

Five pairs of readings give $\sum x = 30$ , $\sum y = 45$ , $\sum x^2 = 220$ , $\sum y^2 = 491$ , $\sum xy = 328$ , with $n = 5$ . Find $r$ , and the equation of the regression line of $y$ on $x$ .

Bivariate sums.

$S_{xx} = 220 - \frac{30^2}{5} = 220 - 180 = 40, \quad S_{yy} = 491 - \frac{45^2}{5} = 491 - 405 = 86. \quad (\text{M1; A1})$ $S_{xy} = 328 - \frac{30\times 45}{5} = 328 - 270 = 58. \quad (\text{A1})$

PMCC.

$r = \frac{58}{\sqrt{40\times 86}} = \frac{58}{\sqrt{3440}} = \frac{58}{58.65} = 0.989. \quad (\text{M1 formula; A1 to 3 s.f.})$

Regression line. With $\bar x = 30/5 = 6$ and $\bar y = 45/5 = 9$ ,

$b = \frac{S_{xy}}{S_{xx}} = \frac{58}{40} = 1.45, \qquad a = \bar y - b\bar x = 9 - 1.45(6) = 0.3. \quad (\text{M1 } b; \ \text{A1 } a)$ $\therefore\ y = 0.3 + 1.45x \quad \text{equivalently} \quad y - 9 = 1.45(x - 6). \quad (\text{A1})$

(M1/A1 for the sums; M1/A1 for $r$ ; M1/A1/A1 for $b$ , $a$ and the equation. The very high $r = 0.989$ signals a strong positive linear relationship, consistent with the steep positive gradient $b = 1.45$ .)

Worked Example 1b: building the sums from raw data

To see where those summary statistics come from, consider the raw pairs $(2,3), (4,7), (6,8), (8,12), (10,15)$ . Tabulating the products:

$x$	$y$	$x^2$	$y^2$	$xy$
2	3	4	9	6
4	7	16	49	28
6	8	36	64	48
8	12	64	144	96
10	15	100	225	150
30	45	220	491	328

The column totals are exactly the summary statistics used in Worked Example 1: $\sum x = 30$ , $\sum y = 45$ , $\sum x^2 = 220$ , $\sum y^2 = 491$ , $\sum xy = 328$ . In an exam you would build this table first, then substitute into the $S_{xx}, S_{yy}, S_{xy}$ formulae — laying out the table earns the AO1 method marks even if a single arithmetic slip costs an accuracy mark. Always keep the totals to full accuracy; round only the final $r$ .

Hypothesis test for the PMCC

A sample $r$ is only an estimate of the population correlation $\rho$ . To test whether there is genuine linear correlation in the population:

$H_0: \rho = 0 \ (\text{no linear correlation}), \qquad H_1: \rho \neq 0 \ (\text{two-tailed}) \ \text{or} \ \rho > 0,\ \rho < 0 \ (\text{one-tailed}).$

Compare the sample $r$ with the PMCC critical value read from tables for the given $n$ and significance level. Reject $H_0$ if $|r|$ exceeds the critical value (two-tailed) or if $r$ is beyond the one-tailed critical value in the stated direction.

The choice of tail must come from the context, set before seeing the data. If the question asks whether there is any association ("is there correlation?"), use a two-tailed test ( $H_1: \rho \neq 0$ ); if it predicts a direction ("do taller people weigh more?"), use a one-tailed test ( $H_1: \rho > 0$ or $< 0$ ). The one-tailed critical value is smaller (easier to reach) because the whole significance level sits in one tail, so choosing the tail after seeing the data would inflate the true Type I error rate — exactly the malpractice flagged in the hypothesis-testing lesson. As ever in a "test", the population parameter $\rho$ appears in the hypotheses; the sample $r$ is only the evidence weighed against the table.

Worked Example 2 — testing $\rho = 0$ (with mark scheme)

For $n = 10$ pairs the sample PMCC is $r = 0.65$ . Test at the $5\%$ level whether there is positive correlation in the population.

$H_0: \rho = 0, \quad H_1: \rho > 0 \ (\text{one-tailed}). \quad (\text{B1 hypotheses in terms of } \rho)$

From PMCC tables, the one-tailed $5\%$ critical value for $n = 10$ is $0.5494$ .

$r = 0.65 > 0.5494 = \text{critical value} \;\Rightarrow\; \text{reject } H_0. \quad (\text{M1 compare; A1 reject})$

There is evidence at the $5\%$ level of positive linear correlation in the population. (B1 for hypotheses stated with the population parameter $\rho$ — not $r$ ; M1 for the correct one-tailed comparison; A1 for a contextual conclusion. A two-tailed test would compare against $0.6319$ ; $0.65 > 0.6319$ still rejects.)

Spearman's Rank Correlation Coefficient

Spearman's coefficient $r_s$ measures monotonic association — whether $y$ tends to increase (or decrease) as $x$ increases, even if the trend is curved. It is simply the PMCC computed on the ranks of the data. When there are no ties, this reduces to the convenient formula

$r_s = 1 - \frac{6\sum d_i^2}{n(n^2 - 1)}, \qquad d_i = \operatorname{rank}(x_i) - \operatorname{rank}(y_i).$

Procedure: rank each variable separately (rank $1$ = smallest, say); for tied values assign the average of the positions they share; compute each $d_i$ and $d_i^2$ ; then apply the formula.

Worked Example 3 — Spearman's coefficient and test (with mark scheme)

Seven products are ranked by two judges. The data and ranks:

Pair	$x$	$y$	Rank $x$	Rank $y$	$d$	$d^2$
1	56	44	6	6	0	0
2	75	70	2	2	0	0
3	45	52	7	5	2	4
4	71	58	3	4	-1	1
5	62	67	4	3	1	1
6	80	82	1	1	0	0
7	58	41	5	7	-2	4

$\sum d^2 = 0+0+4+1+1+0+4 = 10. \quad (\text{M1 ranking}; \ \text{A1 } \textstyle\sum d^2)$ $r_s = 1 - \frac{6\times 10}{7(7^2 - 1)} = 1 - \frac{60}{336} = 1 - 0.1786 = 0.821. \quad (\text{M1 formula; A1})$

Test $H_0$ : no association, $H_1$ : positive association, at $5\%$ . The one-tailed critical value ( $n = 7$ ) is $0.7143$ :

$r_s = 0.821 > 0.7143 \;\Rightarrow\; \text{reject } H_0; \ \text{evidence of positive monotonic association.} \quad (\text{A1 conclusion})$

(M1 for ranking both variables consistently; A1 for $\sum d^2 = 10$ ; M1/A1 for the coefficient; A1 for a contextual conclusion against the table value. A common slip is ranking $x$ ascending but $y$ descending — always rank both the same way.)

When to Use Spearman's vs Pearson's

Feature	Pearson's $r$	Spearman's $r_s$
Measures	linear correlation	monotonic correlation
Data type	continuous (ideally bivariate normal)	ordinal, or non-normal continuous
Sensitivity to outliers	high	low (uses ranks)
Curved monotonic trend	underestimates strength	captures it (can be $\pm 1$ )
Tied values	not an issue	need average ranks

Exam Tip: Choose Spearman's when the data are already ranks, when the relationship is monotonic but visibly non-linear, or when an outlier would distort Pearson's $r$ . Choose Pearson's when a linear model is appropriate and you also want the regression line.

Handling Tied Ranks

When two or more values are equal, assign each the average of the ranks they would have occupied. For example the data $10, 15, 15, 20$ receive ranks $1, 2.5, 2.5, 4$ (the two $15$ s share positions $2$ and $3$ , averaging to $2.5$ ). With ties present, the shortcut $r_s = 1 - 6\sum d^2/(n(n^2-1))$ is only an approximation; for accuracy with several ties, compute the PMCC of the ranks directly using the $S_{xy}/\sqrt{S_{xx}S_{yy}}$ formula.

Coefficient of Determination $r^2$

For the linear model, $r^2$ (the square of Pearson's $r$ ) is the proportion of the variation in $y$ explained by the linear relationship with $x$ :

$r^2 = \frac{\text{explained variation}}{\text{total variation}} = 1 - \frac{\sum (y_i - \hat y_i)^2}{S_{yy}}.$

For Worked Example 1, $r = 0.989$ gives $r^2 = 0.978$ : about $97.8\%$ of the variation in $y$ is explained by the linear fit — an excellent model. A value $r = 0.8$ gives $r^2 = 0.64$ , i.e. $64\%$ explained, leaving $36\%$ to other factors or noise.

Correlation and Regression (Further)

Correlation and Regression (Further)

Where this sits in AQA 7367

Core theory: Pearson's PMCC

Worked Example 1 — PMCC and the regression line from sums (with mark scheme)

Worked Example 1b: building the sums from raw data

Hypothesis test for the PMCC

Worked Example 2 — testing $\rho = 0$ (with mark scheme)

Spearman's Rank Correlation Coefficient

Worked Example 3 — Spearman's coefficient and test (with mark scheme)

When to Use Spearman's vs Pearson's

Handling Tied Ranks

Coefficient of Determination $r^2$

Regression Lines and Prediction

More in Mathematics

Pair	$x$	$y$	Rank $x$	Rank $y$	$d$	$d^2$
1	56	44	6	6	0	0
2	75	70	2	2	0	0
3	45	52	7	5	2	4
4	71	58	3	4	-1	1
5	62	67	4	3	1	1
6	80	82	1	1	0	0
7	58	41	5	7	-2	4

Pair	$x$	$y$	Rank $x$	Rank $y$	$d$	$d^2$
1	56	44	6	6	0	0
2	75	70	2	2	0	0
3	45	52	7	5	2	4
4	71	58	3	4	-1	1
5	62	67	4	3	1	1
6	80	82	1	1	0	0
7	58	41	5	7	-2	4

Correlation and Regression (Further)

Correlation and Regression (Further)

Where this sits in AQA 7367

Core theory: Pearson's PMCC

Worked Example 1 — PMCC and the regression line from sums (with mark scheme)

Worked Example 1b: building the sums from raw data

Hypothesis test for the PMCC

Worked Example 2 — testing ρ=0\rho = 0ρ=0 (with mark scheme)

Spearman's Rank Correlation Coefficient

Worked Example 3 — Spearman's coefficient and test (with mark scheme)

When to Use Spearman's vs Pearson's

Handling Tied Ranks

Coefficient of Determination r2r^2r2

Regression Lines and Prediction

More in Mathematics

Worked Example 2 — testing $\rho = 0$ (with mark scheme)

Coefficient of Determination $r^2$

Pair	$x$	$y$	Rank $x$	Rank $y$	$d$	$d^2$
1	56	44	6	6	0	0
2	75	70	2	2	0	0
3	45	52	7	5	2	4
4	71	58	3	4	-1	1
5	62	67	4	3	1	1
6	80	82	1	1	0	0
7	58	41	5	7	-2	4