Correlation & Regression

This lesson covers correlation and regression at A-Level. These techniques are used to explore and quantify the relationship between two variables. Understanding when and how to apply regression analysis and interpret correlation coefficients is essential for the statistics component of A-Level Mathematics.

Scatter Diagrams

A scatter diagram plots pairs of data $(x, y)$ to visualise the relationship between two variables. The explanatory (independent) variable is plotted on the $x$ -axis, and the response (dependent) variable on the $y$ -axis.

Types of Correlation

Type	Description
Positive correlation	As $x$ increases, $y$ tends to increase
Negative correlation	As $x$ increases, $y$ tends to decrease
No correlation	No linear relationship between $x$ and $y$
Strong correlation	Points lie close to a straight line
Weak correlation	Points are widely scattered but show a general trend

Exam Tip: Correlation does not imply causation. Even if two variables are strongly correlated, one does not necessarily cause the other. Always consider confounding variables and the context.

The Product Moment Correlation Coefficient (PMCC)

The PMCC, denoted $r$ , measures the strength and direction of the linear relationship between two variables:

$r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}}$

where: $S_{xx} = \sum x^2 - \frac{(\sum x)^2}{n}, \quad S_{yy} = \sum y^2 - \frac{(\sum y)^2}{n}, \quad S_{xy} = \sum xy - \frac{(\sum x)(\sum y)}{n}$

Properties of $r$ :

$-1 \leq r \leq 1$
$r = 1$ : perfect positive linear correlation
$r = -1$ : perfect negative linear correlation
$r = 0$ : no linear correlation
$|r|$ close to 1 indicates strong linear correlation

Linear Regression

The least squares regression line of $y$ on $x$ is:

$y = a + bx$

where: $b = \frac{S_{xy}}{S_{xx}} \quad \text{and} \quad a = \bar{y} - b\bar{x}$

This line minimises the sum of the squared vertical distances from the data points to the line.

Interpretation

$b$ (the gradient) represents the change in $y$ for a one-unit increase in $x$ .
$a$ (the $y$ -intercept) represents the predicted value of $y$ when $x = 0$ .

Exam Tip: When interpreting the regression equation, always relate the gradient and intercept to the context of the data. State what $x$ and $y$ represent. If $x = 0$ is outside the range of data, note that the intercept may not have a practical interpretation.

Interpolation and Extrapolation

Method	Definition	Reliability
Interpolation	Estimating within the range of the observed data	Generally reliable
Extrapolation	Estimating beyond the range of the observed data	Unreliable — the relationship may not continue

Coding

If data is transformed using a linear coding such as $X' = \frac{X - a}{b}$ , the regression line can be calculated using the coded data and then decoded. Coding does not affect the value of $r$ (the PMCC).

Hypothesis Test for Correlation

To test whether a correlation coefficient is significantly different from zero:

$H_0: \rho = 0$ (no correlation in the population)
$H_1: \rho > 0$ or $\rho < 0$ or $\rho \neq 0$

Compare the sample $r$ with the critical value from the PMCC table for the given sample size $n$ and significance level $\alpha$ .

If $|r|$ exceeds the critical value, reject $H_0$ and conclude there is evidence of correlation.

Summary

Use scatter diagrams to visualise relationships and identify the type and strength of correlation.
The PMCC ( $r$ ) measures linear correlation: $-1 \leq r \leq 1$ .
The regression line $y = a + bx$ is used for prediction; interpret $a$ and $b$ in context.
Interpolation is reliable; extrapolation is not.
Correlation does not imply causation.
Hypothesis tests for $\rho$ use the PMCC table with the appropriate significance level.

Exam Tip: When asked to comment on correlation, state the type (positive/negative), strength (strong/moderate/weak), and relate it to the context. Then state whether it implies a causal relationship (usually it does not).

A-Level Deep Dive: Correlation and Regression

Spec mapping

AQA 7357 specification, Paper 3 — Statistics, Section P: "Use and interpret scatter diagrams for bivariate data; recognise correlation and know that it does not imply causation; interpret the product moment correlation coefficient (PMCC) $r$ ; understand and use the equation of a regression line $y = a + bx$ ; understand the concepts of interpolation and extrapolation; carry out a hypothesis test for zero correlation using a critical-value table for $\rho$ ." This material is examined in 7357/3 alongside Probability, Statistical Distributions and Hypothesis Testing. Although a Paper 3 topic, regression and correlation interact heavily with Section O (data presentation, the large data set) and with the logarithmic linearisation techniques from Pure Paper 1, where non-linear relationships $y = ax^n$ or $y = ab^x$ are reduced to linear form before regression is applied. The AQA formula booklet provides PMCC critical values but not the formula for $r$ itself for routine computation — calculator statistical mode is expected.

Worked example with full mark scheme

Question (8 marks): A researcher records the number of hours $x$ spent revising and the percentage score $y$ achieved by 10 students. Summary statistics are $\sum x = 50$ , $\sum y = 620$ , $\sum x^2 = 310$ , $\sum y^2 = 39{,}260$ , $\sum xy = 3{,}340$ , $n = 10$ . The PMCC is calculated as $r = 0.842$ (3 s.f.).

(a) Interpret the value of $r$ in context. (2)

(b) Find the equation of the regression line of $y$ on $x$ in the form $y = a + bx$ . (4)

(c) Use your line to estimate the score for a student who revised for 7 hours, and comment on the reliability of estimating the score for a student who revised for 20 hours. (2)

Solution with mark scheme:

(a) B1 — quantitative description: $r = 0.842$ indicates strong positive linear correlation between revision hours and score.

B1 — context: as revision hours increase, the score tends to increase (linear association). Common error: writing "as $x$ increases, $y$ increases" without anchoring to the variables in context loses the contextual mark.

(b) Step 1 — compute the means.

$\bar{x} = \dfrac{\sum x}{n} = \dfrac{50}{10} = 5, \qquad \bar{y} = \dfrac{\sum y}{n} = \dfrac{620}{10} = 62$

M1 — correct means.

Step 2 — compute $S_{xy}$ and $S_{xx}$ .

$S_{xy} = \sum xy - \dfrac{(\sum x)(\sum y)}{n} = 3340 - \dfrac{50 \cdot 620}{10} = 3340 - 3100 = 240$ $S_{xx} = \sum x^2 - \dfrac{(\sum x)^2}{n} = 310 - \dfrac{2500}{10} = 310 - 250 = 60$

M1 — correct $S_{xy}$ and $S_{xx}$ formulae applied. A common slip is using $\sum y^2$ here by mistake — that quantity is $S_{yy}$ , used only for PMCC, never for the regression slope.

Step 3 — gradient.

$b = \dfrac{S_{xy}}{S_{xx}} = \dfrac{240}{60} = 4$

A1 — gradient $b = 4$ .

Step 4 — intercept and final equation.

$a = \bar{y} - b\bar{x} = 62 - 4 \cdot 5 = 42$ $y = 42 + 4x$

A1 — equation in the requested form. Writing $y = 4x + 42$ (slope-intercept order) is mathematically identical but examiners may penalise if the question stem demands $y = a + bx$ exactly.

(c) B1 — for $x = 7$ : $y = 42 + 4 \cdot 7 = 70$ , so the predicted score is approximately 70%. This is interpolation (7 lies inside the data range $0 \leq x \leq$ max observed) and is reliable.

B1 — for $x = 20$ : this lies far outside the observed range ( $\sum x = 50$ across 10 students implies a typical maximum near 10–12 hours). Extrapolating to 20 hours is unreliable: the linear model may not hold, and the predicted $y = 122$ exceeds 100%, which is impossible for a percentage score.

Total: 8 marks (B2 + M2 A2 + B2).

Specimen question modelled on the AQA 7357/3 format

Question (6 marks): A scientist tests whether tree height $h$ (m) and trunk diameter $d$ (cm) are correlated for a random sample of 25 trees. The PMCC is $r = 0.412$ . Test, at the 5% level, whether there is evidence of positive correlation between $h$ and $d$ in the population. The critical value of $\rho$ for $n = 25$ at the 5% one-tailed level is 0.3365.

Mark scheme decomposition by AO:

B1 (AO1.2) — hypotheses: $H_0: \rho = 0$ , $H_1: \rho > 0$ (one-tailed because the question asks about positive correlation).
B1 (AO1.2) — significance level and critical value: 5% one-tailed, critical value 0.3365.
M1 (AO1.1b) — comparison: $0.412 > 0.3365$ .
A1 (AO2.2b) — conclusion in symbols: reject $H_0$ .
A1 (AO3.2a) — conclusion in context: there is evidence at the 5% level of positive correlation between tree height and trunk diameter in the population.
B1 (AO3.5b) — assumption acknowledged: the test assumes the bivariate data is drawn from a population that is approximately normally distributed for the PMCC critical-value table to be valid.

Total: 6 marks split AO1 = 3, AO2 = 1, AO3 = 2. Hypothesis-test questions on Paper 3 deliberately load AO3 marks onto the conclusion-in-context and assumption-checking steps — these are the marks weakest candidates miss.

Synoptic links

Connects to:

Section O — Data presentation and the large data set: scatter diagrams are the visual prerequisite for any $r$ calculation. The AQA-prescribed large data set is routinely used as the source of bivariate samples in exam questions; familiarity with its variables (and their typical units, ranges and outliers) is assumed without reminder.
Pure Paper 1 — Logarithmic linearisation: when $y = a x^n$ , taking logs gives $\log y = \log a + n \log x$ , which is linear in $\log x$ . Plotting $\log y$ against $\log x$ and fitting a regression line recovers $n$ as the gradient and $\log a$ as the intercept. Similarly $y = a b^x$ linearises to $\log y = \log a + x \log b$ — linear in $x$ . This is how regression is extended to non-linear bivariate data without leaving the A-Level toolkit.
Section R — Hypothesis testing: the test for $\rho = 0$ is structurally identical to the binomial / normal tests learned earlier — $H_0$ versus $H_1$ , significance level, critical value, conclusion in context. Only the test statistic (PMCC $r$ ) and the critical-value table change.
Section S — Probability distributions: the PMCC critical-value table presupposes that $(X, Y)$ is bivariate normal in the population. For non-normal bivariate data, Spearman's rank correlation $r_s$ is the appropriate alternative — non-examinable at A-Level but a natural undergraduate extension.
Modelling cycle (cross-paper): every regression question is implicitly a modelling question — fit, interpret, predict, criticise. Extrapolation criticism is the dominant AO3 source on Paper 3 regression.

Mark-scheme literacy

Correlation and regression questions on 7357/3 split AO marks more evenly than pure topics:

AO	Typical share	Earned by
AO1 (knowledge / procedure)	40–50%	Computing $\bar{x}$ , $\bar{y}$ , $S_{xy}$ , $S_{xx}$ , $b$ , $a$ ; stating hypotheses; reading critical-value tables
AO2 (reasoning / interpretation)	25–35%	Interpreting $r$ in context; choosing between one- and two-tailed tests; justifying rejection of $H_0$
AO3 (problem-solving / modelling)	20–30%	Commenting on extrapolation; criticising the linear model; stating bivariate-normal assumption; commenting on causation

Examiner-rewarded phrasing: "evidence at the 5% level"; "do not reject $H_0$ " (never "accept $H_0$ " — a hypothesis test cannot prove the null); "extrapolation is unreliable because $x = 20$ lies outside the observed range $[0, 12]$ "; "correlation does not imply causation — a confounding variable may explain the association". Phrases that lose marks: "accept $H_0$ " (reserved language — examiners deduct); "this proves there is correlation" (proof versus evidence); " $r$ is high so $x$ causes $y$ " (causal claim from correlational data).

A specific AQA pattern: questions that ask "comment on the suitability of the linear model" expect two points — one about the strength of $r$ (does the data look linear?) and one about the residual pattern or scatter spread (any curvature, fanning, outliers?).

Grade-band model answers

3-mark question

Question: A scatter diagram of 30 paired observations gives $r = -0.78$ . Interpret this value in the context of monthly heating bill $y$ (£) and average outdoor temperature $x$ (°C).

Grade C response (~150 words):

$r = -0.78$ shows a strong negative correlation. As temperature increases, heating bill decreases. The relationship is linear and the points on the scatter diagram would lie close to a straight line with negative gradient.

Examiner commentary: Full marks (3/3). The candidate correctly identifies strength (strong), direction (negative) and links the variables in context. Concise but complete. Many candidates write only "negative correlation" without "strong" — that loses the strength mark. Others omit the contextual link and just describe the abstract relationship between $x$ and $y$ — that loses the context mark.

Grade A response (~210 words):*

Correlation & Regression

Correlation & Regression

Scatter Diagrams

Types of Correlation

The Product Moment Correlation Coefficient (PMCC)

Linear Regression

Interpretation

Interpolation and Extrapolation

Coding

Hypothesis Test for Correlation

Summary

A-Level Deep Dive: Correlation and Regression

Spec mapping

Worked example with full mark scheme

Specimen question modelled on the AQA 7357/3 format

Synoptic links

Mark-scheme literacy

Grade-band model answers

3-mark question

More in Mathematics