Correlation & Regression Modelling

This lesson covers correlation and regression in depth — the tools used to quantify and model the relationship between two variables. These techniques are central to the A-Level Mathematics statistics specification and are frequently tested using data from the large data set.

Correlation

Correlation measures the strength and direction of the linear relationship between two variables. It does not imply causation.

Product Moment Correlation Coefficient (PMCC)

The PMCC, denoted $r$ , quantifies the linear correlation between variables $x$ and $y$ :

$r = \frac{S_{xy}}{\sqrt{S_{xx} S_{yy}}}$

where:

$S_{xx} = \sum x^2 - \frac{(\sum x)^2}{n}, \quad S_{yy} = \sum y^2 - \frac{(\sum y)^2}{n}, \quad S_{xy} = \sum xy - \frac{(\sum x)(\sum y)}{n}$

Interpreting $r$

Value of $r$	Interpretation
$r = 1$	Perfect positive linear correlation
$0.7 \leq r < 1$	Strong positive correlation
$0.4 \leq r < 0.7$	Moderate positive correlation
$0 < r < 0.4$	Weak positive correlation
$r = 0$	No linear correlation
$-1 < r < 0$	Negative correlation (use same thresholds)
$r = -1$	Perfect negative linear correlation

Key Properties of the PMCC

$-1 \leq r \leq 1$
$r$ measures only linear correlation. A strong curved relationship may give $r \approx 0$ .
$r$ is not affected by linear coding (adding/subtracting a constant or multiplying/dividing by a positive constant).
$r$ is affected by outliers — a single extreme point can significantly change its value.
$r$ does not depend on which variable is on the $x$ -axis and which is on the $y$ -axis.

Example from the LDS

Calculating $r$ for daily mean temperature and daily total sunshine at Heathrow in May might give $r = 0.72$ , indicating a strong positive correlation. This is physically reasonable: warmer days tend to be sunnier.

Spearman's Rank Correlation Coefficient

Spearman's rank correlation coefficient, $r_s$ , measures the strength and direction of the monotonic relationship between two variables. It is calculated by ranking each variable and then applying the PMCC formula to the ranks, or using the shortcut:

$r_s = 1 - \frac{6 \sum d^2}{n(n^2 - 1)}$

where $d$ is the difference between the ranks for each pair and $n$ is the number of pairs.

When to Use Spearman's Rank

When the data is ordinal (ranked) rather than interval/ratio.
When the relationship is monotonic but not necessarily linear.
When the data contains outliers that would distort the PMCC.
When the data is not normally distributed.

Comparison: PMCC vs Spearman's Rank

Feature	PMCC ( $r$ )	Spearman's rank ( $r_s$ )
Data type	Interval/ratio (numerical)	Ordinal (ranked) or numerical
Relationship type	Linear	Monotonic
Sensitivity to outliers	High	Low
Affected by linear coding	No	No
When to use	Normally distributed data, linear relationship	Non-normal data, non-linear monotonic relationship

Least Squares Regression

Regression analysis fits a straight line to the data, which can then be used for prediction.

The Least Squares Regression Line

The least squares regression line of $y$ on $x$ minimises the sum of the squared vertical distances from each data point to the line:

$y = a + bx$

where:

$b = \frac{S_{xy}}{S_{xx}}, \quad a = \bar{y} - b\bar{x}$

Key Properties

The regression line always passes through the point $(\bar{x}, \bar{y})$ .
$b$ is the gradient — it represents the change in $y$ for each unit increase in $x$ .
$a$ is the $y$ -intercept — the predicted value of $y$ when $x = 0$ . This may or may not have a meaningful interpretation in context.

Interpreting the Regression Line

Example: If the regression line for daily total sunshine ( $y$ , hours) on daily mean temperature ( $x$ , °C) at Hurn is $y = -4.2 + 0.65x$ :

Gradient ( $b = 0.65$ ): For each 1°C increase in daily mean temperature, the daily total sunshine increases by approximately 0.65 hours.
Intercept ( $a = -4.2$ ): When the daily mean temperature is 0°C, the model predicts −4.2 hours of sunshine. This is physically impossible, so the intercept has no meaningful interpretation here — it simply anchors the line algebraically.

Interpolation vs Extrapolation

Term	Definition	Reliability
Interpolation	Predicting $y$ for an $x$ -value within the range of the data	Generally reliable — the linear model has been validated in this region
Extrapolation	Predicting $y$ for an $x$ -value outside the range of the data	Unreliable — the linear relationship may not hold beyond the observed data

Why Extrapolation Is Dangerous

The regression line is a model fitted to the observed data. Outside this range, the relationship between the variables may change. For example:

A regression model relating daily mean temperature and daily total sunshine might work well between 5°C and 25°C. Extrapolating to −10°C or 40°C would give unreasonable predictions because the relationship is unlikely to remain linear at extreme temperatures.
A regression of crop yield on rainfall might show a positive relationship up to a point, but excessive rainfall causes flooding and reduces yield — the relationship reverses.

Limitations of Correlation and Regression Models

1. Correlation Does Not Imply Causation

A correlation between two variables does not mean that one causes the other. There may be:

A confounding variable that influences both (e.g., season affects both temperature and ice cream sales).
A coincidental relationship (e.g., the number of films starring Nicolas Cage correlates with drownings in swimming pools).

2. Linearity Assumption

The PMCC and least squares regression assume a linear relationship. If the true relationship is curved, these measures will be misleading. Always examine the scatter diagram before relying on $r$ or the regression line.

3. Influence of Outliers

A single outlier can dramatically change the value of $r$ and the position of the regression line. Always check for and investigate outliers before drawing conclusions.

4. Range of Data

The correlation and regression results are only valid for the range of data from which they were calculated. Making predictions beyond this range (extrapolation) is unreliable.

5. Non-Constant Variance

If the spread of the data changes across the range of $x$ (heteroscedasticity), the regression model may not be appropriate.

Using Regression in the Context of the LDS

When answering exam questions that involve regression and the large data set:

Identify the explanatory and response variables. The explanatory variable goes on the $x$ -axis.
Plot a scatter diagram (or use the one provided) to check that the relationship is approximately linear.
Calculate $r$ to quantify the strength of the correlation.
Find the regression line $y = a + bx$ .
Interpret $b$ in context — what does it mean for each unit increase in $x$ ?
Use the line for prediction, but state clearly whether you are interpolating or extrapolating.
Discuss limitations — is the model appropriate? Are there outliers? Could there be a confounding variable?

Worked Example

Data for daily mean temperature ( $x$ , °C) and daily total rainfall ( $y$ , mm) at Leeming for 8 days in April:

$x$	8.2	9.1	10.5	7.8	11.2	12.0	9.5	10.0
$y$	5.1	3.8	2.2	6.0	1.5	0.8	3.5	2.9

$\bar{x} = 9.7875$ , $\bar{y} = 3.225$ , $S_{xx} = 14.62$ , $S_{xy} = -15.12$ , $S_{yy} = 21.14$

$r = \frac{-15.12}{\sqrt{14.62 \times 21.14}} = \frac{-15.12}{17.58} \approx -0.860$

$b = \frac{-15.12}{14.62} \approx -1.034, \quad a = 3.225 - (-1.034)(9.7875) \approx 13.34$

Regression line: $y = 13.34 - 1.03x$

Interpretation: There is a strong negative correlation ( $r \approx -0.86$ ) between daily mean temperature and daily rainfall at Leeming in April. For each 1°C increase in temperature, daily rainfall decreases by approximately 1.03 mm. Predicting rainfall for a temperature of 9°C (interpolation) gives $y = 13.34 - 1.03(9) = 4.07$ mm. Predicting for 20°C (extrapolation) would give a negative rainfall, which is impossible — this shows the danger of extrapolation.

Summary

The PMCC ( $r$ ) measures the strength and direction of the linear relationship between two variables.
Spearman's rank correlation is an alternative for non-linear monotonic relationships or ordinal data.
The least squares regression line $y = a + bx$ provides a model for predicting $y$ from $x$ .
Interpolation is generally reliable; extrapolation is not.
Correlation does not imply causation — always consider confounding variables.
Always examine the scatter diagram, check for outliers, and discuss the limitations of any model.

Exam Tip: When interpreting the gradient $b$ , always use the context. Do not say "for each unit increase in $x$ , $y$ increases by $b$ ." Instead say: "for each 1°C increase in daily mean temperature, the model predicts that daily total sunshine increases by 0.65 hours." This contextual interpretation is where marks are awarded.

A-Level Deep Dive: Correlation and Regression in the Large Data Set

Spec mapping

AQA 7357 specification, Paper 3 — Statistics, sub-strands O1, O2, O3 and N (Large Data Set context) covers interpret diagrams for single-variable data, including understanding that area in a histogram represents frequency. Interpret scatter diagrams and regression lines for bivariate data, including recognition of scatter diagrams which include distinct sections of the population. Understand informal interpretation of correlation. Understand that correlation does not imply causation (refer to the official specification document for exact wording). Although bivariate analysis is examined in Paper 3, the LDS-flavoured contexts that AQA favours bleed into Paper 1 (modelling) and Paper 2 (mechanics-style data). The AQA formula booklet provides the product-moment correlation coefficient $r$ formula and the least-squares regression coefficients — students are expected to interpret values, not derive them.

Worked example with full mark scheme

Question (8 marks):

A student investigates the relationship between mean monthly maximum daily temperature $x$ ( $°C$ ) and mean monthly daily total rainfall $y$ ( $\text{mm}$ ) for a single LDS-style location across 12 months. Summary statistics are: $\sum x = 168$ , $\sum y = 720$ , $\sum x^2 = 2616$ , $\sum y^2 = 47\,400$ , $\sum xy = 9648$ , $n = 12$ .

(a) Calculate the product-moment correlation coefficient $r$ , giving your answer to 3 s.f. (3)

(b) Find the equation of the regression line of $y$ on $x$ in the form $y = a + bx$ . (3)

(c) The student uses the line to predict the rainfall in a month with mean maximum temperature $30°C$ . Comment on the reliability of this prediction. (2)

Solution with mark scheme:

(a) Step 1 — compute the sums of squares and products.

$S_{xx} = \sum x^2 - \dfrac{(\sum x)^2}{n} = 2616 - \dfrac{168^2}{12} = 2616 - 2352 = 264$

$S_{yy} = \sum y^2 - \dfrac{(\sum y)^2}{n} = 47\,400 - \dfrac{720^2}{12} = 47\,400 - 43\,200 = 4200$

$S_{xy} = \sum xy - \dfrac{\sum x \sum y}{n} = 9648 - \dfrac{168 \cdot 720}{12} = 9648 - 10\,080 = -432$

M1 — correct method for at least one of $S_{xx}$ , $S_{yy}$ , $S_{xy}$ . The minus sign in $S_{xy}$ is the substantive content here; candidates who write $S_{xy} = 432$ lose the eventual sign of $r$ and $b$ .

Step 2 — apply the PMCC formula.

$r = \dfrac{S_{xy}}{\sqrt{S_{xx} \cdot S_{yy}}} = \dfrac{-432}{\sqrt{264 \cdot 4200}} = \dfrac{-432}{\sqrt{1\,108\,800}} = \dfrac{-432}{1052.99...}$

M1 — substituting into the PMCC formula correctly.

A1 — $r = -0.410$ (3 s.f.). The negative value indicates weak-to-moderate negative linear association; warmer months in this LDS sample are associated with lower rainfall.

(b) Step 1 — compute the gradient.

$b = \dfrac{S_{xy}}{S_{xx}} = \dfrac{-432}{264} = -1.6363...$

M1 — correct formula for $b$ .

Step 2 — compute the intercept using $\bar{y} = a + b\bar{x}$ .

$\bar{x} = 168/12 = 14$ , $\bar{y} = 720/12 = 60$ .

$a = \bar{y} - b\bar{x} = 60 - (-1.6363...)(14) = 60 + 22.909... = 82.9$

M1 — using $a = \bar{y} - b\bar{x}$ .

A1 — final equation $y = 82.9 - 1.64x$ (3 s.f.).

(c) B1 — $30°C$ lies outside the range of the sample data. This is extrapolation, and the linear relationship is not guaranteed to hold beyond the observed range.

B1 — additionally, $|r| = 0.410$ is not strong, so the linear model itself accounts for only $r^2 \approx 16.8\%$ of the variation in $y$ . Even within the data range, the prediction is unreliable; outside it, doubly so.

Total: 8 marks (M2 A1, M2 A1, B2).

Specimen question modelled on the AQA 7357 Paper 3 format

Question (6 marks): A scatter diagram of LDS-style daily mean windspeed $w$ ( $\text{kn}$ ) against daily mean cloud cover $c$ ( $\text{oktas}$ ) for a single station shows positive correlation. The regression line of $w$ on $c$ is $w = 4.2 + 0.85c$ .

(a) Interpret the value $0.85$ in this context. (2)

(b) Explain why it would be inappropriate to use this line to predict cloud cover from a measured windspeed. (2)

Mark scheme decomposition by AO:

(a)

B1 (AO2.4) — identifying $0.85$ as the gradient / rate of change of windspeed with cloud cover.
B1 (AO3.5) — interpretation in context: "for each additional okta of cloud cover, the predicted mean daily windspeed increases by 0.85 knots."

(b)

B1 (AO2.4) — recognising that the line $w = a + bc$ minimises vertical residuals (errors in $w$ , the response variable).
B1 (AO3.5) — therefore it should not be inverted to predict $c$ from $w$ ; the regression of $c$ on $w$ would be a different line with different coefficients.

(c)

B1 (AO3.4) — stating that correlation does not imply causation.
B1 (AO3.5) — suggesting a confounder or alternative mechanism (e.g. both windspeed and cloud cover may respond to large-scale weather systems / pressure gradients), so causation cannot be inferred from a scatter pattern alone.

Total: 6 marks split AO1 = 0, AO2 = 3, AO3 = 3. This is unusually AO2/AO3-heavy — bivariate questions on Paper 3 are designed to test interpretation, not arithmetic. Interpretation marks dominate.

Synoptic links

Connects to:

Paper 3 — Data presentation (scatter diagrams, outliers): the PMCC is calculated from sums-of-products, but its interpretation relies on visual inspection of the scatter. A single outlier can drag $r$ from $0.9$ to $0.3$ or vice versa. AQA explicitly tests recognition of "scatter diagrams which include distinct sections of the population" — bimodal scatter where two sub-populations exist invalidates a single regression line.
Paper 1 — Logarithmic linearisation: non-linear relationships of the form $y = ax^n$ or $y = ab^x$ become linear when log-transformed. Plotting $\log y$ against $\log x$ gives a line with gradient $n$ and intercept $\log a$ for the power model; plotting $\log y$ against $x$ gives gradient $\log b$ for the exponential model. Regression and PMCC then apply to the transformed data. Synoptic LDS questions often exploit this — temperature-versus-time data is genuinely linear, but rainfall-versus-elevation is closer to exponential and requires transformation before $r$ is meaningful.
Paper 3 — Hypothesis test for $\rho$ : the PMCC $r$ is a sample statistic estimating the population correlation $\rho$ . AQA tests $H_0: \rho = 0$ against $H_1: \rho \neq 0$ (or one-tailed) using critical values from the formula booklet at the 5% / 1% significance levels for given $n$ . A non-zero $r$ in a small sample may not be statistically significant; a small $r$ in a large sample may be highly significant.
Paper 1 — Modelling cycle: regression is the canonical example of "fit, predict, criticise". The AQA modelling cycle requires students to (i) propose a model, (ii) collect data, (iii) fit the model, (iv) make predictions, (v) evaluate fit, (vi) refine. Linear regression slots into stages (iii)-(v); recognising when to abandon the linear assumption is the (vi) refinement step that A* candidates handle explicitly.
Paper 3 — Probability and conditional probability: if cloud cover and windspeed are both treated as random variables, the regression coefficient $b = S_{xy}/S_{xx}$ is the empirical analogue of $\text{Cov}(X,Y)/\text{Var}(X)$ , which appears in conditional-expectation calculations. A* candidates connect the descriptive bivariate machinery to the underlying probability structure.

Mark-scheme literacy

Correlation and regression questions on 7357 split AO marks heavily toward AO2 and AO3:

Correlation & Regression Modelling

Correlation & Regression Modelling

Correlation

Product Moment Correlation Coefficient (PMCC)

Interpreting rrr

Key Properties of the PMCC

Example from the LDS

Spearman's Rank Correlation Coefficient

When to Use Spearman's Rank

Comparison: PMCC vs Spearman's Rank

Least Squares Regression

The Least Squares Regression Line

Key Properties

Interpreting the Regression Line

Interpolation vs Extrapolation

Why Extrapolation Is Dangerous

Limitations of Correlation and Regression Models

1. Correlation Does Not Imply Causation

2. Linearity Assumption

3. Influence of Outliers

4. Range of Data

5. Non-Constant Variance

Using Regression in the Context of the LDS

Worked Example

Summary

A-Level Deep Dive: Correlation and Regression in the Large Data Set

Spec mapping

Worked example with full mark scheme

Specimen question modelled on the AQA 7357 Paper 3 format

Synoptic links

Mark-scheme literacy

More in Mathematics

Interpreting $r$