Correlation and Regression

This lesson covers correlation and regression as required by the Edexcel A-Level Mathematics specification (9MA0), Paper 3 Section A -- Statistics. You must be able to draw and interpret scatter diagrams, understand the product moment correlation coefficient (PMCC), calculate and use regression lines, and understand the difference between interpolation and extrapolation.

Scatter Diagrams

A scatter diagram (or scatter plot) is a graph that shows the relationship between two variables. Each data point is plotted as a point with coordinates (x, y).

The explanatory and response variables

The explanatory variable (independent variable) is plotted on the x-axis. This is the variable you think might influence the other.
The response variable (dependent variable) is plotted on the y-axis. This is the variable that may be affected.

Example

If you are investigating whether the number of hours of study affects exam score:

Hours of study = explanatory variable (x-axis)
Exam score = response variable (y-axis)

Types of Correlation

Correlation describes the strength and direction of the linear relationship between two variables.

Type	Description
Strong positive	As x increases, y tends to increase. Points cluster tightly around an upward line.
Weak positive	General upward trend but points are more scattered.
No correlation	No linear relationship between x and y. Points scattered randomly.
Weak negative	General downward trend but points are more scattered.
Strong negative	As x increases, y tends to decrease. Points cluster tightly around a downward line.

Exam Tip: Correlation does not imply causation. Just because two variables are correlated does not mean one causes the other. There may be a lurking (confounding) variable (a third variable that affects both), or the correlation may be coincidental.

The Product Moment Correlation Coefficient (PMCC)

The product moment correlation coefficient (denoted r) is a numerical measure of the strength and direction of the linear relationship between two variables.

Properties of r

r always lies between -1 and +1 (inclusive).
r = +1: perfect positive linear correlation.
r = -1: perfect negative linear correlation.
r = 0: no linear correlation.

Interpreting r

Value of r	Interpretation
0.8 ≤ r ≤ 1	Strong positive correlation
0.5 ≤ r < 0.8	Moderate positive correlation
0 < r < 0.5	Weak positive correlation
r = 0	No linear correlation
-0.5 < r < 0	Weak negative correlation
-0.8 < r ≤ -0.5	Moderate negative correlation
-1 ≤ r ≤ -0.8	Strong negative correlation

Exam Tip: The PMCC only measures linear correlation. A strong non-linear relationship (e.g. quadratic) could give r close to 0. Always check the scatter diagram.

Regression Lines

A regression line is the line of best fit through the data points. The most common is the regression line of y on x.

The equation of the regression line of y on x

y = a + bx

where:

b = Sxy / Sxx (the gradient)
a = y-bar - b x x-bar (the y-intercept)

Summary statistics

Sxx = Sigma(xi²) - (Sigma(xi))²/n
Syy = Sigma(yi²) - (Sigma(yi))²/n
Sxy = Sigma(xi x yi) - (Sigma(xi) x Sigma(yi))/n

Key property

The regression line always passes through the point (x-bar, y-bar).

Example

Given: n = 8, Sigma(x) = 120, Sigma(y) = 200, Sigma(x²) = 2040, Sigma(xy) = 3300

x-bar = 120/8 = 15, y-bar = 200/8 = 25

Sxx = 2040 - (120²/8) = 2040 - 1800 = 240

Sxy = 3300 - (120 x 200/8) = 3300 - 3000 = 300

b = 300/240 = 1.25

a = 25 - 1.25 x 15 = 25 - 18.75 = 6.25

Regression line: y = 6.25 + 1.25x

Interpreting the regression line

b (gradient): for every 1-unit increase in x, y increases (or decreases) by b units.
a (y-intercept): the predicted value of y when x = 0. This may or may not be meaningful -- if x = 0 is outside the range of the data, the intercept has no practical interpretation.

Interpolation and Extrapolation

Interpolation

Interpolation means using the regression line to estimate a value of y for a value of x that lies within the range of the observed data. This is generally reliable.

Extrapolation

Extrapolation means using the regression line to estimate a value of y for a value of x that lies outside the range of the observed data. This is unreliable because the linear relationship may not hold beyond the data.

Example

If your data for hours of study (x) ranges from 2 to 10:

Estimating the score for x = 6 is interpolation -- reliable.
Estimating the score for x = 15 is extrapolation -- unreliable.

Exam Tip: If asked to comment on the reliability of an estimate, always check whether it is interpolation or extrapolation. If extrapolation, state that the estimate is unreliable.

Residuals

A residual is the difference between the observed value and the predicted value:

Residual = observed y - predicted y

A positive residual means the observed value is above the regression line.
A negative residual means the observed value is below the regression line.
The sum of all residuals for the regression line of y on x is always zero.

If residuals show a pattern (e.g. they systematically increase), the linear model may not be appropriate.

Common Exam Pitfalls

Correlation does not imply causation. Always state this if the question involves a causal claim.
Extrapolation is unreliable. State this clearly when a prediction falls outside the data range.
The regression line passes through (x-bar, y-bar). Use this as a calculation check.
Interpret the gradient in context. Say "for every additional hour of study, the predicted exam score increases by 1.25 marks", not just "b = 1.25".
Only use the y-on-x line to predict y from x, never the other way around.

Summary

A scatter diagram shows the relationship between two variables. The explanatory variable goes on the x-axis.
The PMCC (r) measures the strength and direction of linear correlation: -1 ≤ r ≤ +1.
Correlation does not imply causation.
The regression line of y on x is y = a + bx, where b = Sxy/Sxx and a = y-bar - b(x-bar). It passes through (x-bar, y-bar).
Interpolation (within the data range) is reliable; extrapolation (outside the data range) is unreliable.
Residuals = observed - predicted. The sum of residuals is always zero for the least squares regression line.

A-Level Deep Dive: Correlation and Regression

Spec mapping

Edexcel 9MA0-03 specification section 4 — Statistics, sub-strands 4.1 and 4.2 (Paper 3, Statistics and Mechanics) covers interpret diagrams for bivariate data, including scatter diagrams, regression lines and the product moment correlation coefficient. Understand informal interpretation of correlation; understand that correlation does not imply causation (refer to the official specification document for exact wording). Although examined exclusively on Paper 3, regression and correlation appear synoptically with section 1 (data presentation, scatter diagrams), section 6 of the Pure paper (logarithms, used to linearise non-linear models such as $y = ax^n$ or $y = ab^x$ ) and the Year 2 hypothesis-test sub-strand (testing whether $\rho = 0$ for a sample correlation $r$ ). The Edexcel formula booklet provides the regression-line form $y = a + bx$ but candidates are not asked to compute $r$ from raw $\sum x$ , $\sum y$ , $\sum xy$ totals — the calculator does that — so questions test interpretation and use of given values.

Worked example with full mark scheme

Question (8 marks):

A researcher records the daily mean temperature $x$ (\textdegree C) and the number of ice creams sold $y$ at a kiosk on 12 randomly chosen summer days. Summary statistics give a product moment correlation coefficient of $r = 0.892$ , and the regression line of $y$ on $x$ is $y = -38 + 14.6x$ . The recorded temperatures range from $15\,\textdegree\text{C}$ to $28\,\textdegree\text{C}$ .

(a) Interpret the value of $r$ in context. (2)

(b) Interpret the gradient and intercept of the regression line in context. (3)

(c) Use the regression line to estimate ice-cream sales when the daily mean temperature is $22\,\textdegree\text{C}$ , and comment on the reliability of your estimate. (2)

(d) The researcher wishes to predict sales when $x = 35\,\textdegree\text{C}$ . Comment, with justification, on the validity of using the regression line for this prediction. (1)

Solution with mark scheme:

(a) B1 — $r = 0.892$ indicates a strong positive linear correlation between daily mean temperature and number of ice creams sold.

B1 — in context: as temperature increases, ice-cream sales tend to increase, and the data points lie close to a straight line. Common error: writing "strong positive correlation" alone, without the word linear, loses the second mark — $r$ measures linear association only.

(b) B1 — gradient $b = 14.6$ : for every $1\,\textdegree\text{C}$ increase in daily mean temperature, the model predicts approximately $14.6$ additional ice creams sold.

B1 — intercept $a = -38$ : the model predicts $-38$ ice creams sold when temperature is $0\,\textdegree\text{C}$ .

B1 — the intercept is not meaningful in context (you cannot sell a negative number of ice creams), and $x = 0\,\textdegree\text{C}$ lies far outside the data range $15 \leq x \leq 28$ . Examiners reward this dual observation: meaningless value and outside data range.

A1 — predicted sales approximately $283$ ice creams. Reliability: $x = 22\,\textdegree\text{C}$ lies within the observed data range $[15, 28]$ , so this is interpolation and the prediction is reliable provided the linear relationship holds.

(d) B1 — $x = 35\,\textdegree\text{C}$ lies outside the data range $[15, 28]$ , so using the regression line constitutes extrapolation. The linear relationship may not extend to higher temperatures (saturation, supply limits, or non-linear behaviour at extremes), so the prediction is unreliable.

Total: 8 marks (B6 M1 A1).

Specimen question modelled on the Edexcel 9MA0 Paper 3 format

Question (6 marks): A scatter diagram shows the relationship between hours of revision $x$ and exam mark $y$ for a sample of 20 students. The product moment correlation coefficient is calculated as $r = 0.74$ . The equation of the regression line of $y$ on $x$ is $y = 32 + 4.1x$ .

(a) State what is measured by $r$ , and interpret $r = 0.74$ in context. (2)

(b) A teacher claims, "This shows that doing more revision causes higher exam marks." Comment on this claim. (2)

(c) Use the regression equation to estimate the exam mark for a student who revises for $8$ hours, stating one assumption required for your estimate to be reliable. (2)

Mark scheme decomposition by AO:

(a)

B1 (AO1.2) — $r$ measures the strength and direction of linear association between two variables.
B1 (AO2.2b) — $r = 0.74$ indicates moderately strong positive linear correlation: students who revise more tend to score higher marks.

(b)

B1 (AO2.4) — correlation does not imply causation; the regression line establishes association, not a causal mechanism.
B1 (AO3.5b) — alternative explanations exist: a confounding variable (e.g. prior ability, motivation, study habits) may drive both revision time and exam performance simultaneously.

(c)

M1 (AO1.1a) — substitute $x = 8$ : $y = 32 + 4.1 \times 8 = 32 + 32.8 = 64.8$ .
A1 (AO3.5b) — predicted mark approximately $65$ , valid provided $x = 8$ lies within the original data range (interpolation), and assuming the linear relationship continues to hold.

Total: 6 marks split AO1 = 2, AO2 = 2, AO3 = 2. Paper 3 regression questions deliberately balance procedural calculation (AO1) with interpretation in context (AO2) and critique of modelling assumptions (AO3) — an even split is characteristic.

Synoptic links

Connects to:

Section 1 — Data presentation (scatter diagrams): the scatter diagram is the visual prerequisite for any correlation analysis. Before computing $r$ or fitting a regression line, candidates should sketch (or inspect) the scatter plot to verify that a linear relationship is plausible. A clearly curved pattern with $r \approx 0$ is the textbook trap.
Pure section 6 — Logarithms (linearising non-linear data): if data follow $y = ax^n$ , then $\log y = \log a + n \log x$ — a linear relationship between $\log y$ and $\log x$ . Similarly $y = ab^x$ gives $\log y = \log a + x \log b$ . Plotting $\log y$ against $\log x$ (or $x$ ) and computing $r$ on the transformed data is a standard Paper 3 modelling technique that ties Pure and Statistics together.
Year 2 hypothesis testing for correlation: the sample value $r$ is used to test $H_0: \rho = 0$ (no correlation in the population) against $H_1: \rho \neq 0$ (or one-sided alternatives). Critical values from Edexcel tables depend on sample size $n$ and significance level. This is the inferential bridge from descriptive statistics to formal testing.
Section 1 — Large data set: Edexcel's prescribed large data set (weather data) routinely supplies bivariate context for regression questions — daily mean temperature against rainfall, sunshine hours against pressure. Candidates are expected to know broadly that such relationships exist and to recognise plausible variable choices.
Modelling cycle: regression sits inside the wider modelling framework — formulate hypothesis, collect data, fit model, validate, refine. Critiquing model fit (residuals, extrapolation risk, outliers) is the AO3 reasoning that distinguishes A* answers from procedurally correct A answers.

Mark-scheme literacy

Correlation and regression questions on 9MA0-03 distribute AO marks more evenly than Pure topics, reflecting their interpretation-heavy character:

AO	Typical share	Earned by
AO1 (knowledge / procedure)	30–40%	Substituting into a given regression equation, reading $r$ values, identifying gradient and intercept
AO2 (reasoning / interpretation)	35–45%	Interpreting $r$ , gradient and intercept in context; identifying interpolation vs extrapolation; commenting on linearity
AO3 (problem-solving / modelling)	20–30%	Critiquing causation claims, identifying confounders, judging model validity, suggesting refinements

Examiner-rewarded phrasing: "strong positive linear correlation"; "for every $1$ -unit increase in $x$ , $y$ increases by approximately $b$ units"; " $x = k$ lies outside the data range, so this is extrapolation and the prediction is unreliable"; "correlation does not imply causation — a confounding variable may explain both". Phrases that lose marks: "strong positive correlation" without linear; "as $x$ increases, $y$ increases" without quantifying the rate; "the prediction is reliable" without checking whether the $x$ -value lies within the data range.

Correlation and Regression

Correlation and Regression

Scatter Diagrams

The explanatory and response variables

Example

Types of Correlation

The Product Moment Correlation Coefficient (PMCC)

Properties of r

Interpreting r

Regression Lines

The equation of the regression line of y on x

Summary statistics

Key property

Example

Interpreting the regression line

Interpolation and Extrapolation

Interpolation

Extrapolation

Example

Residuals

Common Exam Pitfalls

Summary

A-Level Deep Dive: Correlation and Regression

Spec mapping

Worked example with full mark scheme

Specimen question modelled on the Edexcel 9MA0 Paper 3 format

Synoptic links

Mark-scheme literacy

More in Mathematics