You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson covers correlation and regression as required by the Edexcel A-Level Mathematics specification (9MA0), Paper 3 Section A -- Statistics. You must be able to draw and interpret scatter diagrams, understand the product moment correlation coefficient (PMCC), calculate and use regression lines, and understand the difference between interpolation and extrapolation.
A scatter diagram (or scatter plot) is a graph that shows the relationship between two variables. Each data point is plotted as a point with coordinates (x, y).
If you are investigating whether the number of hours of study affects exam score:
Correlation describes the strength and direction of the linear relationship between two variables.
| Type | Description |
|---|---|
| Strong positive | As x increases, y tends to increase. Points cluster tightly around an upward line. |
| Weak positive | General upward trend but points are more scattered. |
| No correlation | No linear relationship between x and y. Points scattered randomly. |
| Weak negative | General downward trend but points are more scattered. |
| Strong negative | As x increases, y tends to decrease. Points cluster tightly around a downward line. |
Exam Tip: Correlation does not imply causation. Just because two variables are correlated does not mean one causes the other. There may be a lurking (confounding) variable (a third variable that affects both), or the correlation may be coincidental.
The product moment correlation coefficient (denoted r) is a numerical measure of the strength and direction of the linear relationship between two variables.
| Value of r | Interpretation |
|---|---|
| 0.8 ≤ r ≤ 1 | Strong positive correlation |
| 0.5 ≤ r < 0.8 | Moderate positive correlation |
| 0 < r < 0.5 | Weak positive correlation |
| r = 0 | No linear correlation |
| -0.5 < r < 0 | Weak negative correlation |
| -0.8 < r ≤ -0.5 | Moderate negative correlation |
| -1 ≤ r ≤ -0.8 | Strong negative correlation |
Exam Tip: The PMCC only measures linear correlation. A strong non-linear relationship (e.g. quadratic) could give r close to 0. Always check the scatter diagram.
A regression line is the line of best fit through the data points. The most common is the regression line of y on x.
y = a + bx
where:
The regression line always passes through the point (x-bar, y-bar).
Given: n = 8, Sigma(x) = 120, Sigma(y) = 200, Sigma(x²) = 2040, Sigma(xy) = 3300
x-bar = 120/8 = 15, y-bar = 200/8 = 25
Sxx = 2040 - (120²/8) = 2040 - 1800 = 240
Sxy = 3300 - (120 x 200/8) = 3300 - 3000 = 300
b = 300/240 = 1.25
a = 25 - 1.25 x 15 = 25 - 18.75 = 6.25
Regression line: y = 6.25 + 1.25x
Interpolation means using the regression line to estimate a value of y for a value of x that lies within the range of the observed data. This is generally reliable.
Extrapolation means using the regression line to estimate a value of y for a value of x that lies outside the range of the observed data. This is unreliable because the linear relationship may not hold beyond the data.
If your data for hours of study (x) ranges from 2 to 10:
Exam Tip: If asked to comment on the reliability of an estimate, always check whether it is interpolation or extrapolation. If extrapolation, state that the estimate is unreliable.
A residual is the difference between the observed value and the predicted value:
Residual = observed y - predicted y
If residuals show a pattern (e.g. they systematically increase), the linear model may not be appropriate.
Edexcel 9MA0-03 specification section 4 — Statistics, sub-strands 4.1 and 4.2 (Paper 3, Statistics and Mechanics) covers interpret diagrams for bivariate data, including scatter diagrams, regression lines and the product moment correlation coefficient. Understand informal interpretation of correlation; understand that correlation does not imply causation (refer to the official specification document for exact wording). Although examined exclusively on Paper 3, regression and correlation appear synoptically with section 1 (data presentation, scatter diagrams), section 6 of the Pure paper (logarithms, used to linearise non-linear models such as y=axn or y=abx) and the Year 2 hypothesis-test sub-strand (testing whether ρ=0 for a sample correlation r). The Edexcel formula booklet provides the regression-line form y=a+bx but candidates are not asked to compute r from raw ∑x, ∑y, ∑xy totals — the calculator does that — so questions test interpretation and use of given values.
Question (8 marks):
A researcher records the daily mean temperature x (\textdegree C) and the number of ice creams sold y at a kiosk on 12 randomly chosen summer days. Summary statistics give a product moment correlation coefficient of r=0.892, and the regression line of y on x is y=−38+14.6x. The recorded temperatures range from 15\textdegreeC to 28\textdegreeC.
(a) Interpret the value of r in context. (2)
(b) Interpret the gradient and intercept of the regression line in context. (3)
(c) Use the regression line to estimate ice-cream sales when the daily mean temperature is 22\textdegreeC, and comment on the reliability of your estimate. (2)
(d) The researcher wishes to predict sales when x=35\textdegreeC. Comment, with justification, on the validity of using the regression line for this prediction. (1)
Solution with mark scheme:
(a) B1 — r=0.892 indicates a strong positive linear correlation between daily mean temperature and number of ice creams sold.
B1 — in context: as temperature increases, ice-cream sales tend to increase, and the data points lie close to a straight line. Common error: writing "strong positive correlation" alone, without the word linear, loses the second mark — r measures linear association only.
(b) B1 — gradient b=14.6: for every 1\textdegreeC increase in daily mean temperature, the model predicts approximately 14.6 additional ice creams sold.
B1 — intercept a=−38: the model predicts −38 ice creams sold when temperature is 0\textdegreeC.
B1 — the intercept is not meaningful in context (you cannot sell a negative number of ice creams), and x=0\textdegreeC lies far outside the data range 15≤x≤28. Examiners reward this dual observation: meaningless value and outside data range.
(c) M1 — substitute x=22: y=−38+14.6×22=−38+321.2=283.2.
A1 — predicted sales approximately 283 ice creams. Reliability: x=22\textdegreeC lies within the observed data range [15,28], so this is interpolation and the prediction is reliable provided the linear relationship holds.
(d) B1 — x=35\textdegreeC lies outside the data range [15,28], so using the regression line constitutes extrapolation. The linear relationship may not extend to higher temperatures (saturation, supply limits, or non-linear behaviour at extremes), so the prediction is unreliable.
Total: 8 marks (B6 M1 A1).
Question (6 marks): A scatter diagram shows the relationship between hours of revision x and exam mark y for a sample of 20 students. The product moment correlation coefficient is calculated as r=0.74. The equation of the regression line of y on x is y=32+4.1x.
(a) State what is measured by r, and interpret r=0.74 in context. (2)
(b) A teacher claims, "This shows that doing more revision causes higher exam marks." Comment on this claim. (2)
(c) Use the regression equation to estimate the exam mark for a student who revises for 8 hours, stating one assumption required for your estimate to be reliable. (2)
Mark scheme decomposition by AO:
(a)
(b)
(c)
Total: 6 marks split AO1 = 2, AO2 = 2, AO3 = 2. Paper 3 regression questions deliberately balance procedural calculation (AO1) with interpretation in context (AO2) and critique of modelling assumptions (AO3) — an even split is characteristic.
Connects to:
Section 1 — Data presentation (scatter diagrams): the scatter diagram is the visual prerequisite for any correlation analysis. Before computing r or fitting a regression line, candidates should sketch (or inspect) the scatter plot to verify that a linear relationship is plausible. A clearly curved pattern with r≈0 is the textbook trap.
Pure section 6 — Logarithms (linearising non-linear data): if data follow y=axn, then logy=loga+nlogx — a linear relationship between logy and logx. Similarly y=abx gives logy=loga+xlogb. Plotting logy against logx (or x) and computing r on the transformed data is a standard Paper 3 modelling technique that ties Pure and Statistics together.
Year 2 hypothesis testing for correlation: the sample value r is used to test H0:ρ=0 (no correlation in the population) against H1:ρ=0 (or one-sided alternatives). Critical values from Edexcel tables depend on sample size n and significance level. This is the inferential bridge from descriptive statistics to formal testing.
Section 1 — Large data set: Edexcel's prescribed large data set (weather data) routinely supplies bivariate context for regression questions — daily mean temperature against rainfall, sunshine hours against pressure. Candidates are expected to know broadly that such relationships exist and to recognise plausible variable choices.
Modelling cycle: regression sits inside the wider modelling framework — formulate hypothesis, collect data, fit model, validate, refine. Critiquing model fit (residuals, extrapolation risk, outliers) is the AO3 reasoning that distinguishes A* answers from procedurally correct A answers.
Correlation and regression questions on 9MA0-03 distribute AO marks more evenly than Pure topics, reflecting their interpretation-heavy character:
| AO | Typical share | Earned by |
|---|---|---|
| AO1 (knowledge / procedure) | 30–40% | Substituting into a given regression equation, reading r values, identifying gradient and intercept |
| AO2 (reasoning / interpretation) | 35–45% | Interpreting r, gradient and intercept in context; identifying interpolation vs extrapolation; commenting on linearity |
| AO3 (problem-solving / modelling) | 20–30% | Critiquing causation claims, identifying confounders, judging model validity, suggesting refinements |
Examiner-rewarded phrasing: "strong positive linear correlation"; "for every 1-unit increase in x, y increases by approximately b units"; "x=k lies outside the data range, so this is extrapolation and the prediction is unreliable"; "correlation does not imply causation — a confounding variable may explain both". Phrases that lose marks: "strong positive correlation" without linear; "as x increases, y increases" without quantifying the rate; "the prediction is reliable" without checking whether the x-value lies within the data range.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.