You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson covers correlation and regression at A-Level. These techniques are used to explore and quantify the relationship between two variables. Understanding when and how to apply regression analysis and interpret correlation coefficients is essential for the statistics component of A-Level Mathematics.
A scatter diagram plots pairs of data (x,y) to visualise the relationship between two variables. The explanatory (independent) variable is plotted on the x-axis, and the response (dependent) variable on the y-axis.
| Type | Description |
|---|---|
| Positive correlation | As x increases, y tends to increase |
| Negative correlation | As x increases, y tends to decrease |
| No correlation | No linear relationship between x and y |
| Strong correlation | Points lie close to a straight line |
| Weak correlation | Points are widely scattered but show a general trend |
Exam Tip: Correlation does not imply causation. Even if two variables are strongly correlated, one does not necessarily cause the other. Always consider confounding variables and the context.
The PMCC, denoted r, measures the strength and direction of the linear relationship between two variables:
r=SxxSyySxy
where: Sxx=∑x2−n(∑x)2,Syy=∑y2−n(∑y)2,Sxy=∑xy−n(∑x)(∑y)
Properties of r:
The least squares regression line of y on x is:
y=a+bx
where: b=SxxSxyanda=yˉ−bxˉ
This line minimises the sum of the squared vertical distances from the data points to the line.
Exam Tip: When interpreting the regression equation, always relate the gradient and intercept to the context of the data. State what x and y represent. If x=0 is outside the range of data, note that the intercept may not have a practical interpretation.
| Method | Definition | Reliability |
|---|---|---|
| Interpolation | Estimating within the range of the observed data | Generally reliable |
| Extrapolation | Estimating beyond the range of the observed data | Unreliable — the relationship may not continue |
If data is transformed using a linear coding such as X′=bX−a, the regression line can be calculated using the coded data and then decoded. Coding does not affect the value of r (the PMCC).
To test whether a correlation coefficient is significantly different from zero:
Compare the sample r with the critical value from the PMCC table for the given sample size n and significance level α.
If ∣r∣ exceeds the critical value, reject H0 and conclude there is evidence of correlation.
Exam Tip: When asked to comment on correlation, state the type (positive/negative), strength (strong/moderate/weak), and relate it to the context. Then state whether it implies a causal relationship (usually it does not).
AQA 7357 specification, Paper 3 — Statistics, Section P: "Use and interpret scatter diagrams for bivariate data; recognise correlation and know that it does not imply causation; interpret the product moment correlation coefficient (PMCC) r; understand and use the equation of a regression line y=a+bx; understand the concepts of interpolation and extrapolation; carry out a hypothesis test for zero correlation using a critical-value table for ρ." This material is examined in 7357/3 alongside Probability, Statistical Distributions and Hypothesis Testing. Although a Paper 3 topic, regression and correlation interact heavily with Section O (data presentation, the large data set) and with the logarithmic linearisation techniques from Pure Paper 1, where non-linear relationships y=axn or y=abx are reduced to linear form before regression is applied. The AQA formula booklet provides PMCC critical values but not the formula for r itself for routine computation — calculator statistical mode is expected.
Question (8 marks): A researcher records the number of hours x spent revising and the percentage score y achieved by 10 students. Summary statistics are ∑x=50, ∑y=620, ∑x2=310, ∑y2=39,260, ∑xy=3,340, n=10. The PMCC is calculated as r=0.842 (3 s.f.).
(a) Interpret the value of r in context. (2)
(b) Find the equation of the regression line of y on x in the form y=a+bx. (4)
(c) Use your line to estimate the score for a student who revised for 7 hours, and comment on the reliability of estimating the score for a student who revised for 20 hours. (2)
Solution with mark scheme:
(a) B1 — quantitative description: r=0.842 indicates strong positive linear correlation between revision hours and score.
B1 — context: as revision hours increase, the score tends to increase (linear association). Common error: writing "as x increases, y increases" without anchoring to the variables in context loses the contextual mark.
(b) Step 1 — compute the means.
xˉ=n∑x=1050=5,yˉ=n∑y=10620=62
M1 — correct means.
Step 2 — compute Sxy and Sxx.
Sxy=∑xy−n(∑x)(∑y)=3340−1050⋅620=3340−3100=240 Sxx=∑x2−n(∑x)2=310−102500=310−250=60
M1 — correct Sxy and Sxx formulae applied. A common slip is using ∑y2 here by mistake — that quantity is Syy, used only for PMCC, never for the regression slope.
Step 3 — gradient.
b=SxxSxy=60240=4
A1 — gradient b=4.
Step 4 — intercept and final equation.
a=yˉ−bxˉ=62−4⋅5=42 y=42+4x
A1 — equation in the requested form. Writing y=4x+42 (slope-intercept order) is mathematically identical but examiners may penalise if the question stem demands y=a+bx exactly.
(c) B1 — for x=7: y=42+4⋅7=70, so the predicted score is approximately 70%. This is interpolation (7 lies inside the data range 0≤x≤ max observed) and is reliable.
B1 — for x=20: this lies far outside the observed range (∑x=50 across 10 students implies a typical maximum near 10–12 hours). Extrapolating to 20 hours is unreliable: the linear model may not hold, and the predicted y=122 exceeds 100%, which is impossible for a percentage score.
Total: 8 marks (B2 + M2 A2 + B2).
Question (6 marks): A scientist tests whether tree height h (m) and trunk diameter d (cm) are correlated for a random sample of 25 trees. The PMCC is r=0.412. Test, at the 5% level, whether there is evidence of positive correlation between h and d in the population. The critical value of ρ for n=25 at the 5% one-tailed level is 0.3365.
Mark scheme decomposition by AO:
Total: 6 marks split AO1 = 3, AO2 = 1, AO3 = 2. Hypothesis-test questions on Paper 3 deliberately load AO3 marks onto the conclusion-in-context and assumption-checking steps — these are the marks weakest candidates miss.
Connects to:
Section O — Data presentation and the large data set: scatter diagrams are the visual prerequisite for any r calculation. The AQA-prescribed large data set is routinely used as the source of bivariate samples in exam questions; familiarity with its variables (and their typical units, ranges and outliers) is assumed without reminder.
Pure Paper 1 — Logarithmic linearisation: when y=axn, taking logs gives logy=loga+nlogx, which is linear in logx. Plotting logy against logx and fitting a regression line recovers n as the gradient and loga as the intercept. Similarly y=abx linearises to logy=loga+xlogb — linear in x. This is how regression is extended to non-linear bivariate data without leaving the A-Level toolkit.
Section R — Hypothesis testing: the test for ρ=0 is structurally identical to the binomial / normal tests learned earlier — H0 versus H1, significance level, critical value, conclusion in context. Only the test statistic (PMCC r) and the critical-value table change.
Section S — Probability distributions: the PMCC critical-value table presupposes that (X,Y) is bivariate normal in the population. For non-normal bivariate data, Spearman's rank correlation rs is the appropriate alternative — non-examinable at A-Level but a natural undergraduate extension.
Modelling cycle (cross-paper): every regression question is implicitly a modelling question — fit, interpret, predict, criticise. Extrapolation criticism is the dominant AO3 source on Paper 3 regression.
Correlation and regression questions on 7357/3 split AO marks more evenly than pure topics:
| AO | Typical share | Earned by |
|---|---|---|
| AO1 (knowledge / procedure) | 40–50% | Computing xˉ, yˉ, Sxy, Sxx, b, a; stating hypotheses; reading critical-value tables |
| AO2 (reasoning / interpretation) | 25–35% | Interpreting r in context; choosing between one- and two-tailed tests; justifying rejection of H0 |
| AO3 (problem-solving / modelling) | 20–30% | Commenting on extrapolation; criticising the linear model; stating bivariate-normal assumption; commenting on causation |
Examiner-rewarded phrasing: "evidence at the 5% level"; "do not reject H0" (never "accept H0" — a hypothesis test cannot prove the null); "extrapolation is unreliable because x=20 lies outside the observed range [0,12]"; "correlation does not imply causation — a confounding variable may explain the association". Phrases that lose marks: "accept H0" (reserved language — examiners deduct); "this proves there is correlation" (proof versus evidence); "r is high so x causes y" (causal claim from correlational data).
A specific AQA pattern: questions that ask "comment on the suitability of the linear model" expect two points — one about the strength of r (does the data look linear?) and one about the residual pattern or scatter spread (any curvature, fanning, outliers?).
Question: A scatter diagram of 30 paired observations gives r=−0.78. Interpret this value in the context of monthly heating bill y (£) and average outdoor temperature x (°C).
Grade C response (~150 words):
r=−0.78 shows a strong negative correlation. As temperature increases, heating bill decreases. The relationship is linear and the points on the scatter diagram would lie close to a straight line with negative gradient.
Examiner commentary: Full marks (3/3). The candidate correctly identifies strength (strong), direction (negative) and links the variables in context. Concise but complete. Many candidates write only "negative correlation" without "strong" — that loses the strength mark. Others omit the contextual link and just describe the abstract relationship between x and y — that loses the context mark.
Grade A response (~210 words):*
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.