You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson covers correlation and regression in depth — the tools used to quantify and model the relationship between two variables. These techniques are central to the A-Level Mathematics statistics specification and are frequently tested using data from the large data set.
Correlation measures the strength and direction of the linear relationship between two variables. It does not imply causation.
The PMCC, denoted r, quantifies the linear correlation between variables x and y:
r=SxxSyySxy
where:
Sxx=∑x2−n(∑x)2,Syy=∑y2−n(∑y)2,Sxy=∑xy−n(∑x)(∑y)
| Value of r | Interpretation |
|---|---|
| r=1 | Perfect positive linear correlation |
| 0.7≤r<1 | Strong positive correlation |
| 0.4≤r<0.7 | Moderate positive correlation |
| 0<r<0.4 | Weak positive correlation |
| r=0 | No linear correlation |
| −1<r<0 | Negative correlation (use same thresholds) |
| r=−1 | Perfect negative linear correlation |
Calculating r for daily mean temperature and daily total sunshine at Heathrow in May might give r=0.72, indicating a strong positive correlation. This is physically reasonable: warmer days tend to be sunnier.
Spearman's rank correlation coefficient, rs, measures the strength and direction of the monotonic relationship between two variables. It is calculated by ranking each variable and then applying the PMCC formula to the ranks, or using the shortcut:
rs=1−n(n2−1)6∑d2
where d is the difference between the ranks for each pair and n is the number of pairs.
| Feature | PMCC (r) | Spearman's rank (rs) |
|---|---|---|
| Data type | Interval/ratio (numerical) | Ordinal (ranked) or numerical |
| Relationship type | Linear | Monotonic |
| Sensitivity to outliers | High | Low |
| Affected by linear coding | No | No |
| When to use | Normally distributed data, linear relationship | Non-normal data, non-linear monotonic relationship |
Regression analysis fits a straight line to the data, which can then be used for prediction.
The least squares regression line of y on x minimises the sum of the squared vertical distances from each data point to the line:
y=a+bx
where:
b=SxxSxy,a=yˉ−bxˉ
Example: If the regression line for daily total sunshine (y, hours) on daily mean temperature (x, °C) at Hurn is y=−4.2+0.65x:
| Term | Definition | Reliability |
|---|---|---|
| Interpolation | Predicting y for an x-value within the range of the data | Generally reliable — the linear model has been validated in this region |
| Extrapolation | Predicting y for an x-value outside the range of the data | Unreliable — the linear relationship may not hold beyond the observed data |
The regression line is a model fitted to the observed data. Outside this range, the relationship between the variables may change. For example:
A correlation between two variables does not mean that one causes the other. There may be:
The PMCC and least squares regression assume a linear relationship. If the true relationship is curved, these measures will be misleading. Always examine the scatter diagram before relying on r or the regression line.
A single outlier can dramatically change the value of r and the position of the regression line. Always check for and investigate outliers before drawing conclusions.
The correlation and regression results are only valid for the range of data from which they were calculated. Making predictions beyond this range (extrapolation) is unreliable.
If the spread of the data changes across the range of x (heteroscedasticity), the regression model may not be appropriate.
When answering exam questions that involve regression and the large data set:
Data for daily mean temperature (x, °C) and daily total rainfall (y, mm) at Leeming for 8 days in April:
| x | 8.2 | 9.1 | 10.5 | 7.8 | 11.2 | 12.0 | 9.5 | 10.0 |
|---|---|---|---|---|---|---|---|---|
| y | 5.1 | 3.8 | 2.2 | 6.0 | 1.5 | 0.8 | 3.5 | 2.9 |
xˉ=9.7875, yˉ=3.225, Sxx=14.62, Sxy=−15.12, Syy=21.14
r=14.62×21.14−15.12=17.58−15.12≈−0.860
b=14.62−15.12≈−1.034,a=3.225−(−1.034)(9.7875)≈13.34
Regression line: y=13.34−1.03x
Interpretation: There is a strong negative correlation (r≈−0.86) between daily mean temperature and daily rainfall at Leeming in April. For each 1°C increase in temperature, daily rainfall decreases by approximately 1.03 mm. Predicting rainfall for a temperature of 9°C (interpolation) gives y=13.34−1.03(9)=4.07 mm. Predicting for 20°C (extrapolation) would give a negative rainfall, which is impossible — this shows the danger of extrapolation.
Exam Tip: When interpreting the gradient b, always use the context. Do not say "for each unit increase in x, y increases by b." Instead say: "for each 1°C increase in daily mean temperature, the model predicts that daily total sunshine increases by 0.65 hours." This contextual interpretation is where marks are awarded.
AQA 7357 specification, Paper 3 — Statistics, sub-strands O1, O2, O3 and N (Large Data Set context) covers interpret diagrams for single-variable data, including understanding that area in a histogram represents frequency. Interpret scatter diagrams and regression lines for bivariate data, including recognition of scatter diagrams which include distinct sections of the population. Understand informal interpretation of correlation. Understand that correlation does not imply causation (refer to the official specification document for exact wording). Although bivariate analysis is examined in Paper 3, the LDS-flavoured contexts that AQA favours bleed into Paper 1 (modelling) and Paper 2 (mechanics-style data). The AQA formula booklet provides the product-moment correlation coefficient r formula and the least-squares regression coefficients — students are expected to interpret values, not derive them.
Question (8 marks):
A student investigates the relationship between mean monthly maximum daily temperature x (°C) and mean monthly daily total rainfall y (mm) for a single LDS-style location across 12 months. Summary statistics are: ∑x=168, ∑y=720, ∑x2=2616, ∑y2=47400, ∑xy=9648, n=12.
(a) Calculate the product-moment correlation coefficient r, giving your answer to 3 s.f. (3)
(b) Find the equation of the regression line of y on x in the form y=a+bx. (3)
(c) The student uses the line to predict the rainfall in a month with mean maximum temperature 30°C. Comment on the reliability of this prediction. (2)
Solution with mark scheme:
(a) Step 1 — compute the sums of squares and products.
Sxx=∑x2−n(∑x)2=2616−121682=2616−2352=264
Syy=∑y2−n(∑y)2=47400−127202=47400−43200=4200
Sxy=∑xy−n∑x∑y=9648−12168⋅720=9648−10080=−432
M1 — correct method for at least one of Sxx, Syy, Sxy. The minus sign in Sxy is the substantive content here; candidates who write Sxy=432 lose the eventual sign of r and b.
Step 2 — apply the PMCC formula.
r=Sxx⋅SyySxy=264⋅4200−432=1108800−432=1052.99...−432
M1 — substituting into the PMCC formula correctly.
A1 — r=−0.410 (3 s.f.). The negative value indicates weak-to-moderate negative linear association; warmer months in this LDS sample are associated with lower rainfall.
(b) Step 1 — compute the gradient.
b=SxxSxy=264−432=−1.6363...
M1 — correct formula for b.
Step 2 — compute the intercept using yˉ=a+bxˉ.
xˉ=168/12=14, yˉ=720/12=60.
a=yˉ−bxˉ=60−(−1.6363...)(14)=60+22.909...=82.9
M1 — using a=yˉ−bxˉ.
A1 — final equation y=82.9−1.64x (3 s.f.).
(c) B1 — 30°C lies outside the range of the sample data. This is extrapolation, and the linear relationship is not guaranteed to hold beyond the observed range.
B1 — additionally, ∣r∣=0.410 is not strong, so the linear model itself accounts for only r2≈16.8% of the variation in y. Even within the data range, the prediction is unreliable; outside it, doubly so.
Total: 8 marks (M2 A1, M2 A1, B2).
Question (6 marks): A scatter diagram of LDS-style daily mean windspeed w (kn) against daily mean cloud cover c (oktas) for a single station shows positive correlation. The regression line of w on c is w=4.2+0.85c.
(a) Interpret the value 0.85 in this context. (2)
(b) Explain why it would be inappropriate to use this line to predict cloud cover from a measured windspeed. (2)
(c) Comment on whether this evidence supports the claim that increased cloud cover causes higher windspeeds. (2)
Mark scheme decomposition by AO:
(a)
(b)
(c)
Total: 6 marks split AO1 = 0, AO2 = 3, AO3 = 3. This is unusually AO2/AO3-heavy — bivariate questions on Paper 3 are designed to test interpretation, not arithmetic. Interpretation marks dominate.
Connects to:
Paper 3 — Data presentation (scatter diagrams, outliers): the PMCC is calculated from sums-of-products, but its interpretation relies on visual inspection of the scatter. A single outlier can drag r from 0.9 to 0.3 or vice versa. AQA explicitly tests recognition of "scatter diagrams which include distinct sections of the population" — bimodal scatter where two sub-populations exist invalidates a single regression line.
Paper 1 — Logarithmic linearisation: non-linear relationships of the form y=axn or y=abx become linear when log-transformed. Plotting logy against logx gives a line with gradient n and intercept loga for the power model; plotting logy against x gives gradient logb for the exponential model. Regression and PMCC then apply to the transformed data. Synoptic LDS questions often exploit this — temperature-versus-time data is genuinely linear, but rainfall-versus-elevation is closer to exponential and requires transformation before r is meaningful.
Paper 3 — Hypothesis test for ρ: the PMCC r is a sample statistic estimating the population correlation ρ. AQA tests H0:ρ=0 against H1:ρ=0 (or one-tailed) using critical values from the formula booklet at the 5% / 1% significance levels for given n. A non-zero r in a small sample may not be statistically significant; a small r in a large sample may be highly significant.
Paper 1 — Modelling cycle: regression is the canonical example of "fit, predict, criticise". The AQA modelling cycle requires students to (i) propose a model, (ii) collect data, (iii) fit the model, (iv) make predictions, (v) evaluate fit, (vi) refine. Linear regression slots into stages (iii)-(v); recognising when to abandon the linear assumption is the (vi) refinement step that A* candidates handle explicitly.
Paper 3 — Probability and conditional probability: if cloud cover and windspeed are both treated as random variables, the regression coefficient b=Sxy/Sxx is the empirical analogue of Cov(X,Y)/Var(X), which appears in conditional-expectation calculations. A* candidates connect the descriptive bivariate machinery to the underlying probability structure.
Correlation and regression questions on 7357 split AO marks heavily toward AO2 and AO3:
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.