You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson covers the calculation and interpretation of summary statistics — measures of location and spread — in the context of real data from the AQA large data set. Being able to calculate these measures is necessary, but at A-Level standard the emphasis is equally on interpreting them in context and comparing distributions meaningfully.
Measures of location describe the central tendency of a data set — where the "middle" of the data lies.
The mean is the sum of all values divided by the number of values:
xˉ=n∑x
For grouped data (frequency table):
xˉ=∑f∑fx
where f is the frequency and x is the midpoint of each class.
Properties of the mean:
The median is the middle value when the data is arranged in order.
Properties of the median:
The mode is the most frequently occurring value (or the modal class for grouped data).
| Situation | Recommended average |
|---|---|
| Symmetrical distribution, no outliers | Mean |
| Skewed distribution or outliers present | Median |
| Categorical data | Mode |
| Grouped data with open-ended classes | Median (mean cannot be accurately calculated) |
Measures of spread describe how dispersed or variable the data is around the centre.
Range=Maximum−Minimum
IQR=Q3−Q1
where Q1 is the lower quartile (25th percentile) and Q3 is the upper quartile (75th percentile).
For n data values arranged in order:
For grouped data, quartiles can be estimated from a cumulative frequency diagram or by interpolation.
The variance measures the average squared deviation from the mean:
Var(X)=n∑x2−xˉ2=n∑(x−xˉ)2
The standard deviation is the square root of the variance:
σ=Var(X)
For grouped data:
Var(X)=∑f∑fx2−(∑f∑fx)2
Properties:
Suppose the daily mean temperatures (°C) for a UK station over 10 days in July are:
15.2,16.1,17.8,18.3,16.5,19.2,17.0,15.8,20.1,16.9
xˉ=1015.2+16.1+17.8+18.3+16.5+19.2+17.0+15.8+20.1+16.9=10172.9=17.29°C
Arrange in order: 15.2,15.8,16.1,16.5,16.9,17.0,17.8,18.3,19.2,20.1
Median = 216.9+17.0=16.95°C
Range=20.1−15.2=4.9°C
Q1=16.1, Q3=18.3 (using the 4n+1 method)
IQR=18.3−16.1=2.2°C
∑x2=15.22+16.12+⋯+16.92=2998.73
Var(X)=102998.73−17.292=299.873−298.944=0.929
σ=0.929≈0.96°C
Calculating summary statistics is only half the task. At A-Level, you must also interpret them in the context of the data. Here are examples of what good contextual interpretation looks like:
"The mean daily temperature at Heathrow in July was 19.8°C, compared to 14.2°C at Leuchars. This is expected because Heathrow is in south-east England at a lower latitude than Leuchars in Scotland, and also benefits from the urban heat island effect."
"The standard deviation of daily rainfall at Camborne was 8.3 mm, compared to 5.1 mm at Hurn. This indicates that rainfall at Camborne is more variable from day to day, which is consistent with its exposed south-westerly coastal location where weather systems from the Atlantic arrive with varying intensity."
"The mean daily sunshine hours at Leuchars in December (1.2 hours) was less than the median (0.8 hours), suggesting a slight positive skew. This makes sense because most December days in Scotland have very little sunshine, but occasional clear days can produce several hours, pulling the mean upwards."
When comparing two data sets, always comment on:
Linear coding simplifies calculations without affecting the underlying relationships:
If y=bx−a, then:
xˉ=byˉ+a σx=bσy
Example: To simplify calculations with temperatures around 17°C, let y=x−17. Calculate summary statistics for y, then convert back.
This technique is frequently tested in exam questions and is useful when working with the large data set.
Exam Tip: When asked to compare two distributions, structure your answer as: (1) compare a measure of location, (2) compare a measure of spread, (3) relate both to the context. For example: "The median daily rainfall at Camborne (4.2 mm) is higher than at Hurn (2.8 mm), and the IQR is also larger (6.1 mm vs 3.9 mm), suggesting that Camborne is both wetter and more variable in its rainfall, consistent with its exposed Atlantic coast location."
AQA 7357 specification, Paper 3 — Statistics, sub-strands O (statistical sampling) and P (data presentation and interpretation) covers interpret diagrams for single-variable data, including measures of central tendency and variation … select or critique data presentation techniques in the context of a statistical problem (refer to the official specification document for exact wording). The AQA Large Data Set (LDS) — published with the specification and used across all three years of the cohort — provides the context in which every statistic must be interpreted. Mean (xˉ), median, mode, range, interquartile range (IQR), variance (σ2 or s2) and standard deviation (σ or s) are not assessed as bare formula recall — they are assessed as tools for comparing groups within the LDS and answering questions about it. The AQA formula booklet provides σ2=n∑(x−xˉ)2=n∑x2−xˉ2, so the focus of marks lies firmly on selection, computation discipline and interpretation.
Question (8 marks): A student investigates daily mean temperature (∘C) at two LDS weather stations during a single month. Their summary statistics are:
| Station | n | xˉ | s | Median | IQR |
|---|---|---|---|---|---|
| Coastal | 30 | 14.2 | 1.8 | 14.0 | 2.4 |
| Inland | 30 | 13.6 | 3.5 | 13.5 | 4.6 |
Compare the daily mean temperatures at the two stations, in context, using both a measure of centre and a measure of spread. (8)
Solution with mark scheme:
Step 1 — choose appropriate measures and justify.
Both medians are reported and both means are reported. The medians are very close to the means at each station, suggesting the distributions are roughly symmetric, so either centre is defensible. The spreads, however, differ markedly: standard deviation s summarises spread relative to the mean, while IQR captures the middle 50%. Reporting both is unnecessary — the question asks for one of each — but choosing the IQR is safer when subsequent questions might involve outliers.
B1 — explicit choice of measures with reason (centre + spread).
Step 2 — compare centres in context.
The mean daily temperature is higher at the Coastal station (xˉ=14.2∘C) than at the Inland station (xˉ=13.6∘C), a difference of 0.6∘C. The medians give the same direction (14.0 vs 13.5).
B1 — correct numerical comparison of centre. B1 — direction stated with units and in the context of temperature at the two stations (not "the first set is bigger").
Step 3 — compare spreads in context.
The standard deviation at the Inland station (s=3.5) is roughly twice that at the Coastal station (s=1.8). The IQR confirms this (4.6 vs 2.4). Daily mean temperatures inland are therefore much more variable than on the coast.
B1 — correct numerical comparison of spread. B1 — direction stated with units and contextual phrasing ("inland temperatures are more variable").
Step 4 — interpret in context (the AO3 marks).
A plausible explanation is that the sea moderates coastal temperatures: water has a very high specific heat capacity, so coastal air-mass temperatures change less from day to day. Inland sites, lacking this thermal buffer, show wider day-to-day swings. The mean is only slightly higher at the coast, but the consistency (small s) is the dominant feature.
B1 — interpretation linking the difference in spread to a plausible physical reason. B1 — interpretation noting that spread, not centre, is the dominant difference.
Step 5 — caveat.
This is a single month for n=30 days at each station; conclusions about long-term climate would require multiple months and ideally multiple years. A single LDS sample is not the population.
B1 — sample-vs-population caveat or equivalent limitation statement.
Total: 8 marks (B1 x 8 — typical AQA "compare in context" mark allocation).
Question (6 marks): Using LDS data on rainfall (mm) for two months, a student computes xˉA=2.1, sA=1.4 for month A (n=31) and xˉB=3.8, sB=4.2 for month B (n=30).
(a) Explain why the median may be a more appropriate measure of centre than the mean for rainfall data. (2)
(b) Compare the two months' rainfall in context, using both centre and spread. (4)
Mark scheme decomposition by AO:
(a)
(b)
Total: 6 marks split AO1 = 1, AO2 = 2, AO3 = 3. This is an AO3-leaning question — AQA uses LDS comparison questions to push candidates beyond computation into reasoning and limitation-awareness.
Connects to:
Sub-strand O — Statistical sampling: every LDS computation is a sample statistic. A 30-day sample has a sample mean xˉ, not a population mean μ. The distinction governs how confidently you can generalise. Random vs systematic vs stratified sampling determines whether your sample is representative — a question AQA returns to repeatedly.
Sub-strand P — Data presentation: boxplots visualise median + IQR; histograms suggest the shape that determines whether mean or median is appropriate. A right-skewed histogram immediately tells you to prefer the median. The summary statistics and the visualisations are two faces of the same data.
Sub-strand R — Correlation and regression: before computing r or fitting y=a+bx, you summarise each variable's centre and spread. A regression coefficient b is a ratio of covariance to σx2 — variance is the engine of regression. Misunderstanding spread propagates into misunderstanding regression.
Sub-strand S — Probability: the empirical mean from an LDS sample is an estimator of E[X] for the underlying distribution; s2 estimates Var(X). The Central Limit Theorem (preview content) tells us that the sampling distribution of xˉ has mean μ and variance σ2/n — so larger samples give more reliable centre estimates.
Sub-strand T — Statistical hypothesis testing: AQA's hypothesis tests on the population mean (Year 2) use xˉ as the test statistic and s/n as the standard error. Without confident handling of xˉ and s from the LDS, hypothesis testing is impossible.
LDS comparison questions on 7357 split AO marks across AO1, AO2 and AO3 — they are not AO1-dominated:
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.