Summary Statistics in Context

This lesson covers the calculation and interpretation of summary statistics — measures of location and spread — in the context of real data from the AQA large data set. Being able to calculate these measures is necessary, but at A-Level standard the emphasis is equally on interpreting them in context and comparing distributions meaningfully.

Measures of Location (Averages)

Measures of location describe the central tendency of a data set — where the "middle" of the data lies.

Mean (Arithmetic Mean)

The mean is the sum of all values divided by the number of values:

$\bar{x} = \frac{\sum x}{n}$

For grouped data (frequency table):

$\bar{x} = \frac{\sum f x}{\sum f}$

where $f$ is the frequency and $x$ is the midpoint of each class.

Properties of the mean:

Uses every data value, so is affected by outliers.
If the data is skewed, the mean is pulled towards the tail.
It is the most commonly used average and has useful mathematical properties (e.g., $\sum(x - \bar{x}) = 0$ ).

Median

The median is the middle value when the data is arranged in order.

For $n$ values, the median is the $\left(\frac{n+1}{2}\right)$ th value.
For grouped data, the median can be estimated from a cumulative frequency diagram by reading across from $\frac{n}{2}$ on the cumulative frequency axis.

Properties of the median:

Not affected by outliers or extreme values.
Better than the mean for skewed distributions.
Does not use every data value.

Mode

The mode is the most frequently occurring value (or the modal class for grouped data).

A data set can have no mode, one mode, or multiple modes (bimodal, multimodal).
The mode is the only average that can be used for categorical data.
For weather data, the mode is often less informative than the mean or median, but modal class can be useful for grouped data.

Choosing the Best Average

Situation	Recommended average
Symmetrical distribution, no outliers	Mean
Skewed distribution or outliers present	Median
Categorical data	Mode
Grouped data with open-ended classes	Median (mean cannot be accurately calculated)

Measures of Spread

Measures of spread describe how dispersed or variable the data is around the centre.

Range

$\text{Range} = \text{Maximum} - \text{Minimum}$

Simple to calculate but heavily influenced by outliers.
Only uses two values from the data set.

Interquartile Range (IQR)

$\text{IQR} = Q_3 - Q_1$

where $Q_1$ is the lower quartile (25th percentile) and $Q_3$ is the upper quartile (75th percentile).

Measures the spread of the middle 50% of the data.
Not affected by outliers.
Used alongside the median for skewed data.

Finding Quartiles

For $n$ data values arranged in order:

$Q_1$ is the $\frac{n}{4}$ th value
$Q_2$ (median) is the $\frac{n}{2}$ th value
$Q_3$ is the $\frac{3n}{4}$ th value

For grouped data, quartiles can be estimated from a cumulative frequency diagram or by interpolation.

Variance and Standard Deviation

The variance measures the average squared deviation from the mean:

$\text{Var}(X) = \frac{\sum x^2}{n} - \bar{x}^2 = \frac{\sum(x - \bar{x})^2}{n}$

The standard deviation is the square root of the variance:

$\sigma = \sqrt{\text{Var}(X)}$

For grouped data:

$\text{Var}(X) = \frac{\sum f x^2}{\sum f} - \left(\frac{\sum f x}{\sum f}\right)^2$

Properties:

Uses every data value.
Measured in the same units as the data (unlike variance, which is in squared units).
Affected by outliers (since deviations are squared, extreme values have a disproportionate effect).

Calculating Summary Statistics: Worked Example

Suppose the daily mean temperatures (°C) for a UK station over 10 days in July are:

$15.2,\; 16.1,\; 17.8,\; 18.3,\; 16.5,\; 19.2,\; 17.0,\; 15.8,\; 20.1,\; 16.9$

Mean

$\bar{x} = \frac{15.2 + 16.1 + 17.8 + 18.3 + 16.5 + 19.2 + 17.0 + 15.8 + 20.1 + 16.9}{10} = \frac{172.9}{10} = 17.29\,°C$

Median

Arrange in order: $15.2,\; 15.8,\; 16.1,\; 16.5,\; 16.9,\; 17.0,\; 17.8,\; 18.3,\; 19.2,\; 20.1$

Median = $\frac{16.9 + 17.0}{2} = 16.95\,°C$

Range

$\text{Range} = 20.1 - 15.2 = 4.9\,°C$

IQR

$Q_1 = 16.1$ , $Q_3 = 18.3$ (using the $\frac{n+1}{4}$ method)

$\text{IQR} = 18.3 - 16.1 = 2.2\,°C$

Standard Deviation

$\sum x^2 = 15.2^2 + 16.1^2 + \cdots + 16.9^2 = 2998.73$

$\text{Var}(X) = \frac{2998.73}{10} - 17.29^2 = 299.873 - 298.944 = 0.929$

$\sigma = \sqrt{0.929} \approx 0.96\,°C$

Interpreting Summary Statistics in Context

Calculating summary statistics is only half the task. At A-Level, you must also interpret them in the context of the data. Here are examples of what good contextual interpretation looks like:

Example 1: Comparing Two Stations

"The mean daily temperature at Heathrow in July was 19.8°C, compared to 14.2°C at Leuchars. This is expected because Heathrow is in south-east England at a lower latitude than Leuchars in Scotland, and also benefits from the urban heat island effect."

Example 2: Interpreting Spread

"The standard deviation of daily rainfall at Camborne was 8.3 mm, compared to 5.1 mm at Hurn. This indicates that rainfall at Camborne is more variable from day to day, which is consistent with its exposed south-westerly coastal location where weather systems from the Atlantic arrive with varying intensity."

Example 3: Commenting on Skewness

"The mean daily sunshine hours at Leuchars in December (1.2 hours) was less than the median (0.8 hours), suggesting a slight positive skew. This makes sense because most December days in Scotland have very little sunshine, but occasional clear days can produce several hours, pulling the mean upwards."

Comparing Distributions

When comparing two data sets, always comment on:

A measure of location (e.g., "the mean temperature is higher at Heathrow than at Leuchars")
A measure of spread (e.g., "the IQR of rainfall is wider at Camborne, indicating greater variability")
Context (e.g., "this is likely because of the geographical differences between the stations")

Coding (Linear Transformation)

Linear coding simplifies calculations without affecting the underlying relationships:

If $y = \frac{x - a}{b}$ , then:

$\bar{x} = b\bar{y} + a$ $\sigma_x = b \sigma_y$

Example: To simplify calculations with temperatures around 17°C, let $y = x - 17$ . Calculate summary statistics for $y$ , then convert back.

This technique is frequently tested in exam questions and is useful when working with the large data set.

Summary

Measures of location (mean, median, mode) describe where the data is centred.
Measures of spread (range, IQR, standard deviation) describe how variable the data is.
The mean and standard deviation use all the data but are sensitive to outliers; the median and IQR are more robust.
At A-Level, you must interpret summary statistics in context — relating them to the real-world situation.
When comparing two data sets, always comment on both location and spread, with contextual explanation.
Linear coding simplifies calculations without changing the structure of the data.

Exam Tip: When asked to compare two distributions, structure your answer as: (1) compare a measure of location, (2) compare a measure of spread, (3) relate both to the context. For example: "The median daily rainfall at Camborne (4.2 mm) is higher than at Hurn (2.8 mm), and the IQR is also larger (6.1 mm vs 3.9 mm), suggesting that Camborne is both wetter and more variable in its rainfall, consistent with its exposed Atlantic coast location."

A-Level Deep Dive: Summary Statistics in Context

Spec mapping

AQA 7357 specification, Paper 3 — Statistics, sub-strands O (statistical sampling) and P (data presentation and interpretation) covers interpret diagrams for single-variable data, including measures of central tendency and variation … select or critique data presentation techniques in the context of a statistical problem (refer to the official specification document for exact wording). The AQA Large Data Set (LDS) — published with the specification and used across all three years of the cohort — provides the context in which every statistic must be interpreted. Mean ( $\bar{x}$ ), median, mode, range, interquartile range (IQR), variance ( $\sigma^2$ or $s^2$ ) and standard deviation ( $\sigma$ or $s$ ) are not assessed as bare formula recall — they are assessed as tools for comparing groups within the LDS and answering questions about it. The AQA formula booklet provides $\sigma^2 = \dfrac{\sum(x - \bar{x})^2}{n} = \dfrac{\sum x^2}{n} - \bar{x}^2$ , so the focus of marks lies firmly on selection, computation discipline and interpretation.

Worked example with full mark scheme

Question (8 marks): A student investigates daily mean temperature ( $^\circ\mathrm{C}$ ) at two LDS weather stations during a single month. Their summary statistics are:

Station	$n$	$\bar{x}$	$s$	Median	IQR
Coastal	30	14.2	1.8	14.0	2.4
Inland	30	13.6	3.5	13.5	4.6

Compare the daily mean temperatures at the two stations, in context, using both a measure of centre and a measure of spread. (8)

Solution with mark scheme:

Step 1 — choose appropriate measures and justify.

Both medians are reported and both means are reported. The medians are very close to the means at each station, suggesting the distributions are roughly symmetric, so either centre is defensible. The spreads, however, differ markedly: standard deviation $s$ summarises spread relative to the mean, while IQR captures the middle 50%. Reporting both is unnecessary — the question asks for one of each — but choosing the IQR is safer when subsequent questions might involve outliers.

B1 — explicit choice of measures with reason (centre + spread).

Step 2 — compare centres in context.

The mean daily temperature is higher at the Coastal station ( $\bar{x} = 14.2^\circ\mathrm{C}$ ) than at the Inland station ( $\bar{x} = 13.6^\circ\mathrm{C}$ ), a difference of $0.6^\circ\mathrm{C}$ . The medians give the same direction ( $14.0$ vs $13.5$ ).

B1 — correct numerical comparison of centre. B1 — direction stated with units and in the context of temperature at the two stations (not "the first set is bigger").

Step 3 — compare spreads in context.

The standard deviation at the Inland station ( $s = 3.5$ ) is roughly twice that at the Coastal station ( $s = 1.8$ ). The IQR confirms this ( $4.6$ vs $2.4$ ). Daily mean temperatures inland are therefore much more variable than on the coast.

B1 — correct numerical comparison of spread. B1 — direction stated with units and contextual phrasing ("inland temperatures are more variable").

Step 4 — interpret in context (the AO3 marks).

A plausible explanation is that the sea moderates coastal temperatures: water has a very high specific heat capacity, so coastal air-mass temperatures change less from day to day. Inland sites, lacking this thermal buffer, show wider day-to-day swings. The mean is only slightly higher at the coast, but the consistency (small $s$ ) is the dominant feature.

B1 — interpretation linking the difference in spread to a plausible physical reason. B1 — interpretation noting that spread, not centre, is the dominant difference.

Step 5 — caveat.

This is a single month for $n = 30$ days at each station; conclusions about long-term climate would require multiple months and ideally multiple years. A single LDS sample is not the population.

B1 — sample-vs-population caveat or equivalent limitation statement.

Total: 8 marks (B1 x 8 — typical AQA "compare in context" mark allocation).

Specimen question modelled on the AQA 7357 Paper 3 format

Question (6 marks): Using LDS data on rainfall (mm) for two months, a student computes $\bar{x}_A = 2.1$ , $s_A = 1.4$ for month A ( $n = 31$ ) and $\bar{x}_B = 3.8$ , $s_B = 4.2$ for month B ( $n = 30$ ).

(a) Explain why the median may be a more appropriate measure of centre than the mean for rainfall data. (2)

(b) Compare the two months' rainfall in context, using both centre and spread. (4)

Mark scheme decomposition by AO:

(a)

B1 (AO2.4) — observation that rainfall distributions are typically skewed (most days have low rainfall, occasional days have very high rainfall).
B1 (AO2.4) — conclusion that the mean is pulled upward by these high values, so the median is more robust / resistant to skew.

(b)

B1 (AO1.1b) — comparison of centres: month B has higher mean rainfall ( $3.8$ vs $2.1$ mm).
B1 (AO2.5) — comparison of spreads: month B is much more variable ( $s_B = 4.2$ vs $s_A = 1.4$ ).
B1 (AO3.1b) — context: month B is both wetter on average and more erratic.
B1 (AO3.5b) — caveat about sample size / single year of LDS data.

Total: 6 marks split AO1 = 1, AO2 = 2, AO3 = 3. This is an AO3-leaning question — AQA uses LDS comparison questions to push candidates beyond computation into reasoning and limitation-awareness.

Synoptic links

Connects to:

Sub-strand O — Statistical sampling: every LDS computation is a sample statistic. A 30-day sample has a sample mean $\bar{x}$ , not a population mean $\mu$ . The distinction governs how confidently you can generalise. Random vs systematic vs stratified sampling determines whether your sample is representative — a question AQA returns to repeatedly.
Sub-strand P — Data presentation: boxplots visualise median + IQR; histograms suggest the shape that determines whether mean or median is appropriate. A right-skewed histogram immediately tells you to prefer the median. The summary statistics and the visualisations are two faces of the same data.
Sub-strand R — Correlation and regression: before computing $r$ or fitting $y = a + bx$ , you summarise each variable's centre and spread. A regression coefficient $b$ is a ratio of covariance to $\sigma_x^2$ — variance is the engine of regression. Misunderstanding spread propagates into misunderstanding regression.
Sub-strand S — Probability: the empirical mean from an LDS sample is an estimator of $E[X]$ for the underlying distribution; $s^2$ estimates $\mathrm{Var}(X)$ . The Central Limit Theorem (preview content) tells us that the sampling distribution of $\bar{x}$ has mean $\mu$ and variance $\sigma^2/n$ — so larger samples give more reliable centre estimates.
Sub-strand T — Statistical hypothesis testing: AQA's hypothesis tests on the population mean (Year 2) use $\bar{x}$ as the test statistic and $s/\sqrt{n}$ as the standard error. Without confident handling of $\bar{x}$ and $s$ from the LDS, hypothesis testing is impossible.

Mark-scheme literacy

LDS comparison questions on 7357 split AO marks across AO1, AO2 and AO3 — they are not AO1-dominated:

Summary Statistics in Context

Summary Statistics in Context

Measures of Location (Averages)

Mean (Arithmetic Mean)

Median

Mode

Choosing the Best Average

Measures of Spread

Range

Interquartile Range (IQR)

Finding Quartiles

Variance and Standard Deviation

Calculating Summary Statistics: Worked Example

Mean

Median

Range

IQR

Standard Deviation

Interpreting Summary Statistics in Context

Example 1: Comparing Two Stations

Example 2: Interpreting Spread

Example 3: Commenting on Skewness

Comparing Distributions

Coding (Linear Transformation)

Summary

A-Level Deep Dive: Summary Statistics in Context

Spec mapping

Worked example with full mark scheme

Specimen question modelled on the AQA 7357 Paper 3 format

Synoptic links

Mark-scheme literacy

More in Mathematics