Introduction to the Large Data Set

This lesson introduces the AQA Large Data Set (LDS) — what it is, why AQA requires students to work with it, how it appears in exam questions, and strategies for effective familiarisation. The large data set is a distinctive feature of the AQA A-Level Mathematics specification (7357), and understanding how to navigate and interpret real data is essential for success in Paper 3: Statistics and Mechanics.

What Is the Large Data Set?

The large data set is a pre-release collection of real-world data that AQA publishes for each examination series. Students are expected to become familiar with this data set before the exam, so that they can answer questions about it efficiently and with genuine understanding.

For AQA A-Level Mathematics, the large data set consists of weather data collected from a selection of weather stations across the United Kingdom and around the world. The data covers several years and includes a range of meteorological variables recorded on a daily or monthly basis.

Key Features

Feature	Detail
Subject	Weather/meteorological data
Source	Met Office and international equivalents
Coverage	Multiple UK and overseas weather stations
Time period	Several years of recorded data
Format	Spreadsheet (typically Excel or CSV)
Release	Published by AQA ahead of each exam series

The data set is not provided in the exam paper in full. Instead, students are expected to have studied it beforehand and may be given small extracts, summaries, or contextual information in the exam.

Why Does AQA Use a Large Data Set?

AQA's rationale for including a large data set is grounded in several pedagogical and practical principles:

Authentic statistical practice: Working with real data mirrors what statisticians actually do. Unlike textbook exercises with small, clean data sets, the LDS contains anomalies, missing values, and the kind of complexity that real data inevitably presents.
Developing data literacy: Students learn to navigate, interrogate, and interpret large quantities of data — a skill that is increasingly important in higher education and the workplace.
Contextual understanding: Questions on the exam paper are set in the context of the data set. Students who have explored the data will understand the variables, their units, and what realistic values look like. This makes it much easier to spot errors, interpret results, and write meaningful conclusions.
Assessment of higher-order skills: The LDS allows AQA to ask questions that go beyond routine calculation. Students may be asked to comment on data quality, suggest reasons for anomalies, or discuss whether a statistical model is appropriate for a particular variable.
Specification requirement: The Ofqual subject content for A-Level Mathematics explicitly requires that students work with a large data set as part of their statistics training.

How Does the Large Data Set Appear in Exams?

Questions on the large data set appear in Paper 3: Statistics and Mechanics (Section A: Statistics). These questions are designed so that students who have genuinely familiarised themselves with the data are at an advantage.

Types of Exam Questions

Question type	What is expected
Sampling questions	Explain how to take a sample from the LDS using a named method (e.g., stratified, systematic)
Data presentation	Construct or interpret charts (box plots, histograms, scatter diagrams) based on data from the LDS
Summary statistics	Calculate or interpret mean, standard deviation, quartiles, etc., for variables in the LDS
Hypothesis testing	Carry out a test using data or summary statistics derived from the LDS
Interpretation and context	Comment on trends, patterns, outliers, or relationships observed in the data
Data cleaning	Discuss how missing data or anomalies should be handled

Important Points

The exam will not require you to memorise specific data values. However, knowing what is typical — for example, the approximate range of daily mean temperatures at a UK station in summer — will help you check the reasonableness of your answers.
Questions will be set so that students who have not studied the LDS can still attempt them, but they will find it harder to provide full contextual answers.
The LDS questions often carry marks for interpretation in context — this is where familiarity pays off.

Familiarisation Strategies

Effective preparation for LDS questions involves much more than simply downloading the spreadsheet and glancing at it. Below are recommended strategies:

1. Explore the Data Systematically

Open the data set in a spreadsheet application and examine it carefully:

Identify the variables: What is measured? What are the units? What do the column headings mean?
Note the structure: How many rows (observations) and columns (variables) are there? Are there multiple sheets?
Identify the stations: Which locations are included? Are they in the UK, overseas, or both?
Check the time period: What dates does the data cover?

2. Calculate Summary Statistics

For each station and each variable, calculate:

Mean, median, mode
Range, interquartile range, standard deviation
Note which values seem typical and which seem unusual

Use the spreadsheet's built-in functions (e.g., $\text{AVERAGE}$ , $\text{STDEV}$ , $\text{QUARTILE}$ ) to carry out these calculations efficiently.

3. Create Visualisations

Draw box plots to compare distributions across stations or months.
Create scatter diagrams to explore relationships between variables (e.g., daily mean temperature vs. daily total sunshine).
Plot time series to observe seasonal patterns.

4. Look for Patterns and Anomalies

Are there seasonal trends? (Temperature is higher in summer, rainfall patterns vary.)
Are there outliers? (An unusually high wind speed, or a missing value recorded as a code such as $\text{n/a}$ or $\text{tr}$ .)
Are there differences between UK and overseas stations?

5. Practise Exam-Style Questions

Work through past papers and specimen papers that reference the LDS. This will help you understand the types of questions that appear and the level of contextual knowledge expected.

6. Make Summary Notes

Create a one-page summary for each weather station, including:

Location (country, latitude, altitude)
Typical values for each variable
Notable features (e.g., high rainfall, frequent fog, missing data in a particular month)

Understanding the Context: Weather Data

Since the AQA large data set is based on weather data, it helps to have a basic understanding of the meteorological context:

Temperature varies seasonally and by latitude. UK stations typically range from about $-5\,°C$ to $35\,°C$ , with overseas stations showing a wider range.
Rainfall is highly variable. Daily totals can range from $0\,\text{mm}$ (dry) to over $50\,\text{mm}$ (heavy rainfall). The notation $\text{tr}$ (trace) means a tiny, unmeasurable amount.
Sunshine is measured in hours per day. Maximum possible sunshine depends on the time of year and latitude.
Wind speed and wind direction are recorded, with speed in knots (kn) or metres per second.
Cloud cover is measured in oktas (eighths of the sky covered by cloud), ranging from 0 (clear) to 8 (overcast).
Pressure is measured in hectopascals (hPa), typically ranging from about 970 hPa to 1040 hPa in the UK.

Understanding these ranges helps you spot unreasonable values in exam questions and provides the background for sensible interpretation.

Common Pitfalls

Pitfall	Advice
Not studying the LDS at all	Familiarisation is essential — do not leave it to chance
Trying to memorise every value	Focus on typical ranges and patterns, not specific numbers
Ignoring missing data codes	Learn what $\text{n/a}$ , $\text{tr}$ , and blank cells mean
Failing to interpret in context	Always relate your statistical findings back to the real-world setting
Not practising with past papers	Exam-style questions are the best way to prepare

Summary

The AQA large data set is a real-world weather data set that students must study before the exam.
It appears in Paper 3 questions, where contextual knowledge is rewarded.
Effective familiarisation involves exploring the data, calculating summary statistics, creating visualisations, and noting patterns and anomalies.
Understanding the meteorological context (typical values, units, and data codes) is essential.
The LDS is designed to develop genuine data literacy — the ability to work with real, messy data and draw meaningful conclusions.

Exam Tip: In the exam, if a question refers to the large data set, make sure your answer includes specific contextual detail. For example, do not just say "the data shows a positive correlation" — say "the data for Heathrow shows a positive correlation between daily mean temperature and daily total sunshine hours, which is expected because warmer days in the UK tend to have clearer skies and more sunshine."

A-Level Deep Dive: Introduction to the AQA Large Data Set

Spec mapping

AQA 7357 specification, Paper 3 — Statistics, Section O (Statistical sampling and the Large Data Set) covers students are required to become familiar with one or more specific large data sets in advance of the final assessment. Students will be expected to demonstrate familiarity with the context, the variables and the units of measurement, and to use this familiarity to interpret real data, identify outliers and anomalies, and apply statistical techniques in context (refer to the official specification document for exact wording). The Large Data Set (LDS) is pre-released by AQA — typically a multi-thousand-row spreadsheet covering weather, transport or socio-economic indicators across several locations and time periods. Although the LDS is named in Section O, it underpins every Statistics topic on Paper 3: sampling (Section N), data presentation and interpretation (Section P), probability (Section Q), the Normal and Binomial distributions (Sections R and S) and hypothesis testing (Section T) all expect candidates to draw on LDS familiarity. Critically, candidates are not asked to memorise specific numbers — they are asked to reason fluently about variables, units, sampling frames and time-period coverage in ways that only sustained familiarity supports.

Worked example with full mark scheme

Question (8 marks):

An extract from the Large Data Set shows daily mean temperature ( $T$ , in °C) and daily total rainfall ( $R$ , in mm) recorded at a single weather station over a 30-day period in a specific calendar month. The extract is summarised:

$\sum T = 312.0$ , $\sum T^2 = 3402.4$ , $\sum R = 84.0$ , $\sum TR = 720.5$ , $n = 30$ .

(a) Calculate the sample mean and sample standard deviation of the daily mean temperature. (3)

(b) State, with reasons referring explicitly to the LDS context, two limitations of using these 30 observations to describe the climate of the location. (3)

(c) The product moment correlation coefficient between $T$ and $R$ for these 30 days is $r = -0.18$ . Interpret this value, including a comment on whether a linear regression of $R$ on $T$ would be appropriate. (2)

Solution with mark scheme:

(a) Step 1 — sample mean.

$\bar{T} = \dfrac{\sum T}{n} = \dfrac{312.0}{30} = 10.4\text{ °C}$

B1 — correct mean with units.

Step 2 — sample standard deviation.

$s_T^2 = \dfrac{\sum T^2 - n\bar{T}^2}{n - 1} = \dfrac{3402.4 - 30 \times 10.4^2}{29} = \dfrac{3402.4 - 3244.8}{29} = \dfrac{157.6}{29} \approx 5.434$

M1 — correct use of the sum-of-squares formula with the divisor $(n-1)$ for sample standard deviation.

$s_T = \sqrt{5.434} \approx 2.33\text{ °C}$

A1 — correct standard deviation with units, to a sensible number of significant figures.

(b) Marking principle: each limitation must be (i) a genuine statistical limitation and (ii) anchored to the LDS context. Generic answers ("the sample is small") earn no marks unless contextualised.

B1 — time-period limitation: 30 days from a single calendar month captures only one season; the LDS itself spans multiple months and years, and the climate of the location is defined over decades, so a single month cannot represent typical conditions.
B1 — spatial limitation: a single weather station represents one micro-location; the LDS contains multiple stations precisely because conditions vary spatially, and any climate description should aggregate across stations or note the station explicitly.
B1 — missing-value awareness (alternative): the LDS contains non-trivial missing entries marked "n/a" or left blank for instrument failures; if the 30 days include any such entries, the effective sample size is smaller than 30.

Any two of the above earn full marks.

(c) Interpretation (B1): $r = -0.18$ indicates a weak negative linear correlation between daily mean temperature and daily total rainfall — warmer days tend (very slightly) to be drier in this sample, but the relationship is weak.

Suitability of regression (B1): because $|r|$ is small, a linear regression line of $R$ on $T$ would have very little predictive power; reporting a regression equation would over-state the strength of the relationship. Any line of best fit should be accompanied by an explicit health warning on $r$ .

Total: 8 marks (B1 M1 A1 B1 B1 B1 B1 B1).

Specimen question modelled on the AQA 7357 Paper 3 format

Question (6 marks): A student claims that "in the Large Data Set, daily maximum gust speed and daily mean wind speed measure the same thing, so we may use either interchangeably."

(a) Explain, with reference to the LDS, two ways in which these variables differ. (2)

(b) The student takes a systematic sample of every 10th row from one location's data and reports the sample mean of daily maximum gust speed. State two features of the LDS that the student must check before treating this sample mean as representative of that location's typical wind conditions. (4)

Mark scheme decomposition by AO:

(a)

B1 (AO2.4) — identifying that maximum gust is an instantaneous peak (typically the highest 3-second average over the day) while mean wind speed is a temporal average over the whole day.
B1 (AO3.4) — identifying that the units and instrument response differ: gust readings tend to exceed the mean by a factor of roughly 1.3 to 2.0, so swapping them systematically inflates or deflates any conclusion.

(b)

B1 (AO3.1a) — checking the time-period coverage of the sampled rows: if the 10th-row systematic step coincides with a periodicity in the data (e.g. every 10th day landing on a Monday in a Monday–Friday-only record) the sample is biased.
B1 (AO3.1b) — checking missing values: AQA's LDS includes "n/a" entries; the sample mean must either exclude these explicitly or be reported with the effective $n$ .
B1 (AO3.2a) — checking that all sampled rows come from the same location: the LDS interleaves multiple stations and a naive every-10th-row scheme can mix locations.
B1 (AO3.5b) — checking the calendar window: representativeness requires the sample to span a meaningful slice of the year, not a single month with atypical weather.

Total: 6 marks split AO2 = 1, AO3 = 5. Section O questions are AO3-dominated because LDS familiarity is, by design, a real-world reasoning skill rather than a procedural one.

Synoptic links

The LDS is the connective tissue of Paper 3. Every Statistics topic uses it as the canonical context:

Section N — Sampling: the LDS is the textbook example of a sampling frame. Simple random sampling, stratified sampling (by location, by month), systematic sampling (every $k$ th row) and opportunity sampling (the first 30 rows that load) all have natural LDS instantiations. AQA exploits this by asking candidates to evaluate a proposed sampling scheme against a specific LDS feature — for example, "would systematic sampling be appropriate if rainfall is recorded only on weekdays?"
Section P — Data presentation and interpretation: box plots, histograms, cumulative-frequency curves and scatter diagrams in Paper 3 are routinely drawn from LDS-style data. Outlier identification using the $1.5 \times \text{IQR}$ rule is asked in LDS context, and candidates must distinguish between statistical outliers (numerical) and contextual outliers (a 35 °C reading in a UK December that is more likely an instrument fault than a real datum).
Section Q — Probability: modelling the probability that a randomly selected day from the LDS exceeds a threshold (e.g. $P(R > 5\text{ mm})$ ) is a standard application. Empirical probabilities estimated from the LDS feed into the assumptions of the Normal and Binomial models in Sections R and S.
Sections R and S — Normal and Binomial distributions: daily mean temperature is often modelled as Normal; the number of rainy days in a fortnight is modelled as Binomial. The LDS provides the empirical frequencies that justify (or contradict) these distributional assumptions.
Section T — Hypothesis testing: the most synoptic LDS question type asks candidates to formulate hypotheses about an LDS-derived parameter (e.g. "the mean daily temperature in this month at this station is 11 °C") and conduct a one-sample test using LDS summary statistics. The reasoning required is not procedural — it is about whether the LDS data satisfy the test's assumptions of independence and Normality.
Cross-paper synoptic — Paper 1 Pure (Section H, exponentials and logarithms): when LDS data exhibit exponential growth or decay (e.g. transport-LDS bicycle counts versus distance from a city centre), Paper 3 questions can require the Pure-paper technique of linearising via $\ln$ before applying Section O regression.

Mark-scheme literacy

LDS questions on Paper 3 split AO marks heavily toward AO3:

AO	Typical share	Earned by
AO1 (knowledge / procedure)	10–20%	Stating a definition (sampling frame, outlier criterion), computing a mean or standard deviation from LDS summaries
AO2 (reasoning / interpretation)	20–30%	Interpreting a correlation coefficient in context, distinguishing a statistical outlier from an instrument fault, justifying a sampling-scheme choice
AO3 (problem-solving / modelling)	50–70%	Proposing or critiquing a sampling scheme against LDS features, evaluating whether the LDS supports a given modelling assumption, formulating contextualised hypotheses

Examiner-rewarded phrasing: "in the context of the Large Data Set …"; "given that the LDS records data only on weekdays …"; "since the LDS contains multiple locations, the sample must be stratified by station before …"; "missing values (recorded as n/a) reduce the effective sample size to …". Phrases that lose marks: generic statements ("the sample is too small", "the data may be biased") with no LDS anchor; computing sample statistics without commenting on units; treating the LDS as if it were a single time series rather than a multi-variable, multi-location, multi-time-period structure.

A specific AQA pattern to watch: questions that say "with reference to the Large Data Set" or "in the context of the data set" require the candidate to name a specific LDS feature (a variable, a location, a time period, a recording convention). A response that omits all LDS specifics, however statistically correct, typically caps at half marks.

Grade-band model answers

3-mark question

Question: State two features of the AQA Large Data Set that a candidate should familiarise themselves with before the examination, and for each feature explain briefly why it matters.

Grade C response (~190 words):

The first feature is the variables — candidates should know what variables are recorded, for example daily mean temperature in °C and daily rainfall in mm. This matters because the units affect how answers should be reported and because units differ between variables (temperature is continuous, rainfall is bounded below by zero).

The second feature is the locations — the LDS records data at several stations, so candidates need to know which stations are included and not assume all data comes from one place. This matters because comparisons between stations require stratified sampling.

Examiner commentary: Earns 3/3. Both features are named correctly (variables with units; locations) and each is given a reason that links to a statistical technique (units affect reporting; multiple stations affect sampling). The answer would be stronger with a third specific (time period, missing values) but the question only asks for two. Grade C work — accurate, brief, sufficient.

Grade A response (~230 words):*

A first essential feature is the time-period coverage: the LDS spans a defined window (typically several months across multiple years) and on a defined recording schedule (often weekdays only, with weekends or holidays missing). This matters because any sample drawn from the LDS inherits this temporal structure — a systematic sample with step size 5 in a Monday–Friday record will revisit the same weekday repeatedly, biasing any wind- or rainfall-related estimate.

A second essential feature is the handling of missing values, which AQA marks explicitly within the LDS (typically as "n/a" or by leaving the cell blank). This matters because sample statistics computed from a row count of $n = 30$ may be effectively based on $n_{\text{eff}} < 30$ if some cells are missing; a defensible analysis must report the effective sample size and choose between case-wise deletion, pair-wise deletion or imputation. Conflating the nominal and effective sample sizes inflates statistical confidence and is a marker of inexperience with real data.

Examiner commentary: Full marks (3/3). Beyond procedural correctness, the candidate names a specific recording convention (weekday-only data), explains a specific bias mechanism (systematic sampling interacting with periodicity) and connects missing-value handling to inferential validity. The phrase "effective sample size" is technical vocabulary that signals AO3 fluency. This is examination craft.

6-mark question

Question: A researcher takes a stratified random sample of 60 days from the LDS, with 10 days drawn from each of six months. The researcher then computes the sample mean of daily mean temperature across all 60 days and uses it to estimate the annual mean temperature for the location. (a) State two advantages of stratifying by month rather than taking a simple random sample of 60 days from the whole LDS. (b) Identify one way in which the resulting estimate may still be biased, and suggest a correction.

Grade B response (~270 words):

(a) Stratifying by month ensures that every month is represented in the sample, which avoids a simple random sample by chance picking, say, 50 days from the summer and only 10 from the winter. This gives a fairer sample.

A second advantage is that stratification reduces the variance of the estimator when within-month variation is smaller than between-month variation, which is true for temperature.

(b) The sample may still be biased because the LDS does not cover all 12 months of the year, only the months that AQA selected. So the "annual mean" estimate is really a six-month mean. The correction is to either restrict the claim to those six months or to weight the missing months using external data.

Examiner commentary: Earns 5/6. Part (a) full marks (4/4) — both advantages are correct, the second is at A* depth (variance reduction). Part (b) loses one mark: the candidate identifies a real bias (the LDS covers only some months) but the correction is loose. A precise correction would specify either how the external weighting should be combined or, more cleanly, that the estimand should be redefined as "mean for the six LDS months". Total: 5/6 — Grade B work, undermined by an imprecise final clause.

Grade A response (~310 words):*

(a) Stratifying by month guarantees proportional representation of each calendar month within the LDS, ensuring no month is over- or under-sampled by chance. This matters because daily mean temperature has a strong seasonal cycle, so an unstratified simple random sample would have higher variance: a sample dominated by July days would over-estimate the annual mean, and conversely for January.

A second advantage is variance reduction. Where between-strata variance is large compared to within-strata variance — exactly the case for monthly temperature data — the stratified estimator has strictly lower variance than the simple random estimator with the same total sample size, so confidence intervals are tighter for the same effort.

(b) The estimate is biased because the LDS does not span all twelve months of the calendar year: AQA's LDS typically covers a fixed selection of months (e.g. the same six months across multiple years), so the sample mean estimates the mean for those LDS months, not the true annual mean. The correction is to redefine the estimand explicitly as "mean daily temperature across the LDS months at this location" rather than "annual mean", which makes the inference internally valid; alternatively, augment the LDS sample with external monthly climate normals for the missing months and combine via a weighted mean weighted by month-length.

Examiner commentary: Full marks (6/6). The candidate names between-strata and within-strata variance correctly — examination vocabulary that earns AO2 marks cleanly. The correction in (b) offers two valid alternatives, distinguishing the cleaner option (redefine the estimand) from the more involved option (external weighting). The phrase "internally valid" signals that the candidate understands the difference between estimating the intended parameter and estimating some well-defined parameter. This is graduate-style statistical reasoning.

9-mark question

Question: A student writes: "I sampled the first 30 rows of the LDS spreadsheet, computed the sample mean rainfall $\bar{R} = 2.8$ mm with sample standard deviation $s_R = 4.1$ mm, and conducted a one-sample $z$ -test of $H_0: \mu_R = 2.0$ versus $H_1: \mu_R > 2.0$ at the 5% significance level. I found $z = 1.07$ and concluded there is insufficient evidence to reject $H_0$ ." Critique this analysis with reference to (a) the sampling scheme, (b) the distributional assumptions, and (c) the conclusion.

Grade A response (~390 words):*

(a) Sampling scheme. Taking the first 30 rows is opportunity sampling, not random sampling. The first 30 rows of the LDS are typically the first 30 calendar dates at a single location — this introduces two biases. First, the rows are temporally clustered: 30 consecutive days are not independent observations, because rainfall exhibits day-to-day autocorrelation (a rainy day often follows a rainy day). The independence assumption underlying the test is violated. Second, the first rows correspond to a single calendar window (e.g. early in the LDS year), so the sample is seasonally restricted — the resulting mean estimates the rainfall mean for that season at that location, not for the LDS overall.

(b) Distributional assumptions. The $z$ -test assumes (i) the sample is drawn from a population whose mean is being tested, (ii) observations are independent, and (iii) either the population is Normal or $n$ is large enough for the Central Limit Theorem to apply. Daily rainfall is a non-negative, right-skewed variable — many days have zero rainfall, and a small number have very high totals. The Normal model fits poorly. With $n = 30$ the CLT is borderline for a heavily skewed variable, and the $z$ -test using the sample standard deviation $s_R$ in place of the unknown $\sigma$ is more properly a $t$ -test (though for $n = 30$ the difference is small).

(c) Conclusion. The numerical conclusion ("insufficient evidence to reject $H_0$ ") is internally consistent with $z = 1.07 < 1.645$ , but the inferential conclusion is unsafe because (i) and (ii) above are unmet. A defensible reanalysis would: re-sample using stratified random sampling across LDS months and locations; check for skewness via a histogram or box plot; either log-transform rainfall before testing or use a non-parametric alternative such as the sign test; and report the conclusion with the estimand made explicit ("mean rainfall in the LDS months at this location").

Examiner commentary: Full marks (9/9). The candidate critiques on three dimensions exactly as the question demands, names specific statistical concepts (autocorrelation, CLT applicability for skewed variables, $z$ -versus- $t$ ), and offers concrete remediations rather than vague gestures. The closing sentence on estimand specification ties the critique back to the AO3 framework. This is publication-grade statistical reasoning at school level.

A-Level-depth misconceptions

The errors that distinguish A from A* on LDS questions:

Computing without context. A candidate computes $\bar{x} = 10.4$ from LDS summaries and writes "the mean is 10.4". Correct value, no marks: the answer must include units (°C), the variable name (daily mean temperature) and ideally the location and time-period scope. AQA mark schemes routinely award the final A1 only for contextualised statements.
Ignoring units. Treating temperature (°C, interval scale, can be negative) and rainfall (mm, ratio scale, bounded below by zero) interchangeably is a category error. Rainfall cannot be modelled by a Normal distribution naively because the support $\mathbb{R}$ allows negative values; temperature can.
Missing time-period reasoning. Conflating "the LDS" with "the climate" or "this calendar year". The LDS covers a defined set of months and years, not all of them — any inference that extrapolates beyond the LDS time window without explicit justification is overreach.
Treating consecutive rows as independent. Statistical tests assume independence; consecutive days in a weather LDS exhibit autocorrelation. Candidates rarely flag this even when the question mentions "consecutive days", and lose AO3 marks accordingly.
Confusing the sample mean with the population mean. Saying "the LDS mean rainfall is 2.8 mm" when 2.8 is a sample statistic from a sub-set of the LDS. The full LDS has its own population mean; the sample mean is an estimate.
Ignoring missing-value conventions. AQA records missing values explicitly. A naive sum of the column treats missing as zero (depending on software), which biases means downward for rainfall and is undefined for temperature. Candidates who compute summaries without inspecting missing values lose AO3 marks for unsafe practice.
Generic "the sample is too small". This phrase appears in thousands of scripts and earns nothing on its own. To earn marks, the candidate must specify what makes the sample size insufficient (variance of the variable, effect size of interest, distributional shape) — i.e. answer the question "small relative to what?".

Common errors and mark-loss patterns on LDS questions

Three patterns repeatedly cost candidates marks on Paper 3 LDS-context questions. They are about discipline of expression, not technique.

Decontextualised numerical answers. A question asks "interpret $r = -0.18$ in the context of the data". A response that says only "weak negative correlation" earns the AO1 mark but loses the AO2 mark for context. The cure: name the variables and the direction of association ("warmer days are very slightly drier in this sample") in every interpretation.
Failure to evaluate sampling schemes against LDS structure. A question proposes a sampling scheme; the candidate restates the scheme correctly but does not test it against a specific LDS feature (multiple locations, weekday-only recording, missing values, calendar coverage). The cure: have a mental checklist of four LDS features and test the proposed scheme against each.
Ignoring the strictness of "with reference to the LDS". A question with this phrase is a binding instruction. Answers that are statistically correct but generic earn approximately half marks. The cure: every paragraph in an LDS-context answer should name at least one LDS-specific noun (a variable, a location, a time period, a recording convention).

This pattern is endemic to Paper 3 LDS questions: candidates know the statistics, lose marks on context.

Going further — university and academic signposting

LDS-style data analysis points directly toward several undergraduate trajectories:

Applied statistics (Year 1): the LDS is a school-level introduction to exploratory data analysis (EDA), the discipline of inspecting, cleaning and visualising real data before fitting models. University courses formalise EDA via Tukey's five-number summary, the boxplot, and the histogram, and introduce more advanced tools (kernel density estimation, scatter-matrix visualisation).
Data ethics and reproducibility (Year 1/2): modern statistics courses now include a data ethics strand covering provenance, consent, missing-value handling, and the ethics of inference from observational data. The LDS — a public, anonymised, instrumentally-recorded data set — is the easy case; medical, financial and behavioural data sets raise harder questions.
Reproducibility and pre-registration (Year 2/3): modern empirical research increasingly requires researchers to pre-register their hypotheses and analysis pipelines before seeing the data, to avoid $p$ -hacking. The LDS workflow — declare your sampling scheme, declare your test, then look at the data — is exactly the school-level analogue of pre-registration. Mentioning pre-registration in an A-Level extended response signals research-aware sophistication.
Time-series analysis (Year 2/3): weather LDS data are inherently time-indexed and exhibit autocorrelation, seasonality and trend — the three classical components of a time-series decomposition. Year 2/3 courses introduce ARIMA models, exponential smoothing, and spectral analysis on data that look exactly like LDS columns.
Environmental and climate statistics (Year 3 / MSc): real-world climate inference uses LDS-style station data combined with reanalysis products (gridded model output) and remote-sensing data (satellite). The statistical machinery — hierarchical models, Bayesian inference, Gaussian processes — is rooted in the same questions a candidate asks of the AQA LDS at school level.

Oxbridge interview prompt: "You are given a year of daily rainfall data from a single weather station. Frame a precise statistical question that the data could answer, state what assumptions your method requires, and explain how you would check those assumptions using the data itself."

Additional A* practice: framing a question and answering it using LDS-style data

A common A* trap on Paper 3 is to give an open-ended LDS question — "is there evidence that …?" — that requires the candidate to frame a precise question, choose a method, and execute it. The technique is the same three-step pipeline that underpins all applied statistics: frame, check, infer.

Worked example: Using LDS-style daily mean temperature data for a single station across two consecutive months (June, $n = 30$ , $\bar{T}_J = 14.6$ , $s_J = 2.1$ ; July, $n = 31$ , $\bar{T}_{Jl} = 17.9$ , $s_{Jl} = 2.4$ ), test whether mean daily temperature differs between the two months at this station.

Step 1 — frame. The estimand is the difference $\mu_{Jl} - \mu_J$ in true mean daily temperature between July and June at this station, conditional on the LDS years. Hypotheses: $H_0: \mu_{Jl} - \mu_J = 0$ versus $H_1: \mu_{Jl} - \mu_J \neq 0$ .

Step 2 — check. Independence within months is borderline (consecutive days are autocorrelated), but for a school-level test we treat days as approximately independent and acknowledge this in the write-up. Normality of daily temperature is reasonable for monthly windows. Variances differ slightly ( $s_J = 2.1$ , $s_{Jl} = 2.4$ ) — Welch's two-sample $t$ -test is appropriate.

Step 3 — infer. The Welch test statistic is:

$t = \dfrac{\bar{T}_{Jl} - \bar{T}_J}{\sqrt{s_{Jl}^2/n_{Jl} + s_J^2/n_J}} = \dfrac{17.9 - 14.6}{\sqrt{2.4^2/31 + 2.1^2/30}} = \dfrac{3.3}{\sqrt{0.1858 + 0.1470}} = \dfrac{3.3}{0.577} \approx 5.72$

With approximately 58 degrees of freedom (Welch–Satterthwaite), the two-sided $p$ -value is far below 0.001. Conclude: strong evidence that mean daily temperature is higher in July than in June at this station, in this LDS window.

Why A candidates spot this immediately:* the structure "compare two month-means" is the signature of a two-sample $t$ -test. Every time you see this pattern, write the estimand explicitly first, check assumptions second, compute third. The same pipeline answers "does rainfall differ between two stations?", "is the proportion of rainy days different in two years?" (two-proportion $z$ -test) and "is the variance of wind speed higher at coastal versus inland stations?" ( $F$ -test). Recognising the pattern across these contexts is exactly the synoptic skill AQA rewards.

A subtlety: when reporting the conclusion, bind it to the estimand. The conclusion "July is hotter than June" is too strong: the data support "mean daily temperature at this station, in the LDS-recorded years, is higher in July than in June". The narrower claim is what the data licence; the broader claim is an over-reach.

AQA A-Level alignment footer

This content is aligned with the AQA A-Level Mathematics (7357) specification, Paper 3 — Statistics, Large Data Set context. For the most accurate and up-to-date information, please refer to the official AQA specification document.

Visual summary

graph TD
    A["Large Data Set<br/>(pre-released by AQA)"] --> B{"What feature<br/>matters?"}
    B -->|"Variables"| C["Names, units,<br/>scales (continuous,<br/>discrete, bounded)"]
    B -->|"Observations"| D["Locations, dates,<br/>recording schedule<br/>(weekdays only?)"]
    B -->|"Time period"| E["Calendar coverage,<br/>year span,<br/>seasonal scope"]
    B -->|"Missing values"| F["n/a entries,<br/>effective sample size"]
    C --> G["Sampling scheme:<br/>SRS, stratified,<br/>systematic"]
    D --> G
    E --> G
    F --> G
    G --> H{"Statistical<br/>technique"}
    H -->|"Summary"| I["Mean, sd,<br/>quartiles, IQR"]
    H -->|"Association"| J["Correlation,<br/>regression"]
    H -->|"Inference"| K["Hypothesis test,<br/>confidence interval"]
    I --> L["Report in context:<br/>variable, units,<br/>location, period"]
    J --> L
    K --> L

    style G fill:#27ae60,color:#fff
    style L fill:#3498db,color:#fff

Introduction to the Large Data Set

Introduction to the Large Data Set

What Is the Large Data Set?

Key Features

Why Does AQA Use a Large Data Set?

How Does the Large Data Set Appear in Exams?

Types of Exam Questions

Important Points

Familiarisation Strategies

1. Explore the Data Systematically

2. Calculate Summary Statistics

3. Create Visualisations

4. Look for Patterns and Anomalies

5. Practise Exam-Style Questions

6. Make Summary Notes

Understanding the Context: Weather Data

Common Pitfalls

Summary

A-Level Deep Dive: Introduction to the AQA Large Data Set

Spec mapping

Worked example with full mark scheme

Specimen question modelled on the AQA 7357 Paper 3 format

Synoptic links

Mark-scheme literacy

Grade-band model answers

3-mark question

6-mark question

9-mark question

A-Level-depth misconceptions

Common errors and mark-loss patterns on LDS questions

Going further — university and academic signposting

Additional A* practice: framing a question and answering it using LDS-style data

AQA A-Level alignment footer

Visual summary

More in Mathematics