You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson introduces the AQA Large Data Set (LDS) — what it is, why AQA requires students to work with it, how it appears in exam questions, and strategies for effective familiarisation. The large data set is a distinctive feature of the AQA A-Level Mathematics specification (7357), and understanding how to navigate and interpret real data is essential for success in Paper 3: Statistics and Mechanics.
The large data set is a pre-release collection of real-world data that AQA publishes for each examination series. Students are expected to become familiar with this data set before the exam, so that they can answer questions about it efficiently and with genuine understanding.
For AQA A-Level Mathematics, the large data set consists of weather data collected from a selection of weather stations across the United Kingdom and around the world. The data covers several years and includes a range of meteorological variables recorded on a daily or monthly basis.
| Feature | Detail |
|---|---|
| Subject | Weather/meteorological data |
| Source | Met Office and international equivalents |
| Coverage | Multiple UK and overseas weather stations |
| Time period | Several years of recorded data |
| Format | Spreadsheet (typically Excel or CSV) |
| Release | Published by AQA ahead of each exam series |
The data set is not provided in the exam paper in full. Instead, students are expected to have studied it beforehand and may be given small extracts, summaries, or contextual information in the exam.
AQA's rationale for including a large data set is grounded in several pedagogical and practical principles:
Authentic statistical practice: Working with real data mirrors what statisticians actually do. Unlike textbook exercises with small, clean data sets, the LDS contains anomalies, missing values, and the kind of complexity that real data inevitably presents.
Developing data literacy: Students learn to navigate, interrogate, and interpret large quantities of data — a skill that is increasingly important in higher education and the workplace.
Contextual understanding: Questions on the exam paper are set in the context of the data set. Students who have explored the data will understand the variables, their units, and what realistic values look like. This makes it much easier to spot errors, interpret results, and write meaningful conclusions.
Assessment of higher-order skills: The LDS allows AQA to ask questions that go beyond routine calculation. Students may be asked to comment on data quality, suggest reasons for anomalies, or discuss whether a statistical model is appropriate for a particular variable.
Specification requirement: The Ofqual subject content for A-Level Mathematics explicitly requires that students work with a large data set as part of their statistics training.
Questions on the large data set appear in Paper 3: Statistics and Mechanics (Section A: Statistics). These questions are designed so that students who have genuinely familiarised themselves with the data are at an advantage.
| Question type | What is expected |
|---|---|
| Sampling questions | Explain how to take a sample from the LDS using a named method (e.g., stratified, systematic) |
| Data presentation | Construct or interpret charts (box plots, histograms, scatter diagrams) based on data from the LDS |
| Summary statistics | Calculate or interpret mean, standard deviation, quartiles, etc., for variables in the LDS |
| Hypothesis testing | Carry out a test using data or summary statistics derived from the LDS |
| Interpretation and context | Comment on trends, patterns, outliers, or relationships observed in the data |
| Data cleaning | Discuss how missing data or anomalies should be handled |
Effective preparation for LDS questions involves much more than simply downloading the spreadsheet and glancing at it. Below are recommended strategies:
Open the data set in a spreadsheet application and examine it carefully:
For each station and each variable, calculate:
Use the spreadsheet's built-in functions (e.g., AVERAGE, STDEV, QUARTILE) to carry out these calculations efficiently.
Work through past papers and specimen papers that reference the LDS. This will help you understand the types of questions that appear and the level of contextual knowledge expected.
Create a one-page summary for each weather station, including:
Since the AQA large data set is based on weather data, it helps to have a basic understanding of the meteorological context:
Understanding these ranges helps you spot unreasonable values in exam questions and provides the background for sensible interpretation.
| Pitfall | Advice |
|---|---|
| Not studying the LDS at all | Familiarisation is essential — do not leave it to chance |
| Trying to memorise every value | Focus on typical ranges and patterns, not specific numbers |
| Ignoring missing data codes | Learn what n/a, tr, and blank cells mean |
| Failing to interpret in context | Always relate your statistical findings back to the real-world setting |
| Not practising with past papers | Exam-style questions are the best way to prepare |
Exam Tip: In the exam, if a question refers to the large data set, make sure your answer includes specific contextual detail. For example, do not just say "the data shows a positive correlation" — say "the data for Heathrow shows a positive correlation between daily mean temperature and daily total sunshine hours, which is expected because warmer days in the UK tend to have clearer skies and more sunshine."
AQA 7357 specification, Paper 3 — Statistics, Section O (Statistical sampling and the Large Data Set) covers students are required to become familiar with one or more specific large data sets in advance of the final assessment. Students will be expected to demonstrate familiarity with the context, the variables and the units of measurement, and to use this familiarity to interpret real data, identify outliers and anomalies, and apply statistical techniques in context (refer to the official specification document for exact wording). The Large Data Set (LDS) is pre-released by AQA — typically a multi-thousand-row spreadsheet covering weather, transport or socio-economic indicators across several locations and time periods. Although the LDS is named in Section O, it underpins every Statistics topic on Paper 3: sampling (Section N), data presentation and interpretation (Section P), probability (Section Q), the Normal and Binomial distributions (Sections R and S) and hypothesis testing (Section T) all expect candidates to draw on LDS familiarity. Critically, candidates are not asked to memorise specific numbers — they are asked to reason fluently about variables, units, sampling frames and time-period coverage in ways that only sustained familiarity supports.
Question (8 marks):
An extract from the Large Data Set shows daily mean temperature (T, in °C) and daily total rainfall (R, in mm) recorded at a single weather station over a 30-day period in a specific calendar month. The extract is summarised:
∑T=312.0, ∑T2=3402.4, ∑R=84.0, ∑TR=720.5, n=30.
(a) Calculate the sample mean and sample standard deviation of the daily mean temperature. (3)
(b) State, with reasons referring explicitly to the LDS context, two limitations of using these 30 observations to describe the climate of the location. (3)
(c) The product moment correlation coefficient between T and R for these 30 days is r=−0.18. Interpret this value, including a comment on whether a linear regression of R on T would be appropriate. (2)
Solution with mark scheme:
(a) Step 1 — sample mean.
Tˉ=n∑T=30312.0=10.4 °C
B1 — correct mean with units.
Step 2 — sample standard deviation.
sT2=n−1∑T2−nTˉ2=293402.4−30×10.42=293402.4−3244.8=29157.6≈5.434
M1 — correct use of the sum-of-squares formula with the divisor (n−1) for sample standard deviation.
sT=5.434≈2.33 °C
A1 — correct standard deviation with units, to a sensible number of significant figures.
(b) Marking principle: each limitation must be (i) a genuine statistical limitation and (ii) anchored to the LDS context. Generic answers ("the sample is small") earn no marks unless contextualised.
Any two of the above earn full marks.
(c) Interpretation (B1): r=−0.18 indicates a weak negative linear correlation between daily mean temperature and daily total rainfall — warmer days tend (very slightly) to be drier in this sample, but the relationship is weak.
Suitability of regression (B1): because ∣r∣ is small, a linear regression line of R on T would have very little predictive power; reporting a regression equation would over-state the strength of the relationship. Any line of best fit should be accompanied by an explicit health warning on r.
Total: 8 marks (B1 M1 A1 B1 B1 B1 B1 B1).
Question (6 marks): A student claims that "in the Large Data Set, daily maximum gust speed and daily mean wind speed measure the same thing, so we may use either interchangeably."
(a) Explain, with reference to the LDS, two ways in which these variables differ. (2)
(b) The student takes a systematic sample of every 10th row from one location's data and reports the sample mean of daily maximum gust speed. State two features of the LDS that the student must check before treating this sample mean as representative of that location's typical wind conditions. (4)
Mark scheme decomposition by AO:
(a)
(b)
Total: 6 marks split AO2 = 1, AO3 = 5. Section O questions are AO3-dominated because LDS familiarity is, by design, a real-world reasoning skill rather than a procedural one.
The LDS is the connective tissue of Paper 3. Every Statistics topic uses it as the canonical context:
Section N — Sampling: the LDS is the textbook example of a sampling frame. Simple random sampling, stratified sampling (by location, by month), systematic sampling (every kth row) and opportunity sampling (the first 30 rows that load) all have natural LDS instantiations. AQA exploits this by asking candidates to evaluate a proposed sampling scheme against a specific LDS feature — for example, "would systematic sampling be appropriate if rainfall is recorded only on weekdays?"
Section P — Data presentation and interpretation: box plots, histograms, cumulative-frequency curves and scatter diagrams in Paper 3 are routinely drawn from LDS-style data. Outlier identification using the 1.5×IQR rule is asked in LDS context, and candidates must distinguish between statistical outliers (numerical) and contextual outliers (a 35 °C reading in a UK December that is more likely an instrument fault than a real datum).
Section Q — Probability: modelling the probability that a randomly selected day from the LDS exceeds a threshold (e.g. P(R>5 mm)) is a standard application. Empirical probabilities estimated from the LDS feed into the assumptions of the Normal and Binomial models in Sections R and S.
Sections R and S — Normal and Binomial distributions: daily mean temperature is often modelled as Normal; the number of rainy days in a fortnight is modelled as Binomial. The LDS provides the empirical frequencies that justify (or contradict) these distributional assumptions.
Section T — Hypothesis testing: the most synoptic LDS question type asks candidates to formulate hypotheses about an LDS-derived parameter (e.g. "the mean daily temperature in this month at this station is 11 °C") and conduct a one-sample test using LDS summary statistics. The reasoning required is not procedural — it is about whether the LDS data satisfy the test's assumptions of independence and Normality.
Cross-paper synoptic — Paper 1 Pure (Section H, exponentials and logarithms): when LDS data exhibit exponential growth or decay (e.g. transport-LDS bicycle counts versus distance from a city centre), Paper 3 questions can require the Pure-paper technique of linearising via ln before applying Section O regression.
LDS questions on Paper 3 split AO marks heavily toward AO3:
| AO | Typical share | Earned by |
|---|---|---|
| AO1 (knowledge / procedure) | 10–20% | Stating a definition (sampling frame, outlier criterion), computing a mean or standard deviation from LDS summaries |
| AO2 (reasoning / interpretation) | 20–30% | Interpreting a correlation coefficient in context, distinguishing a statistical outlier from an instrument fault, justifying a sampling-scheme choice |
| AO3 (problem-solving / modelling) | 50–70% | Proposing or critiquing a sampling scheme against LDS features, evaluating whether the LDS supports a given modelling assumption, formulating contextualised hypotheses |
Examiner-rewarded phrasing: "in the context of the Large Data Set …"; "given that the LDS records data only on weekdays …"; "since the LDS contains multiple locations, the sample must be stratified by station before …"; "missing values (recorded as n/a) reduce the effective sample size to …". Phrases that lose marks: generic statements ("the sample is too small", "the data may be biased") with no LDS anchor; computing sample statistics without commenting on units; treating the LDS as if it were a single time series rather than a multi-variable, multi-location, multi-time-period structure.
A specific AQA pattern to watch: questions that say "with reference to the Large Data Set" or "in the context of the data set" require the candidate to name a specific LDS feature (a variable, a location, a time period, a recording convention). A response that omits all LDS specifics, however statistically correct, typically caps at half marks.
Question: State two features of the AQA Large Data Set that a candidate should familiarise themselves with before the examination, and for each feature explain briefly why it matters.
Grade C response (~190 words):
The first feature is the variables — candidates should know what variables are recorded, for example daily mean temperature in °C and daily rainfall in mm. This matters because the units affect how answers should be reported and because units differ between variables (temperature is continuous, rainfall is bounded below by zero).
The second feature is the locations — the LDS records data at several stations, so candidates need to know which stations are included and not assume all data comes from one place. This matters because comparisons between stations require stratified sampling.
Examiner commentary: Earns 3/3. Both features are named correctly (variables with units; locations) and each is given a reason that links to a statistical technique (units affect reporting; multiple stations affect sampling). The answer would be stronger with a third specific (time period, missing values) but the question only asks for two. Grade C work — accurate, brief, sufficient.
Grade A response (~230 words):*
A first essential feature is the time-period coverage: the LDS spans a defined window (typically several months across multiple years) and on a defined recording schedule (often weekdays only, with weekends or holidays missing). This matters because any sample drawn from the LDS inherits this temporal structure — a systematic sample with step size 5 in a Monday–Friday record will revisit the same weekday repeatedly, biasing any wind- or rainfall-related estimate.
A second essential feature is the handling of missing values, which AQA marks explicitly within the LDS (typically as "n/a" or by leaving the cell blank). This matters because sample statistics computed from a row count of n=30 may be effectively based on neff<30 if some cells are missing; a defensible analysis must report the effective sample size and choose between case-wise deletion, pair-wise deletion or imputation. Conflating the nominal and effective sample sizes inflates statistical confidence and is a marker of inexperience with real data.
Examiner commentary: Full marks (3/3). Beyond procedural correctness, the candidate names a specific recording convention (weekday-only data), explains a specific bias mechanism (systematic sampling interacting with periodicity) and connects missing-value handling to inferential validity. The phrase "effective sample size" is technical vocabulary that signals AO3 fluency. This is examination craft.
Question: A researcher takes a stratified random sample of 60 days from the LDS, with 10 days drawn from each of six months. The researcher then computes the sample mean of daily mean temperature across all 60 days and uses it to estimate the annual mean temperature for the location. (a) State two advantages of stratifying by month rather than taking a simple random sample of 60 days from the whole LDS. (b) Identify one way in which the resulting estimate may still be biased, and suggest a correction.
Grade B response (~270 words):
(a) Stratifying by month ensures that every month is represented in the sample, which avoids a simple random sample by chance picking, say, 50 days from the summer and only 10 from the winter. This gives a fairer sample.
A second advantage is that stratification reduces the variance of the estimator when within-month variation is smaller than between-month variation, which is true for temperature.
(b) The sample may still be biased because the LDS does not cover all 12 months of the year, only the months that AQA selected. So the "annual mean" estimate is really a six-month mean. The correction is to either restrict the claim to those six months or to weight the missing months using external data.
Examiner commentary: Earns 5/6. Part (a) full marks (4/4) — both advantages are correct, the second is at A* depth (variance reduction). Part (b) loses one mark: the candidate identifies a real bias (the LDS covers only some months) but the correction is loose. A precise correction would specify either how the external weighting should be combined or, more cleanly, that the estimand should be redefined as "mean for the six LDS months". Total: 5/6 — Grade B work, undermined by an imprecise final clause.
Grade A response (~310 words):*
(a) Stratifying by month guarantees proportional representation of each calendar month within the LDS, ensuring no month is over- or under-sampled by chance. This matters because daily mean temperature has a strong seasonal cycle, so an unstratified simple random sample would have higher variance: a sample dominated by July days would over-estimate the annual mean, and conversely for January.
A second advantage is variance reduction. Where between-strata variance is large compared to within-strata variance — exactly the case for monthly temperature data — the stratified estimator has strictly lower variance than the simple random estimator with the same total sample size, so confidence intervals are tighter for the same effort.
(b) The estimate is biased because the LDS does not span all twelve months of the calendar year: AQA's LDS typically covers a fixed selection of months (e.g. the same six months across multiple years), so the sample mean estimates the mean for those LDS months, not the true annual mean. The correction is to redefine the estimand explicitly as "mean daily temperature across the LDS months at this location" rather than "annual mean", which makes the inference internally valid; alternatively, augment the LDS sample with external monthly climate normals for the missing months and combine via a weighted mean weighted by month-length.
Examiner commentary: Full marks (6/6). The candidate names between-strata and within-strata variance correctly — examination vocabulary that earns AO2 marks cleanly. The correction in (b) offers two valid alternatives, distinguishing the cleaner option (redefine the estimand) from the more involved option (external weighting). The phrase "internally valid" signals that the candidate understands the difference between estimating the intended parameter and estimating some well-defined parameter. This is graduate-style statistical reasoning.
Question: A student writes: "I sampled the first 30 rows of the LDS spreadsheet, computed the sample mean rainfall Rˉ=2.8 mm with sample standard deviation sR=4.1 mm, and conducted a one-sample z-test of H0:μR=2.0 versus H1:μR>2.0 at the 5% significance level. I found z=1.07 and concluded there is insufficient evidence to reject H0." Critique this analysis with reference to (a) the sampling scheme, (b) the distributional assumptions, and (c) the conclusion.
Grade A response (~390 words):*
(a) Sampling scheme. Taking the first 30 rows is opportunity sampling, not random sampling. The first 30 rows of the LDS are typically the first 30 calendar dates at a single location — this introduces two biases. First, the rows are temporally clustered: 30 consecutive days are not independent observations, because rainfall exhibits day-to-day autocorrelation (a rainy day often follows a rainy day). The independence assumption underlying the test is violated. Second, the first rows correspond to a single calendar window (e.g. early in the LDS year), so the sample is seasonally restricted — the resulting mean estimates the rainfall mean for that season at that location, not for the LDS overall.
(b) Distributional assumptions. The z-test assumes (i) the sample is drawn from a population whose mean is being tested, (ii) observations are independent, and (iii) either the population is Normal or n is large enough for the Central Limit Theorem to apply. Daily rainfall is a non-negative, right-skewed variable — many days have zero rainfall, and a small number have very high totals. The Normal model fits poorly. With n=30 the CLT is borderline for a heavily skewed variable, and the z-test using the sample standard deviation sR in place of the unknown σ is more properly a t-test (though for n=30 the difference is small).
(c) Conclusion. The numerical conclusion ("insufficient evidence to reject H0") is internally consistent with z=1.07<1.645, but the inferential conclusion is unsafe because (i) and (ii) above are unmet. A defensible reanalysis would: re-sample using stratified random sampling across LDS months and locations; check for skewness via a histogram or box plot; either log-transform rainfall before testing or use a non-parametric alternative such as the sign test; and report the conclusion with the estimand made explicit ("mean rainfall in the LDS months at this location").
Examiner commentary: Full marks (9/9). The candidate critiques on three dimensions exactly as the question demands, names specific statistical concepts (autocorrelation, CLT applicability for skewed variables, z-versus-t), and offers concrete remediations rather than vague gestures. The closing sentence on estimand specification ties the critique back to the AO3 framework. This is publication-grade statistical reasoning at school level.
The errors that distinguish A from A* on LDS questions:
Computing without context. A candidate computes xˉ=10.4 from LDS summaries and writes "the mean is 10.4". Correct value, no marks: the answer must include units (°C), the variable name (daily mean temperature) and ideally the location and time-period scope. AQA mark schemes routinely award the final A1 only for contextualised statements.
Ignoring units. Treating temperature (°C, interval scale, can be negative) and rainfall (mm, ratio scale, bounded below by zero) interchangeably is a category error. Rainfall cannot be modelled by a Normal distribution naively because the support R allows negative values; temperature can.
Missing time-period reasoning. Conflating "the LDS" with "the climate" or "this calendar year". The LDS covers a defined set of months and years, not all of them — any inference that extrapolates beyond the LDS time window without explicit justification is overreach.
Treating consecutive rows as independent. Statistical tests assume independence; consecutive days in a weather LDS exhibit autocorrelation. Candidates rarely flag this even when the question mentions "consecutive days", and lose AO3 marks accordingly.
Confusing the sample mean with the population mean. Saying "the LDS mean rainfall is 2.8 mm" when 2.8 is a sample statistic from a sub-set of the LDS. The full LDS has its own population mean; the sample mean is an estimate.
Ignoring missing-value conventions. AQA records missing values explicitly. A naive sum of the column treats missing as zero (depending on software), which biases means downward for rainfall and is undefined for temperature. Candidates who compute summaries without inspecting missing values lose AO3 marks for unsafe practice.
Generic "the sample is too small". This phrase appears in thousands of scripts and earns nothing on its own. To earn marks, the candidate must specify what makes the sample size insufficient (variance of the variable, effect size of interest, distributional shape) — i.e. answer the question "small relative to what?".
Three patterns repeatedly cost candidates marks on Paper 3 LDS-context questions. They are about discipline of expression, not technique.
This pattern is endemic to Paper 3 LDS questions: candidates know the statistics, lose marks on context.
LDS-style data analysis points directly toward several undergraduate trajectories:
Oxbridge interview prompt: "You are given a year of daily rainfall data from a single weather station. Frame a precise statistical question that the data could answer, state what assumptions your method requires, and explain how you would check those assumptions using the data itself."
A common A* trap on Paper 3 is to give an open-ended LDS question — "is there evidence that …?" — that requires the candidate to frame a precise question, choose a method, and execute it. The technique is the same three-step pipeline that underpins all applied statistics: frame, check, infer.
Worked example: Using LDS-style daily mean temperature data for a single station across two consecutive months (June, n=30, TˉJ=14.6, sJ=2.1; July, n=31, TˉJl=17.9, sJl=2.4), test whether mean daily temperature differs between the two months at this station.
Step 1 — frame. The estimand is the difference μJl−μJ in true mean daily temperature between July and June at this station, conditional on the LDS years. Hypotheses: H0:μJl−μJ=0 versus H1:μJl−μJ=0.
Step 2 — check. Independence within months is borderline (consecutive days are autocorrelated), but for a school-level test we treat days as approximately independent and acknowledge this in the write-up. Normality of daily temperature is reasonable for monthly windows. Variances differ slightly (sJ=2.1, sJl=2.4) — Welch's two-sample t-test is appropriate.
Step 3 — infer. The Welch test statistic is:
t=sJl2/nJl+sJ2/nJTˉJl−TˉJ=2.42/31+2.12/3017.9−14.6=0.1858+0.14703.3=0.5773.3≈5.72
With approximately 58 degrees of freedom (Welch–Satterthwaite), the two-sided p-value is far below 0.001. Conclude: strong evidence that mean daily temperature is higher in July than in June at this station, in this LDS window.
Why A candidates spot this immediately:* the structure "compare two month-means" is the signature of a two-sample t-test. Every time you see this pattern, write the estimand explicitly first, check assumptions second, compute third. The same pipeline answers "does rainfall differ between two stations?", "is the proportion of rainy days different in two years?" (two-proportion z-test) and "is the variance of wind speed higher at coastal versus inland stations?" (F-test). Recognising the pattern across these contexts is exactly the synoptic skill AQA rewards.
A subtlety: when reporting the conclusion, bind it to the estimand. The conclusion "July is hotter than June" is too strong: the data support "mean daily temperature at this station, in the LDS-recorded years, is higher in July than in June". The narrower claim is what the data licence; the broader claim is an over-reach.
This content is aligned with the AQA A-Level Mathematics (7357) specification, Paper 3 — Statistics, Large Data Set context. For the most accurate and up-to-date information, please refer to the official AQA specification document.
graph TD
A["Large Data Set<br/>(pre-released by AQA)"] --> B{"What feature<br/>matters?"}
B -->|"Variables"| C["Names, units,<br/>scales (continuous,<br/>discrete, bounded)"]
B -->|"Observations"| D["Locations, dates,<br/>recording schedule<br/>(weekdays only?)"]
B -->|"Time period"| E["Calendar coverage,<br/>year span,<br/>seasonal scope"]
B -->|"Missing values"| F["n/a entries,<br/>effective sample size"]
C --> G["Sampling scheme:<br/>SRS, stratified,<br/>systematic"]
D --> G
E --> G
F --> G
G --> H{"Statistical<br/>technique"}
H -->|"Summary"| I["Mean, sd,<br/>quartiles, IQR"]
H -->|"Association"| J["Correlation,<br/>regression"]
H -->|"Inference"| K["Hypothesis test,<br/>confidence interval"]
I --> L["Report in context:<br/>variable, units,<br/>location, period"]
J --> L
K --> L
style G fill:#27ae60,color:#fff
style L fill:#3498db,color:#fff