AQA A-Level Maths: Large Data Set — Complete Revision Guide (7357)
AQA A-Level Maths: Large Data Set — Complete Revision Guide (7357)
The Large Data Set is one of the most distinctive and most often underprepared parts of AQA A-Level Maths (7357). AQA pre-releases the data set well before the exam, and Paper 3 — the paper covering statistics and mechanics — will reference it directly. Questions do not just test whether you can do statistics; they test whether you understand the specific context, the variables, the units, the time periods and the limitations of this particular data. Walking into the exam without having spent time with the data is a self-inflicted wound.
This guide is a topic-by-topic walkthrough of the LDS content for the 7357 specification. It covers everything AQA can examine: how the LDS fits into the course, how to explore it, how to clean and prepare data, summary statistics and presentations in context, correlation and regression modelling, probability models, hypothesis testing in context, modelling assumptions, and how the LDS appears in actual exam questions. For each topic you will see the core skills, common pitfalls, a short illustration, and a link to the full lesson.
The aim is not to replace working through the data set itself — the only way to feel comfortable with the LDS is to open it and ask questions about what is in it. The aim is to give you a clear map of what AQA expects, in order, so your revision is targeted.
What the AQA 7357 Specification Covers
The AQA 7357 qualification is assessed through three two-hour papers, each worth 100 marks. Papers 1 and 2 are pure with some applied content; Paper 3 covers statistics and mechanics. The LDS is referenced explicitly on Paper 3, where statistical questions are routinely set in the context of the LDS variables. There is no choice of questions and no coursework, so every mark must be earned in the exam.
The LDS is not a separate content section — it is a context that runs across the statistics topics. AQA can ask you to interpret summary statistics, comment on model suitability, identify outliers, recognise variables and units from memory, or apply hypothesis testing to a sample drawn from the LDS. Familiarity with the data is a multiplier across the whole paper.
The table below shows the sub-topics in this guide, where they sit in the specification, and a realistic estimate of marks from each.
| Topic | Spec Area | Typical Paper 3 marks weight |
|---|---|---|
| Introduction to the Large Data Set | Statistics A | 2-4 marks |
| Exploring the data | Statistics A | 3-5 marks |
| Data cleaning and preparation | Statistics A | 2-4 marks |
| Summary statistics in context | Statistics B | 4-6 marks |
| Data presentation in context | Statistics B | 4-6 marks |
| Correlation and regression modelling | Statistics C | 4-6 marks |
| Probability models | Statistics D | 3-5 marks |
| Hypothesis testing in context | Statistics E | 6-10 marks |
| Modelling assumptions | Cross-cutting | 3-5 marks |
| Exam questions on the Large Data Set | All statistics | 10-15 marks |
These weights are estimates based on the spread of typical 7357 papers — not guarantees for any single year. What is reliable is that the LDS is consistently a material chunk of Paper 3, and that the same skills underpin every statistics question. Mastering this section is high-leverage revision.
Introduction to the Large Data Set
The Large Data Set is a real-world data set published by AQA before the exam series. Its point is to anchor statistics teaching in genuine, messy data rather than tidy textbook examples. Students are expected to have spent time with the data — opening it in a spreadsheet, sorting columns, plotting graphs, computing summaries — long before the exam. Paper 3 questions assume that familiarity.
The core skills are: knowing what the LDS contains at a high level (the population, the time period, the geographic scope), recognising the names and units of variables, and understanding how the data was collected. You do not need to memorise every value, but you should be able to answer "what units is this variable measured in?" without looking it up.
A common pitfall is treating the LDS as a single uniform block. In practice, AQA's data sets typically contain several sub-tables or sheets — different locations, time periods, or categories — and exam questions can probe whether you have noticed the structure. Another pitfall is confusing variable units (for example, tenths of a degree versus whole degrees), which throws summary statistics off by an order of magnitude.
For a guided tour of the LDS structure and a checklist of what to memorise, see the Introduction to the Large Data Set lesson.
Exploring the Data
Once you know what the LDS contains, the next skill is exploring it. Exploration is what statisticians do before they fit any model: sorting, filtering, plotting, summarising, and asking informal questions. AQA expects you to have done this for yourself, because exam questions often ask you to compare two parts of the data or comment on a feature you would only have noticed by looking.
The core skills are: sorting a column to find extremes, filtering to a subset that share a property, plotting a quick scatter or histogram, and computing a five-number summary. These are habits of mind rather than formal techniques being assessed at this stage — the point is to build intuition for what is typical and unusual.
A productive session: sort the daily rainfall column in descending order — what is the largest value, and which location and date? Filter to one location across both years — does the pattern look similar? Plot mean daily maximum temperature against daily total rainfall — is there a visible relationship? Each takes minutes but plants memories that pay off in the exam.
A common pitfall is exploring once, three months before the exam, and never returning — the memories fade. Better to do shorter sessions repeatedly. Another is exploring without a question; always have a short list in mind.
For a structured exploration workflow, see the Exploring the Data lesson.
Data Cleaning and Preparation
Real data is messy. The LDS contains missing values and occasional anomalies, and the decisions about how to handle them may need to be defended in writing. Data cleaning at A-Level is not about reformatting spreadsheets; it is about reasoning carefully when something looks wrong.
The core skills are: recognising missing-data codes, deciding whether to exclude or impute, identifying potential outliers using standard rules (values more than 1.5 times the IQR from the nearest quartile, or more than two standard deviations from the mean), and knowing when an outlier is a data error versus a genuine extreme observation.
A short example. Suppose a daily rainfall column contains an entry of tr rather than a number. This is a meteorological code for "trace amount" — present but too small to measure precisely. A candidate who knows this writes a sentence explaining how they handled it (treating it as zero, or excluding it from numerical summaries). A candidate who does not either includes a non-number in the calculation or quietly drops the row without justification, losing communication marks.
A common pitfall is treating outlier rules as automatic deletion rules. They identify potential outliers; they do not tell you to remove them. A genuine extreme observation should usually be retained, with comment. Another pitfall is failing to mention cleaning decisions at all — examiners want to see that you have thought about the data.
For practice on identifying outliers and writing defensible cleaning notes, see the Data Cleaning and Preparation lesson.
Summary Statistics in Context
Summary statistics — mean, median, mode, range, IQR, variance, standard deviation — are the bread and butter of statistics. The LDS twist is that you are computing them on real, named variables, and the interpretation must be in context.
The core skills are: choosing the appropriate measure for the data type and the question, computing it correctly (often from a frequency table or grouped data), and interpreting it in a sentence that names the variable and its units. "The mean is \18.4$" is not a good answer; "the mean daily maximum temperature for this location in July was \18.4 ,^{\circ}\text{C}$" is.
The choice between mean and median often turns on outliers and distribution shape. The median is robust to extreme values; the mean is not. For a heavily skewed variable like daily rainfall — most days near zero, occasional very wet days — the median is more representative of a typical day, while the mean can be pulled up by a few extremes. Being able to explain your choice is examined.
A short illustration with illustrative numbers. Suppose for an LDS rainfall variable you compute a mean of \2.1$ mm and a median of \0.4$ mm. The gap is consistent with right-skew. A good answer names the skew, chooses the median as the more representative typical value, and notes that the mean is still appropriate for total rainfall comparisons.
A common pitfall is computing every statistic in sight without saying which is most useful. Another is reporting to inappropriate precision — match the precision of the answer to the source.
For practice on summary-statistic calculations with model interpretive sentences, see the Summary Statistics in Context lesson.
Data Presentation in Context
Data presentation covers histograms, box plots, cumulative-frequency curves, scatter diagrams, and time-series plots. AQA expects you to draw, read, compare and critique them in the LDS context.
The core skills are: choosing the right diagram for the data type, drawing it with correctly scaled axes and clear labels, and comparing two diagrams in writing. Comparison answers should always mention at least one measure of location (median, mean) and one of spread (range, IQR, standard deviation), each in context.
Box plots are useful for side-by-side comparisons of two locations or time periods. Histograms suit continuous variables when the question is about distribution shape. Scatter diagrams prepare the ground for correlation and regression.
A short example. Suppose two box plots compare daily rainfall at a coastal and an inland location, with the coastal box higher and wider. A model sentence: "The coastal location had a higher median daily rainfall (\1.2$ mm versus \0.5$ mm) and a larger interquartile range (\3.4$ mm versus \1.6$ mm), suggesting both more rain on a typical day and greater day-to-day variability than inland." It names the variable, gives both numbers, and interprets in context.
A common pitfall is drawing a histogram with unequal class widths and forgetting frequency density. Another is comparing without numbers — "it looks higher" does not score. A third is mixing up axes on a scatter diagram, which propagates into the wrong regression line.
For sketching practice and a comparison sentence template, see the Data Presentation in Context lesson.
Correlation and Regression Modelling
Correlation measures the strength and direction of a linear relationship between two variables. Regression is the line of best fit that quantifies that relationship. Both appear regularly in LDS questions, often with two LDS variables plotted against each other and a model fitted.
The core skills are: computing or interpreting the product moment correlation coefficient \r$, recognising what values of \r$ mean in context (with the convention that \|r| \approx 1$ is strong, \|r| \approx 0$ is weak, and signs indicate direction), fitting a least-squares regression line \y = a + bx$, and using it to predict — but only within the range of the data.
The interpretation is where most marks are won and lost. A correlation of \r = 0.8$ between two LDS variables does not, on its own, mean one causes the other. It means they tend to move together in this sample. The standard caution is that correlation does not imply causation; another is that an apparent correlation can be driven by a lurking variable affecting both. Strong answers state these limits explicitly when interpreting in the LDS context.
Extrapolation — using the regression line outside the range of the data — is risky and often examined. If the data covers temperatures from \10,^{\circ}\text{C}$ to \25,^{\circ}\text{C}$, a prediction at \5,^{\circ}\text{C}$ is an extrapolation and should be flagged as unreliable in the answer. The relationship may not be linear outside the observed range, or the underlying conditions may change.
A short illustration. Suppose the regression line for daily maximum temperature against hours of sunshine at one LDS location comes out as \y = 12.3 + 0.42 x$, where \x$ is hours of sunshine and \y$ is temperature in \^{\circ}\text{C}$. The slope says each additional hour of sunshine corresponds to an increase of about \0.42,^{\circ}\text{C}$ in maximum temperature. A prediction at \x = 8$ hours sits inside the data range; a prediction at \x = 16$ hours is an extrapolation and should be treated cautiously.
A common pitfall is reporting \r$ without context, or stating a relationship is causal when only correlation has been shown.
For practice on \r$, regression equations, and contextual interpretation, see the Correlation and Regression Modelling lesson.
Probability Models
A probability model is a mathematical description of how a random variable behaves. At A-Level the key models are the binomial distribution \X \sim B(n, p)$ for counts of successes in fixed trials, and the normal distribution \X \sim N(\mu, \sigma^2)$ for continuous variables that cluster symmetrically around a mean.
The core skills are: recognising when a model is appropriate, computing probabilities from it (using a calculator with built-in distribution functions, not tables), and critiquing the fit of the model to data. The LDS context tests the third skill especially. AQA can ask you whether the binomial is reasonable for a count derived from the LDS, or whether the normal is reasonable for a continuous LDS variable, and to justify your answer.
For the binomial to be appropriate, four conditions must hold: a fixed number of trials, two outcomes per trial, constant probability of success, and independence between trials. In the LDS, daily weather observations often violate independence — a wet day is more likely to be followed by another wet day than by a dry one. Recognising this is exactly the kind of contextual point AQA rewards.
For the normal to be appropriate, the data should cluster symmetrically around a single peak with light tails. Many LDS variables — like daily maximum temperature in a single month at a single location — are roughly symmetric and approximately normal. Others — like daily rainfall — are heavily right-skewed and not normal at all. Sketching a histogram and commenting on the shape is the standard approach.
A short example. Asked whether a binomial is appropriate for the number of rainy days in a 30-day month at one LDS location, a good answer notes: \n = 30$ is fixed and the trial is "rainy or not", but daily rainfall is unlikely to be independent (weather is autocorrelated) and the probability of rain may not be constant across the month. The binomial is at best an approximation, and any probability computed should be treated cautiously.
A common pitfall is checking only one or two binomial conditions and declaring the model "appropriate" without scrutiny. Another is using the normal for visibly skewed data without comment.
For probability calculations and model-fit critique practice, see the Probability Models lesson.
Hypothesis Testing in Context
Hypothesis testing is the formal procedure for using a sample to make a probabilistic claim about a population. AQA examines two main flavours at A-Level: tests for the binomial proportion \p$, and tests for the product moment correlation coefficient \\rho$. Both can appear in LDS-flavoured questions.
The core skills are: stating \H_0$ and \H_1$ correctly (using parameter names, not sample statistics), choosing a one-tailed or two-tailed test based on the wording, computing or looking up the relevant probability or critical value at the chosen significance level, and concluding in context. The conclusion is where many candidates lose marks. A correct conclusion has two parts: a statistical statement ("reject \H_0$ at the 5% level" or "do not reject \H_0$") and a contextual statement that translates back into LDS terms ("there is sufficient evidence to suggest the proportion of rainy days has increased").
The standard structure for a binomial test. Suppose a previous summer at one LDS location had a probability of rain of \p_0$, and a sample of \n$ days from this summer is observed with \x$ rainy days. To test whether \p$ has changed, set \H_0: p = p_0$ and \H_1: p \neq p_0$ (two-tailed). Under \H_0$, \X \sim B(n, p_0)$. Compute \P(X \geq x)$ or \P(X \leq x)$ as appropriate, double for two-tailed, and compare to the significance level. Conclude.
A short illustration with illustrative numbers. Suppose \p_0 = 0.3$, \n = 30$, observed \x = 14$. Under \H_0$, \X \sim B(30, 0.3)$ with mean \9$. The calculator gives \P(X \geq 14) \approx 0.05$. For a two-tailed test at the 10% level the \p$-value is about \0.10$ — borderline. The conclusion should reflect that borderline status, not collapse it to a confident yes/no.
Common pitfalls: writing \H_1$ using the sample value (e.g. \H_1: x > 14$) instead of the population parameter \H_1: p > 0.3$; forgetting to double the tail probability for two-tailed tests; concluding without context — "reject \H_0$" alone, with no mention of what it means for the LDS variable.
For full test workflows on LDS-flavoured scenarios, see the Hypothesis Testing in Context lesson.
Modelling Assumptions
Every statistical model rests on assumptions about the data. AQA repeatedly examines whether you can identify these assumptions, judge whether they hold for the LDS variable in question, and explain the consequences if they do not.
The core skills are: listing the assumptions of common models (binomial: independence, fixed \n$, constant \p$, two outcomes; normal: symmetry, single peak, light tails; linear regression: linear relationship, constant variance of residuals, independent residuals), checking each against the data, and writing a short sentence on whether the model is reasonable.
The LDS makes assumption-checking concrete. Independence of daily weather observations is rarely exact. Constant probability of rain across a month is rarely exact. Linear relationships hold over some ranges and break down outside them. AQA does not expect you to reject every model — the expectation is that you state the assumptions, comment briefly on plausibility, and proceed with appropriate caution.
A short example. Fitting a linear regression of daily maximum temperature on hours of sunshine, the assumptions are: linearity (check the scatter), constant variance of residuals (does the spread look the same across the range?), and independent residuals (autocorrelated weather is a known issue). A strong answer flags the autocorrelation while still using the model.
A common pitfall is writing a generic assumption list without connecting it to the LDS variable. Another is rejecting the model entirely because one assumption is imperfect — usually the right move is to use it cautiously and note the limitation.
For practice across binomial, normal, and regression models, see the Modelling Assumptions lesson.
Exam Questions on the Large Data Set
The final topic ties everything together: how the LDS actually appears in Paper 3 questions, and how to approach those questions efficiently in the exam.
The core skills are: recognising when a question is drawing on the LDS (often signalled by a familiar variable name, a familiar location, or a phrase like "from the Large Data Set"), choosing the right combination of techniques from the previous topics, and structuring your answer with clear context. Many LDS questions are multi-part, walking from a summary statistic in part (a), through a model-fit comment in part (b), to a hypothesis test in part (c). The marks reward consistent context throughout.
A typical structure for a high-mark LDS question. Part (a): compute or interpret a summary statistic for a named variable. Part (b): critique the appropriateness of a probability model for that variable. Part (c): perform a hypothesis test on a sample drawn from the variable, with a contextual conclusion. The question rewards candidates who carry the same variable name and units through every part rather than computing in isolation.
A common pitfall is treating each part as a separate exercise and forgetting the variable. A clean answer to part (c) names the variable, the location, and the time period at the conclusion stage even if the calculation does not strictly require it. Another pitfall is going beyond what the data supports — claiming causation when only correlation has been shown, or generalising from one location to all of the UK.
Time discipline matters. LDS questions often look long because they are wordy, but the calculations are usually compact. Read every part before starting any of them, so you can plan a consistent thread of context. Write conclusions in full sentences that an examiner can mark quickly.
For full LDS-style question sets with mark-scheme-style solutions, see the Exam Questions on the Large Data Set lesson.
Common Mark-Loss Patterns Across the Large Data Set
Across the whole LDS topic area, a small set of habits accounts for a disproportionate share of lost marks. None of these are about content you do not know. They are all about content you do know, applied carelessly.
- Stating numerical answers without units or context. "The mean is 18.4" is incomplete; the variable, units, location and time period anchor the answer.
- Computing every summary statistic without choosing one. Examiners reward judgement: which statistic is most useful here, and why?
- Treating outlier rules as automatic deletion rules. They identify; they do not delete. Genuine extremes belong in the data.
- Reporting numbers to inappropriate precision. Match the precision of the answer to the precision of the source.
- Comparing two diagrams without numbers. "Looks higher" does not score; cite a measure of location and a measure of spread.
- Stating correlation implies causation. Always include the standard caveat when interpreting \r$ in context.
- Extrapolating without comment. A prediction outside the data range needs an explicit warning.
- Using sample statistics inside hypothesis statements. \H_1$ is about the population parameter, not the observed sample value.
- Forgetting to double tail probabilities for two-tailed tests.
- Concluding tests without context. "Reject \H_0$" is half an answer; the other half names the variable and what the conclusion means.
- Skipping assumption checks for probability models, then being surprised when a follow-up part asks about them.
Many candidates lose marks here every series. A revision plan that explicitly drills these habits — not just the content — will move your grade more than another pass through the textbook.
Recommended Pre-Exam Familiarisation Plan
This plan is designed for a candidate who has covered the statistics content in lessons but wants to get genuinely familiar with the AQA LDS before the exam. It assumes about 4-5 hours per week on this section, with the spreadsheet open. Pre-exam familiarisation is the single highest-leverage move you can make for LDS questions — there is no substitute for time spent with the actual data.
| Week | Focus | Practice |
|---|---|---|
| 1 | Introduction; exploring the data; cleaning and preparation | Open the LDS, list every variable with its units; sort and filter each column; identify any anomalous values |
| 2 | Summary statistics in context; data presentation in context | Compute means, medians, IQRs and standard deviations for at least three LDS variables; draw and compare box plots for two locations |
| 3 | Correlation and regression modelling; probability models | Plot two scatter diagrams of LDS variable pairs; compute \r$ for each; comment on whether binomial or normal is appropriate for two named variables |
| 4 | Hypothesis testing in context; modelling assumptions | Two full hypothesis tests on LDS-flavoured scenarios; one written assumption critique per model type |
| 5 | Exam questions on the LDS; targeted review | One full multi-part LDS question per study session; review marking-scheme style for any answer scoring below 60% |
| 6 | Mixed practice; final pre-exam familiarisation | A short LDS familiarisation session every two days, focused on memory of variables, units, structure |
The point of the plan is to keep the data in front of your eyes across the revision period rather than visiting it once and forgetting it. By the end of week 5, every topic in this guide should have had focused contact and a practice round in the LDS context. Week 6 is consolidation and short, frequent re-exposure.
A useful discipline is to treat any LDS question you got wrong not as a mistake but as a diagnostic. Was it a content gap? A method error? A context omission? A units slip? Logging the cause means your next review session targets the right thing.
How LearningBro's AQA A-Level Maths Large Data Set Course Helps
LearningBro's AQA A-Level Maths: Large Data Set course is built around the structure of this guide. Each of the ten lessons covers one part of the LDS topic area, in the order AQA teaches it, with worked examples, practice questions and full mark-scheme-style solutions. Lessons end with a short review and quick-recall questions designed for spaced revisits, and they are paired with prompts that send you into the actual LDS spreadsheet to verify what you have just learned.
The course is designed to be used in two ways. As a first pass, you can work through the lessons in order, building each topic on the last. As a revision tool, you can drop into any lesson and work the practice independently — for example, drilling hypothesis tests for a week before mocks. The AI tutor is available throughout to give targeted hints when you get stuck, without giving away full solutions, and to mark your written working with structured feedback that flags missing context, missing units, or missing caveats — exactly the things AQA examiners reward.
If you want one place to revise this section of the spec well, with realistic practice and clean explanations of every topic, the full course is the right next step. Start with the AQA A-Level Maths: Large Data Set course.