You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson prepares you for the exam questions that reference the AQA large data set. You will learn about typical question styles, how to use your pre-release familiarity effectively, how to interpret unfamiliar contexts, and how to avoid common pitfalls.
Questions on the large data set appear in Paper 3: Statistics and Mechanics, Section A (Statistics). This section is worth 50 marks out of the total 100 for Paper 3, and the large data set typically features in several questions within this section.
| Section | Content | Marks |
|---|---|---|
| Section A | Statistics | 50 |
| Section B | Mechanics | 50 |
| Total | 100 |
The LDS questions are integrated into the statistics section alongside standard statistics topics. They are not separated into a distinct "LDS section" — instead, they are woven into questions on sampling, data presentation, summary statistics, correlation, regression, and hypothesis testing.
What to expect: You may be asked to describe how to take a particular type of sample from the large data set.
Example: "Describe how you could use the large data set to take a systematic sample of 10 days' data from the month of July at Heathrow."
How to answer:
What to expect: You may be given a diagram (box plot, histogram, scatter diagram) constructed from LDS data and asked to interpret it.
Example: "The box plot below shows the daily mean wind speed at Camborne in January. Describe the distribution and comment on any outliers."
How to answer:
What to expect: You may be given summary statistics calculated from the LDS and asked to compare two stations or two time periods.
Example: "The mean daily total sunshine at Leuchars in June is 6.2 hours with a standard deviation of 3.1 hours. The mean at Hurn is 7.8 hours with a standard deviation of 2.5 hours. Compare the sunshine at the two stations."
How to answer:
What to expect: You may be asked to carry out a hypothesis test using data or summary statistics from the LDS.
Example: "A student believes that there is a positive correlation between daily mean temperature and daily total sunshine at Heathrow. Using a sample of 20 days from the large data set, the student calculates r = 0.52. Carry out a test at the 5% significance level."
What to expect: You may be asked about missing data, anomalies, or the reliability of conclusions.
Example: "Five values are missing from the daily mean temperature data for Leeming in February. Explain how this might affect the calculation of the mean daily temperature for the month."
How to answer:
What to expect: You may be asked to model data from the LDS using a particular distribution and comment on the suitability of the model.
Example: "State the conditions under which a normal distribution is a suitable model for the daily mean pressure at Camborne. Using the data, assess whether these conditions are met."
Your familiarity with the large data set gives you a significant advantage if used correctly:
| Aspect | How it helps |
|---|---|
| Knowing typical values | You can check whether your answers are reasonable |
| Understanding the variables | You know what each column represents and what the units are |
| Awareness of anomalies | You recognise unusual values and can explain them |
| Station characteristics | You can explain differences between stations in terms of geography |
| Data quality issues | You know where missing data occurs and how it is coded |
Sometimes exam questions will present data in a context that is related to, but not identical to, the large data set you studied. For example:
Problem: Writing "reject H0" without explaining what this means in the context of the data.
Solution: Always finish with a sentence like: "There is sufficient evidence at the 5% level to suggest that the mean daily temperature at Leuchars in October has increased from the long-term average."
Problem: Mixing up daily mean temperature with daily maximum temperature, or confusing rainfall with sunshine.
Solution: Read the question carefully and check which variable is being referred to. Use the exact variable name from the question in your answer.
Problem: Giving an answer in the wrong units or forgetting to convert.
Solution: Always check the units given in the question and ensure your answer is consistent.
Problem: Using a regression line to predict values outside the range of the data without acknowledging the limitation.
Solution: State clearly whether your prediction is interpolation or extrapolation, and note that extrapolation is unreliable.
Problem: Calculating a mean or standard deviation without accounting for missing values.
Solution: State how many values are missing and explain the impact on your calculation.
Problem: Applying a normal or binomial model without checking whether the conditions are met.
Solution: Always state the conditions and briefly assess whether they are satisfied.
Paper 3 is 2 hours long, with 100 marks available. This gives approximately 1.2 minutes per mark. LDS questions are worth the same as other questions, so do not spend disproportionate time on them.
Your calculator can compute summary statistics, regression coefficients, and correlation coefficients. Make sure you know how to:
Many marks are lost because students do not read the question carefully. Pay particular attention to:
To prepare effectively, work through:
Exam Tip: Before the exam, create a one-page summary of each weather station in the large data set, listing its location, typical values for each variable, and any notable features. You cannot take this into the exam, but the act of creating it will consolidate your knowledge and help you answer questions confidently.
AQA 7357 specification, Paper 3 — Statistics (Section A): the prescribed content references "Use of the large data set throughout the course of study" and requires that candidates "become familiar with the large data set in advance of the final assessment and may be examined on it directly". LDS-flavoured items appear specifically in Paper 3 Section A, almost always in the early-to-mid stem of a multi-part Statistics question. They draw across the whole Stats spec — section B (data presentation and interpretation), section C (probability), section D (statistical distributions), section E (hypothesis testing) and section F (statistical sampling). The LDS is also assumed background to Section A's standalone items: even where the printed question doesn't reference the LDS by name, candidates whose mental model of "weather data" is sharp answer faster and lose fewer marks on contextual interpretation.
Question (8 marks):
A meteorologist is investigating whether daily mean temperature T (°C) at a UK weather station is associated with daily total rainfall R (mm) during May. A random sample of n=30 days from the large data set yields summary statistics:
∑T=372, ∑T2=4845, ∑R=84, ∑R2=392, ∑TR=980.
(a) Calculate the product moment correlation coefficient r to 3 s.f. (3)
(b) Test, at the 5% significance level, whether there is evidence of a non-zero correlation between T and R in the underlying population. State your hypotheses clearly and your conclusion in context. (5)
Solution with mark scheme:
(a) Step 1 — compute STT, SRR, STR.
STT=∑T2−n(∑T)2=4845−303722=4845−4612.8=232.2
SRR=∑R2−n(∑R)2=392−30842=392−235.2=156.8
STR=∑TR−n∑T∑R=980−30372⋅84=980−1041.6=−61.6
M1 — correct method for at least one S statistic.
Step 2 — combine.
r=STT⋅SRRSTR=232.2⋅156.8−61.6=36408.96−61.6=190.81−61.6
M1 — substituting into the correlation formula.
r≈−0.323
A1 — r=−0.323 (3 s.f.), with negative sign retained.
(b) Step 1 — state hypotheses.
Let ρ denote the population correlation coefficient between T and R.
H0:ρ=0 (no correlation in the population) H1:ρ=0 (two-tailed)
B1 — both hypotheses correct, in terms of ρ (not r). Candidates who write H0:r=0 lose this mark — r is the sample statistic, ρ the population parameter.
Step 2 — identify the critical value.
For a two-tailed test at the 5% level with n=30, the critical value of r from the AQA formula booklet is approximately ±0.3610.
B1 — correct critical value (allow ±0.361).
Step 3 — compare and decide.
∣r∣=0.323<0.361, so r does not lie in the critical region.
M1 — valid comparison of ∣r∣ with the critical value.
A1 — "Do not reject H0".
Step 4 — contextual conclusion.
There is insufficient evidence at the 5% level to conclude that daily mean temperature and daily total rainfall are correlated in May at this UK station.
A1 — conclusion expressed in context (mentions temperature, rainfall, station, May), with non-assertive phrasing ("insufficient evidence" rather than "no correlation").
Total: 8 marks.
Question (6 marks): A student claims that the mean daily maximum gust speed at Heathrow in October exceeds 25 knots. Using the large data set, they take a sample of 24 October days and compute xˉ=26.4 knots and s=5.8 knots. Assume gust speeds are approximately normally distributed.
(a) State a suitable null and alternative hypothesis. (1)
(b) Carry out the test at the 5% significance level using the normal distribution as an approximation, stating your conclusion in context. (5)
Mark scheme decomposition by AO:
(a)
(b)
Total: 6 marks split AO1 = 3, AO2 = 2, AO3 = 1. Hypothesis-testing items in Paper 3 routinely carry an AO3 mark for the contextual conclusion — bare numerical comparisons cap at AO2.
Connects to:
Data presentation and interpretation (section B): LDS items frequently print a histogram, box plot or cumulative frequency diagram derived from the data set and ask candidates to identify outliers, compare distributions, or comment on skew. Recognising whether a back-to-back stem-and-leaf is comparing the same variable across stations, or two variables at one station, is the first reading task — and is regularly mis-read under exam pressure.
Hypothesis testing (section E): the LDS is a natural source of hypothesis-testing contexts — tests on a population mean (with known or estimated σ), tests on ρ for product moment correlation, and binomial/normal tests for proportions of "rainy days" or "warm days". Setting hypotheses in terms of a clearly named population parameter is the single most common mark-loss site.
Modelling assumptions (sections D and E): treating "daily rainfall" as normally distributed is wrong (rainfall is bounded below by zero and heavily right-skewed); treating "temperature deviation from monthly mean" as approximately normal is usually defensible. Examiners reward candidates who name the assumption being made and flag where it is questionable for the given variable.
Correlation and regression (section B): the LDS supports both the calculation of r and the fitting of a regression line of y on x. Candidates need to remember that the regression equation y=a+bx is only meaningful for predicting y from x — using it the other way round, or extrapolating beyond the sampled range, is a classic mark-loser.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.