Exam Questions on the Large Data Set

This lesson prepares you for the exam questions that reference the AQA large data set. You will learn about typical question styles, how to use your pre-release familiarity effectively, how to interpret unfamiliar contexts, and how to avoid common pitfalls.

Where LDS Questions Appear

Questions on the large data set appear in Paper 3: Statistics and Mechanics, Section A (Statistics). This section is worth 50 marks out of the total 100 for Paper 3, and the large data set typically features in several questions within this section.

Paper 3 Structure

Section	Content	Marks
Section A	Statistics	50
Section B	Mechanics	50
Total		100

The LDS questions are integrated into the statistics section alongside standard statistics topics. They are not separated into a distinct "LDS section" — instead, they are woven into questions on sampling, data presentation, summary statistics, correlation, regression, and hypothesis testing.

Typical Question Styles

Style 1: Sampling from the LDS

What to expect: You may be asked to describe how to take a particular type of sample from the large data set.

Example: "Describe how you could use the large data set to take a systematic sample of 10 days' data from the month of July at Heathrow."

How to answer:

State the sampling frame (all 31 days of July at Heathrow).
Calculate the sampling interval: $k = 31 / 10 \approx 3$ .
Choose a random starting point between 1 and 3.
Select every 3rd day from the starting point.
Note any practical issues (e.g., if $31/10$ does not give a whole number, explain how you would handle this).

Style 2: Interpreting Diagrams

What to expect: You may be given a diagram (box plot, histogram, scatter diagram) constructed from LDS data and asked to interpret it.

Example: "The box plot below shows the daily mean wind speed at Camborne in January. Describe the distribution and comment on any outliers."

How to answer:

State the median, IQR, and range.
Comment on skewness (is the median closer to $Q_1$ or $Q_3$ ?).
Identify outliers and suggest possible explanations (e.g., a storm).
Relate to context: "Camborne is in south-west England, exposed to Atlantic weather systems, so high wind speeds in January are expected."

Style 3: Summary Statistics in Context

What to expect: You may be given summary statistics calculated from the LDS and asked to compare two stations or two time periods.

Example: "The mean daily total sunshine at Leuchars in June is 6.2 hours with a standard deviation of 3.1 hours. The mean at Hurn is 7.8 hours with a standard deviation of 2.5 hours. Compare the sunshine at the two stations."

How to answer:

Compare the means: "Hurn has a higher mean daily sunshine (7.8 hours compared to 6.2 hours at Leuchars)."
Compare the spread: "Leuchars has a larger standard deviation (3.1 hours vs 2.5 hours), indicating more day-to-day variability."
Give context: "Hurn is further south and therefore has longer days and generally better weather in June. Leuchars, being on the east coast of Scotland, is more susceptible to haar (coastal fog) which reduces sunshine."

Style 4: Hypothesis Testing Using LDS Data

What to expect: You may be asked to carry out a hypothesis test using data or summary statistics from the LDS.

Example: "A student believes that there is a positive correlation between daily mean temperature and daily total sunshine at Heathrow. Using a sample of 20 days from the large data set, the student calculates r = 0.52. Carry out a test at the 5% significance level."

Style 5: Commenting on Data Quality

What to expect: You may be asked about missing data, anomalies, or the reliability of conclusions.

Example: "Five values are missing from the daily mean temperature data for Leeming in February. Explain how this might affect the calculation of the mean daily temperature for the month."

How to answer:

The sample size is reduced from 28 to 23.
If the missing values are not random (e.g., they correspond to days with extreme cold when the equipment failed), the mean may be biased upwards.
The missing values should be investigated before deciding whether to omit or impute them.

Style 6: Modelling Questions

What to expect: You may be asked to model data from the LDS using a particular distribution and comment on the suitability of the model.

Example: "State the conditions under which a normal distribution is a suitable model for the daily mean pressure at Camborne. Using the data, assess whether these conditions are met."

Using Pre-Release Familiarity

Your familiarity with the large data set gives you a significant advantage if used correctly:

What Familiarity Helps With

Aspect	How it helps
Knowing typical values	You can check whether your answers are reasonable
Understanding the variables	You know what each column represents and what the units are
Awareness of anomalies	You recognise unusual values and can explain them
Station characteristics	You can explain differences between stations in terms of geography
Data quality issues	You know where missing data occurs and how it is coded

What Familiarity Does NOT Help With

You will not be asked to recall specific data values.
You do not need to memorise exact statistics.
The exam may present the data in a different format from the spreadsheet.

Effective Use in the Exam

Use your knowledge to give contextual answers. Reference specific stations and their geographical features.
Check reasonableness. If your calculation gives a mean July temperature of −5°C at Heathrow, something has gone wrong.
Explain anomalies. If you know that a particular station had missing data in a certain month, mention this.
Compare stations knowledgeably. Explain why temperatures at Heathrow are typically higher than at Leuchars.

Interpreting Unfamiliar Contexts

Sometimes exam questions will present data in a context that is related to, but not identical to, the large data set you studied. For example:

A weather station not in the original LDS
A variable presented in different units
A time period you did not study

How to Handle This

Apply the same statistical techniques — the methods do not change.
Use your general understanding of weather data to interpret the context.
Look for parallels — if the question mentions a coastal station in southern England, draw on your knowledge of Hurn or Camborne.
Read the question carefully — all the information you need for the calculation will be given; your LDS knowledge helps with interpretation and context.

Common Pitfalls

Pitfall 1: Not Giving Context

Problem: Writing "reject H0" without explaining what this means in the context of the data.

Solution: Always finish with a sentence like: "There is sufficient evidence at the 5% level to suggest that the mean daily temperature at Leuchars in October has increased from the long-term average."

Pitfall 2: Confusing Variables

Problem: Mixing up daily mean temperature with daily maximum temperature, or confusing rainfall with sunshine.

Solution: Read the question carefully and check which variable is being referred to. Use the exact variable name from the question in your answer.

Pitfall 3: Incorrect Units

Problem: Giving an answer in the wrong units or forgetting to convert.

Solution: Always check the units given in the question and ensure your answer is consistent.

Pitfall 4: Extrapolating Beyond the Data

Problem: Using a regression line to predict values outside the range of the data without acknowledging the limitation.

Solution: State clearly whether your prediction is interpolation or extrapolation, and note that extrapolation is unreliable.

Pitfall 5: Ignoring Missing Data

Problem: Calculating a mean or standard deviation without accounting for missing values.

Solution: State how many values are missing and explain the impact on your calculation.

Pitfall 6: Not Checking Model Suitability

Problem: Applying a normal or binomial model without checking whether the conditions are met.

Solution: Always state the conditions and briefly assess whether they are satisfied.

Exam Technique

Time Management

Paper 3 is 2 hours long, with 100 marks available. This gives approximately 1.2 minutes per mark. LDS questions are worth the same as other questions, so do not spend disproportionate time on them.

Structuring Your Answers

State the method you are using.
Show clear working — even if you use a calculator, write down intermediate values.
State your conclusion in context.
Comment on limitations if the question asks for them.

Using a Calculator

Your calculator can compute summary statistics, regression coefficients, and correlation coefficients. Make sure you know how to:

Enter data into the statistics mode
Calculate mean, standard deviation, PMCC, and regression line
Use the normal distribution function (for hypothesis tests)

Reading the Question

Many marks are lost because students do not read the question carefully. Pay particular attention to:

Which variable is being asked about
Whether a one-tailed or two-tailed test is required
The significance level
Whether the question asks for calculation, interpretation, or both

Practice Questions

To prepare effectively, work through:

AQA past papers — the most valuable source of practice. Focus on Paper 3, Section A.
Specimen papers — these show the intended question style.
Textbook exercises that reference the large data set.
Your own investigations using the data set — formulate a question, carry out the analysis, and write up your conclusions.

Summary

LDS questions appear in Paper 3, Section A, integrated with standard statistics topics.
Typical question styles include sampling, interpreting diagrams, comparing summary statistics, hypothesis testing, commenting on data quality, and modelling.
Pre-release familiarity is an advantage for contextual interpretation, not for recalling specific values.
Avoid common pitfalls: always give context, check units, state assumptions, and acknowledge limitations.
Good exam technique includes clear working, contextual conclusions, and careful reading of the question.

Exam Tip: Before the exam, create a one-page summary of each weather station in the large data set, listing its location, typical values for each variable, and any notable features. You cannot take this into the exam, but the act of creating it will consolidate your knowledge and help you answer questions confidently.

A-Level Deep Dive: Exam Questions on the Large Data Set

Spec mapping

AQA 7357 specification, Paper 3 — Statistics (Section A): the prescribed content references "Use of the large data set throughout the course of study" and requires that candidates "become familiar with the large data set in advance of the final assessment and may be examined on it directly". LDS-flavoured items appear specifically in Paper 3 Section A, almost always in the early-to-mid stem of a multi-part Statistics question. They draw across the whole Stats spec — section B (data presentation and interpretation), section C (probability), section D (statistical distributions), section E (hypothesis testing) and section F (statistical sampling). The LDS is also assumed background to Section A's standalone items: even where the printed question doesn't reference the LDS by name, candidates whose mental model of "weather data" is sharp answer faster and lose fewer marks on contextual interpretation.

Worked example with full mark scheme

Question (8 marks):

A meteorologist is investigating whether daily mean temperature $T$ ( $°C$ ) at a UK weather station is associated with daily total rainfall $R$ (mm) during May. A random sample of $n = 30$ days from the large data set yields summary statistics:

$\sum T = 372$ , $\sum T^2 = 4845$ , $\sum R = 84$ , $\sum R^2 = 392$ , $\sum TR = 980$ .

(a) Calculate the product moment correlation coefficient $r$ to 3 s.f. (3)

(b) Test, at the 5% significance level, whether there is evidence of a non-zero correlation between $T$ and $R$ in the underlying population. State your hypotheses clearly and your conclusion in context. (5)

Solution with mark scheme:

(a) Step 1 — compute $S_{TT}$ , $S_{RR}$ , $S_{TR}$ .

$S_{TT} = \sum T^2 - \frac{(\sum T)^2}{n} = 4845 - \frac{372^2}{30} = 4845 - 4612.8 = 232.2$

$S_{RR} = \sum R^2 - \frac{(\sum R)^2}{n} = 392 - \frac{84^2}{30} = 392 - 235.2 = 156.8$

$S_{TR} = \sum TR - \frac{\sum T \sum R}{n} = 980 - \frac{372 \cdot 84}{30} = 980 - 1041.6 = -61.6$

M1 — correct method for at least one $S$ statistic.

Step 2 — combine.

$r = \frac{S_{TR}}{\sqrt{S_{TT} \cdot S_{RR}}} = \frac{-61.6}{\sqrt{232.2 \cdot 156.8}} = \frac{-61.6}{\sqrt{36408.96}} = \frac{-61.6}{190.81}$

M1 — substituting into the correlation formula.

$r \approx -0.323$

A1 — $r = -0.323$ (3 s.f.), with negative sign retained.

(b) Step 1 — state hypotheses.

Let $\rho$ denote the population correlation coefficient between $T$ and $R$ .

$H_0: \rho = 0$ (no correlation in the population) $H_1: \rho \neq 0$ (two-tailed)

B1 — both hypotheses correct, in terms of $\rho$ (not $r$ ). Candidates who write $H_0: r = 0$ lose this mark — $r$ is the sample statistic, $\rho$ the population parameter.

Step 2 — identify the critical value.

For a two-tailed test at the 5% level with $n = 30$ , the critical value of $r$ from the AQA formula booklet is approximately $\pm 0.3610$ .

B1 — correct critical value (allow $\pm 0.361$ ).

Step 3 — compare and decide.

$|r| = 0.323 < 0.361$ , so $r$ does not lie in the critical region.

M1 — valid comparison of $|r|$ with the critical value.

A1 — "Do not reject $H_0$ ".

Step 4 — contextual conclusion.

There is insufficient evidence at the 5% level to conclude that daily mean temperature and daily total rainfall are correlated in May at this UK station.

A1 — conclusion expressed in context (mentions temperature, rainfall, station, May), with non-assertive phrasing ("insufficient evidence" rather than "no correlation").

Total: 8 marks.

Specimen question modelled on the AQA 7357 Paper 3 format

Question (6 marks): A student claims that the mean daily maximum gust speed at Heathrow in October exceeds 25 knots. Using the large data set, they take a sample of 24 October days and compute $\bar{x} = 26.4$ knots and $s = 5.8$ knots. Assume gust speeds are approximately normally distributed.

(a) State a suitable null and alternative hypothesis. (1)

(b) Carry out the test at the 5% significance level using the normal distribution as an approximation, stating your conclusion in context. (5)

Mark scheme decomposition by AO:

(a)

B1 (AO1.2) — $H_0: \mu = 25$ , $H_1: \mu > 25$ (one-tailed, in line with the directional claim).

(b)

M1 (AO1.1a) — standardising: $z = \dfrac{\bar{x} - 25}{s/\sqrt{n}} = \dfrac{26.4 - 25}{5.8/\sqrt{24}}$ .
A1 (AO1.1b) — $z \approx 1.182$ (allow 1.18).
M1 (AO2.1) — comparison with critical value $z_{0.05} = 1.6449$ (or computing p-value $\approx 0.119$ and comparing with 0.05).
A1 (AO2.2a) — "Do not reject $H_0$ " / "1.182 < 1.645".
A1 (AO3.5a) — contextual conclusion: insufficient evidence at the 5% level that the mean October daily maximum gust speed at Heathrow exceeds 25 knots.

Total: 6 marks split AO1 = 3, AO2 = 2, AO3 = 1. Hypothesis-testing items in Paper 3 routinely carry an AO3 mark for the contextual conclusion — bare numerical comparisons cap at AO2.

Synoptic links

Connects to:

Data presentation and interpretation (section B): LDS items frequently print a histogram, box plot or cumulative frequency diagram derived from the data set and ask candidates to identify outliers, compare distributions, or comment on skew. Recognising whether a back-to-back stem-and-leaf is comparing the same variable across stations, or two variables at one station, is the first reading task — and is regularly mis-read under exam pressure.
Hypothesis testing (section E): the LDS is a natural source of hypothesis-testing contexts — tests on a population mean (with known or estimated $\sigma$ ), tests on $\rho$ for product moment correlation, and binomial/normal tests for proportions of "rainy days" or "warm days". Setting hypotheses in terms of a clearly named population parameter is the single most common mark-loss site.
Modelling assumptions (sections D and E): treating "daily rainfall" as normally distributed is wrong (rainfall is bounded below by zero and heavily right-skewed); treating "temperature deviation from monthly mean" as approximately normal is usually defensible. Examiners reward candidates who name the assumption being made and flag where it is questionable for the given variable.
Correlation and regression (section B): the LDS supports both the calculation of $r$ and the fitting of a regression line of $y$ on $x$ . Candidates need to remember that the regression equation $y = a + bx$ is only meaningful for predicting $y$ from $x$ — using it the other way round, or extrapolating beyond the sampled range, is a classic mark-loser.

Exam Questions on the Large Data Set

Exam Questions on the Large Data Set

Where LDS Questions Appear

Paper 3 Structure

Typical Question Styles

Style 1: Sampling from the LDS

Style 2: Interpreting Diagrams

Style 3: Summary Statistics in Context

Style 4: Hypothesis Testing Using LDS Data

Style 5: Commenting on Data Quality

Style 6: Modelling Questions

Using Pre-Release Familiarity

What Familiarity Helps With

What Familiarity Does NOT Help With

Effective Use in the Exam

Interpreting Unfamiliar Contexts

How to Handle This

Common Pitfalls

Pitfall 1: Not Giving Context

Pitfall 2: Confusing Variables

Pitfall 3: Incorrect Units

Pitfall 4: Extrapolating Beyond the Data

Pitfall 5: Ignoring Missing Data

Pitfall 6: Not Checking Model Suitability

Exam Technique

Time Management

Structuring Your Answers

Using a Calculator

Reading the Question

Practice Questions

Summary

A-Level Deep Dive: Exam Questions on the Large Data Set

Spec mapping

Worked example with full mark scheme

Specimen question modelled on the AQA 7357 Paper 3 format

Synoptic links

More in Mathematics