Exploring the Data

This lesson focuses on the structure and content of AQA's large data set. You will learn about the specific variables measured, the weather stations included, the units used, and how to identify anomalies and missing data. A thorough exploration of the data set is the foundation for all subsequent statistical analysis.

Structure of the AQA Weather Data Set

The AQA large data set is typically provided as a multi-sheet spreadsheet. Each sheet corresponds to a different weather station or a different time period. The data set is structured with:

Rows representing individual observations (usually one per day)
Columns representing different weather variables
Headers at the top of each column giving the variable name and/or units

Typical Layout

Column	Variable	Unit	Description
Date	Date of observation	dd/mm/yyyy	The calendar date
Daily mean temperature	Temperature	°C	Average of the maximum and minimum temperature for the day
Daily total rainfall	Rainfall	mm	Total precipitation recorded over 24 hours
Daily total sunshine	Sunshine	hours	Total hours of bright sunshine
Daily mean wind speed	Wind speed	kn (knots)	Average wind speed over 24 hours
Daily mean wind direction	Wind direction	° (degrees) or compass bearing	Prevailing wind direction
Daily maximum gust	Gust	kn (knots)	Highest instantaneous wind speed
Daily mean cloud cover	Cloud cover	oktas	Average cloud cover (0–8 scale)
Daily mean visibility	Visibility	Dm (decametres)	Average visibility
Daily mean pressure	Pressure	hPa (hectopascals)	Average atmospheric pressure at sea level
Daily maximum relative humidity	Humidity	%	Highest relative humidity during the day

Note: The exact variables included may vary slightly between examination series. Always check the current data set from AQA's website.

Weather Stations

The data set includes readings from several UK weather stations and, in some versions, overseas stations. Each station has distinct geographical characteristics that influence its weather patterns.

Typical UK Stations

Station	Region	Notable features
Heathrow	South-east England	Urban heat island effect; relatively dry; moderate temperatures
Hurn	Southern England (Dorset)	Coastal influence; mild winters
Leeming	Northern England (Yorkshire)	Inland; colder winters; frost-prone
Camborne	South-west England (Cornwall)	Maritime climate; windy; mild and wet
Leuchars	Eastern Scotland (Fife)	Coastal; cool summers; variable winds

Overseas Stations (if included)

Overseas stations often represent contrasting climates — for example, stations in Beijing (continental), Jacksonville (subtropical), or Perth (Mediterranean). These allow comparisons between very different weather patterns.

Why Station Selection Matters

Latitude affects temperature and sunshine hours.
Altitude affects temperature (decreasing approximately $1\,°C$ per $150\,\text{m}$ of altitude).
Proximity to the coast moderates temperature extremes and affects wind patterns.
Urban vs rural locations affect temperature (urban heat island effect) and air quality.

Understanding these factors helps you interpret the data in context — for example, explaining why Camborne has milder winters than Leeming, or why Heathrow tends to record higher temperatures than other UK stations.

Understanding Units

Correct interpretation of the data depends on understanding the units used:

Variable	Unit	Notes
Temperature	°C	Celsius; negative values indicate frost
Rainfall	mm	1 mm of rain = 1 litre per square metre; $\text{tr}$ means trace (< 0.05 mm)
Sunshine	hours	Cannot exceed the astronomical maximum for that date and latitude
Wind speed	kn (knots)	1 knot ≈ 1.15 mph ≈ 0.514 m/s
Wind direction	° (degrees)	Measured clockwise from north; 0° or 360° = north, 90° = east, etc.
Cloud cover	oktas	0 = clear sky, 8 = completely overcast
Pressure	hPa	Standard atmospheric pressure ≈ 1013 hPa
Visibility	Dm (decametres)	1 Dm = 10 m; higher values mean better visibility
Humidity	%	100% = air is fully saturated with moisture

Common Conversion Errors

Students frequently make mistakes with units in exam questions. For example:

Confusing knots with km/h or mph
Misinterpreting decametres as metres
Forgetting that $\text{tr}$ is not zero but a very small positive value

Identifying Anomalies

An anomaly is a data value that does not fit the expected pattern. In the context of weather data, anomalies may arise from:

Genuine Extreme Weather

An unusually hot day (e.g., daily mean temperature exceeding $30\,°C$ in the UK)
A storm producing exceptionally high rainfall or wind speeds
A cold snap with temperatures well below freezing

Recording Errors

A temperature of $100\,°C$ (clearly impossible for weather data)
A negative rainfall value
Cloud cover recorded as 10 oktas (the scale only goes to 8)

Equipment Failure

The same value repeated for several consecutive days (stuck sensor)
Sudden jumps to implausible values

How to Detect Anomalies

Calculate summary statistics (mean, standard deviation) for each variable and station.
Identify values more than 2 or 3 standard deviations from the mean.
Use box plots — values beyond the whiskers ( $Q_1 - 1.5 \times \text{IQR}$ or $Q_3 + 1.5 \times \text{IQR}$ ) are potential outliers.
Plot time series — anomalies often show up as isolated spikes or dips.
Cross-reference with other variables — for example, a day with very high sunshine but also very high rainfall may indicate a recording error.

Missing Data

Real-world data sets almost always contain missing values. In the AQA large data set, missing data may appear as:

Representation	Meaning
Blank cell	No reading was recorded
$\text{n/a}$	Not available — the reading could not be taken
$\text{tr}$	Trace — rainfall too small to measure (not truly missing, but often needs special handling)
Specific codes	Some data sets use numeric codes (e.g., $-99$ ) to indicate missing values

Why Data Might Be Missing

Equipment malfunction — the measuring instrument was broken or offline
Observer absence — at manually operated stations, no one was available to take the reading
Variable not measured — some stations do not record all variables
Data transmission errors — readings were lost in transfer

Impact of Missing Data on Analysis

Missing data can affect statistical analysis in several ways:

Summary statistics may be biased if the missing values are not random (e.g., if readings are more likely to be missing during extreme weather)
Sample sizes will vary between variables and stations, making comparisons harder
Correlation analysis requires paired data — if one value in a pair is missing, that observation must be excluded

Practical Exploration Activities

To build genuine familiarity, carry out the following activities with the data set:

Activity 1: Variable Summary Table

For a single station, create a table showing the minimum, maximum, mean, median, and standard deviation for each numerical variable. Note any values that seem unusual.

Activity 2: Seasonal Comparison

Select one variable (e.g., daily mean temperature) and compare its distribution across different months or seasons. Use appropriate charts to present your findings.

Activity 3: Station Comparison

Compare two stations on the same variable. Identify and explain any differences in terms of geography (latitude, altitude, proximity to coast).

Activity 4: Missing Data Audit

For each variable at each station, count the number of missing values. Is the missing data concentrated in certain time periods? Could this affect any conclusions you might draw?

Typical Exam Context

In the exam, you might see questions such as:

"The table shows daily mean temperatures for Camborne in June. Explain why one of the values might be considered an outlier."
"State one reason why data might be missing from the large data set."
"A student claims that temperatures are higher at Heathrow than at Leuchars. Using your knowledge of the large data set, explain why this claim is likely to be correct."

These questions reward students who have genuinely explored the data and understood the context.

Summary

The AQA large data set contains weather data from several UK (and possibly overseas) stations.
Variables include temperature, rainfall, sunshine, wind speed, cloud cover, pressure, visibility, and humidity.
Understanding units and what constitutes a realistic range for each variable is essential.
Anomalies may arise from genuine extreme weather, recording errors, or equipment failure.
Missing data is common in real-world data sets and must be handled thoughtfully.
Thorough exploration of the data — including summary statistics, visualisations, and comparisons — builds the familiarity needed for exam success.

Exam Tip: When a question asks you to comment on a data value from the large data set, always refer to what you know about the station and the variable. For example: "A daily mean temperature of 22°C at Camborne in January would be unusually high, as Camborne typically experiences mild but not warm winter temperatures due to its south-westerly maritime location."

A-Level Deep Dive: Exploring the Data

Spec mapping

AQA 7357 specification, section S — Statistics, sub-strands S1 (Statistical sampling) and S2 (Data presentation and interpretation) covers become familiar with one or more specific large data sets in advance of the assessment ... use technology to explore data sets, calculate summary statistics, and undertake exploratory data analysis (refer to the official specification document for exact wording). The AQA Large Data Set (LDS) underpins explicit Paper 3 questions in which candidates must already know the structure of the LDS — variable names, units, the calendar coverage, missing-value conventions — so that exam-room time goes on inference, not orientation. Exploratory data analysis (EDA) is examined indirectly throughout S2 (data presentation, outliers, correlation interpretation) and directly via "comment in context" marks. The AQA formula booklet provides standard summary-statistic formulae but does not describe the LDS — that knowledge must be brought in.

Worked example with full mark scheme

Question (8 marks):

A student is exploring a sample of 30 daily observations drawn from the AQA LDS for a single weather station. The recorded variables include: date, daily mean temperature $(^\circ\text{C})$ , daily total rainfall (mm), daily mean wind direction (compass label), and daily maximum gust (knots). Some cells contain the symbol tr (trace), some are blank, and some are clearly numeric. The student computes (for daily mean temperature, after cleaning) $\bar{x} = 14.2$ , sample standard deviation $s = 3.1$ , median $Q_2 = 14.0$ , lower quartile $Q_1 = 12.1$ , upper quartile $Q_3 = 16.4$ , minimum 7.8, maximum 26.5.

(a) Classify each of the four substantive variables (temperature, rainfall, wind direction, gust) as quantitative discrete, quantitative continuous, or qualitative (categorical), justifying briefly. (3)

(b) State two distinct ways missing or coded values (tr, blank cells) could affect the summary statistics if mishandled, and describe one defensible cleaning rule for each. (2)

Solution with mark scheme:

(a) Step 1 — classify each variable.

Daily mean temperature: quantitative continuous (can take any value in a real interval, measured to one decimal place).
Daily total rainfall: quantitative continuous (real-valued non-negative measurement; the tr code is a recording convention, not a separate type).
Daily mean wind direction (compass label, e.g. "NE", "SSW"): qualitative (categorical), ordinal-by-angle but treated as nominal in tabular form.
Daily maximum gust (knots, integer-recorded): quantitative discrete (recorded as whole knots, though physically continuous — A* candidates note the recording convention).

B1 — temperature and rainfall correctly identified as continuous.

B1 — wind direction correctly identified as qualitative (categorical).

B1 — gust correctly identified as discrete with justification (or continuous with justification — both accept, but the justification must reference the recording convention).

A common error is to call gust "continuous" with no further comment because "wind speed is physically continuous" — this misses the AO2 reasoning mark. The examined skill is recognising that recorded data inherits the granularity of its recording rule.

Step 2 — variable-type summary table (best practice).

A* candidates produce a one-line classification table before computing any summary, because the choice of summary statistic is type-determined: means and standard deviations are meaningful only for quantitative data; for categorical wind direction the appropriate summary is a frequency table or modal category.

(b) Step 1 — name two distinct effects.

Effect 1: treating tr as zero biases the mean rainfall downward (trace rainfall is a small positive amount, conventionally below 0.05 mm); treating tr as missing biases the sample size and inflates variance estimators if not handled.
Effect 2: treating blank cells as zero conflates "no measurement taken" with "measurement of zero" — for rainfall these may coincide accidentally, but for temperature a blank cell does not mean the temperature was $0^\circ\text{C}$ .

B1 — two distinct effects clearly articulated (not paraphrases of one another).

Step 2 — defensible cleaning rules.

For tr: replace with a small positive constant (e.g. 0.025 mm — the midpoint of the trace interval) and document the substitution.
For blank cells: treat as missing, exclude from the variable's summary statistics, and report the reduced sample size $n_{\text{used}}$ alongside the statistics.

B1 — both rules are operationally specific (a number, or "exclude and report $n$ ") rather than vague ("handle carefully").

$\text{IQR} = Q_3 - Q_1 = 16.4 - 12.1 = 4.3$

M1 — correct IQR.

Step 2 — compute the upper Tukey fence.

$Q_3 + 1.5 \times \text{IQR} = 16.4 + 1.5 \times 4.3 = 16.4 + 6.45 = 22.85$

M1 — correct fence calculation.

Step 3 — compare and conclude in context.

Since $26.5 > 22.85$ , the value 26.5 is flagged as an outlier under the Tukey rule. In context: this is a hot-day reading, plausible meteorologically, so the candidate should note that being flagged as an outlier does not mean the value is erroneous — it warrants checking, not deletion.

A1 — comparison and contextual conclusion (the AO3 mark — "outlier in the statistical sense, not necessarily an error").

Total: 8 marks (B3 + B2 + M2A1 = 8).

Specimen question modelled on the AQA 7357 Paper 3 format

Question (6 marks): A student samples 40 daily rainfall values from the LDS and tabulates them after cleaning. Summary: $n = 38$ (two cells discarded as missing), $\sum x = 95.0$ mm, $\sum x^2 = 612.5$ , $Q_1 = 0.4$ , $Q_2 = 1.8$ , $Q_3 = 4.1$ , minimum 0, maximum 14.2.

(a) Calculate the sample mean and the sample standard deviation, stating the sample size you have used. (3)

(b) Apply the Tukey $1.5 \times \text{IQR}$ rule to identify whether the maximum is an outlier, and comment on what the candidate should do if it is. (3)

Mark scheme decomposition by AO:

(a)

B1 (AO1.1a) — stating $n = 38$ explicitly (the cleaning has been done; full-size $n = 40$ would be wrong).
M1 (AO1.1b) — $\bar{x} = 95.0 / 38 = 2.5$ mm (to 2 s.f.).
A1 (AO1.1b) — $s^2 = \dfrac{1}{n - 1}\left(\sum x^2 - n\bar{x}^2\right) = \dfrac{1}{37}(612.5 - 38 \times 2.5^2) = \dfrac{1}{37}(612.5 - 237.5) = \dfrac{375}{37} \approx 10.14$ , so $s \approx 3.18$ mm.

(b)

M1 (AO1.1b) — $\text{IQR} = 4.1 - 0.4 = 3.7$ ; upper fence $= 4.1 + 1.5 \times 3.7 = 9.65$ .
A1 (AO2.1) — $14.2 > 9.65$ , so the maximum is flagged as an outlier under the Tukey rule.
A1 (AO3.5a) — comment that an outlier in this rainfall context could be a genuine high-rainfall day; the candidate should verify the original LDS entry rather than delete the value.

Total: 6 marks split AO1 = 4, AO2 = 1, AO3 = 1. This is a balanced EDA question — AO1 fluency on summary statistics, AO2 reasoning to apply the rule correctly, AO3 contextual judgement on what "outlier" means for rainfall.

Synoptic links

Connects to:

S2 — Data presentation and interpretation: the boxplot is the visual realisation of the same five-number summary used in the Tukey rule. Computing $Q_1, Q_2, Q_3$ from the LDS, then drawing the box and whiskers with explicit fences, is the standard Paper 3 presentation question. Mismatching the box's whiskers (e.g. extending whiskers to min/max regardless of fences) is a common mark-loss pattern.
S2 — Outlier rules: the Tukey $1.5 \times \text{IQR}$ rule is one of two AQA-accepted outlier criteria; the other is $\bar{x} \pm 2s$ (mean $\pm$ two standard deviations). Different rules can flag different points — A* answers state which rule is being used and why. For approximately symmetric data the rules tend to agree; for heavily skewed data (rainfall is right-skewed) they can disagree noticeably.
S1 — Sampling: the LDS is a population in the exam context, and any 30- or 40-day extract is a sample. The candidate's choice of sampling method (systematic, random, stratified by month) directly affects which extreme values appear and therefore which points get flagged as outliers. EDA findings are sample-dependent.

Exploring the Data

Exploring the Data

Structure of the AQA Weather Data Set

Typical Layout

Weather Stations

Typical UK Stations

Overseas Stations (if included)

Why Station Selection Matters

Understanding Units

Common Conversion Errors

Identifying Anomalies

Genuine Extreme Weather

Recording Errors

Equipment Failure

How to Detect Anomalies

Missing Data

Why Data Might Be Missing

Impact of Missing Data on Analysis

Practical Exploration Activities

Activity 1: Variable Summary Table

Activity 2: Seasonal Comparison

Activity 3: Station Comparison

Activity 4: Missing Data Audit

Typical Exam Context

Summary

A-Level Deep Dive: Exploring the Data

Spec mapping

Worked example with full mark scheme

Specimen question modelled on the AQA 7357 Paper 3 format

Synoptic links

More in Mathematics