You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson focuses on the structure and content of AQA's large data set. You will learn about the specific variables measured, the weather stations included, the units used, and how to identify anomalies and missing data. A thorough exploration of the data set is the foundation for all subsequent statistical analysis.
The AQA large data set is typically provided as a multi-sheet spreadsheet. Each sheet corresponds to a different weather station or a different time period. The data set is structured with:
| Column | Variable | Unit | Description |
|---|---|---|---|
| Date | Date of observation | dd/mm/yyyy | The calendar date |
| Daily mean temperature | Temperature | °C | Average of the maximum and minimum temperature for the day |
| Daily total rainfall | Rainfall | mm | Total precipitation recorded over 24 hours |
| Daily total sunshine | Sunshine | hours | Total hours of bright sunshine |
| Daily mean wind speed | Wind speed | kn (knots) | Average wind speed over 24 hours |
| Daily mean wind direction | Wind direction | ° (degrees) or compass bearing | Prevailing wind direction |
| Daily maximum gust | Gust | kn (knots) | Highest instantaneous wind speed |
| Daily mean cloud cover | Cloud cover | oktas | Average cloud cover (0–8 scale) |
| Daily mean visibility | Visibility | Dm (decametres) | Average visibility |
| Daily mean pressure | Pressure | hPa (hectopascals) | Average atmospheric pressure at sea level |
| Daily maximum relative humidity | Humidity | % | Highest relative humidity during the day |
Note: The exact variables included may vary slightly between examination series. Always check the current data set from AQA's website.
The data set includes readings from several UK weather stations and, in some versions, overseas stations. Each station has distinct geographical characteristics that influence its weather patterns.
| Station | Region | Notable features |
|---|---|---|
| Heathrow | South-east England | Urban heat island effect; relatively dry; moderate temperatures |
| Hurn | Southern England (Dorset) | Coastal influence; mild winters |
| Leeming | Northern England (Yorkshire) | Inland; colder winters; frost-prone |
| Camborne | South-west England (Cornwall) | Maritime climate; windy; mild and wet |
| Leuchars | Eastern Scotland (Fife) | Coastal; cool summers; variable winds |
Overseas stations often represent contrasting climates — for example, stations in Beijing (continental), Jacksonville (subtropical), or Perth (Mediterranean). These allow comparisons between very different weather patterns.
Understanding these factors helps you interpret the data in context — for example, explaining why Camborne has milder winters than Leeming, or why Heathrow tends to record higher temperatures than other UK stations.
Correct interpretation of the data depends on understanding the units used:
| Variable | Unit | Notes |
|---|---|---|
| Temperature | °C | Celsius; negative values indicate frost |
| Rainfall | mm | 1 mm of rain = 1 litre per square metre; tr means trace (< 0.05 mm) |
| Sunshine | hours | Cannot exceed the astronomical maximum for that date and latitude |
| Wind speed | kn (knots) | 1 knot ≈ 1.15 mph ≈ 0.514 m/s |
| Wind direction | ° (degrees) | Measured clockwise from north; 0° or 360° = north, 90° = east, etc. |
| Cloud cover | oktas | 0 = clear sky, 8 = completely overcast |
| Pressure | hPa | Standard atmospheric pressure ≈ 1013 hPa |
| Visibility | Dm (decametres) | 1 Dm = 10 m; higher values mean better visibility |
| Humidity | % | 100% = air is fully saturated with moisture |
Students frequently make mistakes with units in exam questions. For example:
An anomaly is a data value that does not fit the expected pattern. In the context of weather data, anomalies may arise from:
Real-world data sets almost always contain missing values. In the AQA large data set, missing data may appear as:
| Representation | Meaning |
|---|---|
| Blank cell | No reading was recorded |
| n/a | Not available — the reading could not be taken |
| tr | Trace — rainfall too small to measure (not truly missing, but often needs special handling) |
| Specific codes | Some data sets use numeric codes (e.g., −99) to indicate missing values |
Missing data can affect statistical analysis in several ways:
To build genuine familiarity, carry out the following activities with the data set:
For a single station, create a table showing the minimum, maximum, mean, median, and standard deviation for each numerical variable. Note any values that seem unusual.
Select one variable (e.g., daily mean temperature) and compare its distribution across different months or seasons. Use appropriate charts to present your findings.
Compare two stations on the same variable. Identify and explain any differences in terms of geography (latitude, altitude, proximity to coast).
For each variable at each station, count the number of missing values. Is the missing data concentrated in certain time periods? Could this affect any conclusions you might draw?
In the exam, you might see questions such as:
These questions reward students who have genuinely explored the data and understood the context.
Exam Tip: When a question asks you to comment on a data value from the large data set, always refer to what you know about the station and the variable. For example: "A daily mean temperature of 22°C at Camborne in January would be unusually high, as Camborne typically experiences mild but not warm winter temperatures due to its south-westerly maritime location."
AQA 7357 specification, section S — Statistics, sub-strands S1 (Statistical sampling) and S2 (Data presentation and interpretation) covers become familiar with one or more specific large data sets in advance of the assessment ... use technology to explore data sets, calculate summary statistics, and undertake exploratory data analysis (refer to the official specification document for exact wording). The AQA Large Data Set (LDS) underpins explicit Paper 3 questions in which candidates must already know the structure of the LDS — variable names, units, the calendar coverage, missing-value conventions — so that exam-room time goes on inference, not orientation. Exploratory data analysis (EDA) is examined indirectly throughout S2 (data presentation, outliers, correlation interpretation) and directly via "comment in context" marks. The AQA formula booklet provides standard summary-statistic formulae but does not describe the LDS — that knowledge must be brought in.
Question (8 marks):
A student is exploring a sample of 30 daily observations drawn from the AQA LDS for a single weather station. The recorded variables include: date, daily mean temperature (∘C), daily total rainfall (mm), daily mean wind direction (compass label), and daily maximum gust (knots). Some cells contain the symbol tr (trace), some are blank, and some are clearly numeric. The student computes (for daily mean temperature, after cleaning) xˉ=14.2, sample standard deviation s=3.1, median Q2=14.0, lower quartile Q1=12.1, upper quartile Q3=16.4, minimum 7.8, maximum 26.5.
(a) Classify each of the four substantive variables (temperature, rainfall, wind direction, gust) as quantitative discrete, quantitative continuous, or qualitative (categorical), justifying briefly. (3)
(b) State two distinct ways missing or coded values (tr, blank cells) could affect the summary statistics if mishandled, and describe one defensible cleaning rule for each. (2)
(c) Using the Q3+1.5×IQR Tukey rule, determine whether the maximum value 26.5 is an outlier. (3)
Solution with mark scheme:
(a) Step 1 — classify each variable.
tr code is a recording convention, not a separate type).B1 — temperature and rainfall correctly identified as continuous.
B1 — wind direction correctly identified as qualitative (categorical).
B1 — gust correctly identified as discrete with justification (or continuous with justification — both accept, but the justification must reference the recording convention).
A common error is to call gust "continuous" with no further comment because "wind speed is physically continuous" — this misses the AO2 reasoning mark. The examined skill is recognising that recorded data inherits the granularity of its recording rule.
Step 2 — variable-type summary table (best practice).
A* candidates produce a one-line classification table before computing any summary, because the choice of summary statistic is type-determined: means and standard deviations are meaningful only for quantitative data; for categorical wind direction the appropriate summary is a frequency table or modal category.
(b) Step 1 — name two distinct effects.
tr as zero biases the mean rainfall downward (trace rainfall is a small positive amount, conventionally below 0.05 mm); treating tr as missing biases the sample size and inflates variance estimators if not handled.B1 — two distinct effects clearly articulated (not paraphrases of one another).
Step 2 — defensible cleaning rules.
tr: replace with a small positive constant (e.g. 0.025 mm — the midpoint of the trace interval) and document the substitution.B1 — both rules are operationally specific (a number, or "exclude and report n") rather than vague ("handle carefully").
(c) Step 1 — compute the IQR.
IQR=Q3−Q1=16.4−12.1=4.3
M1 — correct IQR.
Step 2 — compute the upper Tukey fence.
Q3+1.5×IQR=16.4+1.5×4.3=16.4+6.45=22.85
M1 — correct fence calculation.
Step 3 — compare and conclude in context.
Since 26.5>22.85, the value 26.5 is flagged as an outlier under the Tukey rule. In context: this is a hot-day reading, plausible meteorologically, so the candidate should note that being flagged as an outlier does not mean the value is erroneous — it warrants checking, not deletion.
A1 — comparison and contextual conclusion (the AO3 mark — "outlier in the statistical sense, not necessarily an error").
Total: 8 marks (B3 + B2 + M2A1 = 8).
Question (6 marks): A student samples 40 daily rainfall values from the LDS and tabulates them after cleaning. Summary: n=38 (two cells discarded as missing), ∑x=95.0 mm, ∑x2=612.5, Q1=0.4, Q2=1.8, Q3=4.1, minimum 0, maximum 14.2.
(a) Calculate the sample mean and the sample standard deviation, stating the sample size you have used. (3)
(b) Apply the Tukey 1.5×IQR rule to identify whether the maximum is an outlier, and comment on what the candidate should do if it is. (3)
Mark scheme decomposition by AO:
(a)
(b)
Total: 6 marks split AO1 = 4, AO2 = 1, AO3 = 1. This is a balanced EDA question — AO1 fluency on summary statistics, AO2 reasoning to apply the rule correctly, AO3 contextual judgement on what "outlier" means for rainfall.
Connects to:
S2 — Data presentation and interpretation: the boxplot is the visual realisation of the same five-number summary used in the Tukey rule. Computing Q1,Q2,Q3 from the LDS, then drawing the box and whiskers with explicit fences, is the standard Paper 3 presentation question. Mismatching the box's whiskers (e.g. extending whiskers to min/max regardless of fences) is a common mark-loss pattern.
S2 — Outlier rules: the Tukey 1.5×IQR rule is one of two AQA-accepted outlier criteria; the other is xˉ±2s (mean ± two standard deviations). Different rules can flag different points — A* answers state which rule is being used and why. For approximately symmetric data the rules tend to agree; for heavily skewed data (rainfall is right-skewed) they can disagree noticeably.
S1 — Sampling: the LDS is a population in the exam context, and any 30- or 40-day extract is a sample. The candidate's choice of sampling method (systematic, random, stratified by month) directly affects which extreme values appear and therefore which points get flagged as outliers. EDA findings are sample-dependent.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.