You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson covers the essential processes of data cleaning and preparation — the steps required to transform raw data into a form suitable for statistical analysis. In real-world statistics (and in A-Level exam questions), data is rarely perfect. Learning to identify and handle problems systematically is a key skill.
Raw data collected from real-world sources almost always contains imperfections. If these are not addressed before analysis, the results may be misleading, inaccurate, or invalid. Data cleaning is the process of detecting and correcting (or removing) errors, inconsistencies, and gaps in a data set.
In the context of the AQA large data set:
Missing values are one of the most common issues in real data. There are several strategies for dealing with them, each with advantages and disadvantages.
Remove any observation (row) that contains a missing value for the variable(s) of interest.
| Advantages | Disadvantages |
|---|---|
| Simple to implement | Reduces sample size |
| Ensures complete data for analysis | May introduce bias if data is not missing at random |
When to use: When the proportion of missing data is small (say, less than 5%) and there is no pattern to the missingness.
Replace missing values with estimated values. Common imputation methods include:
| Method | Effect on mean | Effect on spread |
|---|---|---|
| Mean imputation | Preserves the mean | Reduces the standard deviation (values cluster around the mean) |
| Median imputation | May slightly change the mean | Less distortion than mean imputation |
| Interpolation | Depends on surrounding values | Preserves local trends |
Important: Any imputation method introduces assumptions. In exam questions, you should state the method used and acknowledge that it may affect the reliability of the analysis.
Analyse only the available data for each variable separately, using different sample sizes as appropriate. This avoids the bias introduced by deletion or imputation but makes comparisons harder.
A typical exam question might ask: "Three daily mean temperature readings are missing from the data for July. Describe how you would deal with these missing values before calculating the mean daily temperature for the month."
A good answer would discuss omitting the missing values and calculating the mean from the remaining observations, or interpolating based on adjacent days, and would note how the chosen method might affect the result.
An outlier is a data value that is markedly different from the rest of the data. In A-Level Mathematics, outliers are typically identified using one of two rules:
A value is an outlier if it lies:
where IQR=Q3−Q1.
A value is an outlier if it lies more than 2 (or sometimes 3) standard deviations from the mean:
∣x−xˉ∣>2σ
| Action | When appropriate |
|---|---|
| Keep | The value is genuine and represents real variation in the data |
| Investigate | Check whether the value could be an error (cross-reference with other variables or sources) |
| Remove | The value is confirmed as an error |
| Report separately | The value is genuine but extreme, and you want to show its effect on the analysis |
If the daily maximum gust at Leeming is recorded as 150kn on a single day, with all other values below 60kn, this would be flagged as an outlier. You would then investigate:
Real data sets may contain inconsistencies in how values are recorded. Common issues include:
Once the data has been cleaned, several preparation steps may be needed:
Choose only the variables relevant to your analysis. If you are investigating the relationship between temperature and sunshine, you do not need wind speed or pressure data.
Select only the observations that are relevant — for example, data from a specific month, season, or station.
Sort the data by date, by variable value, or by station, depending on the analysis you plan to carry out.
Sometimes you need to create new variables from existing ones. For example:
Group data into classes for frequency tables and histograms. Choose class widths that are appropriate for the data range and sample size.
Before proceeding with analysis, carry out a systematic quality check:
| Check | Action |
|---|---|
| Are there missing values? | Count and decide on a strategy |
| Are there obvious errors? | Values outside the physically possible range |
| Are units consistent? | Convert if necessary |
| Are there outliers? | Identify and investigate using IQR or standard deviation rules |
| Is the data complete enough for the intended analysis? | Ensure sufficient data for reliable results |
In both coursework and exam answers, it is good practice to state clearly what data cleaning steps you have taken and why. For example:
"I removed three observations from the data set because the daily mean temperature values were blank, indicating missing readings. This reduced the sample size from 31 to 28 for July. I chose to omit these values rather than impute them because the proportion of missing data was small (less than 10%) and there was no obvious pattern to the missingness."
This shows the examiner that you have thought critically about data quality and can justify your decisions.
Exam Tip: If a question asks you to "clean" or "prepare" data, do not simply say "remove the outliers." Explain how you would identify them (e.g., using Q1−1.5×IQR), what you would do with them (keep, investigate, or remove), and why your choice is appropriate in the given context.
AQA 7357 Paper 3 — Statistics, Large Data Set context (subject content sections N — Statistical sampling, O — Data presentation and interpretation, and the LDS appendix). The published specification requires students to "become familiar with one or more specific large data sets in advance of the final assessment" and to be able to "interpret real data presented in summary or graphical form" and "use … the data set in the context in which it is presented." The cleaning, preparation and pre-analysis stages — handling missing values, anomaly identification, unit harmonisation, coding via y=ax+b — are not a separate sub-strand but are woven through Paper 3 questions: any LDS item presupposes that the candidate has already wrestled with the messiness of the raw data. The same skills feed forward into Paper 3 hypothesis testing (where outliers can dominate test statistics), Paper 3 probability distributions (where modelling assumptions are sensitive to data quality), and synoptically into Paper 1 / Paper 2 modelling sections wherever a "real-world" dataset is invoked. The AQA formula booklet does not list cleaning protocols — these are working-practice expectations rather than memorisable formulae.
Question (8 marks):
A student is analysing a sample of n=30 daily mean wind speeds (knots) from a coastal weather station, drawn from the LDS context. The raw extract shows: most values lie between 5 and 25 knots; three observations are recorded as the string "n/a"; one value is recorded as 185 knots; and two values are recorded in metres per second rather than knots (a transcription error). The student computes summary statistics: xˉ=18.4, s=31.2.
(a) Identify the issues with the dataset and state, with a one-line justification each, how each should be treated before further analysis. (4)
(b) After appropriate cleaning, the student codes the wind speeds using y=5x−10. Given that the cleaned mean of x is xˉ=12.6 knots and the cleaned standard deviation is sx=4.8 knots, find yˉ and sy. (4)
Solution with mark scheme:
(a) Step 1 — classify the issues.
There are three distinct data-quality issues:
"n/a" strings): three entries carry no numerical value. Treatment: flag as missing rather than silently delete; report the missingness rate (3/30=10%) alongside any analysis. If the missingness is plausibly missing-completely-at-random (MCAR), pairwise deletion or mean imputation may be defensible; otherwise the bias must be acknowledged.M1 — identifies all three issues by category. A1 — gives a defensible treatment for each. M1 A1 — explains why silent deletion is wrong (introduces bias / loses information about missingness mechanism).
(b) Step 1 — apply the linear-coding rules.
For y=5x−10=51x−2:
yˉ=5xˉ−10=512.6−10=52.6=0.52
M1 — applies yˉ=axˉ+b correctly with a=51, b=−2.
A1 — yˉ=0.52.
Step 2 — standard deviation under linear coding.
sy=∣a∣sx=51×4.8=0.96
M1 — recognises that the additive constant −2 does not affect spread, only the multiplicative factor 51 does.
A1 — sy=0.96.
Total: 8 marks (M4 A4). The candidate must also note that the coding result is only meaningful after the cleaning in part (a) — coding the contaminated data would propagate the 185-knot outlier into a wildly inflated sx.
Question (6 marks): A meteorologist downloads daily rainfall totals (mm) for a coastal station for the months in the LDS context. The raw column contains entries: 12.4, 0, tr (the meteorological symbol for "trace"), - (no observation), # (sensor fault), and several blank cells.
(a) Explain why simply replacing every non-numeric entry with 0 would bias the mean rainfall, distinguishing the treatment of tr, -, and #. (4)
(b) Suggest one defensible cleaning rule for each of the three non-numeric symbols, and state which rule preserves the original missingness information for downstream analysis. (2)
Mark scheme decomposition by AO:
(a)
tr represents a small but non-zero rainfall (typically below the gauge resolution), so replacing it with 0 systematically under-states wet-day frequency.- represents no observation, not zero rainfall; replacing with 0 confuses missingness with a true measurement of "no rain."# represents a sensor fault, so the underlying weather state is unknown; replacing with 0 asserts knowledge the data does not support.(b)
tr as a small positive value (e.g. 0.05 mm, half the gauge resolution); encode - as NA (true missing); encode # as NA with a "fault" flag retained in a parallel column.NA encoding (rather than 0 or deletion) is the rule that preserves missingness information for any later sensitivity analysis.Total: 6 marks split AO2 = 2, AO3 = 4. Cleaning questions on Paper 3 are AO3-dominated because the assessment objective is "translate problems in mathematical or non-mathematical contexts into mathematical processes" — exactly the work of deciding what counts as data.
Connects to:
Statistical sampling (subject content N): the choice of cleaning rule interacts with sampling. If a stratified sample over-samples nights (when sensors more often fault), silent deletion of # entries biases the cleaned dataset toward day-time observations. Cleaning is not a neutral pre-processing step — it can un-do careful sampling design.
Data presentation and interpretation (subject content O): outlier identification by the 1.5×IQR rule (boxplots) presupposes a clean dataset. Running the rule on a dataset that still contains transcription errors mis-labels real observations as outliers and vice versa.
Statistical distributions (subject content P) — normal modelling: the normal model is heavily sensitive to extreme values; one un-treated 185-knot reading shifts xˉ and inflates s enough to break any subsequent normality assumption. Cleaning is therefore a prerequisite for distributional modelling, not an optional polish.
Hypothesis testing (subject content Q): the test statistic for a single-sample mean test, Z=s/nXˉ−μ0, is dominated by extreme values. A retained outlier can flip a test from "fail to reject" to "reject H0" purely on the strength of contamination, not on the strength of evidence.
Linear coding y=ax+b (formula-booklet identity): the coding rules yˉ=axˉ+b and sy=∣a∣sx assume a clean x. Coding is a presentational convenience, not a cleaning method — it cannot remove anomalies, only re-scale them. Students who hope coding will "smooth out" outliers misunderstand the operation.
LDS / data-handling questions on Paper 3 split AO marks unusually toward AO3:
| AO | Typical share | Earned by |
|---|---|---|
| AO1 (knowledge / procedure) | 25–35% | Computing summary statistics correctly post-cleaning, applying coding identities, computing IQR |
| AO2 (reasoning / interpretation) | 25–35% | Distinguishing missingness from zero; justifying choice of cleaning rule; interpreting effect of outliers on summary statistics |
| AO3 (problem-solving / modelling) | 35–45% | Translating real-world data anomalies into mathematical decisions; recognising that cleaning rules carry assumptions |
Examiner-rewarded phrasing: "this entry is missing, not zero — coded as NA"; "the value 185 is physically implausible for this context, suggesting a likely transcription error and so excluded with justification"; "after conversion to consistent units, the cleaned mean is \dots". Phrases that lose marks: "I deleted the missing rows" (no justification of mechanism); "I replaced the outlier with the mean" (silent imputation); "I assumed the m/s values were knots" (silent unit-mismatch).
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.