Data Cleaning & Preparation

This lesson covers the essential processes of data cleaning and preparation — the steps required to transform raw data into a form suitable for statistical analysis. In real-world statistics (and in A-Level exam questions), data is rarely perfect. Learning to identify and handle problems systematically is a key skill.

Why Data Cleaning Matters

Raw data collected from real-world sources almost always contains imperfections. If these are not addressed before analysis, the results may be misleading, inaccurate, or invalid. Data cleaning is the process of detecting and correcting (or removing) errors, inconsistencies, and gaps in a data set.

In the context of the AQA large data set:

Weather stations may have missing readings due to equipment failure.
Values may be recorded inconsistently (e.g., different date formats).
Outliers may represent genuine extreme weather or errors.
Special codes (such as $\text{tr}$ for trace rainfall) need careful handling.

Handling Missing Values

Missing values are one of the most common issues in real data. There are several strategies for dealing with them, each with advantages and disadvantages.

Strategy 1: Omission (Listwise Deletion)

Remove any observation (row) that contains a missing value for the variable(s) of interest.

Advantages	Disadvantages
Simple to implement	Reduces sample size
Ensures complete data for analysis	May introduce bias if data is not missing at random

When to use: When the proportion of missing data is small (say, less than 5%) and there is no pattern to the missingness.

Strategy 2: Imputation (Replacement)

Replace missing values with estimated values. Common imputation methods include:

Mean imputation: Replace missing values with the mean of the observed values for that variable.
Median imputation: Replace with the median — more robust to outliers than the mean.
Interpolation: Estimate the missing value based on surrounding values (particularly useful for time-series data).
Last observation carried forward (LOCF): Use the previous day's value.

Method	Effect on mean	Effect on spread
Mean imputation	Preserves the mean	Reduces the standard deviation (values cluster around the mean)
Median imputation	May slightly change the mean	Less distortion than mean imputation
Interpolation	Depends on surrounding values	Preserves local trends

Important: Any imputation method introduces assumptions. In exam questions, you should state the method used and acknowledge that it may affect the reliability of the analysis.

Strategy 3: Separate Analysis

Analyse only the available data for each variable separately, using different sample sizes as appropriate. This avoids the bias introduced by deletion or imputation but makes comparisons harder.

Exam Application

A typical exam question might ask: "Three daily mean temperature readings are missing from the data for July. Describe how you would deal with these missing values before calculating the mean daily temperature for the month."

A good answer would discuss omitting the missing values and calculating the mean from the remaining observations, or interpolating based on adjacent days, and would note how the chosen method might affect the result.

Identifying Outliers

An outlier is a data value that is markedly different from the rest of the data. In A-Level Mathematics, outliers are typically identified using one of two rules:

Rule 1: IQR Fences

A value is an outlier if it lies:

below $Q_1 - 1.5 \times \text{IQR}$ , or
above $Q_3 + 1.5 \times \text{IQR}$

where $\text{IQR} = Q_3 - Q_1$ .

Rule 2: Standard Deviation

A value is an outlier if it lies more than $2$ (or sometimes $3$ ) standard deviations from the mean:

$|x - \bar{x}| > 2\sigma$

What to Do with Outliers

Action	When appropriate
Keep	The value is genuine and represents real variation in the data
Investigate	Check whether the value could be an error (cross-reference with other variables or sources)
Remove	The value is confirmed as an error
Report separately	The value is genuine but extreme, and you want to show its effect on the analysis

Example from the LDS

If the daily maximum gust at Leeming is recorded as $150\,\text{kn}$ on a single day, with all other values below $60\,\text{kn}$ , this would be flagged as an outlier. You would then investigate:

Was there a known storm on that date? (If so, keep the value.)
Does the value exceed the physical maximum for the instrument? (If so, it may be a recording error.)
Is it consistent with other variables — e.g., was rainfall also very high? (If not, suspect an error.)

Dealing with Inconsistent Formats

Real data sets may contain inconsistencies in how values are recorded. Common issues include:

Date Formats

Some stations may use $\text{dd/mm/yyyy}$ , others $\text{mm/dd/yyyy}$ or $\text{yyyy-mm-dd}$ .
Always check which format is in use before sorting or filtering by date.

Numeric Codes for Non-Numeric Data

$\text{tr}$ (trace) for rainfall — this is not a number, so it needs to be converted before numerical analysis. A common approach is to replace $\text{tr}$ with $0$ or $0.025$ (half the measurement threshold).
$\text{n/a}$ — must be handled as missing data, not as a zero.

Units

Check whether all stations use the same units for each variable. If one station reports wind speed in m/s and another in knots, you must convert before making comparisons.

Rounding

Some values may be recorded to one decimal place, others to the nearest integer. Ensure consistency for accurate statistical analysis.

Preparing Data for Analysis

Once the data has been cleaned, several preparation steps may be needed:

1. Selecting Variables

Choose only the variables relevant to your analysis. If you are investigating the relationship between temperature and sunshine, you do not need wind speed or pressure data.

2. Filtering Observations

Select only the observations that are relevant — for example, data from a specific month, season, or station.

3. Sorting

Sort the data by date, by variable value, or by station, depending on the analysis you plan to carry out.

4. Creating Derived Variables

Sometimes you need to create new variables from existing ones. For example:

Daily temperature range = Daily maximum temperature − Daily minimum temperature
Season = A categorical variable derived from the date (December–February = winter, etc.)
Coded values = Linear coding to simplify calculations (e.g., $y = x - 10$ to shift temperature values)

5. Grouping Data

Group data into classes for frequency tables and histograms. Choose class widths that are appropriate for the data range and sample size.

Data Quality Assessment

Before proceeding with analysis, carry out a systematic quality check:

Check	Action
Are there missing values?	Count and decide on a strategy
Are there obvious errors?	Values outside the physically possible range
Are units consistent?	Convert if necessary
Are there outliers?	Identify and investigate using IQR or standard deviation rules
Is the data complete enough for the intended analysis?	Ensure sufficient data for reliable results

Communicating Data Cleaning Decisions

In both coursework and exam answers, it is good practice to state clearly what data cleaning steps you have taken and why. For example:

"I removed three observations from the data set because the daily mean temperature values were blank, indicating missing readings. This reduced the sample size from 31 to 28 for July. I chose to omit these values rather than impute them because the proportion of missing data was small (less than 10%) and there was no obvious pattern to the missingness."

This shows the examiner that you have thought critically about data quality and can justify your decisions.

Summary

Data cleaning is an essential preliminary step before any statistical analysis.
Missing values can be handled by omission, imputation, or separate analysis — each approach has trade-offs.
Outliers should be identified (using IQR fences or standard deviation rules), investigated, and then kept, removed, or reported as appropriate.
Inconsistent formats (dates, units, special codes) must be resolved before analysis.
Preparing data involves selecting variables, filtering observations, sorting, creating derived variables, and grouping.
Always document and justify your data cleaning decisions.

Exam Tip: If a question asks you to "clean" or "prepare" data, do not simply say "remove the outliers." Explain how you would identify them (e.g., using $Q_1 - 1.5 \times \text{IQR}$ ), what you would do with them (keep, investigate, or remove), and why your choice is appropriate in the given context.

A-Level Deep Dive: Data Cleaning and Preparation

Spec mapping

AQA 7357 Paper 3 — Statistics, Large Data Set context (subject content sections N — Statistical sampling, O — Data presentation and interpretation, and the LDS appendix). The published specification requires students to "become familiar with one or more specific large data sets in advance of the final assessment" and to be able to "interpret real data presented in summary or graphical form" and "use … the data set in the context in which it is presented." The cleaning, preparation and pre-analysis stages — handling missing values, anomaly identification, unit harmonisation, coding via $y = ax + b$ — are not a separate sub-strand but are woven through Paper 3 questions: any LDS item presupposes that the candidate has already wrestled with the messiness of the raw data. The same skills feed forward into Paper 3 hypothesis testing (where outliers can dominate test statistics), Paper 3 probability distributions (where modelling assumptions are sensitive to data quality), and synoptically into Paper 1 / Paper 2 modelling sections wherever a "real-world" dataset is invoked. The AQA formula booklet does not list cleaning protocols — these are working-practice expectations rather than memorisable formulae.

Worked example with full mark scheme

Question (8 marks):

A student is analysing a sample of $n = 30$ daily mean wind speeds (knots) from a coastal weather station, drawn from the LDS context. The raw extract shows: most values lie between $5$ and $25$ knots; three observations are recorded as the string "n/a"; one value is recorded as $185$ knots; and two values are recorded in metres per second rather than knots (a transcription error). The student computes summary statistics: $\bar{x} = 18.4$ , $s = 31.2$ .

(a) Identify the issues with the dataset and state, with a one-line justification each, how each should be treated before further analysis. (4)

(b) After appropriate cleaning, the student codes the wind speeds using $y = \dfrac{x - 10}{5}$ . Given that the cleaned mean of $x$ is $\bar{x} = 12.6$ knots and the cleaned standard deviation is $s_x = 4.8$ knots, find $\bar{y}$ and $s_y$ . (4)

Solution with mark scheme:

(a) Step 1 — classify the issues.

There are three distinct data-quality issues:

Missing values ("n/a" strings): three entries carry no numerical value. Treatment: flag as missing rather than silently delete; report the missingness rate ( $3/30 = 10\%$ ) alongside any analysis. If the missingness is plausibly missing-completely-at-random (MCAR), pairwise deletion or mean imputation may be defensible; otherwise the bias must be acknowledged.
Anomaly / extreme outlier ( $185$ knots): physically implausible for a UK coastal site (sustained $185$ knots would exceed any recorded wind speed in the British Isles). Treatment: investigate as a likely transcription or sensor error; do not silently delete, but flag and either correct from the original source or exclude with explicit justification.
Unit inconsistency (two values in m/s): mixing units corrupts every summary statistic. Treatment: convert m/s to knots using $1\ \text{m/s} \approx 1.944\ \text{kn}$ , then merge into the cleaned column.

M1 — identifies all three issues by category. A1 — gives a defensible treatment for each. M1 A1 — explains why silent deletion is wrong (introduces bias / loses information about missingness mechanism).

(b) Step 1 — apply the linear-coding rules.

For $y = \dfrac{x - 10}{5} = \dfrac{1}{5}x - 2$ :

$\bar{y} = \dfrac{\bar{x} - 10}{5} = \dfrac{12.6 - 10}{5} = \dfrac{2.6}{5} = 0.52$

M1 — applies $\bar{y} = a\bar{x} + b$ correctly with $a = \tfrac{1}{5}$ , $b = -2$ .

A1 — $\bar{y} = 0.52$ .

Step 2 — standard deviation under linear coding.

$s_y = |a| s_x = \dfrac{1}{5} \times 4.8 = 0.96$

M1 — recognises that the additive constant $-2$ does not affect spread, only the multiplicative factor $\tfrac{1}{5}$ does.

A1 — $s_y = 0.96$ .

Total: 8 marks (M4 A4). The candidate must also note that the coding result is only meaningful after the cleaning in part (a) — coding the contaminated data would propagate the $185$ -knot outlier into a wildly inflated $s_x$ .

Specimen question modelled on the AQA 7357 Paper 3 format

Question (6 marks): A meteorologist downloads daily rainfall totals (mm) for a coastal station for the months in the LDS context. The raw column contains entries: 12.4, 0, tr (the meteorological symbol for "trace"), - (no observation), # (sensor fault), and several blank cells.

(a) Explain why simply replacing every non-numeric entry with $0$ would bias the mean rainfall, distinguishing the treatment of tr, -, and #. (4)

(b) Suggest one defensible cleaning rule for each of the three non-numeric symbols, and state which rule preserves the original missingness information for downstream analysis. (2)

Mark scheme decomposition by AO:

(a)

B1 (AO3.1a) — recognises that tr represents a small but non-zero rainfall (typically below the gauge resolution), so replacing it with $0$ systematically under-states wet-day frequency.
B1 (AO3.1a) — recognises that - represents no observation, not zero rainfall; replacing with $0$ confuses missingness with a true measurement of "no rain."
B1 (AO3.1a) — recognises that # represents a sensor fault, so the underlying weather state is unknown; replacing with $0$ asserts knowledge the data does not support.
B1 (AO2.5) — concludes that the three categories must be treated differently because their data-generating mechanisms differ.

(b)

B1 (AO3.1b) — defensible rules: encode tr as a small positive value (e.g. $0.05$ mm, half the gauge resolution); encode - as NA (true missing); encode # as NA with a "fault" flag retained in a parallel column.
B1 (AO2.5) — the NA encoding (rather than $0$ or deletion) is the rule that preserves missingness information for any later sensitivity analysis.

Total: 6 marks split AO2 = 2, AO3 = 4. Cleaning questions on Paper 3 are AO3-dominated because the assessment objective is "translate problems in mathematical or non-mathematical contexts into mathematical processes" — exactly the work of deciding what counts as data.

Synoptic links

Connects to:

Statistical sampling (subject content N): the choice of cleaning rule interacts with sampling. If a stratified sample over-samples nights (when sensors more often fault), silent deletion of # entries biases the cleaned dataset toward day-time observations. Cleaning is not a neutral pre-processing step — it can un-do careful sampling design.
Data presentation and interpretation (subject content O): outlier identification by the $1.5 \times \text{IQR}$ rule (boxplots) presupposes a clean dataset. Running the rule on a dataset that still contains transcription errors mis-labels real observations as outliers and vice versa.
Statistical distributions (subject content P) — normal modelling: the normal model is heavily sensitive to extreme values; one un-treated $185$ -knot reading shifts $\bar{x}$ and inflates $s$ enough to break any subsequent normality assumption. Cleaning is therefore a prerequisite for distributional modelling, not an optional polish.
Hypothesis testing (subject content Q): the test statistic for a single-sample mean test, $Z = \dfrac{\bar{X} - \mu_0}{s/\sqrt{n}}$ , is dominated by extreme values. A retained outlier can flip a test from "fail to reject" to "reject $H_0$ " purely on the strength of contamination, not on the strength of evidence.
Linear coding $y = ax + b$ (formula-booklet identity): the coding rules $\bar{y} = a\bar{x} + b$ and $s_y = |a| s_x$ assume a clean $x$ . Coding is a presentational convenience, not a cleaning method — it cannot remove anomalies, only re-scale them. Students who hope coding will "smooth out" outliers misunderstand the operation.

Mark-scheme literacy

LDS / data-handling questions on Paper 3 split AO marks unusually toward AO3:

AO	Typical share	Earned by
AO1 (knowledge / procedure)	25–35%	Computing summary statistics correctly post-cleaning, applying coding identities, computing IQR
AO2 (reasoning / interpretation)	25–35%	Distinguishing missingness from zero; justifying choice of cleaning rule; interpreting effect of outliers on summary statistics
AO3 (problem-solving / modelling)	35–45%	Translating real-world data anomalies into mathematical decisions; recognising that cleaning rules carry assumptions

Examiner-rewarded phrasing: "this entry is missing, not zero — coded as NA"; "the value $185$ is physically implausible for this context, suggesting a likely transcription error and so excluded with justification"; "after conversion to consistent units, the cleaned mean is \dots". Phrases that lose marks: "I deleted the missing rows" (no justification of mechanism); "I replaced the outlier with the mean" (silent imputation); "I assumed the m/s values were knots" (silent unit-mismatch).

Data Cleaning & Preparation

Data Cleaning & Preparation

Why Data Cleaning Matters

Handling Missing Values

Strategy 1: Omission (Listwise Deletion)

Strategy 2: Imputation (Replacement)

Strategy 3: Separate Analysis

Exam Application

Identifying Outliers

Rule 1: IQR Fences

Rule 2: Standard Deviation

What to Do with Outliers

Example from the LDS

Dealing with Inconsistent Formats

Date Formats

Numeric Codes for Non-Numeric Data

Units

Rounding

Preparing Data for Analysis

1. Selecting Variables

2. Filtering Observations

3. Sorting

4. Creating Derived Variables

5. Grouping Data

Data Quality Assessment

Communicating Data Cleaning Decisions

Summary

A-Level Deep Dive: Data Cleaning and Preparation

Spec mapping

Worked example with full mark scheme

Specimen question modelled on the AQA 7357 Paper 3 format

Synoptic links

Mark-scheme literacy

More in Mathematics