You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson examines the assumptions that underlie statistical models, the limitations of these models, and how to communicate uncertainty when drawing conclusions from data. Understanding when and why models break down is a higher-order skill that distinguishes strong A-Level candidates.
Every statistical model is built on a set of assumptions. If these assumptions are satisfied, the model provides reliable results. If they are violated, the results may be misleading or invalid.
At A-Level, you are expected to:
What it means: The outcome of one observation does not affect the outcome of any other observation.
Where it is required:
When it may be violated in the LDS:
Weather data on consecutive days is typically not independent. A warm day is more likely to be followed by another warm day than by a cold day (weather systems persist). Similarly, rainfall often occurs over several consecutive days as a weather front passes through.
Consequence: If the data is not independent, confidence intervals will be too narrow and hypothesis tests may give misleading results. The effective sample size is smaller than the actual sample size.
Mitigation: Select data from non-consecutive days, or from different months or years, to increase the plausibility of the independence assumption.
What it means: In a binomial model, the probability of success p is the same for every trial.
Where it is required: Binomial distribution.
When it may be violated in the LDS:
If we model "does it rain today?" as a Bernoulli trial, the probability of rain may change throughout the month (e.g., weather patterns shift) or between months (seasonal variation).
Consequence: The binomial model will give inaccurate probabilities, particularly in the tails of the distribution.
Mitigation: Restrict the analysis to a short, homogeneous time period where the probability is approximately constant, or use the observed proportion as an average estimate of p.
What it means: The data follows (or approximately follows) a normal distribution.
Where it is required:
When it may be violated in the LDS:
Checking normality:
| Method | How it works |
|---|---|
| Histogram | Should be roughly bell-shaped and symmetric |
| Mean ≈ median | A large difference suggests skewness |
| 68-95-99.7 rule | Check whether the proportions match |
| Box plot | Symmetric box with whiskers of similar length |
Consequence: If the data is not normal and the sample is small, hypothesis tests and confidence intervals may be unreliable.
Mitigation: Use a larger sample (Central Limit Theorem ensures the sampling distribution of the mean is approximately normal for large n), or use non-parametric methods.
What it means: Every member of the population has an equal chance of being selected for the sample.
Where it is required: All statistical inference (hypothesis tests, confidence intervals, generalising from sample to population).
When it may be violated in the LDS:
If you select all the data from a single month at a single station, this is not a random sample from all possible weather data. It is a specific period at a specific location. Conclusions drawn from this data should be limited to that context.
Consequence: Results may not generalise to other stations, months, or years.
| Assumption | In the LDS context |
|---|---|
| Fixed number of trials n | Determined by the time period selected (e.g., 31 days in July) |
| Two outcomes per trial | Define "success" clearly (e.g., rainfall > 0.2 mm = rain; otherwise = no rain) |
| Constant p | Approximate — may vary within a month |
| Independence | Unlikely for consecutive days; more plausible for randomly selected days |
| Assumption | In the LDS context |
|---|---|
| Data is continuous | Temperature, pressure, etc., are effectively continuous |
| Distribution is symmetric | Check with a histogram — temperature is often approximately symmetric; rainfall is usually skewed |
| Mean and standard deviation describe the distribution | True for normal; not sufficient for skewed distributions |
Models can fail in several ways:
A regression model fitted to summer data may not apply in winter. The relationship between temperature and sunshine is different in different seasons.
The normal distribution assigns very low probability to values far from the mean. In reality, extreme weather events (storms, heatwaves) may be more common than the normal model predicts. This is known as having heavy tails.
The PMCC and least squares regression assume a linear relationship. If the true relationship is curved, these methods will underestimate the strength of the association and produce poor predictions.
A model based on historical data assumes that the underlying process is stationary — i.e., the statistical properties do not change over time. Climate change means that historical weather patterns may not accurately predict future conditions.
With very small samples, it is difficult to:
The Central Limit Theorem requires a reasonably large sample (typically n≥30) for the sampling distribution of the mean to be approximately normal.
Statistical conclusions are never certain. Good practice involves:
"We are 95% confident that the true mean daily temperature at Heathrow in July lies between 18.5°C and 21.2°C."
"This analysis assumes that the daily mean temperatures are normally distributed, which is approximately supported by the histogram but cannot be guaranteed."
"These results apply to Heathrow in July and should not be generalised to other stations or months without further analysis."
"The normal model provides a reasonable approximation for the daily mean temperature data, but the presence of two outliers suggests that the model may underestimate the probability of extreme temperatures."
A typical exam question might ask:
"State two assumptions that must be made for the binomial distribution to be a suitable model for the number of rainy days in a month. Comment on the validity of each assumption."
Model answer:
"The probability of rain is constant from day to day. This is approximately valid if the analysis is restricted to a single month, but in practice the probability of rain may vary as weather systems move through. For a short period, this is a reasonable approximation."
"Each day's weather is independent of every other day's weather. This is unlikely to be fully valid because weather tends to persist — a rainy day is more likely to be followed by another rainy day. However, if we are looking at general long-term proportions rather than specific sequences, the independence assumption is an acceptable simplification."
Statistical modelling is an iterative process:
At A-Level, you are not expected to carry out all of these steps formally, but you should be able to discuss them and evaluate whether a given model is appropriate.
Exam Tip: When asked to "state an assumption" of a model, make sure you are specific and contextual. Do not just say "the data must be independent." Say: "Each day's weather must be independent of every other day's weather, meaning that the outcome (rain or no rain) on one day does not affect the outcome on any other day." This level of detail is what earns full marks.
AQA A-Level Mathematics (7357) Overarching Theme OT3 — Mathematical Modelling, applied to the AQA Large Data Set (LDS): "Translate a situation in context into a mathematical model, making simplifying assumptions; use a mathematical model with suitable inputs to engage with and explore situations; interpret the outputs of a mathematical model in context, including evaluating model accuracy and limitations; understand that a mathematical model can be refined by considering its outputs and recognise that this may lead to a refined model." OT3 is not a topic confined to one paper — it is examined across Paper 1 (Pure), Paper 2 (Pure and Mechanics) and especially Paper 3 (Statistics and Mechanics), and forms the AO3 spine of every LDS-flavoured question. The LDS itself is referenced in section N (Statistics) of the specification, where candidates must demonstrate familiarity with the data, its variables, units, sampling frame, missing-value coding and known irregularities. Modelling assumptions surface whenever probability distributions, regression lines or hypothesis tests are imposed on real LDS samples — and the explicit, examinable skill is stating, justifying and evaluating those assumptions in the precise context of the data.
Question (8 marks):
A student uses the AQA LDS to investigate whether daily mean wind speed at a chosen weather station can be modelled by a normal distribution. They take the wind speed values for one full month and calculate xˉ=8.4 and s=2.1 (in knots). They then propose a model W∼N(8.4,2.12) and use it to estimate P(W>12) for any future day at that station.
State and evaluate three assumptions the student is making in proposing this model. (8)
Solution with mark scheme:
Assumption 1 — distributional shape.
The student is assuming that daily mean wind speed at the chosen station follows a normal distribution. B1 for stating the assumption explicitly in context (not "the data is normal" but "daily mean wind speed at this station is normally distributed").
Evaluation: Wind speed is a non-negative quantity — W≥0 always — but the proposed normal model assigns positive probability to W<0, which is physically impossible. The model also tends to be right-skewed in real meteorological data (occasional storm days produce a long upper tail). A B1 is awarded for any one well-articulated limitation: non-negativity, skewness, or the fact that a single month's sample is too small to verify the distributional shape reliably (you would want a histogram or normal probability plot from a much larger window).
Assumption 2 — independence of observations.
The student is implicitly assuming that the daily wind speeds in the sample are independent of one another. B1 in context: "the wind speed on one day does not influence the wind speed on the following day."
Evaluation: This is almost certainly false. Weather is autocorrelated on a scale of several days — a windy day is more likely to be followed by another windy day because the same synoptic weather system persists. B1 for identifying autocorrelation as the violation, plus a sketch of consequence: standard errors computed under the independence assumption will be too small, so confidence intervals are too narrow and p-values too low.
Assumption 3 — stationarity (constant mean and variance).
The student is assuming the parameters μ=8.4 and σ=2.1 are constant over time and that future days are drawn from the same distribution as the sample month. B1 for stating stationarity in context.
Evaluation: Wind speed has a strong seasonal component — winter months in the UK have higher mean wind speeds and higher variance than summer months. Using a single month's parameters to predict "any future day" conflates the within-month distribution with the across-year distribution. B1 for the seasonal critique, plus an additional B1 for noting that even within a single month the LDS may span multiple sampling years, so the sample mixes climatologies that may differ.
Total: 8 marks (B1 for each assumption stated, B1 for each evaluation, plus 2 for depth/synthesis).
Question (6 marks): A meteorologist fits a least-squares regression line of daily maximum temperature y (°C) on daily mean cloud cover x (in oktas) using LDS values from one summer month at one station. They obtain y=24.3−1.1x with r=−0.62.
(a) State two assumptions implicit in using this regression line to predict tomorrow's maximum temperature given a forecast cloud cover of 5 oktas. (2)
(b) Evaluate the reliability of the prediction y^(5)=18.8°C in light of these assumptions. (4)
Mark scheme decomposition by AO:
(a)
(b)
Total: 6 marks split AO3 = 6. This is a pure AO3 question — no calculation, all reasoning. AQA examiners use LDS regression questions almost exclusively as AO3 vehicles.
Connects to:
Probability models (Section N5): every probability distribution carries assumptions. The binomial B(n,p) assumes fixed n, constant p, independence and a binary outcome — applying it to LDS rainfall data (rain / no rain on each day) violates independence. The Poisson assumes events occur singly, independently, and at constant rate — applied to daily storm counts it can fail on all three.
Hypothesis testing (Section O): every test rests on assumptions about the null distribution. A one-sample t-test on LDS temperature differences assumes the differences are normally distributed and independent. Failing to check these turns a confident "reject H0 at the 5% level" into a meaningless ritual.
Regression and correlation (Section N6): least-squares regression assumes linearity, homoscedasticity (constant residual variance), independence of residuals and (for inference) normality of residuals. Pearson's r is meaningful only for linear relationships — a perfect quadratic relationship can have r=0.
Sampling (Section N1): the LDS is itself a sample (specific stations, specific years). Generalising from "this month at this station" to "UK weather in general" is a sampling-frame assumption. Stratified, cluster and quota sampling each carry distinct assumption sets.
OT3 (Mathematical Modelling) overarching: the modelling cycle — formulate, solve, interpret, refine — is the meta-skill. Every numerical answer must be interrogated: "what did I assume to get here, and how would the answer change if the assumption fails?"
LDS modelling-assumption questions on Paper 3 split AO marks heavily toward AO3:
| AO | Typical share | Earned by |
|---|---|---|
| AO1 (knowledge / procedure) | 10–20% | Recalling the formal assumptions of named models (binomial, normal, regression) |
| AO2 (reasoning / interpretation) | 20–30% | Justifying why a stated assumption is or is not plausible given the LDS context |
| AO3 (problem-solving / modelling) | 50–70% | Translating a real LDS scenario into mathematical form, choosing an appropriate model, evaluating outputs and refining |
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.