You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson covers the use of probability distributions — particularly the binomial and normal distributions — as models for real-world data. You will learn how to select an appropriate model, check its suitability, and compare model predictions with observed data from the large data set.
A probability model is a mathematical description of a random process. It specifies the possible outcomes and the probability of each outcome. In A-Level Mathematics, the two most important probability models are:
The key idea is that we use these mathematical models to approximate real-world situations. No model is a perfect representation of reality, but a good model captures the essential features of the data and allows us to make useful predictions.
The random variable X follows a binomial distribution X∼B(n,p) if:
P(X=r)=(rn)pr(1−p)n−r
E(X)=np,Var(X)=np(1−p)
Example from the LDS: Suppose historical data shows that it rains on approximately 40% of days in October at Camborne. If we select 10 random October days, we might model the number of rainy days as X∼B(10,0.4).
Checking the conditions:
| Condition | Assessment |
|---|---|
| Fixed number of trials | Yes — 10 days |
| Two outcomes | Yes — rain or no rain (we need a clear definition of "rain", e.g., daily rainfall > 0.2 mm) |
| Constant probability | Approximately — the probability may vary slightly depending on the weather pattern, but 0.4 is a reasonable average |
| Independence | Approximately — weather on consecutive days is not truly independent (weather systems persist), but if the days are randomly selected from different years, independence is more reasonable |
Model predictions vs observed data:
We could calculate P(X=0),P(X=1),…,P(X=10) from the model and compare with the actual frequencies observed in the data set. If the model is a good fit, the predicted and observed frequencies should be similar.
The normal distribution X∼N(μ,σ2) is characterised by:
Many continuous variables in the large data set are approximately normally distributed:
Example: If the daily mean temperatures at Heathrow in July have a mean of 19.5°C and a standard deviation of 2.3°C, we might model the temperature as X∼N(19.5,2.32).
To assess whether a normal distribution is a suitable model for a set of data, consider:
Shape of the distribution: Plot a histogram or frequency polygon. Does it look roughly bell-shaped and symmetric?
Mean vs median: For a normal distribution, these should be approximately equal. A large difference suggests skewness.
68-95-99.7 rule: Check whether approximately 68% of the data lies within 1 standard deviation of the mean, 95% within 2, etc.
Outliers and skewness: A normal distribution has thin tails — if there are many extreme values or the distribution is clearly skewed, the normal model may not be appropriate.
| Check | Normal model suitable | Normal model may not be suitable |
|---|---|---|
| Histogram shape | Roughly symmetric, bell-shaped | Clearly skewed or bimodal |
| Mean ≈ median | Yes | Large difference |
| 68% rule | Approximately 68% within μ±σ | Significantly more or fewer |
| Outliers | Few or none | Many extreme values |
Suppose the model predicts that P(X>24)=0.025 for daily mean temperature at Heathrow in July (approximately 2.5% of days). If there are 31 days in July, the model predicts approximately 31×0.025≈0.8 days with a mean temperature above 24°C. If the actual data shows 2 such days out of 31, the model's prediction is in the right ballpark — but with such a small sample, we would not expect exact agreement.
The process of comparing model predictions with real data is fundamental to statistical modelling:
Select the appropriate distribution (binomial or normal) based on the type of data and the conditions.
Use the observed data to estimate the parameters:
Use the model to predict the expected number of observations in each category or range.
For the binomial: Calculate P(X=r) for each r and multiply by the total number of observations.
For the normal: Calculate the probability of falling in each class interval and multiply by the total frequency.
Compare the expected frequencies with the observed frequencies. A good model will produce expected frequencies that are close to the observed frequencies.
If the model fits well, it can be used for prediction and inference. If not, consider:
When n is large and p is not too close to 0 or 1, the binomial distribution can be approximated by the normal distribution:
X∼B(n,p)≈N(np,np(1−p))
The conditions for this approximation to be valid are:
When using this approximation, a continuity correction must be applied because we are approximating a discrete distribution with a continuous one:
| Binomial probability | Normal approximation (with continuity correction) |
|---|---|
| P(X≤k) | P(Y≤k+0.5) |
| P(X<k) | P(Y<k−0.5) |
| P(X≥k) | P(Y≥k−0.5) |
| P(X>k) | P(Y>k+0.5) |
| P(X=k) | P(k−0.5<Y<k+0.5) |
Using the large data set, count the number of days with measurable rainfall (> 0.2 mm) at Hurn in September over several years. If the proportion is p^=0.45, model the number of rain days in a random sample of 30 September days as X∼B(30,0.45).
Predicted mean: E(X)=30×0.45=13.5 rain days. Observed mean from the data: compare and evaluate.
Daily mean temperatures at Leuchars in March: sample mean xˉ=5.8°C, sample standard deviation s=2.1°C. Model: X∼N(5.8,2.12).
Calculate P(X<2) from the model and compare with the proportion of March days in the data set where the temperature was below 2°C.
Exam Tip: When a question asks you to "comment on the suitability of a model", do not just say "it is suitable" or "it is not suitable." Explain why by checking the conditions (e.g., "The binomial model may not be fully appropriate because the probability of rain is likely to vary across the month, violating the constant probability condition. However, it provides a reasonable approximation for estimation purposes.")
AQA 7357 specification, Paper 3 — Statistics, sub-strands N (Statistical distributions) and O (Statistical hypothesis testing), set within the Large Data Set context of section M covers the binomial distribution as a model; calculate probabilities using the binomial distribution. Understand and use the Normal distribution as a model; find probabilities using the Normal distribution. Select an appropriate probability distribution for a context, with appropriate reasoning, including recognising when the binomial or Normal model may not be appropriate (refer to the official specification document for exact wording). The LDS — daily weather observations from a number of UK and overseas weather stations — supplies the context: rainfall amounts, daily mean temperatures, sunshine hours, wind directions, and binary events such as "rain on a given day" are all routinely modelled probabilistically. The AQA formula booklet supplies neither the binomial probability mass function nor the Normal density; the binomial pmf must be memorised, and Normal probabilities must be looked up via the standard tables provided in the formula booklet for the standard Normal Z=(X−μ)/σ.
Question (8 marks):
A student investigates the LDS for Heathrow during May to October. Two scenarios:
(a) The student records, for each of the 184 days in the period, whether the daily total rainfall is at least 1 mm. Historical proportion suggests the long-run probability of such a "wet day" is p=0.30. Let X be the number of wet days in a randomly chosen 14-day window from this period. Stating the conditions you assume, find P(X≥5). (5)
(b) The student then considers daily mean temperature T(°C) across the same period and proposes the model T∼N(15.4,3.22). Using this model, find the probability that T exceeds 18°C on a randomly chosen day, and comment on whether the binomial distribution would have been an appropriate alternative. (3)
Solution with mark scheme:
(a) Step 1 — state the model and conditions.
Let X be the number of wet days in 14 days. Model X∼B(14,0.30) provided:
B1 — identifying the binomial model with parameters n=14, p=0.30 and at least two named conditions (typically "fixed n" and "constant p" or "independence"). Examiners reward the condition explicitly written in context — e.g. "we assume rainfall on different days is independent". Stating only "binomial" earns nothing.
Step 2 — express the required probability.
P(X≥5)=1−P(X≤4)
M1 — using the complement to convert a tail probability into a cumulative probability that can be looked up or computed.
Step 3 — compute P(X≤4).
Using the binomial cdf with n=14,p=0.30:
P(X≤4)=∑k=04(k14)(0.30)k(0.70)14−k
Evaluating term by term and summing gives P(X≤4)≈0.5842.
M1 — correct cumulative-binomial setup (calculator or table). The structural mark is for the sum from k=0 to k=4; numerical accuracy is rewarded separately.
A1 — P(X≤4)≈0.5842 to four decimal places (accept 0.584).
Step 4 — answer.
P(X≥5)=1−0.5842=0.4158
A1 — P(X≥5)≈0.416 (accept anywhere in the range 0.415 to 0.417).
(b) Step 1 — standardise.
Z=σT−μ=3.218−15.4=3.22.6=0.8125
M1 — correct standardisation. Watch the sign and the order: (x−μ)/σ, not (μ−x)/σ.
Step 2 — read the standard Normal table.
P(Z<0.8125)≈0.7917, so P(T>18)=1−0.7917=0.2083.
A1 — P(T>18)≈0.208.
Step 3 — comparative comment.
The binomial model is not appropriate for daily mean temperature because temperature is a continuous quantity, not a count of successes in a fixed number of trials. A binomial random variable takes only the integer values 0,1,2,…,n, whereas T varies on a continuum.
E1 — clear statement that temperature is continuous, the binomial models discrete counts, hence the binomial is inappropriate.
Total: 8 marks.
Question (6 marks): The LDS records daily mean wind speed W (knots) at Leeming. A student proposes W∼N(11.5,4.52).
(a) Find P(8<W<14). (3)
(b) The Beaufort scale defines "moderate breeze" as wind speed strictly between 11 and 16 knots. Estimate the probability that, on a randomly chosen day, the wind speed is a "moderate breeze" according to this Beaufort definition under the proposed model. (3)
Mark scheme decomposition by AO:
(a)
(b)
Total: 6 marks split AO1 = 5, AO2 = 1. AQA reserves the AO2 mark for the recognition that the strict-inequality boundary in (b) is irrelevant for a continuous distribution: P(W=11)=0 exactly under the Normal model.
Connects to:
The binomial distribution. The natural binary-event LDS quantities are "rain on a given day", "daily maximum gust exceeds a threshold", "wind direction in a chosen sector". For each, modelling the count of successes in n days as B(n,p) is appropriate provided trials are independent and p is constant. Independence is the assumption most often violated in real meteorological data: weather on consecutive days is correlated.
The Normal distribution. Continuous LDS measurements — daily mean temperature, daily mean wind speed, mean cloud cover, sunshine hours — are candidates for a Normal model when the marginal distribution is approximately symmetric and bell-shaped. Skewed quantities (rainfall amounts, which pile up at zero) usually fail the visual symmetry test.
Sampling and estimation. The LDS provides empirical estimates xˉ and s that play the role of μ and σ in the proposed model. Awareness that estimates carry uncertainty is examined in section O (hypothesis testing for the mean of a Normal distribution with known variance).
Modelling assumptions. Both binomial and Normal models depend on stated assumptions; the AO3 (problem-solving) mark in extended LDS questions is awarded for naming and critically evaluating these assumptions.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.