You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
Descriptive statistics (means, standard deviations, graphs) summarise data, but they cannot tell us whether a difference or relationship is real or merely a fluke of sampling. Suppose one group recalls a mean of 14 words and another 12 — is that a genuine effect, or could it easily arise by chance? Inferential statistics answer exactly this question: they let psychologists infer something about the wider population from a sample by calculating the probability that the results occurred by chance, and on that basis deciding whether to reject the null hypothesis. Without this step, a researcher could only ever describe their particular sample; with it, they can make justified claims about people in general, which is what turns a one-off observation into scientific knowledge. This lesson covers probability and significance, Type I and Type II errors, how to choose the correct statistical test, and how to compare the calculated value with the critical value.
Key Definition: Inferential statistics are statistical tests used to determine whether the results of a study are statistically significant — unlikely to have occurred by chance — and can therefore be generalised from the sample to the wider population.
This lesson addresses the following points in AQA A-Level Psychology (7182), Section 4.2 (Research methods):
Assessment objectives engaged: AO1 (significance, errors, the named tests), AO2 (choosing the correct test for, and interpreting the outcome of, a novel study) and AO3 (evaluating significance decisions and the consequences of error). These questions are strongly application-based and may require you to select a test, read a critical-values table, or state a conclusion.
In psychology, the conventional significance level is
p≤0.05
meaning the probability that the observed results are due to chance is 5% or less — equivalently, we can be at least 95% confident the effect is real. If the test shows the probability of a chance result is at or below this threshold, we reject the null hypothesis and accept the alternative (experimental) hypothesis.
| Significance level | Interpretation |
|---|---|
| p ≤ 0.05 | The standard level — results judged statistically significant |
| p ≤ 0.01 | More stringent (1% chance of error) — used when a false positive would be serious, e.g. trialling a drug |
| p ≤ 0.10 | More lenient — occasionally used in exploratory or pilot research |
Why 5%? The figure is a convention, not a law of nature. It represents a pragmatic balance: strict enough that we will not constantly cry "effect!" over random noise (which a 10% level would risk), but lenient enough that genuine effects of reasonable size can be detected (which a 1% level might miss). Probability itself runs on a scale from 0 (an event is impossible) to 1 (it is certain), so p≤0.05 simply marks the point at which a chance explanation becomes implausible enough to discard. Crucially, "significant" never means "certain": there remains up to a 5% chance that we have rejected a true null hypothesis (a Type I error). Replication is therefore essential — a single significant result could always be the 1-in-20 fluke.
Key Definition: The significance level is the probability threshold below which the null hypothesis is rejected. In psychology, p≤0.05 is standard — there is a 5% or smaller probability that the results occurred by chance.
Exam Tip: Note the careful wording: p≤0.05 does not mean "95% certain the hypothesis is true". It means that if the null hypothesis were true, results this extreme would occur 5% of the time or less. We never prove a hypothesis — we reject or retain the null.
Because we work with probability, statistical decisions can be wrong in two ways.
| Error | What happens | When more likely |
|---|---|---|
| Type I error (false positive) | The null hypothesis is rejected when it is actually true — we claim an effect that is not really there | Significance level too lenient (e.g. p≤0.10) |
| Type II error (false negative) | The null hypothesis is retained when it is actually false — we miss a real effect | Significance level too stringent (e.g. p≤0.01) or the sample is too small / power too low |
Exam Tip: Remember Type I as "seeing something that Isn't there" (a false alarm), and Type II as "missing something that is there". Making the level stricter (e.g. moving from 0.05 to 0.01) reduces Type I errors but increases Type II errors — there is always a trade-off, which is why p≤0.05 is the standard compromise.
Worked illustration. A researcher tests whether a new therapy reduces anxiety, setting p≤0.05.
The consequences of each error depend on context, which is why the choice of significance level is not arbitrary. In drug trials, a Type I error (releasing an ineffective or harmful drug) is potentially catastrophic, so researchers adopt a stricter level such as p≤0.01 to guard against false positives — accepting a higher risk of a Type II error in exchange. In exploratory or pilot research, where the aim is simply to decide whether an effect is worth pursuing, a Type II error (abandoning a promising line of enquiry) is the greater worry, so a more lenient level may be justified. Understanding this trade-off — that you cannot minimise both error types at once for a fixed sample — is exactly the kind of evaluative point that distinguishes strong answers.
The probability of correctly rejecting a false null hypothesis — that is, of detecting a real effect — is called statistical power. Power is increased by using a larger sample, a more sensitive measure, a more powerful (parametric) test, and a larger true effect size. A study with low power (often because the sample is too small) is prone to Type II errors, which is why under-powered studies that report "no significant difference" should be interpreted cautiously: absence of evidence is not evidence of absence.
The sign test is the simplest inferential test required at A-Level, and the only one you may be asked to calculate fully. Use it when all three of the following hold:
The sign test is chosen, in terms of the decision table, where the row is "difference — related" and the column is "nominal" — which is exactly why it suits before/after designs in which all we can say about each participant is whether they went up or down. It is the obvious test when the dependent variable is simply a yes/no or improved/worsened judgement.
Procedure:
Worked example. Does a relaxation technique reduce stress ratings? Ten participants rate stress before and after.
| Participant | Before | After | Difference | Sign |
|---|---|---|---|---|
| 1 | 8 | 5 | −3 | − |
| 2 | 7 | 6 | −1 | − |
| 3 | 6 | 6 | 0 | (excluded) |
| 4 | 9 | 4 | −5 | − |
| 5 | 5 | 3 | −2 | − |
| 6 | 8 | 7 | −1 | − |
| 7 | 6 | 5 | −1 | − |
| 8 | 7 | 8 | +1 | + |
| 9 | 9 | 6 | −3 | − |
| 10 | 8 | 5 | −3 | − |
Two features of the sign test are worth pausing on. First, it deliberately throws away information — it uses only the direction of each change, not its size, which is why a −5 counts the same as a −1. This makes the test easy to compute but relatively insensitive; a Wilcoxon test, which also ranks the magnitude of the differences, would extract more from the same data. Second, the one-tailed critical value was used because the hypothesis was directional (the technique would reduce stress); had the prediction merely been that stress would change, the two-tailed critical value would apply and significance would be slightly harder to reach. This worked example therefore illustrates not just the mechanics of one test, but the general principles of significance, tails and the calculated-versus-critical comparison that apply throughout inferential testing.
Exam Tip: For the sign test (and Wilcoxon and Mann-Whitney), the calculated value must be equal to or less than the critical value for significance. For Chi-Squared, Spearman's rho, Pearson's r and the t-tests, the calculated value must be equal to or greater than the critical value. Stating the wrong direction loses the conclusion mark — a handy rule is "R-tests (Rho, R, t) need to be biggeR".
The correct test is fixed by three questions:
graph TD
A[What does the hypothesis test?] -->|Difference| B[What is the design?]
A -->|Correlation / association| C[Level of measurement?]
B -->|Unrelated<br/>independent groups| D[Level of measurement?]
B -->|Related<br/>repeated measures / matched pairs| E[Level of measurement?]
D -->|Nominal| D1[Chi-Squared]
D -->|Ordinal| D2[Mann-Whitney U]
D -->|Interval| D3[Unrelated t-test]
E -->|Nominal| E1[Sign test]
E -->|Ordinal| E2[Wilcoxon signed-rank]
E -->|Interval| E3[Related t-test]
C -->|Nominal| C0[Chi-Squared<br/>test of association]
C -->|Ordinal| C1[Spearman's rho]
C -->|Interval| C2[Pearson's r]
| Nominal | Ordinal | Interval | |
|---|---|---|---|
| Difference — unrelated (independent groups) | Chi-Squared (χ2) | Mann-Whitney U | Unrelated t-test |
| Difference — related (repeated measures / matched pairs) | Sign test | Wilcoxon signed-rank | Related t-test |
| Correlation / association | Chi-Squared (χ2) | Spearman's rho (rs) | Pearson's r |
Working through the three questions in order makes the choice mechanical. First, decide whether the hypothesis concerns a difference or a correlation — this picks the row block. Second, for a difference, decide whether the design is related (the same or matched participants across conditions) or unrelated (different, independent participants) — this picks the row. Third, identify the level of measurement — this picks the column. The cell at the intersection names the test. Chi-Squared appears twice because it serves both as a test of difference between independent categories and as a test of association between two categorical variables; either way it requires nominal (frequency) data.
Exam Tip: This table is the single most-tested piece of methods knowledge. A popular mnemonic for the difference row order (Nominal–Ordinal–Interval, unrelated then related, then correlation) is "Carrots Should Come Mashed With Swede Under Roast Potatoes" → Chi-Squared, Sign test, [then ordinal] Mann-Whitney, Wilcoxon, [then interval] unrelated-t, related-t, [then correlation] Spearman, Pearson. Always justify: name the type of test, the design and the level of measurement.
| Test | When used | Key feature |
|---|---|---|
| Chi-Squared (χ2) | Difference/association, nominal, unrelated | Compares observed vs expected frequencies in a contingency table |
| Sign test | Difference, nominal, related | Counts the direction (sign) of change |
| Mann-Whitney U | Difference, ordinal, unrelated | Ranks all scores together; compares rank totals |
| Wilcoxon signed-rank | Difference, ordinal, related | Ranks the differences between paired scores |
| Unrelated t-test | Difference, interval, unrelated | Compares the means of two independent groups (parametric) |
| Related t-test | Difference, interval, related | Compares paired mean differences (parametric) |
| Spearman's rho (rs) | Correlation, ordinal | Strength/direction of a monotonic relationship between ranks |
| Pearson's r | Correlation, interval | Strength/direction of a linear relationship (parametric) |
The three parametric tests (the two t-tests and Pearson's r) are more powerful but require interval data, a roughly normal distribution, and similar variances; the others are non-parametric and make fewer assumptions.
The distinction matters because it adds a fourth consideration to test choice. A parametric test may be used only when three conditions are met: the data are at the interval level; the populations are approximately normally distributed; and the two samples have similar variances (homogeneity of variance). When these hold, the parametric test (related t-test, unrelated t-test, or Pearson's r) is preferred because it is more powerful — better able to detect a real effect, and so less prone to a Type II error. When the conditions are not met — for example, the data are ordinal, or the distribution is badly skewed — a non-parametric equivalent (Wilcoxon, Mann-Whitney, Spearman's rho) must be used instead. These rank-based tests sacrifice some power in exchange for making far fewer assumptions, which is why they are the workhorses of psychological research on rating scales and rankings.
You are not required to calculate Mann-Whitney or Wilcoxon by hand, but understanding the logic aids interpretation. Mann-Whitney U (unrelated, ordinal) pools all the scores from both groups, ranks them from lowest to highest, and then asks whether the ranks of one group cluster systematically higher than the other; if the two groups were really the same, high and low ranks would be evenly mixed. Wilcoxon signed-rank (related, ordinal) works on the differences between each participant's two scores: it ranks the sizes of those differences and then checks whether the positive and negative differences balance out (as the null predicts) or whether one direction dominates. In both cases the test converts raw scores into ranks, which is precisely why they suit ordinal data and do not assume a normal distribution.
The Chi-Squared statistic compares observed frequencies O with the frequencies expected under the null hypothesis E:
χ2=∑E(O−E)2
The larger the gap between what is observed and what would be expected by chance, the larger χ2 becomes.
Spearman's rho (the rank correlation coefficient) is given by
rs=1−n(n2−1)6∑d2
where d is the difference between the two ranks for each pair and n is the number of pairs. As with any correlation coefficient, the result lies in the range −1≤rs≤+1.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.