Inferential Statistics and Hypothesis Testing

Descriptive statistics (means, standard deviations, graphs) summarise data, but they cannot tell us whether a difference or relationship is real or merely a fluke of sampling. Suppose one group recalls a mean of 14 words and another 12 — is that a genuine effect, or could it easily arise by chance? Inferential statistics answer exactly this question: they let psychologists infer something about the wider population from a sample by calculating the probability that the results occurred by chance, and on that basis deciding whether to reject the null hypothesis. Without this step, a researcher could only ever describe their particular sample; with it, they can make justified claims about people in general, which is what turns a one-off observation into scientific knowledge. This lesson covers probability and significance, Type I and Type II errors, how to choose the correct statistical test, and how to compare the calculated value with the critical value.

Key Definition: Inferential statistics are statistical tests used to determine whether the results of a study are statistically significant — unlikely to have occurred by chance — and can therefore be generalised from the sample to the wider population.

Spec Mapping

This lesson addresses the following points in AQA A-Level Psychology (7182), Section 4.2 (Research methods):

Introduction to statistical testing; the sign test.
Probability and significance; use of statistical tables and critical values in interpretation of significance; Type I and Type II errors.
Factors affecting the choice of statistical test, including level of measurement and experimental design.
The use of inferential tests: Spearman's rho, Pearson's r, Wilcoxon, Mann-Whitney, related and unrelated t-test, and Chi-Squared.
Reporting of psychological investigations; conventions of reporting statistical findings.

Assessment objectives engaged: AO1 (significance, errors, the named tests), AO2 (choosing the correct test for, and interpreting the outcome of, a novel study) and AO3 (evaluating significance decisions and the consequences of error). These questions are strongly application-based and may require you to select a test, read a critical-values table, or state a conclusion.

Probability and Significance

In psychology, the conventional significance level is

$p \leq 0.05$

meaning the probability that the observed results are due to chance is 5% or less — equivalently, we can be at least 95% confident the effect is real. If the test shows the probability of a chance result is at or below this threshold, we reject the null hypothesis and accept the alternative (experimental) hypothesis.

Significance level	Interpretation
p ≤ 0.05	The standard level — results judged statistically significant
p ≤ 0.01	More stringent (1% chance of error) — used when a false positive would be serious, e.g. trialling a drug
p ≤ 0.10	More lenient — occasionally used in exploratory or pilot research

Why 5%? The figure is a convention, not a law of nature. It represents a pragmatic balance: strict enough that we will not constantly cry "effect!" over random noise (which a 10% level would risk), but lenient enough that genuine effects of reasonable size can be detected (which a 1% level might miss). Probability itself runs on a scale from 0 (an event is impossible) to 1 (it is certain), so $p \leq 0.05$ simply marks the point at which a chance explanation becomes implausible enough to discard. Crucially, "significant" never means "certain": there remains up to a 5% chance that we have rejected a true null hypothesis (a Type I error). Replication is therefore essential — a single significant result could always be the 1-in-20 fluke.

Key Definition: The significance level is the probability threshold below which the null hypothesis is rejected. In psychology, $p \leq 0.05$ is standard — there is a 5% or smaller probability that the results occurred by chance.

Exam Tip: Note the careful wording: $p \leq 0.05$ does not mean "95% certain the hypothesis is true". It means that if the null hypothesis were true, results this extreme would occur 5% of the time or less. We never prove a hypothesis — we reject or retain the null.

Type I and Type II Errors

Because we work with probability, statistical decisions can be wrong in two ways.

Error	What happens	When more likely
Type I error (false positive)	The null hypothesis is rejected when it is actually true — we claim an effect that is not really there	Significance level too lenient (e.g. $p \leq 0.10$ )
Type II error (false negative)	The null hypothesis is retained when it is actually false — we miss a real effect	Significance level too stringent (e.g. $p \leq 0.01$ ) or the sample is too small / power too low

Exam Tip: Remember Type I as "seeing something that Isn't there" (a false alarm), and Type II as "missing something that is there". Making the level stricter (e.g. moving from 0.05 to 0.01) reduces Type I errors but increases Type II errors — there is always a trade-off, which is why $p \leq 0.05$ is the standard compromise.

Worked illustration. A researcher tests whether a new therapy reduces anxiety, setting $p \leq 0.05$ .

A Type I error occurs if the therapy does not actually work but the test happens to produce a significant result by chance — the therapy is wrongly judged effective.
A Type II error occurs if the therapy does work but the test fails to detect it (perhaps the sample was too small) — the therapy is wrongly judged ineffective.

The consequences of each error depend on context, which is why the choice of significance level is not arbitrary. In drug trials, a Type I error (releasing an ineffective or harmful drug) is potentially catastrophic, so researchers adopt a stricter level such as $p \leq 0.01$ to guard against false positives — accepting a higher risk of a Type II error in exchange. In exploratory or pilot research, where the aim is simply to decide whether an effect is worth pursuing, a Type II error (abandoning a promising line of enquiry) is the greater worry, so a more lenient level may be justified. Understanding this trade-off — that you cannot minimise both error types at once for a fixed sample — is exactly the kind of evaluative point that distinguishes strong answers.

The probability of correctly rejecting a false null hypothesis — that is, of detecting a real effect — is called statistical power. Power is increased by using a larger sample, a more sensitive measure, a more powerful (parametric) test, and a larger true effect size. A study with low power (often because the sample is too small) is prone to Type II errors, which is why under-powered studies that report "no significant difference" should be interpreted cautiously: absence of evidence is not evidence of absence.

The Sign Test

The sign test is the simplest inferential test required at A-Level, and the only one you may be asked to calculate fully. Use it when all three of the following hold:

the hypothesis predicts a difference (not a correlation);
the design is related (repeated measures or matched pairs);
the data are nominal (or can be reduced to the direction of change).

The sign test is chosen, in terms of the decision table, where the row is "difference — related" and the column is "nominal" — which is exactly why it suits before/after designs in which all we can say about each participant is whether they went up or down. It is the obvious test when the dependent variable is simply a yes/no or improved/worsened judgement.

Procedure:

Record each participant's score in Condition A and Condition B.
Work out the sign of the difference (A − B): $+$ , $-$ , or $0$ .
Discard any participant scoring $0$ (no change).
Let $N$ = the number of remaining participants.
Let $S$ = the count of the less frequent sign — this is the calculated value.
Find the critical value of $S$ for that $N$ and significance level (one- or two-tailed).
If $S \leq$ the critical value, the result is significant — reject the null hypothesis.

Worked example. Does a relaxation technique reduce stress ratings? Ten participants rate stress before and after.

Participant	Before	After	Difference	Sign
1	8	5	−3	−
2	7	6	−1	−
3	6	6	0	(excluded)
4	9	4	−5	−
5	5	3	−2	−
6	8	7	−1	−
7	6	5	−1	−
8	7	8	+1	+
9	9	6	−3	−
10	8	5	−3	−

Participant 3 is excluded (difference $= 0$ ), so $N = 9$ .
There are 8 minus signs and 1 plus sign, so $S = 1$ (the less frequent sign).
The critical value for a one-tailed test at $p \leq 0.05$ with $N = 9$ is $1$ .
Since $S\,(1) \leq$ critical value $(1)$ , the result is significant.
Conclusion: the relaxation technique significantly reduced stress ratings ( $p \leq 0.05$ , one-tailed); the null hypothesis is rejected.

Two features of the sign test are worth pausing on. First, it deliberately throws away information — it uses only the direction of each change, not its size, which is why a $-5$ counts the same as a $-1$ . This makes the test easy to compute but relatively insensitive; a Wilcoxon test, which also ranks the magnitude of the differences, would extract more from the same data. Second, the one-tailed critical value was used because the hypothesis was directional (the technique would reduce stress); had the prediction merely been that stress would change, the two-tailed critical value would apply and significance would be slightly harder to reach. This worked example therefore illustrates not just the mechanics of one test, but the general principles of significance, tails and the calculated-versus-critical comparison that apply throughout inferential testing.

Exam Tip: For the sign test (and Wilcoxon and Mann-Whitney), the calculated value must be equal to or less than the critical value for significance. For Chi-Squared, Spearman's rho, Pearson's r and the t-tests, the calculated value must be equal to or greater than the critical value. Stating the wrong direction loses the conclusion mark — a handy rule is "R-tests (Rho, R, t) need to be biggeR".

Choosing the Right Statistical Test

The correct test is fixed by three questions:

Difference or correlation? Is the hypothesis about a difference between conditions, or an association between two co-variables?
Related or unrelated design? Related = repeated measures or matched pairs; unrelated = independent groups. (For correlations, the data are paired by definition.)
Level of measurement? Nominal, ordinal, or interval.

Decision tree

graph TD
    A[What does the hypothesis test?] -->|Difference| B[What is the design?]
    A -->|Correlation / association| C[Level of measurement?]
    B -->|Unrelated<br/>independent groups| D[Level of measurement?]
    B -->|Related<br/>repeated measures / matched pairs| E[Level of measurement?]
    D -->|Nominal| D1[Chi-Squared]
    D -->|Ordinal| D2[Mann-Whitney U]
    D -->|Interval| D3[Unrelated t-test]
    E -->|Nominal| E1[Sign test]
    E -->|Ordinal| E2[Wilcoxon signed-rank]
    E -->|Interval| E3[Related t-test]
    C -->|Nominal| C0[Chi-Squared<br/>test of association]
    C -->|Ordinal| C1[Spearman's rho]
    C -->|Interval| C2[Pearson's r]

Decision table

	Nominal	Ordinal	Interval
Difference — unrelated (independent groups)	Chi-Squared ( $\chi^2$ )	Mann-Whitney U	Unrelated t-test
Difference — related (repeated measures / matched pairs)	Sign test	Wilcoxon signed-rank	Related t-test
Correlation / association	Chi-Squared ( $\chi^2$ )	Spearman's rho ( $r_s$ )	Pearson's r

Working through the three questions in order makes the choice mechanical. First, decide whether the hypothesis concerns a difference or a correlation — this picks the row block. Second, for a difference, decide whether the design is related (the same or matched participants across conditions) or unrelated (different, independent participants) — this picks the row. Third, identify the level of measurement — this picks the column. The cell at the intersection names the test. Chi-Squared appears twice because it serves both as a test of difference between independent categories and as a test of association between two categorical variables; either way it requires nominal (frequency) data.

Exam Tip: This table is the single most-tested piece of methods knowledge. A popular mnemonic for the difference row order (Nominal–Ordinal–Interval, unrelated then related, then correlation) is "Carrots Should Come Mashed With Swede Under Roast Potatoes" → Chi-Squared, Sign test, [then ordinal] Mann-Whitney, Wilcoxon, [then interval] unrelated-t, related-t, [then correlation] Spearman, Pearson. Always justify: name the type of test, the design and the level of measurement.

The named tests

Test	When used	Key feature
Chi-Squared ( $\chi^2$ )	Difference/association, nominal, unrelated	Compares observed vs expected frequencies in a contingency table
Sign test	Difference, nominal, related	Counts the direction (sign) of change
Mann-Whitney U	Difference, ordinal, unrelated	Ranks all scores together; compares rank totals
Wilcoxon signed-rank	Difference, ordinal, related	Ranks the differences between paired scores
Unrelated t-test	Difference, interval, unrelated	Compares the means of two independent groups (parametric)
Related t-test	Difference, interval, related	Compares paired mean differences (parametric)
Spearman's rho ( $r_s$ )	Correlation, ordinal	Strength/direction of a monotonic relationship between ranks
Pearson's r	Correlation, interval	Strength/direction of a linear relationship (parametric)

The three parametric tests (the two t-tests and Pearson's r) are more powerful but require interval data, a roughly normal distribution, and similar variances; the others are non-parametric and make fewer assumptions.

Parametric vs Non-Parametric Tests

The distinction matters because it adds a fourth consideration to test choice. A parametric test may be used only when three conditions are met: the data are at the interval level; the populations are approximately normally distributed; and the two samples have similar variances (homogeneity of variance). When these hold, the parametric test (related t-test, unrelated t-test, or Pearson's r) is preferred because it is more powerful — better able to detect a real effect, and so less prone to a Type II error. When the conditions are not met — for example, the data are ordinal, or the distribution is badly skewed — a non-parametric equivalent (Wilcoxon, Mann-Whitney, Spearman's rho) must be used instead. These rank-based tests sacrifice some power in exchange for making far fewer assumptions, which is why they are the workhorses of psychological research on rating scales and rankings.

How the Rank-Based Tests Work (Conceptually)

You are not required to calculate Mann-Whitney or Wilcoxon by hand, but understanding the logic aids interpretation. Mann-Whitney U (unrelated, ordinal) pools all the scores from both groups, ranks them from lowest to highest, and then asks whether the ranks of one group cluster systematically higher than the other; if the two groups were really the same, high and low ranks would be evenly mixed. Wilcoxon signed-rank (related, ordinal) works on the differences between each participant's two scores: it ranks the sizes of those differences and then checks whether the positive and negative differences balance out (as the null predicts) or whether one direction dominates. In both cases the test converts raw scores into ranks, which is precisely why they suit ordinal data and do not assume a normal distribution.

Two key formulae

The Chi-Squared statistic compares observed frequencies $O$ with the frequencies expected under the null hypothesis $E$ :

$\chi^2 = \sum \frac{(O - E)^2}{E}$

The larger the gap between what is observed and what would be expected by chance, the larger $\chi^2$ becomes.

Spearman's rho (the rank correlation coefficient) is given by

$r_s = 1 - \frac{6 \sum d^2}{n(n^2 - 1)}$

where $d$ is the difference between the two ranks for each pair and $n$ is the number of pairs. As with any correlation coefficient, the result lies in the range $-1 \leq r_s \leq +1$ .

Inferential Statistics and Hypothesis Testing

Inferential Statistics and Hypothesis Testing

Spec Mapping

Probability and Significance

Type I and Type II Errors

The Sign Test

Choosing the Right Statistical Test

Decision tree

Decision table

The named tests

Parametric vs Non-Parametric Tests

How the Rank-Based Tests Work (Conceptually)

Two key formulae

More in Psychology