Contingency Tables and Tests of Association

A contingency table (or two-way table) cross-classifies a sample by two categorical variables — gender against subject choice, treatment against outcome, region against voting intention. The chi-squared test of association uses the same $\sum (O-E)^2/E$ machinery as the goodness-of-fit test (Lesson 9), but asks a different question: are the two variables independent, or is there a relationship between them? The key new ingredients are the expected-frequency formula $E_{ij} = R_iC_j/N$ (forced by the independence hypothesis) and the degrees-of-freedom rule $\nu = (r-1)(c-1)$ . This lesson is the capstone of 7367/3S hypothesis testing.

1. Where this sits in AQA 7367

This is Paper 3 Statistics option (7367/3S) content (per-paper weighting AO1 40% / AO2 25% / AO3 35%) and the second hypothesis-testing lesson, building directly on Lesson 9. Computing expected frequencies and the test statistic is AO1; deriving $E_{ij}$ from the independence assumption and choosing the degrees of freedom is AO2; interpreting which cells drive an association, and writing a careful "association ≠ causation" conclusion, is AO3. The prerequisites are the $\chi^2$ statistic and pooling rule (Lesson 9) and the multiplication rule for independent events from A-Level Mathematics probability.

This lesson completes the hypothesis-testing strand of the statistics option and is among the most applied topics in the whole qualification: contingency-table tests are the everyday workhorse of medicine, social science, market research and quality control, wherever two categorical classifications are cross-tabulated and the question "are these related?" arises. Because the underlying machinery — the $\sum (O-E)^2/E$ statistic, the $E \ge 5$ pooling rule, the upper-tail comparison — is identical to the goodness-of-fit test of Lesson 9, the genuinely new learning is concentrated in two places: the derivation of the expected frequencies from the independence hypothesis, and the degrees-of-freedom rule $\nu = (r-1)(c-1)$ . Get those two right and the rest is familiar territory. The interpretive demands, however, are higher here, because a real association invites the tempting but unwarranted leap to causation — a trap the examiners test explicitly.

2. Core theory

Structure of an $r \times c$ table

A table with $r$ rows and $c$ columns records observed frequencies $O_{ij}$ , with row totals $R_i$ , column totals $C_j$ , and grand total $N$ :

	Col 1	Col 2	$\cdots$	Col $c$	Row total
Row 1	$O_{11}$	$O_{12}$	$\cdots$	$O_{1c}$	$R_1$
Row 2	$O_{21}$	$O_{22}$	$\cdots$	$O_{2c}$	$R_2$
$\vdots$	$\vdots$	$\vdots$		$\vdots$	$\vdots$
Row $r$	$O_{r1}$	$O_{r2}$	$\cdots$	$O_{rc}$	$R_r$
Col total	$C_1$	$C_2$	$\cdots$	$C_c$	$N$

Hypotheses

$H_0: \text{the two variables are independent (no association)}; \qquad H_1: \text{the two variables are associated}.$

Expected frequencies — derived, not assumed

Under independence, $P(\text{row }i \cap \text{col }j) = P(\text{row }i)\,P(\text{col }j)$ . Estimating the marginals from the data, $P(\text{row }i) = \tfrac{R_i}{N}$ and $P(\text{col }j) = \tfrac{C_j}{N}$ , so the expected count is

$E_{ij} = N\cdot\frac{R_i}{N}\cdot\frac{C_j}{N} = \frac{R_i\,C_j}{N} = \frac{\text{row total}\times\text{column total}}{\text{grand total}}.$

A built-in check: the expected row and column totals automatically equal the observed ones, and $\sum_{ij}E_{ij} = N$ . This self-consistency is not a coincidence but a direct consequence of the formula, and it gives you a free and powerful arithmetic check: after filling in the expected-frequency table, add along each row and down each column; if any expected margin fails to match the corresponding observed margin, you have made a slip. Because every $E_{ij}$ shares the same denominator $N$ , the calculation is also quick — compute each row total times each column total, then divide. A common time-saver is to notice that whenever two rows (or two columns) have equal totals, their expected frequencies are identical, so you need only compute one of them; in Example 1 below, the two gender rows both total 150, so the female expected frequencies simply copy the male ones.

Test statistic and degrees of freedom

$X^2 = \sum_{\text{all cells}} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \ \sim\ \chi^2_\nu, \qquad \nu = (r-1)(c-1).$

The degrees of freedom arise because, once the marginal totals are fixed, only $(r-1)(c-1)$ of the interior cells are free to vary — the rest are determined by the totals. To see this concretely in a $2 \times 2$ table: if you know the four marginal totals and one interior cell, every other cell is forced (each row and column must sum to its margin), so just one cell is free — hence $\nu = (2-1)(2-1) = 1$ . In a $2 \times 3$ table you may choose two interior cells freely before the rest are pinned down, giving $\nu = 2$ ; and so on. This "how many cells can I fill in before the totals decide the rest?" picture makes the formula memorable and explains why it is a product of the reduced dimensions rather than a simple count of cells.

Table size	$\nu = (r-1)(c-1)$
$2 \times 2$	1
$2 \times 3$	2
$3 \times 3$	4
$3 \times 4$	6

As in Lesson 9 the test is upper-tailed, all $E_{ij}$ should be $\ge 5$ (pool adjacent rows/columns otherwise), and a $2\times2$ table ( $\nu = 1$ ) calls for Yates' continuity correction.

3. Worked examples with M1/A1 mark scheme

Example 1 — a $2 \times 3$ table

A survey of 300 students records subject preference by gender:

	Subject A	Subject B	Subject C	Row total
Male	60	40	50	150
Female	30	50	70	150
Col total	90	90	120	300

$H_0$ : gender and subject preference are independent; $H_1$ : they are associated.

Expected frequencies $E_{ij} = \tfrac{R_iC_j}{300}$ (M1 for the formula, A1 for the table):

	Subject A	Subject B	Subject C
Male	$\tfrac{150\cdot 90}{300}=45$	$\tfrac{150\cdot 90}{300}=45$	$\tfrac{150\cdot 120}{300}=60$
Female	45	45	60

All $E \ge 5$ , so no pooling. Test statistic:

Cell	$O$	$E$	$(O-E)^2/E$
M, A	60	45	$225/45 = 5.000$
M, B	40	45	$25/45 = 0.556$
M, C	50	60	$100/60 = 1.667$
F, A	30	45	$225/45 = 5.000$
F, B	50	45	$25/45 = 0.556$
F, C	70	60	$100/60 = 1.667$

$X^2 = 5.000 + 0.556 + 1.667 + 5.000 + 0.556 + 1.667 = 14.446. \quad (\textbf{M1 A1})$

$\nu = (2-1)(3-1) = 2$ (B1); at 5%, $\chi^2_2 = 5.991$ . Since $14.446 > 5.991$ , reject $H_0$ (M1 A1): there is significant evidence at the 5% level of an association between gender and subject preference. The largest contributions ( $5.000$ each) come from Subject A, where males are over-represented and females under-represented — the heart of the association.

Example 2 — a $2 \times 2$ table with Yates' correction

A trial records recovery by treatment:

	Recovered	Not recovered	Row total
Drug	38	12	50
Placebo	26	24	50
Col total	64	36	100

$H_0$ : recovery is independent of treatment; $H_1$ : they are associated. Expected frequencies:

$E_{\text{Drug,Rec}} = \tfrac{50\cdot 64}{100} = 32,\quad E_{\text{Drug,Not}} = \tfrac{50\cdot 36}{100} = 18,\quad E_{\text{Pla,Rec}} = 32,\quad E_{\text{Pla,Not}} = 18.$

Here $\nu = (2-1)(2-1) = 1$ , so apply Yates' correction. Every $|O - E| = 6$ , so each term is $\tfrac{(6 - 0.5)^2}{E} = \tfrac{30.25}{E}$ :

$X^2 = \frac{30.25}{32} + \frac{30.25}{18} + \frac{30.25}{32} + \frac{30.25}{18} = 0.9453 + 1.6806 + 0.9453 + 1.6806 = 5.252. \quad (\textbf{M1}\ \text{Yates};\ \textbf{A1})$

At 5%, $\chi^2_1 = 3.841$ . Since $5.252 > 3.841$ , reject $H_0$ (A1): there is significant evidence of an association between treatment and recovery. (The single-formula route gives the same value — see §10.)

Example 3 — pooling a sparse row

A $3 \times 2$ table of grade by school has a third row "School C" with totals so small that one expected cell falls below 5. Suppose $E_{\text{C,Pass}} = 4.2$ and $E_{\text{C,Fail}} = 2.8$ . Both are $< 5$ , so pool row C with the most similar adjacent row (say School B), summing the observed and the row totals, before computing $X^2$ . The table becomes $2 \times 2$ , so $\nu = 1$ and Yates' correction then applies. The lesson: always inspect every $E_{ij}$ before computing the statistic, and pool whole rows or columns (never single interior cells).

4. Specimen-style exam question

(Specimen-style — not from any real paper.)

A researcher classifies 200 adults by exercise level (Low / High) and self-reported sleep quality (Poor / Good):

	Poor	Good	Row total
Low exercise	50	30	80
High exercise	40	80	120
Col total	90	110	200

Test at the 1% level whether sleep quality is associated with exercise level.

Solution. $H_0$ : sleep quality and exercise are independent; $H_1$ : they are associated. Expected:

$E_{\text{L,P}} = \tfrac{80\cdot 90}{200} = 36,\ E_{\text{L,G}} = \tfrac{80\cdot 110}{200} = 44,\ E_{\text{H,P}} = \tfrac{120\cdot 90}{200} = 54,\ E_{\text{H,G}} = \tfrac{120\cdot 110}{200} = 66.$

$\nu = 1$ , so use Yates. Each $|O - E| = 14$ , giving terms $\tfrac{(14-0.5)^2}{E} = \tfrac{182.25}{E}$ :

$X^2 = \frac{182.25}{36} + \frac{182.25}{44} + \frac{182.25}{54} + \frac{182.25}{66} = 5.0625 + 4.1420 + 3.3750 + 2.7614 = 15.341.$

At the 1% level, $\chi^2_1 = 6.635$ . Since $15.341 > 6.635$ , reject $H_0$ : there is significant evidence at the 1% level of an association between exercise level and sleep quality (those with high exercise report good sleep more often than independence would predict). This is observational data, so it does not establish that exercise causes better sleep.

5. Synoptic links

Lesson 9 (goodness of fit): identical statistic $\sum (O-E)^2/E$ , identical pooling and upper-tail logic; only $E_{ij}$ and $\nu$ change.
A-Level Maths probability: $E_{ij} = R_iC_j/N$ is the multiplication rule for independent events with marginals estimated from the data.
Lesson 8 (MGFs): the $\chi^2_\nu$ reference distribution is Gamma $(\tfrac\nu2,\tfrac12)$ with MGF $(1-2t)^{-\nu/2}$ .
Correlation (S2 / A-Level): a $\chi^2$ test of association is the categorical analogue of testing a correlation coefficient for quantitative variables — both probe relationship, neither proves causation.

Contingency Tables and Tests of Association

Contingency Tables and Tests of Association

1. Where this sits in AQA 7367

2. Core theory

Structure of an r×cr \times cr×c table

Hypotheses

Expected frequencies — derived, not assumed

Test statistic and degrees of freedom

3. Worked examples with M1/A1 mark scheme

Example 1 — a 2×32 \times 32×3 table

Example 2 — a 2×22 \times 22×2 table with Yates' correction

Example 3 — pooling a sparse row

4. Specimen-style exam question

5. Synoptic links

More in Mathematics

Structure of an $r \times c$ table

Example 1 — a $2 \times 3$ table

Example 2 — a $2 \times 2$ table with Yates' correction