You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
A contingency table (or two-way table) cross-classifies a sample by two categorical variables — gender against subject choice, treatment against outcome, region against voting intention. The chi-squared test of association uses the same ∑(O−E)2/E machinery as the goodness-of-fit test (Lesson 9), but asks a different question: are the two variables independent, or is there a relationship between them? The key new ingredients are the expected-frequency formula Eij=RiCj/N (forced by the independence hypothesis) and the degrees-of-freedom rule ν=(r−1)(c−1). This lesson is the capstone of 7367/3S hypothesis testing.
This is Paper 3 Statistics option (7367/3S) content (per-paper weighting AO1 40% / AO2 25% / AO3 35%) and the second hypothesis-testing lesson, building directly on Lesson 9. Computing expected frequencies and the test statistic is AO1; deriving Eij from the independence assumption and choosing the degrees of freedom is AO2; interpreting which cells drive an association, and writing a careful "association ≠ causation" conclusion, is AO3. The prerequisites are the χ2 statistic and pooling rule (Lesson 9) and the multiplication rule for independent events from A-Level Mathematics probability.
This lesson completes the hypothesis-testing strand of the statistics option and is among the most applied topics in the whole qualification: contingency-table tests are the everyday workhorse of medicine, social science, market research and quality control, wherever two categorical classifications are cross-tabulated and the question "are these related?" arises. Because the underlying machinery — the ∑(O−E)2/E statistic, the E≥5 pooling rule, the upper-tail comparison — is identical to the goodness-of-fit test of Lesson 9, the genuinely new learning is concentrated in two places: the derivation of the expected frequencies from the independence hypothesis, and the degrees-of-freedom rule ν=(r−1)(c−1). Get those two right and the rest is familiar territory. The interpretive demands, however, are higher here, because a real association invites the tempting but unwarranted leap to causation — a trap the examiners test explicitly.
A table with r rows and c columns records observed frequencies Oij, with row totals Ri, column totals Cj, and grand total N:
| Col 1 | Col 2 | ⋯ | Col c | Row total | |
|---|---|---|---|---|---|
| Row 1 | O11 | O12 | ⋯ | O1c | R1 |
| Row 2 | O21 | O22 | ⋯ | O2c | R2 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | |
| Row r | Or1 | Or2 | ⋯ | Orc | Rr |
| Col total | C1 | C2 | ⋯ | Cc | N |
H0:the two variables are independent (no association);H1:the two variables are associated.
Under independence, P(row i∩col j)=P(row i)P(col j). Estimating the marginals from the data, P(row i)=NRi and P(col j)=NCj, so the expected count is
Eij=N⋅NRi⋅NCj=NRiCj=grand totalrow total×column total.
A built-in check: the expected row and column totals automatically equal the observed ones, and ∑ijEij=N. This self-consistency is not a coincidence but a direct consequence of the formula, and it gives you a free and powerful arithmetic check: after filling in the expected-frequency table, add along each row and down each column; if any expected margin fails to match the corresponding observed margin, you have made a slip. Because every Eij shares the same denominator N, the calculation is also quick — compute each row total times each column total, then divide. A common time-saver is to notice that whenever two rows (or two columns) have equal totals, their expected frequencies are identical, so you need only compute one of them; in Example 1 below, the two gender rows both total 150, so the female expected frequencies simply copy the male ones.
X2=∑all cellsEij(Oij−Eij)2 ∼ χν2,ν=(r−1)(c−1).
The degrees of freedom arise because, once the marginal totals are fixed, only (r−1)(c−1) of the interior cells are free to vary — the rest are determined by the totals. To see this concretely in a 2×2 table: if you know the four marginal totals and one interior cell, every other cell is forced (each row and column must sum to its margin), so just one cell is free — hence ν=(2−1)(2−1)=1. In a 2×3 table you may choose two interior cells freely before the rest are pinned down, giving ν=2; and so on. This "how many cells can I fill in before the totals decide the rest?" picture makes the formula memorable and explains why it is a product of the reduced dimensions rather than a simple count of cells.
| Table size | ν=(r−1)(c−1) |
|---|---|
| 2×2 | 1 |
| 2×3 | 2 |
| 3×3 | 4 |
| 3×4 | 6 |
As in Lesson 9 the test is upper-tailed, all Eij should be ≥5 (pool adjacent rows/columns otherwise), and a 2×2 table (ν=1) calls for Yates' continuity correction.
A survey of 300 students records subject preference by gender:
| Subject A | Subject B | Subject C | Row total | |
|---|---|---|---|---|
| Male | 60 | 40 | 50 | 150 |
| Female | 30 | 50 | 70 | 150 |
| Col total | 90 | 90 | 120 | 300 |
H0: gender and subject preference are independent; H1: they are associated.
Expected frequencies Eij=300RiCj (M1 for the formula, A1 for the table):
| Subject A | Subject B | Subject C | |
|---|---|---|---|
| Male | 300150⋅90=45 | 300150⋅90=45 | 300150⋅120=60 |
| Female | 45 | 45 | 60 |
All E≥5, so no pooling. Test statistic:
| Cell | O | E | (O−E)2/E |
|---|---|---|---|
| M, A | 60 | 45 | 225/45=5.000 |
| M, B | 40 | 45 | 25/45=0.556 |
| M, C | 50 | 60 | 100/60=1.667 |
| F, A | 30 | 45 | 225/45=5.000 |
| F, B | 50 | 45 | 25/45=0.556 |
| F, C | 70 | 60 | 100/60=1.667 |
X2=5.000+0.556+1.667+5.000+0.556+1.667=14.446.(M1 A1)
ν=(2−1)(3−1)=2 (B1); at 5%, χ22=5.991. Since 14.446>5.991, reject H0 (M1 A1): there is significant evidence at the 5% level of an association between gender and subject preference. The largest contributions (5.000 each) come from Subject A, where males are over-represented and females under-represented — the heart of the association.
A trial records recovery by treatment:
| Recovered | Not recovered | Row total | |
|---|---|---|---|
| Drug | 38 | 12 | 50 |
| Placebo | 26 | 24 | 50 |
| Col total | 64 | 36 | 100 |
H0: recovery is independent of treatment; H1: they are associated. Expected frequencies:
EDrug,Rec=10050⋅64=32,EDrug,Not=10050⋅36=18,EPla,Rec=32,EPla,Not=18.
Here ν=(2−1)(2−1)=1, so apply Yates' correction. Every ∣O−E∣=6, so each term is E(6−0.5)2=E30.25:
X2=3230.25+1830.25+3230.25+1830.25=0.9453+1.6806+0.9453+1.6806=5.252.(M1 Yates; A1)
At 5%, χ12=3.841. Since 5.252>3.841, reject H0 (A1): there is significant evidence of an association between treatment and recovery. (The single-formula route gives the same value — see §10.)
A 3×2 table of grade by school has a third row "School C" with totals so small that one expected cell falls below 5. Suppose EC,Pass=4.2 and EC,Fail=2.8. Both are <5, so pool row C with the most similar adjacent row (say School B), summing the observed and the row totals, before computing X2. The table becomes 2×2, so ν=1 and Yates' correction then applies. The lesson: always inspect every Eij before computing the statistic, and pool whole rows or columns (never single interior cells).
(Specimen-style — not from any real paper.)
A researcher classifies 200 adults by exercise level (Low / High) and self-reported sleep quality (Poor / Good):
| Poor | Good | Row total | |
|---|---|---|---|
| Low exercise | 50 | 30 | 80 |
| High exercise | 40 | 80 | 120 |
| Col total | 90 | 110 | 200 |
Test at the 1% level whether sleep quality is associated with exercise level.
Solution. H0: sleep quality and exercise are independent; H1: they are associated. Expected:
EL,P=20080⋅90=36, EL,G=20080⋅110=44, EH,P=200120⋅90=54, EH,G=200120⋅110=66.
ν=1, so use Yates. Each ∣O−E∣=14, giving terms E(14−0.5)2=E182.25:
X2=36182.25+44182.25+54182.25+66182.25=5.0625+4.1420+3.3750+2.7614=15.341.
At the 1% level, χ12=6.635. Since 15.341>6.635, reject H0: there is significant evidence at the 1% level of an association between exercise level and sleep quality (those with high exercise report good sleep more often than independence would predict). This is observational data, so it does not establish that exercise causes better sleep.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.