SQL: Aggregate Functions and GROUP BY

Most of the SELECT queries you have written so far return rows as they are stored — one output row per matching table row. But a huge proportion of real questions are about summaries: "how many students are in each form group?", "what is the average grade per subject?", "which courses have more than 30 enrolments?". These are answered by aggregate functions, which collapse many rows into a single value, and by GROUP BY, which applies those functions to each group of rows separately rather than to the whole table. This lesson builds the complete summarising toolkit and — critically — establishes the difference between WHERE (which filters individual rows before aggregation) and HAVING (which filters whole groups after aggregation), one of the most reliably examined distinctions in the SQL part of the specification. Throughout we reuse the Student / Course / Enrolment schema and show every query with its result table.

Spec Mapping

This lesson covers the AQA A-Level Computer Science (7517) Fundamentals of databases content on aggregating and summarising data with SQL: the aggregate functions COUNT, SUM, AVG, MIN and MAX; grouping rows with GROUP BY; and filtering groups with HAVING as distinct from filtering rows with WHERE. It builds on the SELECT/WHERE and join lessons (aggregates are routinely applied across joined tables) and supports the exam-practice synthesis lesson, where summarising queries appear frequently.

A sample dataset

So that every result is concrete, assume this small Enrolment table joined to Student and Course. The raw Enrolment rows are:

StudentID	CourseID	Subject	Score
1001	CS101	Computing	88
1002	CS101	Computing	74
1001	MA210	Maths	92
1003	MA210	Maths	84
1002	EN105	English	71
1003	EN105	English	73

We will run summarising queries against this data and show the exact output each one produces.

Notice the shift in granularity that aggregation brings. Every SELECT you have written so far returns data at the granularity of the row — one output line per matching enrolment. Aggregation deliberately changes that granularity: a whole-table aggregate returns one line for the entire table, and a GROUP BY returns one line per group. Keeping clear in your mind "what does one output row represent here?" is the single most useful habit when reading or writing these queries — for a plain SELECT it is one enrolment; for GROUP BY Subject it is one subject; for GROUP BY FormGroup, Subject it is one form-group-and-subject combination. Almost every mistake in this topic comes from losing track of that question.

Aggregate Functions

An aggregate function takes a column of many values and returns a single value. The five in the specification are:

Function	Returns	Typical use
`COUNT()`	The number of rows	How many enrolments?
`SUM()`	The total of numeric values	Total revenue
`AVG()`	The arithmetic mean	Average score
`MIN()`	The smallest value	Lowest score / earliest date
`MAX()`	The largest value	Highest score / latest date

Each function in detail

It is worth being precise about what each of the five functions does and where each is appropriate, because exam questions often hinge on choosing the right one:

COUNT answers "how many?". It is the only aggregate that works equally on text, dates and numbers, because it counts rows (or non-NULL values), not magnitudes. COUNT(*) counts every row; COUNT(column) counts non-NULL values; COUNT(DISTINCT column) counts distinct non-NULL values. Use it for "number of students", "number of orders", "number of different cities".
SUM answers "what is the total?" and applies only to numeric columns. A classic use is total revenue: SUM(Quantity * UnitPrice), computing the product per row first, then totalling. Summing a text column is meaningless and is rejected.
AVG answers "what is the mean?" — the sum divided by the count of non-NULL values. Because NULLs are excluded from both the sum and the divisor, AVG over a column with missing values is not the same as treating those values as zero.
MIN and MAX answer "what is the smallest / largest?". They work on numbers (lowest/highest score), on dates (MIN(DateOfBirth) is the oldest student, MAX(LoanDate) is the most recent loan), and even on text (alphabetical first/last). This date behaviour is a frequent exam point: the earliest date is the minimum, which can feel counter-intuitive.

A small worked contrast — the difference between SUM and COUNT on the same column:

SELECT COUNT(Score) AS HowMany, SUM(Score) AS TotalMarks, AVG(Score) AS Mean
FROM Enrolment;

Result:

HowMany	TotalMarks	Mean
6	482	80.33

COUNT(Score) reports there are 6 scores; SUM(Score) adds them to 482; AVG(Score) is 482 ÷ 6 = 80.33. Picking COUNT when the question wants a total — or vice versa — is a common slip that these distinct results guard against.

Aggregates over the whole table

Without GROUP BY, an aggregate condenses the entire table to one row:

SELECT COUNT(*)  AS TotalEnrolments,
       AVG(Score) AS MeanScore,
       MAX(Score) AS TopScore,
       MIN(Score) AS LowestScore
FROM Enrolment;

Result:

TotalEnrolments	MeanScore	TopScore	LowestScore
6	80.33	92	71

One input table of six rows becomes one summary row. Note you can compute several aggregates in a single SELECT.

COUNT(*) vs COUNT(column)

This distinction matters because of NULLs:

COUNT(*) counts every row, including rows where columns are NULL.
COUNT(column) counts only rows where that column is not NULL.
COUNT(DISTINCT column) counts the distinct non-NULL values.

Suppose Student has 6 rows but only 4 have an Email:

SELECT COUNT(*)            AS AllStudents,
       COUNT(Email)        AS WithEmail,
       COUNT(DISTINCT FormGroup) AS FormGroups
FROM Student;

Result:

AllStudents	WithEmail	FormGroups
6	4	2

The two NULL emails are excluded from COUNT(Email). This is a frequent exam trap: COUNT(*) and COUNT(Email) give different answers precisely when NULLs are present. (SUM, AVG, MIN and MAX also ignore NULLs — an important subtlety for AVG, since a NULL is skipped, not treated as zero.)

GROUP BY — Summarising Per Group

GROUP BY partitions the rows into groups that share a value, then applies the aggregate to each group separately, producing one output row per group.

SELECT Subject, AVG(Score) AS AvgScore
FROM Enrolment
GROUP BY Subject;

Result:

Subject	AvgScore
Computing	81.0
Maths	88.0
English	72.0

The six raw rows are grouped into three subjects, and AVG(Score) is computed within each: Computing = (88+74)/2 = 81.0, Maths = (92+84)/2 = 88.0, English = (71+73)/2 = 72.0.

Counting per group

SELECT FormGroup, COUNT(*) AS NumberOfStudents
FROM Student
GROUP BY FormGroup;

Result (for our sample):

FormGroup	NumberOfStudents
10A	3
10B	3

The golden rule of GROUP BY

Every column in the SELECT list must either be inside an aggregate function or be named in the GROUP BY clause.

The reason is logical: each output row represents a whole group, so a bare column that is not grouped has no single value to show. This query is invalid:

-- INVALID: Score is neither aggregated nor grouped
SELECT Subject, Score, AVG(Score)
FROM Enrolment
GROUP BY Subject;

There are two scores in the Computing group (88 and 74) — which would Score display? The DBMS rejects it. Either aggregate Score (e.g. MAX(Score)) or add it to GROUP BY.

Grouping by more than one column

You may group by a combination, producing one row per distinct combination:

SELECT s.FormGroup, e.Subject, AVG(e.Score) AS AvgScore
FROM Student s
JOIN Enrolment e ON s.StudentID = e.StudentID
GROUP BY s.FormGroup, e.Subject
ORDER BY s.FormGroup, e.Subject;

This reports the average score for each (form group, subject) pair — a two-dimensional summary, and a natural place where grouping meets the join skills from the previous lesson.

HAVING — Filtering Groups

A summary query often needs to show only some of the groups — "subjects whose average is above 75", "courses with more than 30 enrolments". You cannot do this with WHERE, because WHERE is evaluated before the rows are grouped and so cannot see a group's aggregate. That is the job of HAVING:

SELECT Subject, AVG(Score) AS AvgScore
FROM Enrolment
GROUP BY Subject
HAVING AVG(Score) > 75;

Result:

Subject	AvgScore
Computing	81.0
Maths	88.0

English (average 72.0) is filtered out after grouping because its group average fails the HAVING test. Computing and Maths remain.

WHERE and HAVING are not interchangeable

This is the single most examined point in the lesson. They filter at different stages:

Clause	Filters	When	Can it use an aggregate?
`WHERE`	Individual rows	Before grouping	No
`HAVING`	Whole groups	After grouping	Yes

A query may use both, and they do different jobs. The following asks: for enrolments scored 70 or above (row filter), what is the average per subject, showing only subjects whose qualifying average exceeds 80 (group filter)?

SELECT Subject, AVG(Score) AS AvgScore
FROM Enrolment
WHERE Score >= 70
GROUP BY Subject
HAVING AVG(Score) > 80;

WHERE Score >= 70 discards individual low rows first; the survivors are grouped; HAVING then drops whole subjects whose average is not above 80. Crucially, the two clauses can change the answer independently — WHERE changes which rows enter a group; HAVING changes which groups survive.

Why you cannot put an aggregate in WHERE

-- INVALID
SELECT Subject, AVG(Score)
FROM Enrolment
WHERE AVG(Score) > 75
GROUP BY Subject;

At the moment WHERE runs, the rows have not yet been grouped, so AVG(Score) does not exist — there is no group to average over. Move the condition to HAVING, which runs after grouping.

A simple decision rule resolves almost every "WHERE or HAVING?" choice: ask whether the condition can be judged by looking at a single row on its own, or whether it needs the whole group. Score >= 70 and Subject = 'Maths' can be decided from one row, so they belong in WHERE. AVG(Score) > 80 and COUNT(*) > 5 can only be decided once the group exists, so they belong in HAVING. When a query needs both kinds of condition — keep certain rows, then keep certain groups — use both clauses together, exactly as in the combined example above. Internalising "single row → WHERE; whole group → HAVING" makes the distinction automatic under exam pressure rather than something to puzzle out each time.

The Logical Order of Clauses

A summarising query is written in one order but processed in another. Knowing the processing order explains every rule above:

SELECT   columns / aggregates   -- 5: choose what to output
FROM     table(s)               -- 1: gather and join rows
WHERE    row_conditions         -- 2: filter individual rows
GROUP BY columns                -- 3: partition survivors into groups
HAVING   group_conditions       -- 4: filter whole groups
ORDER BY columns;               -- 6: sort the final rows

Processing order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY. Because WHERE (step 2) precedes GROUP BY (step 3), it cannot reference an aggregate; because HAVING (step 4) follows it, it can. Because SELECT aliases are created at step 5, some DBMSs forbid using a SELECT alias inside HAVING — repeat the aggregate to be safe.

Worked Example — Course Enrolment Report

Scenario: the school wants management reports from the Student, Course and Enrolment tables.

Task 1 — How many students are enrolled on each course (including courses with none)? A LEFT JOIN keeps courses that have no enrolments:

SELECT c.CourseName, COUNT(e.StudentID) AS EnrolledStudents
FROM Course c
LEFT JOIN Enrolment e ON c.CourseID = e.CourseID
GROUP BY c.CourseName
ORDER BY EnrolledStudents DESC;

Result (illustrative):

CourseName	EnrolledStudents
Intro to Computing	2
Pure Mathematics	2
English Literature	2
Further Maths	0

COUNT(e.StudentID) correctly returns 0 for Further Maths because the unmatched LEFT JOIN row has a NULL StudentID, which COUNT(column) does not count — exactly the NULL behaviour discussed above. COUNT(*) here would have wrongly returned 1.

Task 2 — Which subjects have an average score above 80?

SELECT Subject, ROUND(AVG(Score), 1) AS AvgScore, COUNT(*) AS Entries
FROM Enrolment
GROUP BY Subject
HAVING AVG(Score) > 80;

Result:

Subject	AvgScore	Entries
Computing	81.0	2
Maths	88.0	2

Task 3 — Which courses have more than 30 enrolments (popular courses)?

SELECT c.CourseName, COUNT(e.StudentID) AS Total
FROM Course c
JOIN Enrolment e ON c.CourseID = e.CourseID
GROUP BY c.CourseName
HAVING COUNT(e.StudentID) > 30
ORDER BY Total DESC;

This combines a join, GROUP BY, and a HAVING on an aggregate — the multi-skill pattern examiners favour for higher-mark questions.

Tracing a Grouped Query Mentally

Examiners reward candidates who can predict the output of a grouped query, so it pays to be able to "execute" one by hand. Take this query against our six-row Enrolment sample, asking for the count and average per subject but only for subjects with more than one entry:

SELECT Subject, COUNT(*) AS Entries, AVG(Score) AS AvgScore
FROM Enrolment
WHERE Score >= 70
GROUP BY Subject
HAVING COUNT(*) > 1
ORDER BY AvgScore DESC;

Walk the processing order:

SQL: Aggregate Functions and GROUP BY

SQL: Aggregate Functions and GROUP BY

Spec Mapping

A sample dataset

Aggregate Functions

Each function in detail

Aggregates over the whole table

COUNT(*) vs COUNT(column)

GROUP BY — Summarising Per Group

Counting per group

The golden rule of GROUP BY

Grouping by more than one column

HAVING — Filtering Groups

WHERE and HAVING are not interchangeable

Why you cannot put an aggregate in WHERE

The Logical Order of Clauses

Worked Example — Course Enrolment Report

Tracing a Grouped Query Mentally

More in Computer Science