Evaluating Fitness Tests

This lesson covers how to evaluate the quality and usefulness of fitness tests, as required by the Edexcel GCSE PE specification (1PE0). You must understand the concepts of validity, reliability, practicality and the use of normative data when assessing whether a fitness test is fit for purpose.

Why Evaluate Fitness Tests?

Not all fitness tests are equally useful. Before relying on the results of a test, a coach or performer should consider whether the test actually measures what it claims to, whether the results can be trusted, and whether the test is practical to carry out.

The Four Evaluation Criteria

1. Validity

Definition: The degree to which a test measures what it claims to measure.

A valid test accurately reflects the component of fitness being assessed.

Example	Validity Assessment
The Cooper 12-min run measures cardiovascular endurance	High validity — running for 12 minutes directly tests the heart and lungs' ability to supply oxygen
Using a grip dynamometer to measure overall body strength	Low validity — grip strength only measures hand/forearm strength, not whole-body strength
The sit and reach test measures hamstring and lower-back flexibility	Moderate validity — it only measures flexibility at one joint area, not overall flexibility

Exam Tip: If a test only measures one aspect of a broad component, its validity for that overall component is reduced. For example, the sit and reach test is valid for hamstring flexibility but not valid for shoulder flexibility.

2. Reliability

Definition: The degree to which a test produces consistent, repeatable results under the same conditions.

A reliable test gives similar results when repeated by the same person under the same conditions.

Factors that affect reliability:

Factor	How It Affects Reliability
Standardised procedures	If the test is carried out the same way every time (same equipment, same instructions), reliability is higher
Environmental conditions	Temperature, wind, surface and time of day should be consistent
Calibrated equipment	Equipment must be checked and standardised (e.g. dynamometer calibrated to zero)
Human error	If a partner operates the stopwatch, reaction time in starting/stopping can vary
Performer's state	Fatigue, motivation, illness, time since last meal — all affect results

Example: If a performer completes the bleep test on Monday and scores Level 9.4, then repeats it on Tuesday under the same conditions and scores Level 9.3, the test has high reliability (the results are consistent). If the score dropped to Level 6.2, the test (or conditions) would be unreliable.

3. Practicality

Definition: How easy, affordable and feasible a test is to carry out.

Factor	Questions to Ask
Cost	Is the equipment expensive? Can a school afford it?
Equipment	Is specialist equipment needed? Is it readily available?
Time	How long does the test take? Can it be completed in a single lesson?
Space	Is a large area needed? Is it available?
Expertise	Does the tester need specialist training to administer the test?
Number of participants	Can only one person be tested at a time, or can groups be tested?

Test	Practicality
Ruler drop test	Very high — only needs a ruler, can be done anywhere, takes seconds
Skinfold callipers	Lower — requires trained tester, specialist callipers, and privacy
Bleep test	Moderate — needs 20 m space, audio equipment, but many performers can be tested at once
VO2 max lab test	Very low — requires expensive lab equipment, trained technicians

4. Normative Data

Definition: A set of average or expected results for a specific population (e.g. by age and gender) against which an individual's test result can be compared.

Normative data allows you to rate a result as excellent, above average, average, below average or poor.

Why Normative Data Is Useful	Explanation
Benchmarking	Tells the performer where they stand compared to others of the same age and gender
Goal setting	Helps set realistic targets (e.g. "move from average to above average")
Monitoring progress	Re-testing and comparing to norms shows improvement over time
Identifying strengths and weaknesses	A performer may score "excellent" for speed but "below average" for flexibility

Limitations of normative data:

Data may be based on a non-representative sample (e.g. only university students, not the general population).
Data may be outdated — fitness norms change over time.
Data does not account for individual differences (genetics, training history, injury status).

Exam Tip: Normative data is only useful if it is relevant to the person being tested. Comparing a 15-year-old's results to data from 25-year-old university students would be misleading.

Evaluating Specific Tests: Worked Examples

Test	Validity	Reliability	Practicality
Bleep test	High — directly tests CV endurance	High — standardised audio and protocol	Moderate — needs 20 m space and audio
35 m sprint	High — directly tests speed	Moderate — wind and surface affect results	High — needs only tape measure and stopwatch
BMI	Low — does not distinguish fat from muscle	High — calculation is consistent	Very high — only needs scales and tape measure
Skinfold callipers	Moderate-high — measures body fat directly	Moderate — tester skill affects accuracy	Low — requires trained tester and privacy
Ruler drop test	Moderate — measures reaction time but involves motor skill (catching)	Moderate — partner's release and performer's attention vary	Very high — needs only a ruler

Common Exam Mistakes

Not explaining WHY a test has high/low validity — do not just say "the test is valid." Explain why (e.g. "the Cooper run directly tests cardiovascular endurance because...").
Confusing validity and reliability — validity = does it measure the right thing? Reliability = does it give consistent results?
Ignoring practicality — a test may be valid and reliable but impractical for a school (e.g. VO2 max lab testing).
Not mentioning limitations of normative data — always acknowledge that norms may not be representative or up to date.
Assuming all tests are equally good — some tests are better suited to certain contexts than others.

Evaluating Fitness Tests

Evaluating Fitness Tests

Why Evaluate Fitness Tests?

The Four Evaluation Criteria

1. Validity

2. Reliability

3. Practicality

4. Normative Data

Evaluating Specific Tests: Worked Examples

Common Exam Mistakes

Summary

More in Physical Education