Molecular Evidence for Evolution and Phylogeny

By the end of this lesson you should be able to explain and apply each part of this topic — DNA Sequence Comparison, Protein Sequence Comparison — Cytochrome c, Molecular Clocks and rRNA Phylogenies and the Three Domains — and use these ideas accurately in exam-style questions.

Molecular evidence has revolutionised our understanding of evolutionary relationships in the last sixty years. Before the 1960s, classification depended on morphology, anatomy, and the fossil record; phylogenetic relationships were inferred from shared traits whose homology had to be argued case by case. The advent of protein sequencing in the 1950s (Sanger's insulin sequence, 1953), DNA sequencing in the 1970s (Sanger and Maxam-Gilbert methods), and high-throughput sequencing from the mid-2000s onward has transformed phylogenetics into a quantitative, sequence-based discipline. Today, the entire tree of life is being re-resolved with molecular data, often overturning classical morphology-based groupings.

Spec mapping: This lesson sits in AQA 7402 Section 3.7.3 (species and taxonomy — DNA sequencing, immunological evidence, and protein-sequence comparison as evidence for evolutionary relationships). Refer to the official AQA specification document for exact wording. It extends the classification framework established in lesson 5 with the molecular data that underpins modern phylogenetics.

Connects to: Classification and taxonomy (Section 3.7.3, lesson 5 — the three-domain reform rests on molecular evidence); DNA structure and replication (Section 3.4.1, course 4 DNA, Genes and Inheritance); protein synthesis (Section 3.4.2, course 4); speciation and divergence (Section 3.7.3, lesson 4); cell recognition and immunology (Section 3.2.4, course 2 Cells, Microscopy and Cell Cycle).

Why Molecules Are Better Phylogenetic Evidence Than Morphology

Morphological characters have served taxonomy well for two centuries, but they suffer from systematic problems that molecular characters largely avoid:

Convergent evolution is common in morphology. Wings have evolved independently in birds, bats, pterosaurs and insects; eyes have evolved independently dozens of times. Convergent traits can fool morphology-based classification into grouping unrelated organisms.
Morphological characters are limited in number. A vertebrate has perhaps a few hundred to a few thousand independent morphological characters available for analysis; a single genome contains 10⁹ to 10¹⁰ base pairs, providing orders of magnitude more data points.
Subjective weighting. Morphologists must decide which features matter most — a judgement call. Molecular data are quantitatively comparable in a way morphological data are not.
Cross-domain comparability. Morphology of a bacterium cannot be compared with morphology of a tree; both can be compared at the level of homologous protein sequences (ribosomal RNA, ATP synthase, RNA polymerase).
Continuous evolutionary signal. DNA sequence diverges roughly continuously with time, providing a quantitative "molecular clock"; morphology often evolves in spurts and stasis, with limited time-resolution.

Molecular evidence does not entirely replace morphology — fossils require morphological interpretation, and morphological evidence remains important for placing extinct lineages — but it is now the dominant evidence for phylogenetic relationships among living taxa.

DNA Sequence Comparison

The foundational molecular technique is comparative DNA sequencing. The number of differences between homologous DNA sequences (DNA derived from a common ancestor) in two species is approximately proportional to the time since their last common ancestor.

Three main types of DNA used in phylogenetics

Nuclear DNA. Most of the genome. Inherited biparentally; recombines each generation; informative for relatively recent divergences.
Mitochondrial DNA (mtDNA). Inherited only maternally (no recombination in most animals); evolves about 10 times faster than nuclear DNA. Useful for resolving recent divergences within and between species. The mitochondrial cytochrome oxidase I (COI) gene is the standard "DNA barcode" for animals (cross-link to lesson 5).
Ribosomal RNA genes (rDNA). Highly conserved across all life — the small-subunit rRNA gene (16S in prokaryotes, 18S in eukaryotes) was the gene Woese used to establish the three-domain system. Useful for the deepest evolutionary relationships, including between domains.

Key Principle: Closely related species (recent common ancestor) have fewer DNA differences; distantly related species have more. Comparing DNA from many species allows reconstruction of the branching pattern that best explains the observed differences.

Building a phylogeny from DNA sequence

The standard workflow:

Extract DNA from each species of interest.
Amplify and sequence a homologous region (e.g. a specific gene, or a barcode region like COI).
Align the sequences using bioinformatic tools (BLAST for pairwise alignment, multiple-sequence alignment for many species).
Count differences — the simplest analysis tabulates pairwise sequence differences.
Build a tree using statistical algorithms (maximum parsimony, neighbour-joining, maximum likelihood, Bayesian inference). Each algorithm produces a tree topology that best explains the observed sequence data.
Assess support for the tree using bootstrapping or posterior probabilities — measuring how stable each branching is to perturbations in the data.

A typical undergraduate phylogenetics project might involve 1–2 kb of sequence from 20–50 species; modern phylogenomic studies use whole-genome alignments across hundreds of species.

Protein Sequence Comparison — Cytochrome c

Before genomic sequencing was routine, protein sequencing provided the first molecular phylogenies. Cytochrome c is a small (104 amino acids in vertebrates) electron-transport protein found in the mitochondria of all aerobic eukaryotes. It is essential for cellular respiration (cross-link to course 7 Energy Transfers in Organisms) and therefore conserved across the eukaryotic tree.

The differences in cytochrome c amino-acid sequence between species are striking:

Species pair	Approximate amino-acid differences (of ~104 positions)
Human vs chimpanzee	0 (identical)
Human vs rhesus monkey	1
Human vs horse	12
Human vs dog	11
Human vs chicken	13
Human vs rattlesnake	14
Human vs tuna	21
Human vs Drosophila	24
Human vs wheat germ	43
Human vs yeast	44
Human vs Neurospora mould	45

These data, accumulated in the 1960s and 1970s through the pioneering work of Margaret Dayhoff and others, reveal a striking pattern: vertebrate-vertebrate differences are small; vertebrate-plant differences are large; the magnitudes correspond well to the branching depths inferred from morphology and the fossil record. Cytochrome c was one of the first proteins for which a clear molecular phylogeny was constructed (Fitch and Margoliash, 1967), and the resulting tree closely matched the morphological tree.

The Linus Pauling / Emile Zuckerkandl molecular-clock framework, developed in the 1960s, formalised the observation that protein sequences appear to accumulate substitutions at roughly constant rates over geological time. Paraphrasing their proposal (rather than quoting verbatim): proteins under similar functional constraints across lineages should accumulate amino-acid substitutions at approximately constant rates, allowing sequence divergence to be used as a clock to date evolutionary splits.

Molecular Clocks

Key Definition: A molecular clock is the observation that, for a given gene or protein, the rate of molecular substitution per unit time is approximately constant across lineages. By counting substitutions and knowing the rate, one can estimate the time since two lineages diverged from their last common ancestor.

How molecular clocks work

Choose a gene with a relatively constant evolutionary rate (rRNA, cytochrome c, or a synonymous substitution at a coding locus).
Estimate the substitution rate per site per million years, calibrated against fossil-dated divergences.
Count substitutions between two species at the chosen gene.
Divide by the rate to estimate the time since divergence.

Worked example

If cytochrome c accumulates substitutions at approximately 0.6 substitutions per site per 100 million years, and human and chicken cytochrome c differ at 13 amino acid sites out of 104 (~0.125 substitutions per site), the estimated divergence time is approximately 0.125 / 0.006 = ~21 million years per substitution * 13 substitutions = around 200 million years. This estimate is broadly consistent with the fossil-dated synapsid-sauropsid split at ~310 million years ago — molecular clocks are accurate to within a factor of about 2 for many proteins.

Strengths and limitations

Strengths:

Provides time-resolution where fossils are absent or sparse.
Quantitative and reproducible.
Cross-checks fossil-based dates.

Limitations:

Rates are not perfectly constant across lineages (the "molecular-clock hypothesis" is often relaxed in modern Bayesian phylogenetics).
Rates vary among genes (synonymous sites evolve faster than nonsynonymous; introns faster than exons; mitochondrial DNA faster than nuclear).
Calibration depends on fossil dates, introducing circularity.
Selection on the chosen gene biases the clock — positively selected genes evolve faster than neutral ones.

Modern phylogenomics uses relaxed clocks that allow rate variation across the tree and across genes, providing more accurate divergence-time estimates than the original Zuckerkandl-Pauling constant-rate model.

rRNA Phylogenies and the Three Domains

The most influential application of molecular phylogenetics is Carl Woese's 1977 demonstration that the small-subunit ribosomal RNA gene (16S in prokaryotes, 18S in eukaryotes) divides life into three deep clades: Bacteria, Archaea, and Eukarya. The choice of rRNA was deliberate: rRNA is universal (every cell has ribosomes), functionally essential (so under strong purifying selection that limits drift), and slowly evolving (preserving signal back to the deepest splits). Paraphrasing Woese's framework rather than quoting verbatim: a single molecule, ubiquitous and slowly evolving, can serve as a universal yardstick for measuring evolutionary distance across the entire tree of life.

The pre-Woesean picture was a single division between Eukaryotes (with nuclei) and Prokaryotes (without). Woese's rRNA evidence revealed that the prokaryotes contained two distinct lineages — Bacteria and Archaea — that were as distant from each other as either was from Eukarya. Further, the molecular machinery of Archaea (RNA polymerase, ribosomal proteins, histones, DNA replication enzymes) shared more features with Eukarya than with Bacteria, implying that Archaea and Eukarya share a more recent common ancestor than either does with Bacteria.

The reclassification was published as the three-domain system in 1990 (Woese, Kandler & Wheelis) and is now the standard framework for the deepest levels of biological classification (cross-link to lesson 5).

flowchart TD
  LUCA["Last Universal Common Ancestor (LUCA)"] --> Bact["Bacteria (peptidoglycan walls, ester lipids)"]
  LUCA --> Arch_Euk["Common ancestor of Archaea + Eukarya"]
  Arch_Euk --> Arch["Archaea (no peptidoglycan, ether lipids, histones)"]
  Arch_Euk --> Euk["Eukarya (nucleus, mitochondria, complex organelles)"]
  Euk --> Animal["Animalia"]
  Euk --> Plant["Plantae"]
  Euk --> Fungi["Fungi"]
  Euk --> Prot["Protists"]

Modern phylogenomic refinements have further complicated the picture: the Asgard archaea (described in deep-marine sediments from 2015 onward) appear to be the sister group to all eukaryotes, suggesting eukaryotes emerged from within the archaeal lineage rather than as a sister to it. The strict three-domain tree may evolve into a "two-domain" picture in which Eukaryotes nest within Archaea — but the molecular-phylogenetic principle remains intact.

Building a Cladogram — The Parsimony Principle

When constructing a phylogenetic tree from sequence data, the central computational challenge is: among the many possible tree topologies, which one best explains the observed differences?

The classical answer is maximum parsimony: choose the tree that requires the fewest evolutionary changes to explain the data. The intuition is that convergent evolution (the same change happening twice independently) is unlikely to repeat across complex molecules, so the tree topology with fewest required changes is the most probable.

Worked example — a four-taxon problem

Four species (A, B, C, D) are sampled at a hypothetical locus. The sequences differ at five positions. Three possible unrooted tree topologies exist for four taxa: ((A,B),(C,D)), ((A,C),(B,D)), ((A,D),(B,C)). The parsimony algorithm maps the observed character differences onto each topology and counts the minimum changes required.

For instance, if A and B share four derived character states with each other and C and D share four derived states with each other, the topology ((A,B),(C,D)) requires only one substitution per shared state — total 5 substitutions. The alternative ((A,C),(B,D)) would require many more independent substitutions to explain the same data — typically 8 or more. Parsimony selects the ((A,B),(C,D)) topology.

For real phylogenetic problems with dozens of taxa and thousands of sites, exhaustive search is infeasible; heuristic algorithms (tree-bisection-reconnection, branch swapping) find near-optimal trees rapidly.

The role of outgroups

To root a tree (orient it relative to evolutionary time), an outgroup is included — a taxon known to lie outside the clade of interest. The branch between the outgroup and the ingroup defines the root. For instance, when analysing primate relationships, a rodent might be used as outgroup; when analysing mammalian relationships, a reptile or amphibian; when analysing tetrapod relationships, a fish.

flowchart LR
  Out["Outgroup (e.g. lamprey)"] --> Root["Root"]
  Root --> Ingroup["Ingroup MRCA"]
  Ingroup --> A["Ingroup taxon A"]
  Ingroup --> B_C["MRCA of B + C"]
  B_C --> B["Taxon B"]
  B_C --> C["Taxon C"]

Maximum likelihood and Bayesian inference

Modern phylogenetics has largely shifted from maximum parsimony to maximum likelihood (ML) and Bayesian inference, which explicitly model the probability of substitution under a statistical model (e.g. Jukes-Cantor, Kimura 2-parameter, GTR). These methods are more robust to long-branch attraction and other parsimony pitfalls, and they provide statistical measures of branch support (bootstrap percentages, posterior probabilities). Maximum parsimony remains pedagogically central at A-Level because it embodies the conceptual logic cleanly.

Comparative Genomics — BLAST and Multiple Sequence Alignment

Two foundational bioinformatic tools enable practical phylogenetics:

BLAST (Basic Local Alignment Search Tool). Given a query DNA or protein sequence, BLAST searches a database (typically NCBI's GenBank) for similar sequences. The output is a list of matches ranked by similarity score, with statistical significance (E-values). BLAST is the standard tool for identifying the species of origin of an unknown sequence and for finding homologues of a gene across species.

Molecular Evidence for Evolution and Phylogeny

Molecular Evidence for Evolution and Phylogeny

Why Molecules Are Better Phylogenetic Evidence Than Morphology

DNA Sequence Comparison

Three main types of DNA used in phylogenetics

Building a phylogeny from DNA sequence

Protein Sequence Comparison — Cytochrome c

Molecular Clocks

How molecular clocks work

Worked example

Strengths and limitations

rRNA Phylogenies and the Three Domains

Building a Cladogram — The Parsimony Principle

Worked example — a four-taxon problem

The role of outgroups

Maximum likelihood and Bayesian inference

Comparative Genomics — BLAST and Multiple Sequence Alignment

More in Biology