You are viewing a free preview of this lesson.
Subscribe to unlock all 8 lessons in this course and every other course on LearningBro.
Molecular evidence has revolutionised our understanding of evolutionary relationships in the last sixty years. Before the 1960s, classification depended on morphology, anatomy, and the fossil record; phylogenetic relationships were inferred from shared traits whose homology had to be argued case by case. The advent of protein sequencing in the 1950s (Sanger's insulin sequence, 1953), DNA sequencing in the 1970s (Sanger and Maxam-Gilbert methods), and high-throughput sequencing from the mid-2000s onward has transformed phylogenetics into a quantitative, sequence-based discipline. Today, the entire tree of life is being re-resolved with molecular data, often overturning classical morphology-based groupings.
Spec mapping: This lesson sits in AQA 7402 Section 3.7.3 (species and taxonomy — DNA sequencing, immunological evidence, and protein-sequence comparison as evidence for evolutionary relationships). Refer to the official AQA specification document for exact wording. It extends the classification framework established in lesson 5 with the molecular data that underpins modern phylogenetics.
Connects to: Classification and taxonomy (Section 3.7.3, lesson 5 — the three-domain reform rests on molecular evidence); DNA structure and replication (Section 3.4.1, course 4 DNA, Genes and Inheritance); protein synthesis (Section 3.4.2, course 4); speciation and divergence (Section 3.7.3, lesson 4); cell recognition and immunology (Section 3.2.4, course 2 Cells, Microscopy and Cell Cycle).
Morphological characters have served taxonomy well for two centuries, but they suffer from systematic problems that molecular characters largely avoid:
Molecular evidence does not entirely replace morphology — fossils require morphological interpretation, and morphological evidence remains important for placing extinct lineages — but it is now the dominant evidence for phylogenetic relationships among living taxa.
The foundational molecular technique is comparative DNA sequencing. The number of differences between homologous DNA sequences (DNA derived from a common ancestor) in two species is approximately proportional to the time since their last common ancestor.
Key Principle: Closely related species (recent common ancestor) have fewer DNA differences; distantly related species have more. Comparing DNA from many species allows reconstruction of the branching pattern that best explains the observed differences.
The standard workflow:
A typical undergraduate phylogenetics project might involve 1–2 kb of sequence from 20–50 species; modern phylogenomic studies use whole-genome alignments across hundreds of species.
Before genomic sequencing was routine, protein sequencing provided the first molecular phylogenies. Cytochrome c is a small (104 amino acids in vertebrates) electron-transport protein found in the mitochondria of all aerobic eukaryotes. It is essential for cellular respiration (cross-link to course 7 Energy Transfers in Organisms) and therefore conserved across the eukaryotic tree.
The differences in cytochrome c amino-acid sequence between species are striking:
| Species pair | Approximate amino-acid differences (of ~104 positions) |
|---|---|
| Human vs chimpanzee | 0 (identical) |
| Human vs rhesus monkey | 1 |
| Human vs horse | 12 |
| Human vs dog | 11 |
| Human vs chicken | 13 |
| Human vs rattlesnake | 14 |
| Human vs tuna | 21 |
| Human vs Drosophila | 24 |
| Human vs wheat germ | 43 |
| Human vs yeast | 44 |
| Human vs Neurospora mould | 45 |
These data, accumulated in the 1960s and 1970s through the pioneering work of Margaret Dayhoff and others, reveal a striking pattern: vertebrate-vertebrate differences are small; vertebrate-plant differences are large; the magnitudes correspond well to the branching depths inferred from morphology and the fossil record. Cytochrome c was one of the first proteins for which a clear molecular phylogeny was constructed (Fitch and Margoliash, 1967), and the resulting tree closely matched the morphological tree.
The Linus Pauling / Emile Zuckerkandl molecular-clock framework, developed in the 1960s, formalised the observation that protein sequences appear to accumulate substitutions at roughly constant rates over geological time. Paraphrasing their proposal (rather than quoting verbatim): proteins under similar functional constraints across lineages should accumulate amino-acid substitutions at approximately constant rates, allowing sequence divergence to be used as a clock to date evolutionary splits.
Key Definition: A molecular clock is the observation that, for a given gene or protein, the rate of molecular substitution per unit time is approximately constant across lineages. By counting substitutions and knowing the rate, one can estimate the time since two lineages diverged from their last common ancestor.
If cytochrome c accumulates substitutions at approximately 0.6 substitutions per site per 100 million years, and human and chicken cytochrome c differ at 13 amino acid sites out of 104 (~0.125 substitutions per site), the estimated divergence time is approximately 0.125 / 0.006 = ~21 million years per substitution * 13 substitutions = around 200 million years. This estimate is broadly consistent with the fossil-dated synapsid-sauropsid split at ~310 million years ago — molecular clocks are accurate to within a factor of about 2 for many proteins.
Strengths:
Limitations:
Modern phylogenomics uses relaxed clocks that allow rate variation across the tree and across genes, providing more accurate divergence-time estimates than the original Zuckerkandl-Pauling constant-rate model.
The most influential application of molecular phylogenetics is Carl Woese's 1977 demonstration that the small-subunit ribosomal RNA gene (16S in prokaryotes, 18S in eukaryotes) divides life into three deep clades: Bacteria, Archaea, and Eukarya. The choice of rRNA was deliberate: rRNA is universal (every cell has ribosomes), functionally essential (so under strong purifying selection that limits drift), and slowly evolving (preserving signal back to the deepest splits). Paraphrasing Woese's framework rather than quoting verbatim: a single molecule, ubiquitous and slowly evolving, can serve as a universal yardstick for measuring evolutionary distance across the entire tree of life.
The pre-Woesean picture was a single division between Eukaryotes (with nuclei) and Prokaryotes (without). Woese's rRNA evidence revealed that the prokaryotes contained two distinct lineages — Bacteria and Archaea — that were as distant from each other as either was from Eukarya. Further, the molecular machinery of Archaea (RNA polymerase, ribosomal proteins, histones, DNA replication enzymes) shared more features with Eukarya than with Bacteria, implying that Archaea and Eukarya share a more recent common ancestor than either does with Bacteria.
The reclassification was published as the three-domain system in 1990 (Woese, Kandler & Wheelis) and is now the standard framework for the deepest levels of biological classification (cross-link to lesson 5).
flowchart TD
LUCA["Last Universal Common Ancestor (LUCA)"] --> Bact["Bacteria (peptidoglycan walls, ester lipids)"]
LUCA --> Arch_Euk["Common ancestor of Archaea + Eukarya"]
Arch_Euk --> Arch["Archaea (no peptidoglycan, ether lipids, histones)"]
Arch_Euk --> Euk["Eukarya (nucleus, mitochondria, complex organelles)"]
Euk --> Animal["Animalia"]
Euk --> Plant["Plantae"]
Euk --> Fungi["Fungi"]
Euk --> Prot["Protists"]
Modern phylogenomic refinements have further complicated the picture: the Asgard archaea (described in deep-marine sediments from 2015 onward) appear to be the sister group to all eukaryotes, suggesting eukaryotes emerged from within the archaeal lineage rather than as a sister to it. The strict three-domain tree may evolve into a "two-domain" picture in which Eukaryotes nest within Archaea — but the molecular-phylogenetic principle remains intact.
When constructing a phylogenetic tree from sequence data, the central computational challenge is: among the many possible tree topologies, which one best explains the observed differences?
The classical answer is maximum parsimony: choose the tree that requires the fewest evolutionary changes to explain the data. The intuition is that convergent evolution (the same change happening twice independently) is unlikely to repeat across complex molecules, so the tree topology with fewest required changes is the most probable.
Four species (A, B, C, D) are sampled at a hypothetical locus. The sequences differ at five positions. Three possible unrooted tree topologies exist for four taxa: ((A,B),(C,D)), ((A,C),(B,D)), ((A,D),(B,C)). The parsimony algorithm maps the observed character differences onto each topology and counts the minimum changes required.
For instance, if A and B share four derived character states with each other and C and D share four derived states with each other, the topology ((A,B),(C,D)) requires only one substitution per shared state — total 5 substitutions. The alternative ((A,C),(B,D)) would require many more independent substitutions to explain the same data — typically 8 or more. Parsimony selects the ((A,B),(C,D)) topology.
For real phylogenetic problems with dozens of taxa and thousands of sites, exhaustive search is infeasible; heuristic algorithms (tree-bisection-reconnection, branch swapping) find near-optimal trees rapidly.
To root a tree (orient it relative to evolutionary time), an outgroup is included — a taxon known to lie outside the clade of interest. The branch between the outgroup and the ingroup defines the root. For instance, when analysing primate relationships, a rodent might be used as outgroup; when analysing mammalian relationships, a reptile or amphibian; when analysing tetrapod relationships, a fish.
flowchart LR
Out["Outgroup (e.g. lamprey)"] --> Root["Root"]
Root --> Ingroup["Ingroup MRCA"]
Ingroup --> A["Ingroup taxon A"]
Ingroup --> B_C["MRCA of B + C"]
B_C --> B["Taxon B"]
B_C --> C["Taxon C"]
Modern phylogenetics has largely shifted from maximum parsimony to maximum likelihood (ML) and Bayesian inference, which explicitly model the probability of substitution under a statistical model (e.g. Jukes-Cantor, Kimura 2-parameter, GTR). These methods are more robust to long-branch attraction and other parsimony pitfalls, and they provide statistical measures of branch support (bootstrap percentages, posterior probabilities). Maximum parsimony remains pedagogically central at A-Level because it embodies the conceptual logic cleanly.
Two foundational bioinformatic tools enable practical phylogenetics:
BLAST (Basic Local Alignment Search Tool). Given a query DNA or protein sequence, BLAST searches a database (typically NCBI's GenBank) for similar sequences. The output is a list of matches ranked by similarity score, with statistical significance (E-values). BLAST is the standard tool for identifying the species of origin of an unknown sequence and for finding homologues of a gene across species.
Multiple Sequence Alignment (MSA). Given several homologous sequences, MSA tools (Clustal, MUSCLE, MAFFT) align them column-by-column so homologous positions are stacked vertically. The aligned matrix is then used as input for phylogenetic reconstruction. Conserved columns indicate functionally important positions under purifying selection; variable columns provide the data points for tree inference.
Subscribe to continue reading
Get full access to this lesson and all 8 lessons in this course.