Genomics and Bioinformatics

This lesson covers genomics — the study of entire genomes — and bioinformatics — the use of computational tools to analyse biological data. These are increasingly important topics in the Edexcel A-Level Biology specification (9BI0, Topic 7).

What Is Genomics?

Genomics is the branch of molecular biology that focuses on the structure, function, evolution and mapping of entire genomes. Unlike classical genetics, which studies individual genes, genomics takes a holistic approach — examining all genes and their interactions.

Types of Genomics

Type	Focus
Structural genomics	Determining the complete DNA sequence and physical organisation of genomes
Functional genomics	Understanding what genes do — identifying gene functions, expression patterns and interactions
Comparative genomics	Comparing genomes of different species to understand evolution and identify conserved sequences
Pharmacogenomics	Studying how genetic variation affects drug response — towards personalised medicine
Metagenomics	Analysing DNA from environmental samples containing mixed communities of organisms

DNA Sequencing

Sanger Sequencing (Chain Termination Method)

Developed by Frederick Sanger in 1977, this was the method used in the Human Genome Project.

Principle

The method uses dideoxynucleotides (ddNTPs) — modified nucleotides that lack the 3'-OH group. When a ddNTP is incorporated into a growing DNA strand, no further nucleotides can be added — the chain is terminated.

Method

The DNA template, a primer, DNA polymerase, normal dNTPs and a small proportion of fluorescently labelled ddNTPs (one colour per base) are mixed together
During replication, chains are terminated at random positions wherever a ddNTP is incorporated
This produces fragments of every possible length, each ending with a fluorescently labelled ddNTP
The fragments are separated by capillary electrophoresis (by size)
A laser detector reads the fluorescent labels as the fragments pass through
The sequence is read from the shortest fragment to the longest

Exam Tip: The key principle of Sanger sequencing is that ddNTPs terminate the chain because they lack the 3'-OH group needed to form a phosphodiester bond with the next nucleotide.

Next-Generation Sequencing (NGS)

NGS technologies have made sequencing much faster and cheaper than Sanger sequencing:

Feature	Sanger sequencing	Next-generation sequencing
Throughput	Low (one fragment at a time)	High (millions of fragments in parallel)
Cost per genome	~$100 million (HGP)	~$1,000 or less
Speed	Years for a whole genome	Hours to days
Read length	700–1000 bp	50–300 bp (but massively parallel)
Applications	Single genes, validation	Whole genomes, transcriptomics, metagenomics

NGS works by sequencing millions of small DNA fragments simultaneously (massively parallel sequencing) and then using powerful computers to assemble the fragments into a complete sequence.

The Human Genome Project (HGP)

The Human Genome Project was an international collaborative project (1990–2003) to sequence the entire human genome.

Key Findings

Finding	Detail
Genome size	~3.2 billion base pairs
Protein-coding genes	~20,000–25,000
Coding DNA	~1.5% of the genome
Repetitive sequences	>50% of the genome
Genetic similarity between humans	~99.9% identical
Number of chromosomes	22 autosome pairs + 1 pair of sex chromosomes

Applications of the HGP

Application	Impact
Medicine	Identification of disease genes; development of genetic tests; pharmacogenomics
Forensics	Improved DNA profiling techniques
Evolutionary biology	Comparison with other species' genomes reveals evolutionary relationships
Agriculture	Identification of genes for crop improvement
Ancestry	Consumer genomics services (e.g. 23andMe) trace ancestry and health risks

What Is Bioinformatics?

Bioinformatics is the application of computer science, statistics and mathematics to the analysis of biological data, particularly DNA, RNA and protein sequences.

Key Bioinformatics Tools and Databases

Tool/Database	Purpose
GenBank (NCBI)	Public database of DNA sequences from all organisms
BLAST (Basic Local Alignment Search Tool)	Compares a query sequence against a database to find similar sequences
UniProt	Database of protein sequences and functional information
Ensembl	Genome browser for visualising and analysing genome data
CLUSTAL	Multiple sequence alignment tool for comparing related sequences
Protein Data Bank (PDB)	Database of 3D structures of proteins

Sequence Alignment

One of the most fundamental tasks in bioinformatics is sequence alignment — comparing two or more DNA, RNA or protein sequences to identify regions of similarity.

Why align sequences?

Identify homologous genes in different species (evidence for evolution)
Predict the function of a newly discovered gene by finding similar genes with known functions
Identify mutations by comparing a patient's sequence with the reference sequence
Construct phylogenetic trees showing evolutionary relationships

Comparative Genomics

Comparing the genomes of different species provides powerful insights into evolution and gene function.

Key Findings from Comparative Genomics

Comparison	Finding
Human vs chimpanzee	~98.7% DNA sequence similarity
Human vs mouse	~85% protein-coding genes have clear counterparts
Human vs banana	~60% of genes have counterparts
Conserved sequences across species	Likely to have essential functions (strong purifying selection)

Homologous Genes

Orthologous genes — genes in different species that diverged from a common ancestor (e.g. the haemoglobin gene in humans and mice)
Paralogous genes — genes within the same species that arose by gene duplication (e.g. alpha-globin and beta-globin in humans)

Highly conserved sequences (those that are similar across many species) are likely to have important biological functions, because mutations in these regions would be deleterious and removed by natural selection.

Transcriptomics

Transcriptomics is the study of the complete set of RNA transcripts (the transcriptome) produced by the genome at a given time, in a given cell type, under given conditions.

Microarrays (DNA Chips)

A microarray is a glass slide or silicon chip containing thousands of single-stranded DNA probes, each representing a different gene.

How Microarrays Work

mRNA is extracted from two samples (e.g. cancer cells vs normal cells)
mRNA is converted to cDNA using reverse transcriptase
cDNA from each sample is labelled with a different fluorescent dye (e.g. red for cancer, green for normal)
The labelled cDNA is washed over the microarray and hybridises to complementary probes
The chip is scanned — spots that glow red indicate genes expressed more in cancer cells; green spots indicate genes expressed more in normal cells; yellow spots indicate equal expression

RNA-Seq

RNA-Seq (RNA sequencing) is a newer technique that uses next-generation sequencing to determine which genes are being transcribed and at what level.

Advantages over microarrays:

Can detect novel transcripts (not limited to known genes)
Greater dynamic range (can measure very low and very high expression levels)
Single-nucleotide resolution
Can detect alternative splicing events

Proteomics

Proteomics is the large-scale study of the complete set of proteins (the proteome) produced by an organism, tissue or cell.

The proteome is more complex than the genome because:

Alternative splicing produces multiple proteins from one gene
Post-translational modifications alter protein function
Protein expression varies between cell types and conditions

Key techniques in proteomics include:

2D gel electrophoresis — separates proteins by charge and then by size
Mass spectrometry — identifies proteins by their mass-to-charge ratio
Protein-protein interaction studies (e.g. yeast two-hybrid system)

Personalised Medicine

Genomics and bioinformatics are driving the move towards personalised medicine — tailoring medical treatment to the individual based on their genetic profile.

Application	Example
Pharmacogenomics	Testing for CYP2D6 variants to determine optimal drug dosage
Cancer genomics	Sequencing tumour DNA to identify driver mutations and select targeted therapies (e.g. trastuzumab for HER2+ breast cancer)
Risk prediction	Identifying BRCA1/BRCA2 mutations to assess breast/ovarian cancer risk
Rare disease diagnosis	Whole exome/genome sequencing to identify causative mutations in undiagnosed patients

Summary

Topic	Key Points
Genomics	Study of entire genomes: structural, functional, comparative
DNA sequencing	Sanger (chain termination) and NGS (massively parallel)
HGP	3.2 billion bp, ~20,000 genes, 1.5% coding
Bioinformatics	Computer analysis of biological data (BLAST, GenBank, alignment)
Comparative genomics	Genome comparison reveals evolution and conserved functions
Transcriptomics	Microarrays and RNA-Seq measure gene expression
Proteomics	Study of all proteins; 2D gels and mass spectrometry
Personalised medicine	Genetic information guides treatment decisions

Exam Tip: Bioinformatics and genomics questions may ask you to explain how sequence comparison can reveal the function of a gene, or how transcriptomics can be used to identify genes involved in disease. Always link your answer to specific techniques (BLAST, microarrays, RNA-Seq) and explain why computational analysis is essential for handling the vast amount of data generated.

A-Level Deep Dive: Genomics and Bioinformatics

Spec mapping

This material sits in Edexcel 9BI0 Topic 8 (Grey Matter — Coordination, Response and Gene Technology), which expects candidates to describe a genome as the complete set of genetic material of an organism (for humans, ~3 × 10⁹ base pairs distributed across 22 autosome pairs, the sex chromosomes and the small mitochondrial DNA), to outline the principles of DNA sequencing (chain-termination Sanger and massively parallel next-generation sequencing), and to describe bioinformatics as the computational discipline that turns sequence data into biological insight by alignment, variant calling, assembly and annotation. Synoptic links run backwards to lesson 1 on gene structure and the genetic code (the genome is the substrate that lessons 1–4 dissected — exons, introns, regulatory elements and non-coding sequence are all features that genomic annotation must recover); to lesson 6 on PCR and gel electrophoresis (PCR is integral to short-read library preparation and to allele-specific assays that use the reference genome as their coordinate system); to lesson 7 on recombinant DNA (the cDNA, restriction-mapping and ligation logic of recombinant work is the historical engineering predecessor of modern genome assembly); and to lesson 8 on gene therapy and genetic screening (every modern carrier-screening panel and every prenatal-NIPT pipeline rests on a reference-genome assembly and a curated variant database). Synoptic links run sideways to Topic 6 (immunity, vaccines and infection) — pathogen-genome surveillance during the COVID-19 pandemic showed real-time sequencing of SARS-CoV-2 isolates identifying B.1.1.7 (Alpha), B.1.617.2 (Delta) and B.1.1.529 (Omicron) lineages within weeks of emergence and informing vaccine and public-health response — and to Topic 4 (biodiversity and evolution), where DNA barcoding (typically the COI gene in animals or rbcL / matK in plants) identifies species from short DNA fragments, and where comparative genomics across phyla furnishes molecular evidence for common ancestry. Refer to the official Pearson Edexcel 9BI0 specification document for exact wording.

Worked example with full mark scheme

Question (8 marks):

(a) Outline the timeline and methodology by which the cost of sequencing a human-sized genome fell from ~ $3 billion in the Human Genome Project era to ~$ 100 today. Identify the two key methodological transitions that drove the cost reduction. (6)

(b) A bioinformatician runs a BLAST search with a 600-bp query sequence and obtains a top hit with 94% identity over 580 bp to a haemoglobin-β gene in Mus musculus. State two conclusions the bioinformatician may legitimately draw, and one conclusion they may not draw without further work. (2)

Solution with mark scheme:

(a) M1 (AO1) — Sanger sequencing era. The chain-termination method, published by Frederick Sanger in 1977, used dideoxynucleotides (ddNTPs) lacking the 3'-OH group to terminate strand synthesis at random positions; capillary electrophoresis separated the resulting fragment ladder by length and a fluorescence detector read the bases. Per-run reads were ~700 bp, throughput was one sample at a time and per-genome cost was prohibitively high.

A1 (AO1) — Human Genome Project. The Human Genome Project (1990–2003) scaled Sanger sequencing across an international consortium, producing a ~99% coverage human reference at total cost on the order of $3 billion over ~13 years. This established the reference assembly that all later short-read pipelines align against.

A1 (AO2) — first transition: short-read NGS. Illumina sequencing-by-synthesis (commercialised from 2007) generates ~150-bp reads in parallel — a single instrument run produces on the order of 10⁹–10¹⁰ reads. The cost per genome dropped to ~ $1,000 by 2014** and **~$ 100 by 2023. The methodological transition is from serial (one Sanger reaction reads one fragment) to massively parallel (millions of clusters imaged simultaneously on a flow cell).

A1 (AO2) — second transition: long-read sequencing. PacBio (single-molecule real-time / SMRT) and Oxford Nanopore (current-disruption-through-pore) sequencing produce reads of ~10–100 kb at higher per-base cost than Illumina but with the ability to span repetitive regions, resolve structural variants and phase haplotypes that short reads cannot. The methodological transition here is from short reads + complex assembly inference to long reads + direct contig formation.

A1 (AO3) — bioinformatic consequence. The cost transition is not a sequencing-only story — alignment (BWA, Bowtie), variant calling (GATK), de novo assembly (Canu, hifiasm) and annotation (Ensembl, RefSeq) are required to turn raw reads into biological signal. Without bioinformatics, an Illumina run produces only an enormous unstructured FASTQ archive.

A1 (AO3) — rate-of-cost-decline framing. The cost-per-genome curve fell faster than Moore's Law from ~2008 onwards, because the per-base cost of imaging cluster after cluster on a flow cell scaled with hardware and chemistry simultaneously. The transition was the move from incremental Sanger optimisation to a fundamentally different physical workflow.

(b) M1 (AO3) — legitimate conclusions. The high identity (94% over 580 bp) implies (i) the query encodes a homologous gene to Mus musculus haemoglobin-β, likely with a conserved haem-binding function, and (ii) the query is probably mammalian in origin (if a tighter species-level ID is needed, multiple top hits across mammals would refine the inference).

A1 (AO3) — illegitimate conclusion. The bioinformatician may not conclude that the query is haemoglobin-β in their target species without reciprocal best BLAST, synteny checks, or alignment against a non-redundant database — high BLAST identity can also reflect paralogous genes (haemoglobin-α, myoglobin), and a 600-bp window is too short to distinguish ortholog from paralog with confidence in a multigene family.

Total: 8 marks (M2 A6).

Specimen question modelled on the Edexcel 9BI0 paper format

Question (6 marks): A public-health laboratory receives respiratory-virus samples during a winter surge and uses next-generation sequencing to characterise circulating SARS-CoV-2 lineages. (i) Describe how short-read NGS data are converted into a lineage call using a bioinformatic pipeline. (ii) Explain why long-read sequencing is sometimes used in parallel, and (iii) discuss two synoptic links between this workflow and other parts of the A-Level Biology specification.

Mark scheme decomposition by AO:

Mark	AO	Earned by
1	AO1.1	Naming the pipeline stages: DNA / RNA extraction → reverse transcription → library preparation → sequencing on Illumina (or equivalent) → FASTQ output
2	AO1.2	Naming the bioinformatic stages: read alignment to reference (BWA / minimap2) → variant calling → lineage assignment (e.g. against the SARS-CoV-2 reference)
3	AO2.1	Explaining that the reference-aligned variants are compared against curated lineage-defining mutation panels to assign a designation; novel variants flagged for inspection
4	AO2.7	Explaining that long-read sequencing (PacBio, Nanopore) resolves structural variation and haplotype phasing that short reads cannot, and is useful for novel lineages whose insertion / deletion patterns are not yet captured by the reference
5	AO3.1	Synoptic — connecting to Topic 6 (immunity, infection, vaccines) — pathogen-genome surveillance informs vaccine updates, public-health response and outbreak source-tracing
6	AO3.2	Synoptic — connecting to lesson 6 (PCR), which underpins library preparation, and to lesson 1 (gene structure), which gives the molecular vocabulary (codons, ORFs, regulatory regions) that lineage-defining mutations are described in

Total: 6 marks (AO1 = 2, AO2 = 2, AO3 = 2). Edexcel reliably tests genomics through scenario prompts ("a public-health laboratory…", "a clinical-genetics service…") that demand a paired molecular + computational answer. Candidates who answer only the molecular half (extraction, library prep, sequencing) without the bioinformatic half (alignment, variant calling, lineage assignment) lose the AO2 marks; candidates who omit the synoptic link to Topic 6 immunity or to lessons 1 / 6 lose the AO3 marks. A* candidates name both Illumina and a long-read platform, distinguish their use cases, and integrate the pathogen-genome story into the wider biology of infection and immunity.

Synoptic links

Lesson 1 (gene structure and the genetic code). Genome annotation is the act of recovering the structures lesson 1 dissected — exons, introns, promoters, untranslated regions, non-coding RNA loci — from raw sequence. Roughly 2% of the human genome encodes protein; the remainder includes regulatory elements, ncRNA loci, transposable-element fragments and repetitive sequence. The "junk DNA" label is largely abandoned: much non-coding sequence has documented regulatory or structural function, and the term was always premature.
Lesson 6 (PCR and gel electrophoresis). PCR is integral to short-read library preparation (cluster amplification on Illumina flow cells; bridge amplification) and to allele-specific assays that probe the reference genome at known coordinates. Long-read platforms (PacBio HiFi, Nanopore) avoid PCR for some library preps to reduce amplification bias. The continuity is direct: PCR is the workhorse of every molecular-genetics laboratory the genomic era runs on.
Lesson 7 (recombinant DNA). The cDNA, restriction-enzyme, ligase and vector toolkit of lesson 7 is the historical engineering predecessor of modern genome work. cDNA libraries were the original method for cloning and sequencing expressed transcripts; shotgun cloning into bacterial vectors was the workflow Sanger sequencing scaled across in the Human Genome Project. The conceptual continuity is that today's sequencing technologies are massively parallelised industrial implementations of the recombinant-DNA logic of the 1980s.
Lesson 8 (gene therapy and genetic screening). Population-scale expanded carrier screening, non-invasive prenatal testing (NIPT) and expanded newborn screening all depend on the reference-genome assembly, on curated variant databases (ClinVar, gnomAD) and on bioinformatic variant-calling pipelines. The clinical workflow lesson 8 introduced is the deployment layer of the genomic infrastructure lesson 9 builds.
Topic 6 (immunity, infection and vaccines). Pathogen-genome surveillance showed during the COVID-19 pandemic that real-time sequencing of SARS-CoV-2 isolates can identify emerging lineages (Alpha, Delta, Omicron) within weeks, inform vaccine updates, and trace outbreak transmission chains. The same workflow applies to seasonal influenza (annual vaccine-strain selection), tuberculosis outbreak tracing, and antimicrobial-resistance surveillance. Pathogen genomics is now a routine public-health discipline rather than a research curiosity.
Topic 4 (biodiversity, evolution and DNA barcoding). DNA barcoding uses short standardised loci — most commonly the mitochondrial COI gene in animals and the chloroplast rbcL / matK genes in plants — to identify species from environmental or fragmentary samples. The same logic underpins environmental DNA (eDNA) surveys of biodiversity in rivers and seas. Comparative genomics across phyla furnishes molecular evidence for common ancestry: conserved core machinery (ribosomal RNA, tubulins, histones) recovers the same phylogenetic relationships morphology indicated, while protein-coding sequence divergence quantifies how long lineages have been separate.
Hardy-Weinberg at genome scale. Population-level allele-frequency datasets like gnomAD (~800,000 exomes / genomes) operationalise Hardy-Weinberg expectations across millions of variants simultaneously. Variants whose observed homozygous frequency is far below Hardy-Weinberg expectation are flagged as likely deleterious; this is the population-genetic signal that filters candidate disease genes from background variation in clinical sequencing.
Personalised medicine — pharmacogenomics and oncology. Pharmacogenomics (CYP2D6 variants for codeine and tamoxifen; TPMT variants for thiopurines; HLA-B*57:01 for abacavir) identifies patients at risk of severe adverse drug reactions before prescribing. Cancer genomics sequences tumour DNA to identify driver mutations (BRAF V600E in melanoma; HER2 amplification in breast cancer; KRAS in colorectal cancer) and matches them to targeted therapies. Both rest on the reference-genome and variant-database infrastructure of the genomic era.

Genomics and Bioinformatics

Genomics and Bioinformatics

What Is Genomics?

Types of Genomics

DNA Sequencing

Sanger Sequencing (Chain Termination Method)

Principle

Method

Next-Generation Sequencing (NGS)

The Human Genome Project (HGP)

Key Findings

Applications of the HGP

What Is Bioinformatics?

Key Bioinformatics Tools and Databases

Sequence Alignment

Comparative Genomics

Key Findings from Comparative Genomics

Homologous Genes

Transcriptomics

Microarrays (DNA Chips)

How Microarrays Work

RNA-Seq

Proteomics

Personalised Medicine

Summary

A-Level Deep Dive: Genomics and Bioinformatics

Spec mapping

Worked example with full mark scheme

Specimen question modelled on the Edexcel 9BI0 paper format

Synoptic links

Mark-scheme literacy

More in Biology