You are viewing a free preview of this lesson.
Subscribe to unlock all 10 lessons in this course and every other course on LearningBro.
This lesson covers genomics — the study of entire genomes — and bioinformatics — the use of computational tools to analyse biological data. These are increasingly important topics in the Edexcel A-Level Biology specification (9BI0, Topic 7).
Genomics is the branch of molecular biology that focuses on the structure, function, evolution and mapping of entire genomes. Unlike classical genetics, which studies individual genes, genomics takes a holistic approach — examining all genes and their interactions.
| Type | Focus |
|---|---|
| Structural genomics | Determining the complete DNA sequence and physical organisation of genomes |
| Functional genomics | Understanding what genes do — identifying gene functions, expression patterns and interactions |
| Comparative genomics | Comparing genomes of different species to understand evolution and identify conserved sequences |
| Pharmacogenomics | Studying how genetic variation affects drug response — towards personalised medicine |
| Metagenomics | Analysing DNA from environmental samples containing mixed communities of organisms |
Developed by Frederick Sanger in 1977, this was the method used in the Human Genome Project.
The method uses dideoxynucleotides (ddNTPs) — modified nucleotides that lack the 3'-OH group. When a ddNTP is incorporated into a growing DNA strand, no further nucleotides can be added — the chain is terminated.
Exam Tip: The key principle of Sanger sequencing is that ddNTPs terminate the chain because they lack the 3'-OH group needed to form a phosphodiester bond with the next nucleotide.
NGS technologies have made sequencing much faster and cheaper than Sanger sequencing:
| Feature | Sanger sequencing | Next-generation sequencing |
|---|---|---|
| Throughput | Low (one fragment at a time) | High (millions of fragments in parallel) |
| Cost per genome | ~$100 million (HGP) | ~$1,000 or less |
| Speed | Years for a whole genome | Hours to days |
| Read length | 700–1000 bp | 50–300 bp (but massively parallel) |
| Applications | Single genes, validation | Whole genomes, transcriptomics, metagenomics |
NGS works by sequencing millions of small DNA fragments simultaneously (massively parallel sequencing) and then using powerful computers to assemble the fragments into a complete sequence.
The Human Genome Project was an international collaborative project (1990–2003) to sequence the entire human genome.
| Finding | Detail |
|---|---|
| Genome size | ~3.2 billion base pairs |
| Protein-coding genes | ~20,000–25,000 |
| Coding DNA | ~1.5% of the genome |
| Repetitive sequences | >50% of the genome |
| Genetic similarity between humans | ~99.9% identical |
| Number of chromosomes | 22 autosome pairs + 1 pair of sex chromosomes |
| Application | Impact |
|---|---|
| Medicine | Identification of disease genes; development of genetic tests; pharmacogenomics |
| Forensics | Improved DNA profiling techniques |
| Evolutionary biology | Comparison with other species' genomes reveals evolutionary relationships |
| Agriculture | Identification of genes for crop improvement |
| Ancestry | Consumer genomics services (e.g. 23andMe) trace ancestry and health risks |
Bioinformatics is the application of computer science, statistics and mathematics to the analysis of biological data, particularly DNA, RNA and protein sequences.
| Tool/Database | Purpose |
|---|---|
| GenBank (NCBI) | Public database of DNA sequences from all organisms |
| BLAST (Basic Local Alignment Search Tool) | Compares a query sequence against a database to find similar sequences |
| UniProt | Database of protein sequences and functional information |
| Ensembl | Genome browser for visualising and analysing genome data |
| CLUSTAL | Multiple sequence alignment tool for comparing related sequences |
| Protein Data Bank (PDB) | Database of 3D structures of proteins |
One of the most fundamental tasks in bioinformatics is sequence alignment — comparing two or more DNA, RNA or protein sequences to identify regions of similarity.
Why align sequences?
Comparing the genomes of different species provides powerful insights into evolution and gene function.
| Comparison | Finding |
|---|---|
| Human vs chimpanzee | ~98.7% DNA sequence similarity |
| Human vs mouse | ~85% protein-coding genes have clear counterparts |
| Human vs banana | ~60% of genes have counterparts |
| Conserved sequences across species | Likely to have essential functions (strong purifying selection) |
Highly conserved sequences (those that are similar across many species) are likely to have important biological functions, because mutations in these regions would be deleterious and removed by natural selection.
Transcriptomics is the study of the complete set of RNA transcripts (the transcriptome) produced by the genome at a given time, in a given cell type, under given conditions.
A microarray is a glass slide or silicon chip containing thousands of single-stranded DNA probes, each representing a different gene.
RNA-Seq (RNA sequencing) is a newer technique that uses next-generation sequencing to determine which genes are being transcribed and at what level.
Advantages over microarrays:
Proteomics is the large-scale study of the complete set of proteins (the proteome) produced by an organism, tissue or cell.
The proteome is more complex than the genome because:
Key techniques in proteomics include:
Genomics and bioinformatics are driving the move towards personalised medicine — tailoring medical treatment to the individual based on their genetic profile.
| Application | Example |
|---|---|
| Pharmacogenomics | Testing for CYP2D6 variants to determine optimal drug dosage |
| Cancer genomics | Sequencing tumour DNA to identify driver mutations and select targeted therapies (e.g. trastuzumab for HER2+ breast cancer) |
| Risk prediction | Identifying BRCA1/BRCA2 mutations to assess breast/ovarian cancer risk |
| Rare disease diagnosis | Whole exome/genome sequencing to identify causative mutations in undiagnosed patients |
| Topic | Key Points |
|---|---|
| Genomics | Study of entire genomes: structural, functional, comparative |
| DNA sequencing | Sanger (chain termination) and NGS (massively parallel) |
| HGP | 3.2 billion bp, ~20,000 genes, 1.5% coding |
| Bioinformatics | Computer analysis of biological data (BLAST, GenBank, alignment) |
| Comparative genomics | Genome comparison reveals evolution and conserved functions |
| Transcriptomics | Microarrays and RNA-Seq measure gene expression |
| Proteomics | Study of all proteins; 2D gels and mass spectrometry |
| Personalised medicine | Genetic information guides treatment decisions |
Exam Tip: Bioinformatics and genomics questions may ask you to explain how sequence comparison can reveal the function of a gene, or how transcriptomics can be used to identify genes involved in disease. Always link your answer to specific techniques (BLAST, microarrays, RNA-Seq) and explain why computational analysis is essential for handling the vast amount of data generated.
This material sits in Edexcel 9BI0 Topic 8 (Grey Matter — Coordination, Response and Gene Technology), which expects candidates to describe a genome as the complete set of genetic material of an organism (for humans, ~3 × 10⁹ base pairs distributed across 22 autosome pairs, the sex chromosomes and the small mitochondrial DNA), to outline the principles of DNA sequencing (chain-termination Sanger and massively parallel next-generation sequencing), and to describe bioinformatics as the computational discipline that turns sequence data into biological insight by alignment, variant calling, assembly and annotation. Synoptic links run backwards to lesson 1 on gene structure and the genetic code (the genome is the substrate that lessons 1–4 dissected — exons, introns, regulatory elements and non-coding sequence are all features that genomic annotation must recover); to lesson 6 on PCR and gel electrophoresis (PCR is integral to short-read library preparation and to allele-specific assays that use the reference genome as their coordinate system); to lesson 7 on recombinant DNA (the cDNA, restriction-mapping and ligation logic of recombinant work is the historical engineering predecessor of modern genome assembly); and to lesson 8 on gene therapy and genetic screening (every modern carrier-screening panel and every prenatal-NIPT pipeline rests on a reference-genome assembly and a curated variant database). Synoptic links run sideways to Topic 6 (immunity, vaccines and infection) — pathogen-genome surveillance during the COVID-19 pandemic showed real-time sequencing of SARS-CoV-2 isolates identifying B.1.1.7 (Alpha), B.1.617.2 (Delta) and B.1.1.529 (Omicron) lineages within weeks of emergence and informing vaccine and public-health response — and to Topic 4 (biodiversity and evolution), where DNA barcoding (typically the COI gene in animals or rbcL / matK in plants) identifies species from short DNA fragments, and where comparative genomics across phyla furnishes molecular evidence for common ancestry. Refer to the official Pearson Edexcel 9BI0 specification document for exact wording.
Question (8 marks):
(a) Outline the timeline and methodology by which the cost of sequencing a human-sized genome fell from ~3billionintheHumanGenomeProjecterato 100 today. Identify the two key methodological transitions that drove the cost reduction. (6)
(b) A bioinformatician runs a BLAST search with a 600-bp query sequence and obtains a top hit with 94% identity over 580 bp to a haemoglobin-β gene in Mus musculus. State two conclusions the bioinformatician may legitimately draw, and one conclusion they may not draw without further work. (2)
Solution with mark scheme:
(a) M1 (AO1) — Sanger sequencing era. The chain-termination method, published by Frederick Sanger in 1977, used dideoxynucleotides (ddNTPs) lacking the 3'-OH group to terminate strand synthesis at random positions; capillary electrophoresis separated the resulting fragment ladder by length and a fluorescence detector read the bases. Per-run reads were ~700 bp, throughput was one sample at a time and per-genome cost was prohibitively high.
A1 (AO1) — Human Genome Project. The Human Genome Project (1990–2003) scaled Sanger sequencing across an international consortium, producing a ~99% coverage human reference at total cost on the order of $3 billion over ~13 years. This established the reference assembly that all later short-read pipelines align against.
A1 (AO2) — first transition: short-read NGS. Illumina sequencing-by-synthesis (commercialised from 2007) generates ~150-bp reads in parallel — a single instrument run produces on the order of 10⁹–10¹⁰ reads. The cost per genome dropped to ~1,000by2014∗∗and∗∗ 100 by 2023. The methodological transition is from serial (one Sanger reaction reads one fragment) to massively parallel (millions of clusters imaged simultaneously on a flow cell).
A1 (AO2) — second transition: long-read sequencing. PacBio (single-molecule real-time / SMRT) and Oxford Nanopore (current-disruption-through-pore) sequencing produce reads of ~10–100 kb at higher per-base cost than Illumina but with the ability to span repetitive regions, resolve structural variants and phase haplotypes that short reads cannot. The methodological transition here is from short reads + complex assembly inference to long reads + direct contig formation.
A1 (AO3) — bioinformatic consequence. The cost transition is not a sequencing-only story — alignment (BWA, Bowtie), variant calling (GATK), de novo assembly (Canu, hifiasm) and annotation (Ensembl, RefSeq) are required to turn raw reads into biological signal. Without bioinformatics, an Illumina run produces only an enormous unstructured FASTQ archive.
A1 (AO3) — rate-of-cost-decline framing. The cost-per-genome curve fell faster than Moore's Law from ~2008 onwards, because the per-base cost of imaging cluster after cluster on a flow cell scaled with hardware and chemistry simultaneously. The transition was the move from incremental Sanger optimisation to a fundamentally different physical workflow.
(b) M1 (AO3) — legitimate conclusions. The high identity (94% over 580 bp) implies (i) the query encodes a homologous gene to Mus musculus haemoglobin-β, likely with a conserved haem-binding function, and (ii) the query is probably mammalian in origin (if a tighter species-level ID is needed, multiple top hits across mammals would refine the inference).
A1 (AO3) — illegitimate conclusion. The bioinformatician may not conclude that the query is haemoglobin-β in their target species without reciprocal best BLAST, synteny checks, or alignment against a non-redundant database — high BLAST identity can also reflect paralogous genes (haemoglobin-α, myoglobin), and a 600-bp window is too short to distinguish ortholog from paralog with confidence in a multigene family.
Total: 8 marks (M2 A6).
Question (6 marks): A public-health laboratory receives respiratory-virus samples during a winter surge and uses next-generation sequencing to characterise circulating SARS-CoV-2 lineages. (i) Describe how short-read NGS data are converted into a lineage call using a bioinformatic pipeline. (ii) Explain why long-read sequencing is sometimes used in parallel, and (iii) discuss two synoptic links between this workflow and other parts of the A-Level Biology specification.
Mark scheme decomposition by AO:
| Mark | AO | Earned by |
|---|---|---|
| 1 | AO1.1 | Naming the pipeline stages: DNA / RNA extraction → reverse transcription → library preparation → sequencing on Illumina (or equivalent) → FASTQ output |
| 2 | AO1.2 | Naming the bioinformatic stages: read alignment to reference (BWA / minimap2) → variant calling → lineage assignment (e.g. against the SARS-CoV-2 reference) |
| 3 | AO2.1 | Explaining that the reference-aligned variants are compared against curated lineage-defining mutation panels to assign a designation; novel variants flagged for inspection |
| 4 | AO2.7 | Explaining that long-read sequencing (PacBio, Nanopore) resolves structural variation and haplotype phasing that short reads cannot, and is useful for novel lineages whose insertion / deletion patterns are not yet captured by the reference |
| 5 | AO3.1 | Synoptic — connecting to Topic 6 (immunity, infection, vaccines) — pathogen-genome surveillance informs vaccine updates, public-health response and outbreak source-tracing |
| 6 | AO3.2 | Synoptic — connecting to lesson 6 (PCR), which underpins library preparation, and to lesson 1 (gene structure), which gives the molecular vocabulary (codons, ORFs, regulatory regions) that lineage-defining mutations are described in |
Total: 6 marks (AO1 = 2, AO2 = 2, AO3 = 2). Edexcel reliably tests genomics through scenario prompts ("a public-health laboratory…", "a clinical-genetics service…") that demand a paired molecular + computational answer. Candidates who answer only the molecular half (extraction, library prep, sequencing) without the bioinformatic half (alignment, variant calling, lineage assignment) lose the AO2 marks; candidates who omit the synoptic link to Topic 6 immunity or to lessons 1 / 6 lose the AO3 marks. A* candidates name both Illumina and a long-read platform, distinguish their use cases, and integrate the pathogen-genome story into the wider biology of infection and immunity.
Lesson 1 (gene structure and the genetic code). Genome annotation is the act of recovering the structures lesson 1 dissected — exons, introns, promoters, untranslated regions, non-coding RNA loci — from raw sequence. Roughly 2% of the human genome encodes protein; the remainder includes regulatory elements, ncRNA loci, transposable-element fragments and repetitive sequence. The "junk DNA" label is largely abandoned: much non-coding sequence has documented regulatory or structural function, and the term was always premature.
Lesson 6 (PCR and gel electrophoresis). PCR is integral to short-read library preparation (cluster amplification on Illumina flow cells; bridge amplification) and to allele-specific assays that probe the reference genome at known coordinates. Long-read platforms (PacBio HiFi, Nanopore) avoid PCR for some library preps to reduce amplification bias. The continuity is direct: PCR is the workhorse of every molecular-genetics laboratory the genomic era runs on.
Lesson 7 (recombinant DNA). The cDNA, restriction-enzyme, ligase and vector toolkit of lesson 7 is the historical engineering predecessor of modern genome work. cDNA libraries were the original method for cloning and sequencing expressed transcripts; shotgun cloning into bacterial vectors was the workflow Sanger sequencing scaled across in the Human Genome Project. The conceptual continuity is that today's sequencing technologies are massively parallelised industrial implementations of the recombinant-DNA logic of the 1980s.
Lesson 8 (gene therapy and genetic screening). Population-scale expanded carrier screening, non-invasive prenatal testing (NIPT) and expanded newborn screening all depend on the reference-genome assembly, on curated variant databases (ClinVar, gnomAD) and on bioinformatic variant-calling pipelines. The clinical workflow lesson 8 introduced is the deployment layer of the genomic infrastructure lesson 9 builds.
Topic 6 (immunity, infection and vaccines). Pathogen-genome surveillance showed during the COVID-19 pandemic that real-time sequencing of SARS-CoV-2 isolates can identify emerging lineages (Alpha, Delta, Omicron) within weeks, inform vaccine updates, and trace outbreak transmission chains. The same workflow applies to seasonal influenza (annual vaccine-strain selection), tuberculosis outbreak tracing, and antimicrobial-resistance surveillance. Pathogen genomics is now a routine public-health discipline rather than a research curiosity.
Topic 4 (biodiversity, evolution and DNA barcoding). DNA barcoding uses short standardised loci — most commonly the mitochondrial COI gene in animals and the chloroplast rbcL / matK genes in plants — to identify species from environmental or fragmentary samples. The same logic underpins environmental DNA (eDNA) surveys of biodiversity in rivers and seas. Comparative genomics across phyla furnishes molecular evidence for common ancestry: conserved core machinery (ribosomal RNA, tubulins, histones) recovers the same phylogenetic relationships morphology indicated, while protein-coding sequence divergence quantifies how long lineages have been separate.
Hardy-Weinberg at genome scale. Population-level allele-frequency datasets like gnomAD (~800,000 exomes / genomes) operationalise Hardy-Weinberg expectations across millions of variants simultaneously. Variants whose observed homozygous frequency is far below Hardy-Weinberg expectation are flagged as likely deleterious; this is the population-genetic signal that filters candidate disease genes from background variation in clinical sequencing.
Personalised medicine — pharmacogenomics and oncology. Pharmacogenomics (CYP2D6 variants for codeine and tamoxifen; TPMT variants for thiopurines; HLA-B*57:01 for abacavir) identifies patients at risk of severe adverse drug reactions before prescribing. Cancer genomics sequences tumour DNA to identify driver mutations (BRAF V600E in melanoma; HER2 amplification in breast cancer; KRAS in colorectal cancer) and matches them to targeted therapies. Both rest on the reference-genome and variant-database infrastructure of the genomic era.
Subscribe to continue reading
Get full access to this lesson and all 10 lessons in this course.