Introduction to Genome assembly

Last updated on 2025-07-01 | Edit this page

Overview

Genome assembly is the process of reconstructing a complete genome sequence by arranging fragmented DNA sequences (reads) into a continuous sequence.

Goal is to achieve a high-quality reference genome that accurately represents the structure and sequence of an organism’s DNA.
It enables deeper understanding of genes, their function, and evolutionary history.
Vital for studying complex traits, species diversity, and disease mechanisms.

Applications in medicine, agriculture, conservation, and biotechnology:
- Human genetics: Assembling genomes to identify disease-causing mutations.
- Crop improvement: Identifying beneficial traits in plant genomes.
- Conservation biology: Sequencing endangered species to understand genetic diversity.
Examples of major genome sequencing projects (Human Genome Project, Vertebrate Genome Project).

De novo assembly:
- Used when no reference genome exists.
- Requires assembling the genome from scratch using computational methods.
- Example: Assembling a new plant species genome.
Reference-guided assembly:
- Aligns reads to a closely related reference genome.
- Useful for identifying variations but limited by reference bias.
- Example: Human genome resequencing for variant detection.

Sequencing: Generating raw reads from a genome.
Preprocessing & Quality Control: Filtering and trimming reads.
Assembly: Aligning and merging overlapping reads into contigs.
Scaffolding: Ordering contigs into larger scaffolds using long-range sequencing data or mapping techniques.
Polishing: Correcting errors using additional data.
Quality Assessment: Evaluating assembly completeness and accuracy.
Downstream Analysis: Annotating genes, identifying variants, and studying genome structure.

Illumina

Excels at high-throughput, short-read sequencing with high accuracy.
Uses Sequencing-by-Synthesis (SBS). DNA is fragmented, adapters are attached, and the fragments are immobilized on a flowcell. A polymerase incorporates fluorescently labeled nucleotides, and a camera captures the emitted signals in real time. Each cycle represents one nucleotide added to the growing DNA strand.

PacBio HiFi

provide accurate long reads, balancing throughput and error correction.
Uses Single Molecule, Real-Time (SMRT) sequencing. DNA is ligated into circular molecules and loaded onto a chip with zero-mode waveguides (ZMWs). A polymerase synthesizes the complementary strand, incorporating fluorescently labeled nucleotides. The system detects light pulses in real time, capturing multiple passes of the same molecule to generate highly accurate HiFi reads.

Oxford Nanopore

enables ultra-long reads but requires more advanced error correction.
A single DNA strand is passed through a biological nanopore embedded in a membrane. As nucleotides move through the pore, they cause characteristic disruptions in an electrical current, which are interpreted by machine-learning algorithms to determine the sequence.

Repetitive Elements: Identical or similar sequences that occur multiple times in the genome, making it difficult to resolve unique regions.
Heterozygosity: Presence of two or more alleles at a given locus, leading to ambiguous read alignments.
Polyploidy: Multiple copies of chromosomes, complicating assembly due to similar sequences.
Genome Size: Large genomes require more computational resources and specialized algorithms.
Error Correction: Addressing sequencing errors and distinguishing true variants from artifacts.
Structural Variants: Large-scale rearrangements, duplications, deletions, and inversions that disrupt contiguity.

Data QC:
- NanoPlot: Visualization of sequencing data quality.
- FiltLong: Filtering long reads based on quality and length.
Assembly:
- HiFiasm: HiFi assembler for PacBio data.
- Flye: de novo assembler for long reads.
Post-processing:
- Medaka: Basecaller and consensus polishing for Flye assembly.
- Bionano Solve: Optical mapping for scaffolding and validation.
Evaluation:
- QUAST: Quality assessment tool for evaluating assemblies.
- Compleasm (BUSCO alternative): Benchmarking tool for assessing genome completeness.
- KAT: Kmer-based evaluation of assembly accuracy and completeness.

Genome assembly reconstructs complete genome sequences from fragmented DNA reads.
De novo assembly builds genomes without a reference, while reference-guided assembly uses existing genomes.
Sequencing technologies like Illumina, PacBio HiFi, and Oxford Nanopore offer different read lengths and error rates.
Challenges include repetitive elements, heterozygosity, and error correction.
Tools many programs are available for data QC, assembly, post-processing, and evaluation - choice depends on data type and research goals.