Introduction to Genome assembly

Last updated on 2025-02-14 | Edit this page

Overview

Questions

  • What is genome assembly, and why is it important?
  • What sequencing technologies can be used for genome assembly?
  • What are de novo and reference-guided assemblies?
  • What challenges arise when generating high-quality assemblies?
  • What software tools are used for assembling genomes?

Objectives

  • Learn key concepts and terminology related to genome assembly.
  • Understand datasets and tools used for genome assembly.
  • Describe major sequencing technologies and their impact on assembly.
  • Identify common computational challenges in genome assembly.
  • Explain fundamental strategies and algorithms used in genome assembly.

What is Genome Assembly?


Genome assembly is the process of reconstructing a complete genome sequence by arranging fragmented DNA sequences (reads) into a continuous sequence.

  • Goal is to achieve a high-quality reference genome that accurately represents the structure and sequence of an organism’s DNA.
  • It enables deeper understanding of genes, their function, and evolutionary history.
  • Vital for studying complex traits, species diversity, and disease mechanisms.
Genome Assembly
Genome Assembly

Imporance of Genome Assembly


  • Applications in medicine, agriculture, conservation, and biotechnology:
    • Human genetics: Assembling genomes to identify disease-causing mutations.
    • Crop improvement: Identifying beneficial traits in plant genomes.
    • Conservation biology: Sequencing endangered species to understand genetic diversity.
  • Examples of major genome sequencing projects (Human Genome Project, Vertebrate Genome Project).
Number of genomes in NCBI
Number of genomes in NCBI

De Novo vs. Reference-Guided Assembly


  • De novo assembly:
    • Used when no reference genome exists.
    • Requires assembling the genome from scratch using computational methods.
    • Example: Assembling a new plant species genome.
  • Reference-guided assembly:
    • Aligns reads to a closely related reference genome.
    • Useful for identifying variations but limited by reference bias.
    • Example: Human genome resequencing for variant detection.
Assembly types

Basic Steps in Genome Assembly


  1. Sequencing: Generating raw reads from a genome.
  2. Preprocessing & Quality Control: Filtering and trimming reads.
  3. Assembly: Aligning and merging overlapping reads into contigs.
  4. Scaffolding: Ordering contigs into larger scaffolds using long-range sequencing data or mapping techniques.
  5. Polishing: Correcting errors using additional data.
  6. Quality Assessment: Evaluating assembly completeness and accuracy.
  7. Downstream Analysis: Annotating genes, identifying variants, and studying genome structure.
Steps Overview
Steps Overview

reference: 10.1016/j.xpro.2022.101506

Data types for Genome Assembly


Illumina

  • Excels at high-throughput, short-read sequencing with high accuracy.

  • Uses Sequencing-by-Synthesis (SBS). DNA is fragmented, adapters are attached, and the fragments are immobilized on a flowcell. A polymerase incorporates fluorescently labeled nucleotides, and a camera captures the emitted signals in real time. Each cycle represents one nucleotide added to the growing DNA strand.

PacBio HiFi

  • provide accurate long reads, balancing throughput and error correction.

  • Uses Single Molecule, Real-Time (SMRT) sequencing. DNA is ligated into circular molecules and loaded onto a chip with zero-mode waveguides (ZMWs). A polymerase synthesizes the complementary strand, incorporating fluorescently labeled nucleotides. The system detects light pulses in real time, capturing multiple passes of the same molecule to generate highly accurate HiFi reads.

Oxford Nanopore

  • enables ultra-long reads but requires more advanced error correction.

  • A single DNA strand is passed through a biological nanopore embedded in a membrane. As nucleotides move through the pore, they cause characteristic disruptions in an electrical current, which are interpreted by machine-learning algorithms to determine the sequence.

Sequencing Instruments
Sequencing Instruments

Challenges in Genome Assembly


  • Repetitive Elements: Identical or similar sequences that occur multiple times in the genome, making it difficult to resolve unique regions.
  • Heterozygosity: Presence of two or more alleles at a given locus, leading to ambiguous read alignments.
  • Polyploidy: Multiple copies of chromosomes, complicating assembly due to similar sequences.
  • Genome Size: Large genomes require more computational resources and specialized algorithms.
  • Error Correction: Addressing sequencing errors and distinguishing true variants from artifacts.
  • Structural Variants: Large-scale rearrangements, duplications, deletions, and inversions that disrupt contiguity.
Challenges in Genome Assembly
Challenges in Genome Assembly

Main programs used for Genome Assembly


  • Data QC:
    • NanoPlot: Visualization of sequencing data quality.
    • FiltLong: Filtering long reads based on quality and length.
  • Assembly:
    • HiFiasm: HiFi assembler for PacBio data.
    • Flye: de novo assembler for long reads.
  • Post-processing:
    • Medaka: Basecaller and consensus polishing for Flye assembly.
    • Bionano Solve: Optical mapping for scaffolding and validation.
  • Evaluation:
    • QUAST: Quality assessment tool for evaluating assemblies.
    • Compleasm (BUSCO alternative): Benchmarking tool for assessing genome completeness.
    • KAT: Kmer-based evaluation of assembly accuracy and completeness.
Programs used in this workshop

Key Points

  • Genome assembly reconstructs complete genome sequences from fragmented DNA reads.
  • De novo assembly builds genomes without a reference, while reference-guided assembly uses existing genomes.
  • Sequencing technologies like Illumina, PacBio HiFi, and Oxford Nanopore offer different read lengths and error rates.
  • Challenges include repetitive elements, heterozygosity, and error correction.
  • Tools many programs are available for data QC, assembly, post-processing, and evaluation - choice depends on data type and research goals.