Introduction to Genome assembly


  • Genome assembly reconstructs complete genome sequences from fragmented DNA reads.
  • De novo assembly builds genomes without a reference, while reference-guided assembly uses existing genomes.
  • Sequencing technologies like Illumina, PacBio HiFi, and Oxford Nanopore offer different read lengths and error rates.
  • Challenges include repetitive elements, heterozygosity, and error correction.
  • Tools many programs are available for data QC, assembly, post-processing, and evaluation - choice depends on data type and research goals.

Assembly Strategies


  • Genome assembly strategy depends on read type, genome complexity, and computational resources, with PacBio HiFi, ONT, and hybrid approaches offering different advantages in accuracy, cost, and contiguity.
  • Assembly evaluation is critical for assessing completeness and accuracy, using tools like BUSCO for gene completeness, QUAST for structural integrity, and Merqury for k-mer-based validation.
  • Scaffolding methods like Bionano OGM and Hi-C improve genome organization, resolving large structural variations and ordering contigs into chromosome-level assemblies.
  • A well-assembled genome is essential for downstream applications such as annotation, comparative genomics, and structural variation analysis, with missing or misassembled regions potentially leading to incorrect biological conclusions.

Data Quality Control


  • Data Quality Control: Assessing and filtering raw sequencing data is essential for accurate genome assembly.
  • NanoPlot: Visualizes read length distributions, quality scores, and other metrics to evaluate sequencing data quality.
  • Filtlong: Filters long-read sequencing data based on length and quality to retain high-quality reads.
  • GenomeScope: Profiles genomes using k-mer frequency distributions to estimate genome size, heterozygosity, and repeat content.

PacBio HiFi Assembly using HiFiasm


  • HiFiasm is a specialized assembler for PacBio HiFi reads, providing high-quality, haplotype-resolved genome assemblies.
  • It leverages the high accuracy of HiFi reads to generate phased assembly graphs, preserving haplotype information.
  • HiFiasm is optimized for resolving complex regions and distinguishing haplotypes in diploid or polyploid organisms.
  • The assembler generates primary contigs and haplotype-resolved contigs, offering valuable information for downstream analyses.
  • By adjusting purging levels and using parental kmer profiles, users can improve haplotype resolution and assembly quality.

Oxford Nanopore Assembly using Flye


  • ONT provides long-read sequencing data with high error rates.
  • Flye is a long-read assembler optimized for handling ONT data and producing highly contiguous assemblies.
  • The Flye assembly workflow involves read preprocessing, repeat graph construction, graph resolution, polishing, and post-processing.
  • Flye output includes the final assembly sequence, assembly graph, and summary information for evaluation.
  • Polishing the assembly can improve base-level accuracy and overall assembly quality.
  • Flye provides built-in polishing capabilities, and other tools like Racon, Nanopolish, and Medaka can be used for further refinement.

Hybrid Long Read Assembly (optional)


  • Hybrid assembly with Flye combines ONT and PacBio reads to leverage long-read continuity and high-accuracy sequencing, with separate polishing steps to refine base-level errors.
  • Assembly quality assessment using QUAST and Compleasm provides critical insights into contiguity, completeness, and potential misassemblies, ensuring reliability before scaffolding.
  • Bionano Optical Genome Mapping (OGM) improves hybrid assemblies by scaffolding contigs, resolving misassemblies, and enhancing genome continuity, leading to chromosome-scale assemblies.
  • Final scaffolding validation and quality assessment ensure the integrity of the genome assembly, with QUAST and Compleasm used to confirm improvements after Bionano integration.

Scaffolding using Optical Genome Mapping


  • Bionano optical genome mapping (OGM) provides long-range structural information for scaffolding genome assemblies.
  • Bionano Solve hybrid scaffolding integrates optical maps with sequence assemblies to improve contiguity and accuracy.
  • The Bionano Solve pipeline involves in silico map generation, conflict resolution, hybrid scaffolding, and final alignment.
  • The output of Bionano Solve includes scaffolded genome assemblies in AGP and FASTA formats, alignment results, and conflict resolution information.
  • Quality assessment of hybrid scaffolds involves evaluating alignment statistics, conflict resolution, scaffold N50 values, and completeness of the assembly.

Assembly Assessment


  • QUAST evaluates assembly contiguity and quality metrics.
  • Compleasm assesses gene-space completeness in genome assemblies.
  • Merqury provides reference-free evaluation based on k-mer analysis.
  • Bandage visualizes assembly graphs for structural assessment.