Introduction to Genome assembly
- Genome assembly reconstructs complete genome sequences from fragmented DNA reads.
- De novo assembly builds genomes without a reference, while reference-guided assembly uses existing genomes.
- Sequencing technologies like Illumina, PacBio HiFi, and Oxford Nanopore offer different read lengths and error rates.
- Challenges include repetitive elements, heterozygosity, and error correction.
- Tools many programs are available for data QC, assembly, post-processing, and evaluation - choice depends on data type and research goals.
Assembly Strategies
- Genome assembly strategy depends on read type, genome complexity,
and computational resources, with PacBio HiFi, ONT, and hybrid
approaches offering different advantages in accuracy, cost, and
contiguity.
- Assembly evaluation is critical for assessing completeness and
accuracy, using tools like BUSCO for gene completeness, QUAST
for structural integrity, and Merqury for k-mer-based
validation.
- Scaffolding methods like Bionano OGM and Hi-C improve genome
organization, resolving large structural variations and ordering contigs
into chromosome-level assemblies.
- A well-assembled genome is essential for downstream applications such as annotation, comparative genomics, and structural variation analysis, with missing or misassembled regions potentially leading to incorrect biological conclusions.
Data Quality Control
- Data Quality Control: Assessing and filtering raw sequencing data is essential for accurate genome assembly.
- NanoPlot: Visualizes read length distributions, quality scores, and other metrics to evaluate sequencing data quality.
- Filtlong: Filters long-read sequencing data based on length and quality to retain high-quality reads.
- GenomeScope: Profiles genomes using k-mer frequency distributions to estimate genome size, heterozygosity, and repeat content.
PacBio HiFi Assembly using HiFiasm
- HiFiasm is a specialized assembler for PacBio HiFi reads, providing high-quality, haplotype-resolved genome assemblies.
- It leverages the high accuracy of HiFi reads to generate phased assembly graphs, preserving haplotype information.
- HiFiasm is optimized for resolving complex regions and distinguishing haplotypes in diploid or polyploid organisms.
- The assembler generates primary contigs and haplotype-resolved contigs, offering valuable information for downstream analyses.
- By adjusting purging levels and using parental kmer profiles, users can improve haplotype resolution and assembly quality.
Oxford Nanopore Assembly using Flye
- ONT provides long-read sequencing data with high error rates.
- Flye is a long-read assembler optimized for handling ONT data and producing highly contiguous assemblies.
- The Flye assembly workflow involves read preprocessing, repeat graph construction, graph resolution, polishing, and post-processing.
- Flye output includes the final assembly sequence, assembly graph, and summary information for evaluation.
- Polishing the assembly can improve base-level accuracy and overall assembly quality.
- Flye provides built-in polishing capabilities, and other tools like Racon, Nanopolish, and Medaka can be used for further refinement.
Hybrid Long Read Assembly (optional)
- Hybrid assembly with Flye combines ONT and PacBio reads to leverage
long-read continuity and high-accuracy sequencing, with separate
polishing steps to refine base-level errors.
- Assembly quality assessment using QUAST and Compleasm provides
critical insights into contiguity, completeness, and potential
misassemblies, ensuring reliability before scaffolding.
- Bionano Optical Genome Mapping (OGM) improves hybrid assemblies by scaffolding contigs, resolving misassemblies, and enhancing genome continuity, leading to chromosome-scale assemblies.
- Final scaffolding validation and quality assessment ensure the integrity of the genome assembly, with QUAST and Compleasm used to confirm improvements after Bionano integration.
Scaffolding using Optical Genome Mapping
- Bionano optical genome mapping (OGM) provides long-range structural information for scaffolding genome assemblies.
- Bionano Solve hybrid scaffolding integrates optical maps with sequence assemblies to improve contiguity and accuracy.
- The Bionano Solve pipeline involves in silico map generation, conflict resolution, hybrid scaffolding, and final alignment.
- The output of Bionano Solve includes scaffolded genome assemblies in AGP and FASTA formats, alignment results, and conflict resolution information.
- Quality assessment of hybrid scaffolds involves evaluating alignment statistics, conflict resolution, scaffold N50 values, and completeness of the assembly.
Assembly Assessment
- QUAST evaluates assembly contiguity and quality metrics.
- Compleasm assesses gene-space completeness in genome assemblies.
- Merqury provides reference-free evaluation based on k-mer analysis.
- Bandage visualizes assembly graphs for structural assessment.