Genome Assembly: Hands-on Training: All in One View

Content from Introduction to Genome assembly

Last updated on 2025-07-01 | Edit this page

Overview

Questions

What is genome assembly, and why is it important?
What sequencing technologies can be used for genome assembly?
What are de novo and reference-guided assemblies?
What challenges arise when generating high-quality assemblies?
What software tools are used for assembling genomes?

Objectives

Learn key concepts and terminology related to genome assembly.
Understand datasets and tools used for genome assembly.
Describe major sequencing technologies and their impact on assembly.
Identify common computational challenges in genome assembly.
Explain fundamental strategies and algorithms used in genome assembly.

What is Genome Assembly?

Genome assembly is the process of reconstructing a complete genome sequence by arranging fragmented DNA sequences (reads) into a continuous sequence.

Goal is to achieve a high-quality reference genome that accurately represents the structure and sequence of an organism’s DNA.
It enables deeper understanding of genes, their function, and evolutionary history.
Vital for studying complex traits, species diversity, and disease mechanisms.

Imporance of Genome Assembly

Applications in medicine, agriculture, conservation, and biotechnology:
- Human genetics: Assembling genomes to identify disease-causing mutations.
- Crop improvement: Identifying beneficial traits in plant genomes.
- Conservation biology: Sequencing endangered species to understand genetic diversity.
Examples of major genome sequencing projects (Human Genome Project, Vertebrate Genome Project).

De Novo vs. Reference-Guided Assembly

De novo assembly:
- Used when no reference genome exists.
- Requires assembling the genome from scratch using computational methods.
- Example: Assembling a new plant species genome.
Reference-guided assembly:
- Aligns reads to a closely related reference genome.
- Useful for identifying variations but limited by reference bias.
- Example: Human genome resequencing for variant detection.

Basic Steps in Genome Assembly

Sequencing: Generating raw reads from a genome.
Preprocessing & Quality Control: Filtering and trimming reads.
Assembly: Aligning and merging overlapping reads into contigs.
Scaffolding: Ordering contigs into larger scaffolds using long-range sequencing data or mapping techniques.
Polishing: Correcting errors using additional data.
Quality Assessment: Evaluating assembly completeness and accuracy.
Downstream Analysis: Annotating genes, identifying variants, and studying genome structure.

reference: 10.1016/j.xpro.2022.101506

Data types for Genome Assembly

Illumina

Excels at high-throughput, short-read sequencing with high accuracy.
Uses Sequencing-by-Synthesis (SBS). DNA is fragmented, adapters are attached, and the fragments are immobilized on a flowcell. A polymerase incorporates fluorescently labeled nucleotides, and a camera captures the emitted signals in real time. Each cycle represents one nucleotide added to the growing DNA strand.

PacBio HiFi

provide accurate long reads, balancing throughput and error correction.
Uses Single Molecule, Real-Time (SMRT) sequencing. DNA is ligated into circular molecules and loaded onto a chip with zero-mode waveguides (ZMWs). A polymerase synthesizes the complementary strand, incorporating fluorescently labeled nucleotides. The system detects light pulses in real time, capturing multiple passes of the same molecule to generate highly accurate HiFi reads.

Oxford Nanopore

enables ultra-long reads but requires more advanced error correction.
A single DNA strand is passed through a biological nanopore embedded in a membrane. As nucleotides move through the pore, they cause characteristic disruptions in an electrical current, which are interpreted by machine-learning algorithms to determine the sequence.

Challenges in Genome Assembly

Repetitive Elements: Identical or similar sequences that occur multiple times in the genome, making it difficult to resolve unique regions.
Heterozygosity: Presence of two or more alleles at a given locus, leading to ambiguous read alignments.
Polyploidy: Multiple copies of chromosomes, complicating assembly due to similar sequences.
Genome Size: Large genomes require more computational resources and specialized algorithms.
Error Correction: Addressing sequencing errors and distinguishing true variants from artifacts.
Structural Variants: Large-scale rearrangements, duplications, deletions, and inversions that disrupt contiguity.

Main programs used for Genome Assembly

Data QC:
- NanoPlot: Visualization of sequencing data quality.
- FiltLong: Filtering long reads based on quality and length.
Assembly:
- HiFiasm: HiFi assembler for PacBio data.
- Flye: de novo assembler for long reads.
Post-processing:
- Medaka: Basecaller and consensus polishing for Flye assembly.
- Bionano Solve: Optical mapping for scaffolding and validation.
Evaluation:
- QUAST: Quality assessment tool for evaluating assemblies.
- Compleasm (BUSCO alternative): Benchmarking tool for assessing genome completeness.
- KAT: Kmer-based evaluation of assembly accuracy and completeness.

Key Points

Genome assembly reconstructs complete genome sequences from fragmented DNA reads.
De novo assembly builds genomes without a reference, while reference-guided assembly uses existing genomes.
Sequencing technologies like Illumina, PacBio HiFi, and Oxford Nanopore offer different read lengths and error rates.
Challenges include repetitive elements, heterozygosity, and error correction.
Tools many programs are available for data QC, assembly, post-processing, and evaluation - choice depends on data type and research goals.

Content from Assembly Strategies

Last updated on 2025-07-01 | Edit this page

Overview

Questions

What factors influence the choice of genome assembly strategy?
How do different assembly methods compare in terms of read length, accuracy, and computational requirements?
What are the key steps in evaluating genome assemblies using BUSCO and QUAST?
How do Bionano OGM and Hi-C sequencing improve genome continuity and organization?

Objectives

Understand factors influencing the choice of genome assembly strategy.
Compare different assembly methods based on read length, accuracy, and computational requirements.
Learn how to evaluate genome assemblies using BUSCO and QUAST.
Explore the role of Bionano OGM and Hi-C sequencing in improving genome continuity.

Assembly Strategies

Genome assembly involves choosing the right approach based on sequencing technology, read type, genome complexity, and research objectives. This chapter introduces key factors influencing assembly strategy selection, from read length and coverage requirements to computational trade-offs. We will explore different methods—PacBio HiFi with HiFiasm, ONT with Flye, and hybrid assemblies—along with scaffolding techniques like Bionano Optical Genome Mapping (OGM) and Hi-C, which help improve genome continuity and organization. Finally, we’ll discuss assembly evaluation tools such as BUSCO and QUAST to assess the completeness and quality of assembled genomes.

Factors Influencing the Choice of Strategy

Read length: Affects the ability to resolve repeats; short reads struggle with complex regions, while long reads improve contiguity.
Coverage depth: Crucial for assembly accuracy, with HiFi requiring 20-30x, ONT needing 50-100x, and hybrid assemblies depending on short-read support for polishing.
Genome complexity: Includes repeat content, heterozygosity, and polyploidy, which influence assembly success and determine whether specialized tools or additional scaffolding methods are needed.
Computational resources: Vary across assembly strategies, with HiFiasm being RAM-intensive, Flye being more lightweight, and hybrid assemblies requiring additional processing for polishing.
Sequencing budget: Plays a role, as HiFi sequencing is costlier but highly accurate, ONT is cheaper but requires more data for error correction, and hybrid approaches balance cost and quality.
Downstream analyses: Structural variation detection, gene annotation, or chromosome-level assemblies influence the choice of assembler and the need for scaffolding methods like Hi-C or Bionano OGM.

Comparative Assembly Strategies

Factor	PacBio HiFi	ONT	Hybrid (ONT + Illumina)
Read length	~15-20kb	10-100kb+	Mix of short & long
Read accuracy	High (~99%)	Moderate (~90%)	High (after polishing)
Coverage needed	20-30x	50-100x	ONT: 50x + Illumina: 30x
Cost per Gb	Expensive	Lower	Medium
Error profile	Random errors, low indels	Higher error rate, systematic errors	ONT errors corrected by Illumina
Computational requirements	High RAM required	Moderate RAM required	Moderate
Best for	High-accuracy assemblies	Ultra-long contigs	Combines advantages of both
Repeat resolution	Good	Very good	Very good
Scaffolding needed	Rarely needed	May be needed	Sometimes needed
Polishing required	Not required	Required (Racon + Medaka)	Required (Pilon)
Structural variant detection	Good	Excellent	Good
Haplotype phasing	Excellent	Good	Moderate
Genome size suitability	Suitable for large and small genomes	Best for large genomes	Best for complex genomes
Downstream applications	Reference-quality genome assembly, annotation	Structural variation analysis, de novo assembly	Genome correction, variant calling, scaffolding

Contig vs. Scaffold vs. Chromosome-Level Assembly

Genome assemblies progress through different levels of completeness and organization:

Contig-level assembly: The raw output of assemblers, consisting of contiguous sequences without known order or orientation. Longer contigs indicate better assembly continuity.
Scaffold-level assembly: Contigs linked together using additional data (e.g., long-range mate-pair reads, optical maps, or Hi-C). Gaps (represented as ’N’s) remain where connections exist but sequence information is missing.
Chromosome-level assembly: The highest-quality assembly, where scaffolds are further ordered and oriented into full chromosomes using genetic maps, Hi-C data, or synteny with a reference genome.

Higher levels of assembly provide better genome context, but require additional scaffolding methods beyond de novo assembly.

Workflow for Various Assemblies

In this workshop, we will use HiFiasm for PacBio HiFi assemblies, Flye for ONT assemblies, and Flye in hybrid mode for ONT + Illumina assemblies, followed by quality assessment using BUSCO and QUAST to evaluate completeness and accuracy.

PacBio HiFi Assembly with HiFiasm

PacBio HiFi reads: Highly accurate long reads with low error rates, suitable for de novo assembly without polishing.
HiFiasm: A specialized assembler for HiFi data, leveraging read accuracy and length to resolve complex regions and produce high-quality contigs.
Workflow: Run HiFiasm with HiFi reads, adjust parameters based on genome size and complexity, and evaluate the assembly using Compleasm and QUAST.

ONT Assembly with Flye

ONT reads: Ultra-long reads with higher error rates, requiring additional error correction and polishing steps.
Flye: A de novo assembler optimized for long reads, capable of resolving complex repeats and generating high-quality assemblies.
Workflow: Run Flye with ONT reads, adjust parameters based on genome size and complexity, and polish the assembly using Medaka for basecalling and consensus polishing.

Hybrid (ONT + PacBio) Assembly with Flye

Hybrid assembly: Combines the strengths of both technologies for improved accuracy and contiguity.
Workflow: Run Flye with both ONT and PacBio reads, adjust parameters for hybrid mode, and polish the assembly using ONT or PacBio reads for error correction and consensus polishing.

Assembly Evaluation

Why assessing genome assembly quality is crucial before downstream analyses.
Metrics to determine assembly completeness, accuracy, and contiguity.

1. BUSCO (Benchmarking Universal Single-Copy Orthologs)/ Compleasm

Compleasm is a faster alternative to BUSCO for assessing genome completeness based on single-copy orthologs. It evaluates genome completeness by checking for highly conserved, single-copy genes expected to be present in nearly all members of a given lineage. These genes are essential for basic cellular functions, making them reliable markers for assessing genome assembly quality.
We expect these genes to be present in our organism because they are evolutionarily conserved and critical for survival. If many BUSCO genes are missing or fragmented, it suggests gaps, misassemblies, or sequencing errors, which can compromise downstream analyses like gene annotation and functional studies. A high BUSCO completeness score indicates a well-assembled genome with minimal missing data.
The BUSCO reports provide detailed statistics on the number of complete, fragmented, and missing BUSCO genes, as well as the percentage of genome completeness. The output helps identify areas for improvement and guide further optimization steps in the assembly process.

2. QUAST (Quality Assessment Tool for Genome Assemblies)

Comprehensive Assembly Statistics: QUAST provides detailed metrics beyond N50 and L50, including total number of contigs, GC content, genome size estimation, and misassembly rates, allowing in-depth evaluation of genome continuity and structure.
Reference-Based and Reference-Free Evaluation: It can assess assemblies against a reference genome (identifying misassemblies, inversions, and duplications) or work in reference-free mode, making it useful for de novo assemblies without a known genome sequence.
Structural Error Detection and Gene Feature Analysis: QUAST integrates gene annotation tools like BUSCO and GeneMark, highlights misassemblies based on alignment breaks, and detects gaps, relocations, and translocations, making it particularly useful for validating scaffolding approaches and hybrid assemblies.

3. Merqury: K-mer based assembly evaluation

Assembly Accuracy Check: Merqury compares k-mers from sequencing reads to the assembled genome, identifying mismatches, missing k-mers, and sequencing errors without requiring a reference genome.
Haplotype Purity and Phasing: It calculates QV (quality value) scores and provides completeness metrics for haplotypes, helping assess whether an assembly accurately represents both parental haplotypes or contains chimeric sequences.
Consensus and Read Support Validation: By analyzing k-mer spectra, Merqury detects underrepresented or overrepresented regions, highlighting assembly errors, collapsed repeats, or sequencing biases that may impact downstream analyses.

Bionano and Hi-C Reads in Genome Assembly

Bionano Optical Genome Mapping (OGM) provides ultra-long, label-based maps of DNA molecules, helping to scaffold contigs, detect misassemblies, and resolve large structural variations. It improves genome continuity by linking fragmented sequences, especially in repeat-rich or complex genomes.

Hi-C sequencing captures chromatin interactions, allowing scaffolding of contigs into chromosome-scale assemblies based on physical proximity in the nucleus. It helps in ordering and orienting scaffolds, identifying misassemblies, and resolving haplotypes, making it essential for generating chromosome-level genome assemblies.

Bionano and HiC for assembly improvement

ref

Key Points

Genome assembly strategy depends on read type, genome complexity, and computational resources, with PacBio HiFi, ONT, and hybrid approaches offering different advantages in accuracy, cost, and contiguity.
Assembly evaluation is critical for assessing completeness and accuracy, using tools like BUSCO for gene completeness, QUAST for structural integrity, and Merqury for k-mer-based validation.
Scaffolding methods like Bionano OGM and Hi-C improve genome organization, resolving large structural variations and ordering contigs into chromosome-level assemblies.
A well-assembled genome is essential for downstream applications such as annotation, comparative genomics, and structural variation analysis, with missing or misassembled regions potentially leading to incorrect biological conclusions.

Content from Data Quality Control

Last updated on 2025-07-01 | Edit this page

Overview

Questions

What is data quality checking and filtering?
Why is it necessary to assess the quality of raw sequencing data?
What are the key steps in filtering long-read sequencing data?
How can visualization tools like NanoPlot help in quality assessment?

Objectives

Understand the importance of data quality checking and filtering in genome assembly.
Learn how to assess raw sequencing data quality using NanoPlot.
Gain hands-on experience in filtering long-read sequencing data with Filtlong.
Evaluate the impact of filtering on data quality using NanoPlot.
Explore k-mer analysis for quality assessment (optional).

Data Quality Check and Filtering

What is data quality checking and filtering?
- Assessing the quality of raw sequencing data before genome assembly.
- Identifying and removing low-quality or problematic reads before assembly.
- Ensuring that the data is suitable for downstream analysis.
- Better quality ingredients make a better quality cake!
Why is it necessary?
- Poor quality data can lead to errors in genome assembly.
- Low-quality reads can introduce gaps, misassemblies, or incorrect base calls.
- Filtering out low-quality reads can improve the accuracy and efficiency of assembly.
- It is a critical step to ensure the success of downstream analyses.
What we will do today?
- Learn the process of long-read generation (PacBio HiFi and ONT) - a brief overview.
- Assess raw data quality using NanoPlot for PacBio HiFi and ONT reads - hands-on.
- Filter reads based on quality and length using Filtlong - hands-on.
- Re-evaluate data post-filtering using NanoPlot to confirm improvements - hands-on.
- K-mer analysis to ensure data quality (optional hands-on).

PacBio HiFi reads from Subreads

PacBio HiFi reads can be generated using Circular Consensus Sequencing (CCS) program from PacBio

PacBio circular consensus sequencing (CCS) generates HiFi reads by sequencing the same DNA molecule multiple times
The more passes over the same molecule, the higher the consensus accuracy
Two main stages:
- Generate a draft consensus from multiple subreads
- Iteratively polish the consensus using all subreads to refine accuracy

What does CCS do?

Filter low-quality subreads based on length and signal-to-noise ratio
Generate an initial consensus using overlapping subreads
Align all subreads to the draft consensus for refinement
Divide sequence into small overlapping windows to optimize polishing
Identify and correct errors, heteroduplex artifacts, and large insertions
Apply polishing algorithms to refine the sequence, removing ambiguities
Compute read accuracy based on error likelihood
Output the final HiFi read if accuracy meets the threshold

Why is this important?

Produces highly accurate long reads (99 percent or higher)
Enables better genome assembly, especially in complex or repetitive regions
Reduces the need for additional error correction

How was the data generated?

Subreads (in bam format) were converted to ccs fastq as follows:

BASH

ccs \
   --hifi-kinetics \
   --num-threads $SLURM_CPUS_ON_NODE \
   input.subreads.bam \
   output.hifi.bam
samtools fastq \
   output.hifi.bam > output.hifi.fastq

What is the source of this data?

The PacBio HiFi reads are from the project PRJEB50694.
Data is for Arabidopsis thaliana ecotype Col-0, sequenced using PacBio HiFi technology (they also sequenced CLR data for this project).
The data is publicly available on the European Nucleotide Archive (ENA), and the 9994.q20.CCS.fastq.gz reads were used for analysis.
Data has been filtered to include only HiFi reads with a Q-score of 20 or higher.

ONT Reads from MinION Sequencing

Oxford Nanopore Technologies (ONT) sequencing generates long reads by passing DNA through a biological nanopore

ONT reads are generated in real-time as the DNA strand moves through the nanopore.
Dorado is Oxford Nanopore Technologies’ basecaller that converts raw electrical signals from nanopores into nucleotide sequences using machine learning.

Key steps in dorado base calling

Signal detection
- DNA or RNA molecules pass through a nanopore, disrupting an electrical current.
- These disruptions create a unique signal pattern called a squiggle.
Real-time data processing
- The MinKNOW software captures and processes the squiggle into sequencing reads.
- Reads include standard nucleotides and potential base modifications like methylation.
Basecalling with machine learning
- Neural networks, including transformer models, predict the nucleotide sequence from raw signals.
- Models continuously improve by training on diverse sequencing data.
Error correction and refinement
- Dorado refines base predictions to reduce errors, especially in homopolymer regions.
- Models are optimized for accuracy across various DNA/RNA types.
High-speed processing
- Basecalling can be performed during sequencing for real-time analysis or after sequencing for higher accuracy.
- GPUs accelerate computation, enabling rapid basecalling and simultaneous modification detection.

Why this matters

Enables real-time sequencing and analysis for quick decision-making.
Uses advanced machine learning to improve sequence accuracy over time.
Supports epigenetic modification detection without extra processing steps.

How was the data generated?

pod5 ONT reads were basecalled using Dorado as follows:

BASH

# download
wget ftp.sra.ebi.ac.uk/vol1/run/ERR791/ERR7919757/Arabidopsis-pass.tar.gz
# extract
tar -xvf Arabidopsis-pass.tar.gz # fast5_pass is the extracted directory 
# convert to pod5 format
pod5 convert \
   fast5 fast5_pass --output pod5_pass
# download model
dorado download \
   --model "dna_r10.4.1_e8.2_400bps_hac@v3.5.2" \
   --models-directory models_dir/
# basecall
dorado basecaller \
   --emit-fastq \
   --output-dir dorado_output_dir \
   models_dir/dna_r10.4.1_e8.2_400bps_hac@v3.5.2 \
   input_pass.pod5

What is the source of this data?

The ONT reads are from the project PRJEB49840.
Data is for Arabidopsis thaliana ecotype Col-0, sequenced using R10.4/Q20+ chemistry from MinION cell
The data is publicly available on the European Nucleotide Archive (ENA), and the pass_fast5 reads were used for basecalling (with commands above).

A. NanoPlot for Quality Assessment

NanoPlot ref is a visualization tool designed for quality assessment of long-read sequencing data. It generates a variety of plots, including read length histograms, cumulative yield plots, violin plots of read length and quality over time, and bivariate plots that compare read lengths, quality scores, reference identity, and mapping quality. By providing both single-variable and density-based visualizations, NanoPlot helps users quickly assess sequencing run quality and detect potential issues. The tool also allows downsampling, length and quality filtering, and barcode-specific analysis for multiplexed experiments.

1. Quality Assessment of PacBio HiFi Reads

Assessing the quality of ONT reads using NanoPlot. Create a slurm script to run NanoPlot on the HiFi reads.

BASH

ml --force purge
ml biocontainers
ml nanoplot
NanoPlot \
   --threads ${SLURM_CPUS_ON_NODE} \
   --verbose \
   --outdir nanoplot_pacbio_pre \
   --prefix At_PacBio_ \
   --plots kde \
   --N50 \
   --dpi 300 \
   --fastq At_pacbio-hifi.fastq.gz

Show details

The stdout from the NanoPlot run will look like this:

2025-02-13 12:13:19,155 NanoPlot 1.44.1 started with arguments Namespace(threads=16, verbose=True, store=False, raw=False, huge=False, outdir='nanoplot_pacbio_output', no_static=False, prefix='At_PacBio_', tsv_stats=False, only_report=False, info_in_report=False, maxlength=None, minlength=None, drop_outliers=False, downsample=None, loglength=False, percentqual=False, alength=False, minqual=None, runtime_until=None, readtype='1D', barcoded=False, no_supplementary=False, color='#4CB391', colormap='Greens', format=['png'], plots=['kde'], legacy=None, listcolors=False, listcolormaps=False, no_N50=False, N50=True, title=None, font_scale=1, dpi=300, hide_stats=False, fastq=['9994.q20.CCS.fastq.gz'], fasta=None, fastq_rich=None, fastq_minimal=None, summary=None, bam=None, ubam=None, cram=None, pickle=None, feather=None, path='nanoplot_pacbio_output/At_PacBio_')
2025-02-13 12:13:19,156 Python version is: 3.9.21 | packaged by conda-forge | (main, Dec  5 2024, 13:51:40)  [GCC 13.3.0]
2025-02-13 12:13:19,186 Nanoget: Starting to collect statistics from plain fastq file.
2025-02-13 12:13:19,187 Nanoget: Decompressing gzipped fastq 9994.q20.CCS.fastq.gz
2025-02-13 12:29:10,170 Reduced DataFrame memory usage from 12.780670166015625Mb to 12.780670166015625Mb
2025-02-13 12:29:10,194 Nanoget: Gathered all metrics of 837586 reads
2025-02-13 12:29:10,538 Calculated statistics
2025-02-13 12:29:10,539 Using sequenced read lengths for plotting.
2025-02-13 12:29:10,556 NanoPlot:  Valid color #4CB391.
2025-02-13 12:29:10,557 NanoPlot:  Valid colormap Greens.
2025-02-13 12:29:10,582 NanoPlot:  Creating length plots for Read length.
2025-02-13 12:29:10,583 NanoPlot: Using 837586 reads with read length N50 of 22587bp and maximum of 57055bp.
2025-02-13 12:29:11,933 Saved nanoplot_pacbio_output/At_PacBio_WeightedHistogramReadlength  as png (or png for --legacy)
2025-02-13 12:29:12,443 Saved nanoplot_pacbio_output/At_PacBio_WeightedLogTransformed_HistogramReadlength  as png (or png for --legacy)
2025-02-13 12:29:12,899 Saved nanoplot_pacbio_output/At_PacBio_Non_weightedHistogramReadlength  as png (or png for --legacy)
2025-02-13 12:29:13,371 Saved nanoplot_pacbio_output/At_PacBio_Non_weightedLogTransformed_HistogramReadlength  as png (or png for --legacy)
2025-02-13 12:29:13,372 NanoPlot: Creating yield by minimal length plot for Read length.
2025-02-13 12:29:14,465 Saved nanoplot_pacbio_output/At_PacBio_Yield_By_Length  as png (or png for --legacy)
2025-02-13 12:29:14,466 Created length plots
2025-02-13 12:29:14,474 NanoPlot: Creating Read lengths vs Average read quality plots using 837586 reads.
2025-02-13 12:29:15,012 Saved nanoplot_pacbio_output/At_PacBio_LengthvsQualityScatterPlot_kde  as png (or png for --legacy)
2025-02-13 12:29:15,013 Created LengthvsQual plot
2025-02-13 12:29:15,013 Writing html report.
2025-02-13 12:29:15,029 Finished!

Evaluate the quality of HiFi reads:

Examine the At_PacBio_NanoPlot-report.html file

Read length distribution: Histogram of read lengths, showing the distribution of read lengths in the dataset.
Read length vs. Quality: Scatter plot showing the relationship between read length and quality score.
Yield (number of bases) by read length: Plot showing the cumulative yield of reads based on their length.
Log-Transformed histograms: Histogram of read lengths with a log-transformed scale for better visualization.
KDE plots: Kernel Density Estimation plots for read length and quality score distributions.
Summary statistics: N50 value, maximum read length, and other key metrics.

Callout

What filtering should be applied to the PacBio HiFi reads based on the quality assessment?

Our genome (A. thaliana) has a genome size of ~135 Mb. Our target coverage is ~40x. Currently we have ~18Gb of HiFi reads (~138X depth of coverage). We need to filter the reads to ensure we have good quality reads of desired length and coverage.

2. Quality Assessment of ONT Reads

Assessing the quality of ONT reads using NanoPlot. Create a slurm script to run NanoPlot on the basecalled ONT reads.

BASH

ml --force purge
ml biocontainers
ml nanoplot
NanoPlot \
   --threads ${SLURM_CPUS_ON_NODE} \
   --verbose \
   --outdir nanoplot_ont_pre \
   --prefix At_ONT_ \
   --readtype 1D \
   --plots kde \
   --N50 \
   --dpi 300 \
   --fastq At_ont-reads.fastq.gz

Show details

The stdout from the NanoPlot run will look like this:

2025-02-13 12:15:51,066 NanoPlot 1.44.1 started with arguments Namespace(threads=8, verbose=True, store=False, raw=False, huge=False, outdir='nanoplot_pacbio_output', no_static=False, prefix='At_ONT_', tsv_stats=False, only_report=False, info_in_report=False, maxlength=None, minlength=None, drop_outliers=False, downsample=None, loglength=False, percentqual=False, alength=False, minqual=None, runtime_until=None, readtype='1D', barcoded=False, no_supplementary=False, color='#4CB391', colormap='Greens', format=['png'], plots=['kde'], legacy=None, listcolors=False, listcolormaps=False, no_N50=False, N50=True, title=None, font_scale=1, dpi=300, hide_stats=False, fastq=['basecalled_2025-02-12.fastq'], fasta=None, fastq_rich=None, fastq_minimal=None, summary=None, bam=None, ubam=None, cram=None, pickle=None, feather=None, path='nanoplot_pacbio_output/At_ONT_')
2025-02-13 12:15:51,067 Python version is: 3.9.21 | packaged by conda-forge | (main, Dec  5 2024, 13:51:40)  [GCC 13.3.0]
2025-02-13 12:15:51,096 Nanoget: Starting to collect statistics from plain fastq file.
2025-02-13 12:25:11,429 Reduced DataFrame memory usage from 8.842315673828125Mb to 8.842315673828125Mb
2025-02-13 12:25:11,455 Nanoget: Gathered all metrics of 579482 reads
2025-02-13 12:25:11,692 Calculated statistics
2025-02-13 12:25:11,693 Using sequenced read lengths for plotting.
2025-02-13 12:25:11,707 NanoPlot:  Valid color #4CB391.
2025-02-13 12:25:11,707 NanoPlot:  Valid colormap Greens.
2025-02-13 12:25:11,725 NanoPlot:  Creating length plots for Read length.
2025-02-13 12:25:11,725 NanoPlot: Using 579482 reads with read length N50 of 36292bp and maximum of 298974bp.
2025-02-13 12:25:13,096 Saved nanoplot_pacbio_output/At_ONT_WeightedHistogramReadlength  as png (or png for --legacy)
2025-02-13 12:25:13,571 Saved nanoplot_pacbio_output/At_ONT_WeightedLogTransformed_HistogramReadlength  as png (or png for --legacy)
2025-02-13 12:25:14,971 Saved nanoplot_pacbio_output/At_ONT_Non_weightedHistogramReadlength  as png (or png for --legacy)
2025-02-13 12:25:15,440 Saved nanoplot_pacbio_output/At_ONT_Non_weightedLogTransformed_HistogramReadlength  as png (or png for --legacy)
2025-02-13 12:25:15,441 NanoPlot: Creating yield by minimal length plot for Read length.
2025-02-13 12:25:16,485 Saved nanoplot_pacbio_output/At_ONT_Yield_By_Length  as png (or png for --legacy)
2025-02-13 12:25:16,486 Created length plots
2025-02-13 12:25:16,495 NanoPlot: Creating Read lengths vs Average read quality plots using 579482 reads.
2025-02-13 12:25:17,029 Saved nanoplot_pacbio_output/At_ONT_LengthvsQualityScatterPlot_kde  as png (or png for --legacy)
2025-02-13 12:25:17,030 Created LengthvsQual plot
2025-02-13 12:25:17,030 Writing html report.
2025-02-13 12:25:17,047 Finished!

Evaluate the quality of ONT reads:

Examine the At_ONT_NanoPlot-report.html file.

Read length distribution: Histogram of read lengths, showing the distribution of read lengths in the dataset.
Read length vs. Quality: Scatter plot showing the relationship between read length and quality score.
Yield (number of bases) by read length: Plot showing the cumulative yield of reads based on their length.
Log-Transformed histograms: Histogram of read lengths with a log-transformed scale for better visualization.
KDE plots: Kernel Density Estimation plots for read length and quality score distributions.
Summary statistics: N50 value, maximum read length, and other key metrics.

Callout

What filtering should be applied to the ONT reads based on the quality assessment?

Our genome (A. thaliana) has a genome size of ~135 Mb. Our target coverage is ~40x. Currently we have ~14Gb of ONT reads (104X depth of coverage). We need to filter the reads to ensure we have good quality reads of desired length and coverage.

B. Filtering Sequencing Reads

Filtlong is a tool designed to filter long-read sequencing data by selecting a smaller, higher-quality subset of reads based on length and identity. It prioritizes longer reads with higher sequence identity while discarding shorter or lower-quality reads, ensuring that the retained data contributes to more accurate genome assemblies. This filtering step is crucial for improving assembly contiguity, reducing errors, and optimizing computational efficiency by removing excess low-quality data.

1. Filtering PacBio HiFi Reads

Filter the PacBio HiFi reads using Filtlong to retain only high-quality reads.

BASH

ml --force purge
ml biocontainers
ml filtlong
filtlong \
   --target_bases 5400000000 \
   --keep_percent 90 \
   --min_length 1000 \
     At_pacbio-hifi.fastq.gz > At_pacbio-hifi-filtered.fastq

2. Filtering ONT Reads

Filter the ONT reads using Filtlong to retain only high-quality reads.

BASH

ml --force purge
ml biocontainers
ml filtlong
filtlong \
   --target_bases 5400000000 \
   --keep_percent 90 \
   --min_length 1000 \
     At_ont-reads.fastq.gz > At_ont-reads-filtered.fastq

Callout

What does this command do?

--target_bases 5400000000: Target number of bases to retain in the filtered dataset (5.4 Gb).
--keep_percent 90: Retain reads that cover 90% of the target bases.
--min_length 1000: Minimum read length to keep in the filtered dataset (1000 bp).

C. Evaluating Data Quality After Filtering

We will re-run NanoPlot on the filtered HiFi and ONT reads to assess the quality of the filtered datasets.

1. For PacBio HiFi Reads:

BASH

ml --force purge
ml biocontainers
ml nanoplot
NanoPlot \
   --threads ${SLURM_CPUS_ON_NODE} \
   --verbose \
   --outdir nanoplot_pacbio_post \
   --prefix At_PacBio_post_ \
   --plots kde \
   --N50 \
   --dpi 300 \
   --fastq At_pacbio-hifi-filtered.fastq

2. For ONT Reads:

BASH

ml --force purge
ml biocontainers
ml nanoplot
NanoPlot \
   --threads ${SLURM_CPUS_ON_NODE} \
   --verbose \
   --outdir nanoplot_ont_post \
   --prefix At_ONT_post_ \
   --readtype 1D \
   --plots kde \
   --N50 \
   --dpi 300 \
   --fastq At_ont-reads-filtered.fastq

Now, examine the At_PacBio_post_NanoPlot-report.html and At_ONT_post_NanoPlot-report.html files to assess the quality of the filtered HiFi and ONT reads. Do you observe any improvements in read quality after filtering? We will use these filtered reads for downstream genome assembly.

D. K-mer Based Quality Checks (Optional)

GenomeScope is a k-mer-based tool used to profile genomes without requiring a reference, providing estimates of genome size, heterozygosity, and repeat content. It uses k-mer frequency distributions from raw sequencing reads to model genome characteristics, making it especially useful for detecting sequencing artifacts and assessing data quality before assembly. In this optional section, we will use GenomeScope to evaluate the quality of our Oxford Nanopore and PacBio reads by identifying potential errors, biases, and coverage issues, helping to refine filtering strategies and improve downstream assembly results ref.

1. For PacBio HiFi Reads:

BASH

ml --force purge
ml biocontainers
ml kmc
mkdir tmp
ls At_pacbio-hifi-filtered.fastq > FILES
kmc -k21 -t10 -m64 -ci1 -cs10000 @FILES reads tmp/
kmc_tools transform reads histogram reads-pacbio.histo -cx10000

2. For ONT Reads:

BASH

ml --force purge
ml biocontainers
ml kmc
mkdir tmp
ls At_ont-reads-filtered.fastq > FILES
kmc -k21 -t10 -m64 -ci1 -cs10000 @FILES reads tmp/
kmc_tools transform reads histogram reads-ont.histo -cx10000

Now you can visualize the k-mer frequency distributions using GenomeScope to assess the quality of the HiFi and ONT reads. This analysis can help identify potential issues and guide further filtering or processing steps to improve data quality.

To visualize the k-mer frequency distributions:

Visit the GenomeScope website and upload the reads-pacbio.histo OR reads-ont.histo files (drag and drop).
If you uploaded the reads-pacbio.histo file, enter description with PacBio HiFi. If you uploaded the reads-ont.histo file, enter description with Oxford Nanopore.
Click on the Submit button to generate the k-mer frequency distribution plots.

Challenge

Q: Why did the ONT k-mer analysis fail?

Show me the solution

A: High error rates in ONT reads can lead to k-mer counting errors, causing the analysis to fail. K-mer analyses reads rely on accuracy to generate reliable frequency distributions. Only reads higher than Q20 are recommended for k-mer analysis.

Callout

What insights can you gain from the k-mer frequency distributions?

Look for peaks and patterns in the k-mer frequency distributions.
Identify potential issues such as heterozygosity, repeat content, or sequencing errors.
Did your models converge? What does this indicate about the quality of your data?

Key Points

Data Quality Control: Assessing and filtering raw sequencing data is essential for accurate genome assembly.
NanoPlot: Visualizes read length distributions, quality scores, and other metrics to evaluate sequencing data quality.
Filtlong: Filters long-read sequencing data based on length and quality to retain high-quality reads.
GenomeScope: Profiles genomes using k-mer frequency distributions to estimate genome size, heterozygosity, and repeat content.

Content from PacBio HiFi Assembly using HiFiasm

Last updated on 2025-07-01 | Edit this page

Overview

Questions

What is HiFiasm, and how does it improve genome assembly using PacBio HiFi reads?
What are the key steps in running HiFiasm for haplotype-resolved assembly?
How does HiFiasm handle haplotype resolution and purging of duplications?
What are the benefits of using HiFiasm for assembling complex and heterozygous genomes?

Objectives

Understand the purpose and function of HiFiasm for haplotype-resolved genome assembly.
Learn to set up and run HiFiasm for assembling genomes using PacBio HiFi reads.
Gain hands-on experience with haplotype resolution and purging of duplications in HiFiasm.
Analyze and interpret HiFiasm output to assess assembly quality and completeness.

Introduction to HiFiasm

HiFiasm is a specialized de novo assembler designed for PacBio HiFi reads, providing high-quality, haplotype-resolved genome assemblies. Unlike traditional assemblers that collapse heterozygous regions into a consensus sequence, HiFiasm preserves haplotype information using a phased assembly graph approach. This enables more accurate representation of genetic variations and structural differences.

Leveraging the low error rate of HiFi reads, HiFiasm constructs phased assembly graphs that allow for haplotype separation without requiring external polishing or duplication-purging tools. It significantly improves assembly contiguity, resolving complex regions more effectively than alternative methods. HiFiasm is widely used in genome projects, including the Human Pangenome Project, and has been successfully applied to large and highly heterozygous genomes such as Sequoia sempervirens (~30 Gb).

With its ability to generate fast and accurate assemblies, HiFiasm has become the preferred tool for haplotype-resolved genome assembly, especially when parental reads or Hi-C data are available.

Latest version of HiFiasm

The latest version of HiFiasm supports assembling ONT reads as well. It has also added support to integrate ultra-long ONT reads for improved contiguity, as well as hybrid assembly (using both ONT and HiFi reads). Apart from ONT data, HiFiasm can handle Hi-C data for scaffolding, as well as kmer profiles from parents to resolve haplotypes. The latest version includes several bug fixes and performance improvements, making it more efficient and user-friendly.

Installation and Setup

HiFiasm is available as module on RCAC clusters. You can load the module using the following command:

BASH

ml --force purge
ml biocontainers
ml hifiasm
hifiasm --version

You can also use the Singularity container for HiFiasm, which provides a consistent environment across different systems. The container can be pulled from the BioContainers registry using the following command:

BASH

apptainer pull docker://quay.io/biocontainers/hifiasm:0.24.0--h5ca1c30_0
apptainer exec hifiasm_0.24.0--h5ca1c30_0.sif hifiasm --version

Overview of HiFiASM Read Assembly

HiFiasm can assemble high-quality, contiguous genome sequences from PacBio High-Fidelity (HiFi) reads. HiFi reads are long and highly accurate (99%+), making them ideal for assembling complex genomes, resolving repetitive regions, and distinguishing haplotypes in diploid or polyploid organisms.

The assembly workflow typically involves:

Preprocessing reads – filtering and quality-checking raw hifi reads
Read overlap detection – identifying how reads align to each other
Error correction – resolving sequencing errors while maintaining true haplotype differences
Graph construction – building an assembly graph to represent contig relationships
Contig generation – extracting the final set of contiguous sequences
Post-processing – refining assemblies by purging duplications or scaffolding

HiFiasm is optimized for this process, leveraging the high accuracy of HiFi reads to generate contigs with minimal fragmentation and greater haplotype resolution compared to traditional assemblers.

HiFiasm: basic workflow

To run HiFiasm, you need to provide the input HiFi reads in FASTA or FASTQ format. The basic command structure is as follows:

BASH

ml --force purge
ml biocontainers
ml hifiasm
hifiasm \
    -t ${SLURM_CPUS_ON_NODE} \
    -o hifiasm_default/At_hifiasm_default.asm\
    At_pacbio-hifi-filtered.fastq

Callout

In this command:

-t specifies the number of threads to use
-o specifies the output prefix for the assembly
last argument is the input HiFi reads file (fastq format)

The input can either be fastq or fasta, compressed or uncompressed. The output will be stored in the same directory with the specified prefix.

Understanding HiFiasm Output

The run generates several output files. Here are all the files and their descriptions:

filename	description
`At_hifiasm_default.asm.ec.bin`	error-corrected reads stored in binary format
`At_hifiasm_default.asm.ovlp.source.bin`	source overlap data between reads in binary format
`At_hifiasm_default.asm.ovlp.reverse.bin`	reverse overlap data between reads in binary format
`At_hifiasm_default.asm.bp.r_utg.noseq.gfa`	assembly graph of raw unitigs (without sequence)
`At_hifiasm_default.asm.bp.r_utg.gfa`	assembly graph of raw unitigs (with sequence)
`At_hifiasm_default.asm.bp.r_utg.lowQ.bed`	low-quality regions in raw unitigs
`At_hifiasm_default.asm.bp.p_utg.noseq.gfa`	assembly graph of purged unitigs (without sequence)
`At_hifiasm_default.asm.bp.p_utg.gfa`	assembly graph of purged unitigs (with sequence)
`At_hifiasm_default.asm.bp.p_utg.lowQ.bed`	low-quality regions in purged unitigs
`At_hifiasm_default.asm.bp.p_ctg.noseq.gfa`	assembly graph of primary contigs (without sequence)
`At_hifiasm_default.asm.bp.p_ctg.gfa`	assembly graph of primary contigs (with sequence)
`At_hifiasm_default.asm.bp.p_ctg.lowQ.bed`	low-quality regions in primary contigs
`At_hifiasm_default.asm.bp.hap1.p_ctg.noseq.gfa`	haplotype 1 primary contigs (without sequence)
`At_hifiasm_default.asm.bp.hap1.p_ctg.gfa`	haplotype 1 primary contigs (with sequence)
`At_hifiasm_default.asm.bp.hap2.p_ctg.noseq.gfa`	haplotype 2 primary contigs (without sequence)
`At_hifiasm_default.asm.bp.hap2.p_ctg.gfa`	haplotype 2 primary contigs (with sequence)
`At_hifiasm_default.asm.bp.hap1.p_ctg.lowQ.bed`	low-quality regions in haplotype 1 primary contigs

Where are the assembly/contig sequences?

The *_ctg.gfa file contains the contigs (haplotype-resolved, and primary only) in GFA (Graphical Fragment Assembly) format. You can extract the sequences from this file using awk. The sequences are represented as lines starting with S followed by the contig ID and the sequence.

BASH

for ctg in *_ctg.gfa; do
    awk '/^S/{print ">"$2"\n"$3}' ${ctg} > ${ctg%.gfa}.fasta
done

Let’s take a look at the sats of this assembly:

BASH

ml --force purge
ml biocontainers
ml quast
quast.py \
    --fast \
    --threads ${SLURM_CPUS_ON_NODE} \
    -o quast_basic_stats \
    *p_ctg.fasta

Quast metrics

Key metrics for assembly quality assessment

Metric	Description & Importance
# Contigs	The number of contiguous sequences in the assembly. Fewer, larger contigs indicate a more contiguous assembly.
Largest Contig	The length of the longest assembled sequence. A larger value suggests better resolution of large genomic regions.
Total Length	The sum of all contig lengths. Should approximately match the expected genome size.
N50	The contig length at which 50% of the assembly is covered. Higher values indicate a more contiguous assembly.
N90	The contig length at which 90% of the assembly is covered. Provides insight into the distribution of smaller contigs.
L50	The number of contigs that make up 50% of the assembly. Lower values indicate higher contiguity.
L90	The number of contigs that make up 90% of the assembly. Lower values suggest fewer, larger contigs.
auN	Weighted average of contig lengths, emphasizing longer contigs. Higher values indicate better continuity.
# N/100 kbp	Measures the presence of gaps (`N`s) in the assembly. Ideally should be 0, meaning no unresolved bases.

How to interpret

High N50 and low L50 suggest a well-assembled genome with fewer, larger contigs.
Total Length should be close to the estimated genome size, ensuring completeness.
Low # of contigs indicates better continuity, meaning fewer breaks in the genome.
No N bases means the assembly is gap-free and doesn’t contain unresolved regions.

Handling haplotype-resolved contigs

HiFiasm generates haplotype-resolved contigs. With the default options above, you saw that it generated hap1.p_ctg, hap2.p_ctg and .p_ctg GFA files, which corresponds to haplotype 1, haplotype 2, and primary contigs, respectively. Although HiFiasm separates the haplotypes, it is unable to phase (assign the actual regions of hap1 and hap2 to their respective haplotypes consistently across the genome) them without additional data. The haplotype-resolved contigs, as-is, is still valuable information, and can be used for downstream analyses requiring haplotype-specific information. The primary contigs represent the consensus sequence, and is usually more complete than either of the haplotype only assemblies.

Hifiasm purges haplotig duplications by default (to produce two sets of partially phased contigs)
For inbred or homozygous genomes, you may disable purging with option -l 0 ((hifiasm -o prefix.asm -l 0 -t ${SLURM_CPUS_ON_NODE} input.fq.gz)
To get primary/alternate assemblies, the option --primary should be set (hifiasm -o prefix.asm --primary -t ${SLURM_CPUS_ON_NODE} input.fq.gz)
For heterozygous genomes, you can set -l 1, -l 2, or -l 3, to adjust purging of haplotigs
- -l 1 to only purge contained haplotigs
- -l 2 to purge all types of haplotigs
- -l 3 to purge all types of haplotigs in the most aggressive way
If you have parental kmer profiles, you can use them to resolve haplotypes

We can try running HiFiasm with various -l options to see how it affects the assembly quality.

BASH

ml --force purge
ml biocontainers
ml hifiasm
# purge level 0
mkdir -p hifiasm_purge-0
hifiasm \
  -o hifiasm_purge-0/At_hifiasm_purge-0.asm \
  -l 0 \
  -t ${SLURM_CPUS_ON_NODE} \
  At_pacbio-hifi-filtered.fastq

BASH

ml --force purge
ml biocontainers
ml hifiasm
# purge level 1
mkdir -p hifiasm_purge-1
hifiasm \
  -o hifiasm_purge-1/At_hifiasm_purge-1.asm \
  -l 1 \
  -t ${SLURM_CPUS_ON_NODE} \
  At_pacbio-hifi-filtered.fastq

BASH

ml --force purge
ml biocontainers
ml hifiasm
# purge level 2
mkdir -p hifiasm_purge-2
hifiasm \
  -o hifiasm_purge-2/At_hifiasm_purge-2.asm \
  -l 2 \
  -t ${SLURM_CPUS_ON_NODE} \
  At_pacbio-hifi-filtered.fastq

BASH

ml --force purge
ml biocontainers
ml hifiasm
# purge level 3
mkdir -p hifiasm_purge-3
hifiasm \
  -o hifiasm_purge-3/At_hifiasm_purge-0.asm \
  -l 3 \
  -t ${SLURM_CPUS_ON_NODE} \
  At_pacbio-hifi-filtered.fastq

Callout

Each of these will run in about ~15 minutes with 32 cores. You can either run them in parallel or sequentially or request more cores to run them faster.

Comparing assemblies

Convert GFA files to FASTA format

BASH

for dir in hifiasm_purge-{0..3}; do 
    cd ${dir}
    for ctg in *_ctg.gfa; do
        awk '/^S/{print ">"$2"\n"$3}' ${ctg} > ${ctg%.gfa}.fasta
    done
    cd ..
done

Run QUAST to compare the assemblies

BASH

mkdir -p quast_stats
for fasta in hifiasm_purge_level_{0..3}/*_p_ctg.fasta; do
    ln -s ${fasta} quast_stats/
done
cd quast_stats
quast.py \
    --fast \
    --threads ${SLURM_CPUS_ON_NODE} \
    -o quast_purge_level_stats \
    *_p_ctg.fasta

Run Compleasm to compare the assemblies

BASH

ml --force purge
ml biocontainers
ml compleasm
mkdir -p compleasm_stats
for fasta in hifiasm_purge-{0..3}/*_p_ctg.fasta; do
    ln -s ${fasta} compleasm_stats/
done
cd compleasm_stats
for fasta in *_p_ctg.fasta; do
    compleasm run \
       -a ${fasta} \
       -o ${fasta%.*} \
       -l brassicales_odb10 \
       -t ${SLURM_CPUS_ON_NODE}
done

Examining the results from QUAST and Compleasm, compare the assembly statistics and assess the impact of different purging levels on the assembly quality. Look for metrics like N50, L50, and total assembly size to evaluate the contiguity and completeness of the assemblies.

Improving Assembly Quality

After the first round of assembly, you will have the files *.ec.bin, *.ovlp.source.bin, and *.ovlp.reverse.bin. Save these files and try various options to see if you can improve the assembly. First, make a folder to move the .gfa, .fasta, and .bed files. These are the results from the first round of assembly. Second, adjust the parameters in the hifiasm command and run the assembler again. Third, move results to a new folder and compare the results of the first folder. You can re-run the assembly quickly and generate statistics for each of these folders and compare them to see if the changes improved the assembly.

Alternative Assembler: Flye for HiFi

Flye is another popular assembler specialized for ONT reads, offering a different approach to haplotype-resolved assembly. The latest version can also use HiFi reads to generate great quality assemblies. We will explore Flye in a separate episode to compare its performance with HiFiasm. But in this optional section, you can try running Flye with HiFi reads to see how it performs compared to HiFiasm.

Running Flye with HiFi Reads

To run Flye with HiFi reads, you can use the following command structure:

BASH

ml --force purge
ml biocontainers
ml flye
flye \
  --pacbio-hifi At_pacbio-hifi-filtered.fastq \
  --genome-size 135m \
  --out-dir flye_default \
  --threads ${SLURM_CPUS_ON_NODE}

Options used

In this command:

--pacbio-hifi specifies the input HiFi reads file
--genome-size provides an estimate of the genome size (optional)
--out-dir specifies the output directory for Flye results
--threads specifies the number of threads to use

The output will be stored in the specified directory, containing the assembly graph, contigs, and other relevant files.

With 64 cores, this will run in about ~40 mins. It needs about ~80-90Gb of memory.

Quality metrics

Run quality metrics on Flye assembly:

BASH

ml --force purge
ml biocontainers
ml quast
quast.py \
    --fast \
    --threads ${SLURM_CPUS_ON_NODE} \
    -o quast_flye_stats \
    flye_default/assembly.fasta
ml compleasm
compleasm run \
  -a flye_default/assembly.fasta \
  -o flye_default \
  -l brassicales_odb10 \
  -t ${SLURM_CPUS_ON_NODE}

Which assembler did a better job at assembling the genome? Compare the statistics from QUAST and Compleasm for Flye and HiFiasm assemblies to evaluate their performance.

Key Points

HiFiasm is a specialized assembler for PacBio HiFi reads, providing high-quality, haplotype-resolved genome assemblies.
It leverages the high accuracy of HiFi reads to generate phased assembly graphs, preserving haplotype information.
HiFiasm is optimized for resolving complex regions and distinguishing haplotypes in diploid or polyploid organisms.
The assembler generates primary contigs and haplotype-resolved contigs, offering valuable information for downstream analyses.
By adjusting purging levels and using parental kmer profiles, users can improve haplotype resolution and assembly quality.

Content from Oxford Nanopore Assembly using Flye

Last updated on 2025-07-01 | Edit this page

Overview

Questions

What are the key features of ONT reads?
Why is Flye good for assembling ONT reads?
What are the main steps in the Flye assembly workflow?
How can you evaluate the quality of a Flye assembly?

Objectives

Understand the characteristics of ONT reads.
Learn about the Flye assembler and its advantages for ONT data.
Explore the key steps in the Flye assembly workflow.
Evaluate the quality of a Flye assembly using common metrics.

Introduction to ONT reads and Flye Assembler

Oxford Nanopore Technologies (ONT) has revolutionized sequencing by providing long-read data, enabling the resolution of complex genomic structures that were previously intractable with short-read technologies. However, ONT reads are error-prone, necessitating specialized assembly algorithms that can handle high sequencing error rates while maximizing contiguity and accuracy.

Traditional assemblers designed for short reads rely on de Bruijn graph approaches, which break sequences into fixed k-mers and struggle with error-rich long reads. In contrast, modern long-read assemblers like Flye use alternative graph-based strategies to overcome these limitations. Flye specifically constructs repeat graphs to accurately reconstruct genomes while addressing challenges posed by structural variations and repeats. This makes it particularly well-suited for ONT data, producing high-quality, contiguous assemblies for small microbial genomes to large eukaryotic genomes.

The latest ultra-long ONT reads, such as those generated by the PromethION platform, have further improved assembly quality and contiguity. Flye can leverage these ultra-long reads to generate even more accurate and contiguous assemblies, making it a powerful tool for a wide range of genomic analyses.

Installation and Setup

Flye is available as module on RCAC clusters. You can load the module using the following command:

BASH

ml --force purge
ml biocontainers
ml flye
flye --version

BASH

apptainer pull docker://quay.io/biocontainers/flye:2.9.5--py311h2de2dd3_2
apptainer exec hflye_2.9.5--py311h2de2dd3_2.sif flye --version

Overview of Flye Assembler

Flye is a de novo assembler designed for high-error, long-read sequencing data from Oxford Nanopore Technologies (ONT) and PacBio. It is optimized to handle the inherent noise in single-molecule sequencing (SMS) reads while producing highly contiguous assemblies. Flye is particularly well-suited for assembling complex genomes, resolving repetitive regions, and reconstructing structural variations that short-read assemblers struggle with.

The Flye assembly workflow typically involves:

Read preprocessing – filtering and quality-checking raw ONT reads
Disjointig generation – constructing long, error-prone sequences from overlapping reads
Repeat graph construction – building a repeat-aware assembly graph to represent genome structure
Graph resolution – disentangling repeats and structural variations to produce accurate contigs
Polishing – refining assemblies to improve base-level accuracy using read alignment
Post-processing – assessing assembly quality and generating final output

Flye is optimized for this process, leveraging repeat graph-based assembly to generate longer, more contiguous sequences than many traditional long-read assemblers. Its ability to handle highly repetitive regions, coupled with its fast runtime and efficient memory usage, makes it a powerful choice for ONT genome assembly.

Flye: basic workflow

To run flye, you need to provide the input long reads in FASTA or FASTQ format, specifying the long read type, provide estimated genome size, output directory and the threads to use. The basic command structure is as follows:

BASH

ml --force purge
ml biocontainers
ml flye
flye \
  --nano-raw At_ont-reads-filtered.fastq \
  --genome-size 135m \
  --out-dir flye_ont \
  --threads ${SLURM_CPUS_ON_NODE}

Options used

--nano-raw specifies the input ONT long reads in FASTQ format
--genome-size provides an estimate of the genome size to guide assembly
--out-dir specifies the output directory for Flye results
--threads specifies the number of CPU threads to use for assembly

The input can either be fastq or fasta, compressed or uncompressed. The output will be stored in the directory provided.

Understanding Flye Output

The output of Flye includes several files and directories that provide information about the assembly process and results. Key components of the Flye output include:

File/Folder	Description
00-assembly/	Initial draft assembly output.
10-consensus/	Consensus refinement step.
20-repeat/	Repeat graph construction and analysis.
30-contigger/	Final contig generation step.
40-polishing/	Final polishing step for improving assembly quality.
assembly.fasta	Final polished assembly sequence.
assembly_graph.gfa	Final assembly graph in GFA format.
assembly_graph.gv	Visualization of final assembly graph.
assembly_info.txt	Summary information about the assembly.
flye.log	Log file detailing the Flye run.

Which file should I use as my final assembly?

The assembly.fasta file contains the final polished assembly sequence and is typically used as the primary output for downstream analyses. This file represents the best estimate of the assembled genome based on the input data and the Flye assembly process. You can use this file for further analyses, such as gene prediction, variant calling, or comparative genomics studies.

Quick look at metrics for this assembly:

BASH

ml --force purge
ml biocontainers
ml quast
quast.py \
    --fast \
    --threads ${SLURM_CPUS_ON_NODE} \
    -o quast_basic_stats \
    flye_ont/assembly.fasta

Which of these assemblies look better?

Check the quast_basic_stats/report.txt file to check assembly statistics. Based on your previous assembly using hifiasm, what assembly do you think is better? What metrics are you using to make this decision? Discuss which assembly has better contiguity and completeness based on these statistics.

Other important parameters

flye provides several additional parameters that can be used to customize the assembly process and improve results. Some key parameters include:

Pick the right input type (--nano-hq, --pacbio-hifi, etc.) → Incorrect selection affects accuracy.
Always specify --out-dir and --threads for faster and organized runs.
Use --keep-haplotypes if you don’t want a collapsed assembly.
For metagenomes, use --meta to handle variable coverage.
If the assembly fails, use --resume to avoid losing progress.

Interested in exploring more about Flye?

Check out the Flye FAQ for answers to common questions and troubleshooting tips. You can also explore the Flye GitHub repository for the latest updates, documentation, and discussions about the assembler.

Improving Assembly Quality with Polishing (Optional)

After generating the initial assembly, it is often beneficial to polish the assembly to improve base-level accuracy. Polishing involves aligning the raw reads back to the assembly and correcting errors to produce a more accurate consensus sequence. This step can significantly enhance the quality of the assembly, especially for error-prone long-read data like ONT reads.

Flye provides built-in polishing capabilities. By default, Flye performs one round of polishing to refine the assembly. However, you can customize the polishing process by running polishing separately after the initial assembly.

An example command to polish assembly with accurate PacBio HiFi reads:

BASH

ml --force purge
ml biocontainers
ml flye
flye \
  --polish-target flye_ont/assembly.fasta \
  --pacbio-raw At_pacbio-hifi-filtered.fastq \
  --genome-size 135m \
  --iterations 1 \
  --out-dir flye_ont_polished \
  --threads ${SLURM_CPUS_ON_NODE}

*You can also provide Bam file as input instead of reads

There are many other polishing tools available, such as Racon, Nanoploish, and medaka, which can be used to further refine the ONT assembly. Each tool has its strengths and limitations, so it is recommended to try different polishing strategies to achieve the best results for your specific dataset. Medaka is a popular choice for polishing ONT assemblies due to its accuracy and efficiency.

HiFiasm for ONT Data (Optional)

Since HiFiasm also supports ONT reads for assembly, we can test it out to access the quality of the assembly. The basic command structure is as follows:

BASH

ml --force purge
ml biocontainers
ml hifiasm
hifiasm \
    -t ${SLURM_CPUS_ON_NODE} \
    -o athaliana_ont.asm \
    --ont \
    At_ont-reads-filtered.fastq

Post processing

Once the run completes (~30 mins with 32 threads), you can convert GFA to FASTA using the following command:

BASH

for ctg in *_ctg.gfa; do
    awk '/^S/{print ">"$2"\n"$3}' ${ctg} > ${ctg%.gfa}.fasta
done

Get teh basic stats using quast:

BASH

ml --force purge
ml biocontainers
ml quast
quast.py \
    --fast \
    --threads ${SLURM_CPUS_ON_NODE} \
    -o quast_basic_stats \
    *_ctg.fasta

and run compleasm to get the assembly completeness:

BASH

ml --force purge
ml biocontainers
ml completeasm
for fasta in *_ctg.fasta; do
    compleasm run \
       -a ${fasta} \
       -o ${fasta%.*} \
       -l brassicales_odb10 \
       -t ${SLURM_CPUS_ON_NODE}
done

Key Points

ONT provides long-read sequencing data with high error rates.
Flye is a long-read assembler optimized for handling ONT data and producing highly contiguous assemblies.
The Flye assembly workflow involves read preprocessing, repeat graph construction, graph resolution, polishing, and post-processing.
Flye output includes the final assembly sequence, assembly graph, and summary information for evaluation.
Polishing the assembly can improve base-level accuracy and overall assembly quality.
Flye provides built-in polishing capabilities, and other tools like Racon, Nanopolish, and Medaka can be used for further refinement.

Content from Hybrid Long Read Assembly (optional)

Last updated on 2025-07-01 | Edit this page

Overview

Questions

What is hybrid assembly, and how does it combine different sequencing technologies?
How can you perform hybrid assembly using both types of long-read data?
What are the key steps in hybrid assembly, including polishing and scaffolding?
How do you evaluate the quality of a hybrid assembly using bioinformatics tools?

Objectives

Understand the concept of hybrid assembly and its advantages in genome sequencing.
Learn how to perform hybrid assembly using long-read sequencing data from PacBio and ONT platforms.
Explore the key steps involved in hybrid assembly, including polishing and scaffolding.
Evaluate the quality of a hybrid assembly using bioinformatics tools such as QUAST and Compleasm.

Flye for Hybrid Assembly

For hybrid assembly using flye, first, run the pipeline with all your reads in the –pacbio-raw mode (you can specify multiple files, no need to merge all you reads into one). Also add –iterations 0 to stop the pipeline before polishing. Once the assembly finishes, run polishing using either PacBio or ONT reads only. Use the same assembly options, but add –resume-from polishing. Here is an example of a script that should do the job:

BASH

ml --force purge
ml biocontainers
ml flye
# reads
PBREADS="At_pacbio-hifi-filtered.fastq"
ONTREADS="At_ont-reads-filtered.fastq"
# round 1
flye \
    --pacbio-raw $PBREADS $ONTREADS \
    --iterations 0 \
    --out-dir hybrid_flye_out \
    --genome-size 135m \
    --threads ${SLURM_CPUS_ON_NODE}
# round 2
flye \
   --pacbio-raw $PBREADS \
   --resume-from polishing \
   --out-dir hybrid_flye_out  \
   --genome-size 135m \
   --threads ${SLURM_CPUS_ON_NODE}

Evaluating Assembly Quality

For quick evaluation, we will run quast and compleasm on the hybrid assembly output.

BASH

ml --force purge
ml biocontainers
ml quast
ml compleasm
fasta="hybrid_flye_out/assembly.fasta"
quast.py \
  --fast \
  --threads ${SLURM_CPUS_ON_NODE} \
  -o quast_basic \
    ${fasta}
compleasm \
compleasm run \
   -a ${fasta} \
   -o compleasm_out \
   -l brassicales_odb10  \
   -t ${SLURM_CPUS_ON_NODE}

Scaffolding with Bionano

To scaffold the assembly using Bionano data, we will use the bionano solve. We can run the following script to scaffold the assembly:

BASH

ml --force purge
export PATH=$PATH:/apps/biocontainers/exported-wrappers/bionano/3.8.0
fasta="hybrid_flye_out/assembly.fasta"
run_hybridscaffold.sh \
  -c /opt/Solve3.7_10192021_74_1/HybridScaffold/1.0/hybridScaffold_DLE1_config.xml\
  -b workshop_assembly/col-0_bionano/Evry.OpticalMap.Col-0.cmap \
  -n ${fasta} \
  -u CTTAAG \
  -z results_bionano_hybrid_scaffolding.zip \
  -w log.txt \
  -B 2 \
  -N 2 \
  -g \
  -f \
  -r /opt/Solve3.7_10192021_74_1/RefAligner/1.0/sse/RefAligner \
  -p /opt/Solve3.7_10192021_74_1/Pipeline/1.0 \
  -o bionano_hybrid_scaffolding

Once this completes, you can generate the final scaffold-level assembly by merging placed and unplaced contigs in the bionano_hybrid_scaffolding/hybrid_scaffolds directory.

BASH

cd bionano_hybrid_scaffolding/hybrid_scaffolds
cat *HYBRID_SCAFFOLD.fasta *_HYBRID_SCAFFOLD_NOT_SCAFFOLDED.fasta \
   > ../../assembly_scaffolds.fasta
cd ../..

You can evaluate the final assembly using quast and compleasm as before.

BASH

fasta="assembly_scaffolds.fasta"
quast.py \
  --fast \
  --threads ${SLURM_CPUS_ON_NODE} \
  -o quast_scaffolds \
    ${fasta}
compleasm run \
   -a ${fasta} \
   -o compleasm_scaffolds_out \
   -l brassicales_odb10  \
   -t ${SLURM_CPUS_ON_NODE}

Hybrid Assembly Summary

In this section you have learned how to perform a hybrid assembly using Flye, polish the assembly, scaffold it using Bionano, and evaluate the final assembly quality. This workflow combines the advantages of ONT and PacBio sequencing, improves structural accuracy with Bionano scaffolding, and ensures a high-quality genome assembly. The steps involved are:

Run Hybrid Assembly with Flye
- Use Flye in --pacbio-raw mode to assemble both PacBio and ONT reads.
- Set --iterations 0 to stop before polishing.
Polishing the Assembly
- Polish the assembly using either PacBio or ONT reads by resuming Flye with --resume-from polishing.
Evaluate Assembly Quality
- Run QUAST for basic assembly metrics (contig count, N50, genome size, misassemblies).
- Use Compleasm to assess genome completeness based on conserved single-copy genes.
Scaffold Assembly with Bionano Optical Mapping
- Use Bionano Solve to integrate optical maps and scaffold the assembly.
- Run run_hybridscaffold.sh with the reference .cmap optical map file.
Generate Final Scaffold-Level Assembly
- Merge placed and unplaced contigs from the Bionano scaffolding output to create the final genome assembly.
Final Evaluation of Scaffolds
- Re-run QUAST and Compleasm to validate improvements and ensure genome completeness after scaffolding.

Key Points

Hybrid assembly with Flye combines ONT and PacBio reads to leverage long-read continuity and high-accuracy sequencing, with separate polishing steps to refine base-level errors.
Assembly quality assessment using QUAST and Compleasm provides critical insights into contiguity, completeness, and potential misassemblies, ensuring reliability before scaffolding.
Bionano Optical Genome Mapping (OGM) improves hybrid assemblies by scaffolding contigs, resolving misassemblies, and enhancing genome continuity, leading to chromosome-scale assemblies.
Final scaffolding validation and quality assessment ensure the integrity of the genome assembly, with QUAST and Compleasm used to confirm improvements after Bionano integration.

Content from Scaffolding using Optical Genome Mapping

Last updated on 2025-07-01 | Edit this page

Overview

Questions

What is Bionano optical genome mapping (OGM) and how does it improve genome assembly?
How does Bionano Solve hybrid scaffolding integrate optical maps with sequence assemblies?
What are the key steps involved in running the Bionano Solve pipeline for hybrid scaffolding?
How can you assess the quality of hybrid scaffolds generated by Bionano Solve?

Objectives

Understand the principles of Bionano optical genome mapping (OGM) and its role in genome assembly.
Learn how to run the Bionano Solve hybrid scaffolding pipeline to improve genome assemblies.
Explore the key steps involved in scaffolding HiFiasm and Flye assemblies using Bionano Solve.
Evaluate the quality of hybrid scaffolds generated by Bionano Solve and interpret the results.

Introduction to Bionano optical genome mapping (OGM)

Bionano optical mapping is a high-resolution genome analysis technique that generates long-range structural information by labeling and imaging ultra-long DNA molecules. It provides genome-wide maps that can be used to scaffold contigs from sequencing-based assemblies, significantly improving contiguity and structural accuracy. By integrating Bionano maps with assemblies from PacBio HiFi and Oxford Nanopore Technologies (ONT), misassemblies can be corrected, chimeric contigs resolved, and scaffold N50s increased by orders of magnitude. This approach is particularly valuable for complex genomes, where repetitive sequences and structural variations pose challenges for traditional sequencing methods. Bionano hybrid scaffolding has become a standard for enhancing genome assemblies, enabling researchers to achieve high-quality, chromosome-level assemblies efficiently.

Bionano Solve Hybrid Scaffolding

Bionano Solve improves genome assembly by integrating optical genome mapping data with sequence assemblies, generating ultra-long hybrid scaffolds that enhance contiguity and accuracy. The pipeline identifies and resolves assembly conflicts, orders and orients sequence contigs, and estimates gap sizes between adjacent sequences.

The scaffolding workflow involves:

In Silico Map Generation – Converting sequence assembly into a map format for alignment.
Conflict Resolution – Aligning in silico maps to Bionano genome maps and identifying misassemblies.
Hybrid Scaffolding – Merging high-confidence sequence and optical maps into a refined scaffold.
Final Alignment – Mapping sequence contigs back to the hybrid scaffold for consistency validation.
Output Generation – Producing final AGP and FASTA files with corrected genome structures.

What is the source of this data?

The optical genome mapping data was obtained from the project PRJEB50694.
Data corresponds to Arabidopsis thaliana ecotype Col-0, generated using Bionano optical genome mapping technology.
The dataset is publicly available on the European Nucleotide Archive (ENA) and can be downloaded using:
The .cmap file contains high-resolution optical maps used for scaffolding and structural validation of genome assemblies.

To download:

BASH

wget ftp://ftp.sra.ebi.ac.uk/vol1/analysis/ERZ227/ERZ2272299/Evry.OpticalMap.Col-0.cmap.gz
gunzip Evry.OpticalMap.Col-0.cmap.gz

Installation and Setup

Bionano Solve is available on Bionano.com and can be installed on Linux-based systems. The software requires a valid license and access to Bionano data files for processing.

Custom container with just the hybrid scaffolding tools can be used to run the Bionano Solve pipeline. On Negishi, you can add it to your PATH using the command below:

BASH

export PATH=$PATH:/apps/biocontainers/exported-wrappers/bionano/3.8.0

Running Bionano Solve

To scaffold a genome using Bionano Solve, you need to provide the following input files:

BASH

export PATH=$PATH:/apps/biocontainers/exported-wrappers/bionano/3.8.0
run_hybridscaffold.sh
  -c /opt/Solve3.7_10192021_74_1/HybridScaffold/1.0/hybridScaffold_DLE1_config.xml\
  -b input.cmap \
  -n genome.fasta \
  -u CTTAAG \
  -z results_output.zip \
  -w log.txt \
  -B 2 \
  -N 2 \
  -g \
  -f \
  -r /opt/Solve3.7_10192021_74_1/RefAligner/1.0/sse/RefAligner \
  -p /opt/Solve3.7_10192021_74_1/Pipeline/1.0 \
  -o output_dir

Options used

Option	Argument	Description
`-c`	`/opt/Solve3.7_10192021_74_1/HybridScaffold/1.0/hybridScaffold_DLE1_config.xml`	Specifies the hybrid scaffolding configuration file required for the pipeline.
`-b`	`input.cmap`	Input Bionano CMAP file, which contains the optical genome map data.
`-n`	`genome.fasta`	Input genome sequence in FASTA format from NGS assembly.
`-u`	`CTTAAG`	Specifies the sequence of the enzyme recognition site, overriding the one in the config XML file.
`-z`	`results_output.zip`	Generates a ZIP archive containing essential output files.
`-w`	`log.txt`	Defines the name of the status text file needed for IrysView.
`-B`	`2`	Conflict filter level: `2` means cut the contig at conflict points (required if not using `-M`).
`-N`	`2`	Conflict filter level: `2` means cut the contig at conflict points (same as `-B`, applied to sequencing contigs).
`-g`	(No argument)	Enables trimming of overlapping NGS sequences during AGP and FASTA export.
`-f`	(No argument)	Forces output generation and overwrites any existing files in the output directory.
`-r`	`/opt/Solve3.7_10192021_74_1/RefAligner/1.0/sse/RefAligner`	Specifies the path to the RefAligner program, which is required for scaffolding.
`-p`	`/opt/Solve3.7_10192021_74_1/Pipeline/1.0`	Specifies the directory for the de novo assembly pipeline (optional, required for `-x`).
`-o`	`output_dir`	Defines the output folder where scaffolded results will be stored.

Scffolding HiFiasm assembly with Bionano Solve

For HiFiasm assembly, you need to provide the HiFiasm assembly FASTA file as input to the Bionano Solve pipeline. The command structure remains the same, with the only change being the input sequence file.

BASH

export PATH=$PATH:/apps/biocontainers/exported-wrappers/bionano/3.8.0
run_hybridscaffold.sh \
  -c /opt/Solve3.7_10192021_74_1/HybridScaffold/1.0/hybridScaffold_DLE1_config.xml\
  -b Evry.OpticalMap.Col-0.cmap \
  -n hifiasm_60x/athaliana_hifi.asm.bp.p_ctg.fasta \
  -u CTTAAG \
  -z results_bionano_hifiasm_scaffolding.zip \
  -w log.txt \
  -B 2 \
  -N 2 \
  -g \
  -f \
  -r /opt/Solve3.7_10192021_74_1/RefAligner/1.0/sse/RefAligner \
  -p /opt/Solve3.7_10192021_74_1/Pipeline/1.0 \
  -o bionano_hifiasm_scaffolding

Scffolding Flye assembly with Bionano Solve

For Flye assembly, the process is similar to HiFiasm, but you need to provide the Flye assembly FASTA file instead of the HiFiasm assembly. The command structure remains the same, with the only change being the input sequence file.

BASH

export PATH=$PATH:/apps/biocontainers/exported-wrappers/bionano/3.8.0
run_hybridscaffold.sh \
  -c /opt/Solve3.7_10192021_74_1/HybridScaffold/1.0/hybridScaffold_DLE1_config.xml\
  -b workshop_assembly/col-0_bionano/Evry.OpticalMap.Col-0.cmap \
  -n flye_ont_60x/assembly.fasta \
  -u CTTAAG \
  -z results_bionano_flye_scaffolding.zip \
  -w log.txt \
  -B 2 \
  -N 2 \
  -g \
  -f \
  -r /opt/Solve3.7_10192021_74_1/RefAligner/1.0/sse/RefAligner \
  -p /opt/Solve3.7_10192021_74_1/Pipeline/1.0 \
  -o bionano_flye_scaffolding

Understanding Hybrid Scaffolding Output

The output of the Bionano Solve pipeline includes scaffolded genome assemblies in AGP and FASTA formats, along with alignment and conflict resolution information. The hybrid scaffolds provide a more accurate representation of the genome structure, with improved contiguity and reduced misassemblies. The output files can be visualized using genome browsers or alignment viewers to assess the quality and completeness of the assembly.

Folder	Contents
agp_fasta/	Final scaffolded genome assembly in FASTA, AGP format, alignment results, gap information, and logs.
align0/	Initial alignment of optical maps to the sequence assembly, including XMAP, CMAP, and error logs.
align1/	Secondary alignment refinement, similar to align0 but after resolving initial inconsistencies.
align_final/	Final alignment results of hybrid scaffolds to optical maps, including mapping rates and statistics.
assignAlignType/	Tracks conflicts between NGS contigs and optical maps, includes exclusion and trimming decisions.
cut_conflicts/	Stores files related to contig trimming and conflict resolution between NGS and optical maps.
fa2cmap/	Converts the sequence assembly into Bionano’s CMAP format before integration with optical maps.
hybrid_scaffolds/	Contains final scaffolded genome with CMAP, XMAP, AGP files, and a scaffolding report.
mergeNGS_BN/	Stores intermediate files merging NGS contigs with Bionano maps, including hybrid scaffold progress.
results_output.zip	Compressed archive containing essential scaffolding results for easy sharing.

Within hybrid_scaffolds, the files ending with HYBRID_SCAFFOLD.fasta and HYBRID_SCAFFOLD_NOT_SCAFFOLDED.fasta represent the final scaffolded genome and unplaced contigs, respectively. You will need to merge these files to obtain your final scaffolded genome assembly.

Quality Assessment of Hybrid Scaffolds

The hybird scaffolds report file will be in the hybrid_scaffolds directory and will provide a summary of the scaffolding process, including alignment statistics, conflict resolution, and scaffold N50 values. This report is essential for evaluating the quality and completeness of the hybrid scaffolds and identifying any potential issues that need further investigation.

Category	PacBio HiFi (hifiasm)	ONT (Flye)
Original BioNano Genome Map
Count	18	18
Min length (Mbp)	0.342	0.342
Median length (Mbp)	3.956	3.956
Mean length (Mbp)	7.396	7.396
N50 length (Mbp)	15.529	15.529
Max length (Mbp)	17.518	17.518
Total length (Mbp)	133.124	133.124
Original NGS Sequences
Count	152	43
Min length (Mbp)	0.027	0.008
Median length (Mbp)	0.050	0.276
Mean length (Mbp)	0.896	2.797
N50 length (Mbp)	7.981	9.261
Max length (Mbp)	13.758	14.609
Total length (Mbp)	136.156	120.259
Conflict Resolution (BNG-NGS Alignment)
Conflict cuts made to Bionano maps	2	0
Conflict cuts made to NGS sequences	30	0
Bionano maps to be cut	2	0
NGS sequences to be cut	18	0
NGS FASTA Sequence in Hybrid Scaffold
Count	40	26
Min length (Mbp)	0.033	0.065
Median length (Mbp)	0.945	1.681
Mean length (Mbp)	2.689	4.558
N50 length (Mbp)	8.437	9.261
Max length (Mbp)	13.484	14.609
Total length (Mbp)	107.558	118.508
Hybrid Scaffold FASTA
Count	11	12
Min length (Mbp)	0.104	0.524
Median length (Mbp)	11.824	12.426
Mean length (Mbp)	10.518	9.891
N50 length (Mbp)	14.479	14.886
Max length (Mbp)	15.227	16.188
Total length (Mbp)	115.698	118.689
Hybrid Scaffold FASTA + Not Scaffolded NGS
Count	161	33
Min length (Mbp)	0.024	0.006
Median length (Mbp)	0.051	0.159
Mean length (Mbp)	0.896	3.650
N50 length (Mbp)	14.083	14.886
Max length (Mbp)	15.227	16.188
Total length (Mbp)	144.295	120.440

Which assembler and data performed better?

The HiFiasm assembly with PacBio HiFi data resulted in a higher N50 length and total length in the hybrid scaffold compared to the Flye assembly with ONT data.
The conflict resolution process involved more cuts in the NGS sequences for the HiFiasm assembly, indicating a higher level of alignment discrepancies.
The final hybrid scaffold from the HiFiasm assembly had a higher N50 length and total length, suggesting better contiguity and completeness compared to the Flye assembly.

Key Points

Bionano optical genome mapping (OGM) provides long-range structural information for scaffolding genome assemblies.
Bionano Solve hybrid scaffolding integrates optical maps with sequence assemblies to improve contiguity and accuracy.
The Bionano Solve pipeline involves in silico map generation, conflict resolution, hybrid scaffolding, and final alignment.
The output of Bionano Solve includes scaffolded genome assemblies in AGP and FASTA formats, alignment results, and conflict resolution information.
Quality assessment of hybrid scaffolds involves evaluating alignment statistics, conflict resolution, scaffold N50 values, and completeness of the assembly.

Content from Assembly Assessment

Last updated on 2025-07-01 | Edit this page

Overview

Questions

Why is evaluating genome assembly quality important?
What tools can be used to assess assembly completeness, accuracy, and structural integrity?
How do you interpret key metrics from assembly evaluation tools?
What are the main steps in evaluating a genome assembly using bioinformatics tools?

Objectives

Understand the importance of evaluating genome assembly quality.
Learn about tools for assessing assembly completeness, accuracy, and structural integrity.
Interpret key metrics from assembly evaluation tools to guide further analysis.
Evaluate a genome assembly using bioinformatics tools such as QUAST, Compleasm, Merqury, and Bandage.

Evaluating Assembly Quality

Assessing genome assembly quality is essential to ensure completeness, accuracy, and structural integrity before downstream analyses. Different tools provide complementary insights—QUAST evaluates assembly contiguity, Compleasm assesses gene-space completeness, Merqury validates k-mer consistency, and Bandage visualizes assembly graphs for structural assessment. Together, these methods help identify errors, improve genome reconstruction, and ensure high-quality results.

Why is Assembly Evaluation Important?

Detects misassemblies and structural errors: Identifies fragmented, misjoined, or incorrectly placed contigs that can impact genome interpretation.
Measures completeness and accuracy: Ensures that essential genes and expected genome regions are properly assembled and not missing or duplicated.
Validates sequencing data quality: Confirms whether sequencing errors, biases, or artifacts affect the final assembly.
Guides further refinement: Helps decide whether additional polishing, scaffolding, or reassembly is needed for better genome reconstruction.

Quast for quality metrics

You can run quast to evaluate the quality of your genome assembly. It is also useful for comparing multiple assemblies to identify the best one based on key metrics such as contig count, N50, and misassemblies.

BASH

ml --force purge
ml biocontainers
ml compleasm
mkdir -p quast_evaluation
# ln -s ../assembly1.fasta all_assemblies/assembly1.fasta
# ln -s ../assembly2.fasta all_assemblies/assembly2.fasta
# ln -s ../assembly3.fasta all_assemblies/assembly3.fasta
# link any other assemblies you want to compare
# ln -s ../pacbio/9994.q20.CCS-filtered-60x.fastq
# donwload the reference genome
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
quast.py \
   --output-dir quast_complete_stats \
   --no-read-stats \
   -r  Arabidopsis_thaliana.TAIR10.dna.toplevel.fa \
   --threads ${SLURM_CPUS_ON_NODE} \
   --eukaryote \
   --pacbio 9994.q20.CCS_ge20Kb.fasta \
   assembly1.fasta assembly2.fasta assembly3.fasta

This will generate a detailed report in the quast_complete_stats directory, including key metrics for each assembly and a summary of their quality. You can use this information to compare different assemblies and select the best one for downstream analysis.

Compleasm for genome completeness (gene-space)

Similarly, you can use compleasm to assess the completeness of your genome assembly in terms of gene-space representation. This tool compares the assembly against a set of conserved genes to estimate the level of completeness and identify missing or fragmented genes.

BASH

ml --force purge
ml biocontainers
ml compleasm
mkdir -p compleasm_evaluation
# ln -s ../assembly1.fasta all_assemblies/assembly1.fasta
# ln -s ../assembly2.fasta all_assemblies/assembly2.fasta
# ln -s ../assembly3.fasta all_assemblies/assembly3.fasta
# link any other assemblies you want to compare
# ln -s ../quast_evaluation/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa # reference
for fasta in *.fasta; do
  compleasm run \
    -a ${fasta} \
    -o ${fasta%.*}_out \
    -l brassicales_odb10  \
    -t ${SLURM_CPUS_ON_NODE}
done

This will generate a detailed report for each assembly in the directory, highlighting the completeness of conserved genes and potential gaps in the genome reconstruction. The assessment result by compleasm is saved in the file summary.txt in the compleasm_evaluation/assemblyN_out (specified in output -o option) folder. These BUSCO genes are categorized into the following classes:

S (Single Copy Complete Genes): The BUSCO genes that can be entirely aligned in the assembly, with only one copy present.
D (Duplicated Complete Genes): The BUSCO genes that can be completely aligned in the assembly, with more than one copy present.
F (Fragmented Genes, subclass 1): The BUSCO genes which only a portion of the gene is present in the assembly, and the rest of the gene cannot be aligned.
I (Fragmented Genes, subclass 2): The BUSCO genes in which a section of the gene aligns to one position in the assembly, while the remaining part aligns to another position.
M (Missing Genes): The BUSCO genes with no alignment present in the assembly.

Merqury for evaluating genome assembly

Merqury is a tool for reference-free assembly evaluation based on efficient k-mer set operations. It provides insights into various aspects of genome assembly, offering a comprehensive view of genome quality without relying on a reference sequence. Specifically, Merqury can generate the following plots and metrics:

Copy Number Spectrum (Spectra-cn Plot):
- A k-mer-based analysis that detects heterozygosity levels and genome repeats by identifying peaks in k-mer coverage.
- Helps estimate genome size, detect missing regions, and distinguish between homozygous and heterozygous k-mers in an assembly.
Assembly Spectrum (Spectra-asm Plot):
- Compares k-mers between different assemblies or between an assembly and raw sequencing reads.
- Useful for detecting missing sequences, shared regions, and assembly-specific k-mers that may indicate errors or haplotype-specific variations.
K-mer Completeness:
- Measures how many reliable k-mers (those likely to be real and not sequencing errors) are present in both the sequencing reads and the assembly.
- Helps identify missing regions, misassemblies, and sequencing biases affecting genome reconstruction.
Consensus Quality (QV) Estimation:
- Uses k-mer agreement between the assembly and the read set to estimate base-level accuracy.
- Higher QV scores indicate a more accurate consensus sequence, but results depend on read quality and coverage depth.
Misassembly Detection with K-mer Positioning:
- Identifies unexpected k-mers or false duplications in assemblies, reporting their positions in .bed and .tdf files for visualization in genome browsers.
- Helps pinpoint structural errors such as collapsed repeats, chimeric joins, or large insertions/deletions.

This k-mer-based approach in Merqury provides reference-free genome quality evaluation, making it highly effective for de novo assemblies and structural validation.

BASH

ml --force purge
ml biocontainers
ml merqury
ml meryl
mkdir -p merqury_evaluation
# ln -s ../assembly1.fasta all_assemblies/assembly1.fasta
# ln -s ../assembly2.fasta all_assemblies/assembly2.fasta
# ln -s ../assembly3.fasta all_assemblies/assembly3.fasta
# link any other assemblies you want to compare
# ln -s ../quast_evaluation/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa # reference
# ln -s ../pacbio/9994.q20.CCS-filtered-60x.fastq # pacbio reads
meryl \
   count k=21 \
   threads=${SLURM_CPUS_ON_NODE} \
   memory=8g \
   output 9994.q20.CCS-filtered.meryl\
   9994.q20.CCS-filtered.fastq
merqury \
   -a assembly1.fasta \
   -r 9994.q20.CCS-filtered.meryl \
   -o merqury_evaluation/assembly1
merqury.sh \
   9994.q20.CCS-filtered.meryl 
   assembly1.fasta assembly2.fasta assembly3.fasta
   merqury_evaluation_output

This will generate numberous files with merqury_evaluation_output prefix, including k-mer spectra, completeness metrics, and consensus quality estimates for each assembly. You can use these results to evaluate the accuracy, completeness, and structural integrity of your genome assemblies.

Assembly graph visualization using Bandage

Bandage is a tool for visualizing assembly graphs, which represent the connections between contigs or scaffolds in a genome assembly. By visualizing the graph structure, you can identify complex regions, repetitive elements, and potential misassemblies that may affect the genome reconstruction.

To visualize the assembly graph using Bandage:

Open a web browser and navigate to desktop.negishi.rcac.purdue.edu.
Log in with your Purdue Career Account username and password, but append “,push” to your password.
Lauch the terminal and run the following command:

BASH

ml --force purge
ml biocontainers
ml bandage
Bandage

In the Bandage interface, navigate to your assembly folder (hifiasm or flye), and load your assembly graph (e.g., assembly1.fasta) .
Explore the graph structure, identify complex regions, and visualize connections between contigs or scaffolds.

Key Points

QUAST evaluates assembly contiguity and quality metrics.
Compleasm assesses gene-space completeness in genome assemblies.
Merqury provides reference-free evaluation based on k-mer analysis.
Bandage visualizes assembly graphs for structural assessment.