Oxford Nanopore Assembly using Flye

Last updated on 2025-04-22 | Edit this page

Overview

Questions

What are the key features of ONT reads?
Why is Flye good for assembling ONT reads?
What are the main steps in the Flye assembly workflow?
How can you evaluate the quality of a Flye assembly?

Objectives

Understand the characteristics of ONT reads.
Learn about the Flye assembler and its advantages for ONT data.
Explore the key steps in the Flye assembly workflow.
Evaluate the quality of a Flye assembly using common metrics.

Introduction to ONT reads and Flye Assembler

Oxford Nanopore Technologies (ONT) has revolutionized sequencing by providing long-read data, enabling the resolution of complex genomic structures that were previously intractable with short-read technologies. However, ONT reads are error-prone, necessitating specialized assembly algorithms that can handle high sequencing error rates while maximizing contiguity and accuracy.

Traditional assemblers designed for short reads rely on de Bruijn graph approaches, which break sequences into fixed k-mers and struggle with error-rich long reads. In contrast, modern long-read assemblers like Flye use alternative graph-based strategies to overcome these limitations. Flye specifically constructs repeat graphs to accurately reconstruct genomes while addressing challenges posed by structural variations and repeats. This makes it particularly well-suited for ONT data, producing high-quality, contiguous assemblies for small microbial genomes to large eukaryotic genomes.

The latest ultra-long ONT reads, such as those generated by the PromethION platform, have further improved assembly quality and contiguity. Flye can leverage these ultra-long reads to generate even more accurate and contiguous assemblies, making it a powerful tool for a wide range of genomic analyses.

Installation and Setup

Flye is available as module on RCAC clusters. You can load the module using the following command:

BASH

ml --force purge
ml biocontainers
ml flye
flye --version

You can also use the Singularity container for HiFiasm, which provides a consistent environment across different systems. The container can be pulled from the BioContainers registry using the following command:

BASH

apptainer pull docker://quay.io/biocontainers/flye:2.9.5--py311h2de2dd3_2
apptainer exec hflye_2.9.5--py311h2de2dd3_2.sif flye --version

Overview of Flye Assembler

Flye is a de novo assembler designed for high-error, long-read sequencing data from Oxford Nanopore Technologies (ONT) and PacBio. It is optimized to handle the inherent noise in single-molecule sequencing (SMS) reads while producing highly contiguous assemblies. Flye is particularly well-suited for assembling complex genomes, resolving repetitive regions, and reconstructing structural variations that short-read assemblers struggle with.

The Flye assembly workflow typically involves:

Read preprocessing – filtering and quality-checking raw ONT reads
Disjointig generation – constructing long, error-prone sequences from overlapping reads
Repeat graph construction – building a repeat-aware assembly graph to represent genome structure
Graph resolution – disentangling repeats and structural variations to produce accurate contigs
Polishing – refining assemblies to improve base-level accuracy using read alignment
Post-processing – assessing assembly quality and generating final output

Flye is optimized for this process, leveraging repeat graph-based assembly to generate longer, more contiguous sequences than many traditional long-read assemblers. Its ability to handle highly repetitive regions, coupled with its fast runtime and efficient memory usage, makes it a powerful choice for ONT genome assembly.

Flye: basic workflow

To run flye, you need to provide the input long reads in FASTA or FASTQ format, specifying the long read type, provide estimated genome size, output directory and the threads to use. The basic command structure is as follows:

BASH

ml --force purge
ml biocontainers
ml flye
flye \
  --nano-raw At_ont-reads-filtered.fastq \
  --genome-size 135m \
  --out-dir flye_ont \
  --threads ${SLURM_CPUS_ON_NODE}

Options used

--nano-raw specifies the input ONT long reads in FASTQ format
--genome-size provides an estimate of the genome size to guide assembly
--out-dir specifies the output directory for Flye results
--threads specifies the number of CPU threads to use for assembly

The input can either be fastq or fasta, compressed or uncompressed. The output will be stored in the directory provided.

Understanding Flye Output

The output of Flye includes several files and directories that provide information about the assembly process and results. Key components of the Flye output include:

File/Folder	Description
00-assembly/	Initial draft assembly output.
10-consensus/	Consensus refinement step.
20-repeat/	Repeat graph construction and analysis.
30-contigger/	Final contig generation step.
40-polishing/	Final polishing step for improving assembly quality.
assembly.fasta	Final polished assembly sequence.
assembly_graph.gfa	Final assembly graph in GFA format.
assembly_graph.gv	Visualization of final assembly graph.
assembly_info.txt	Summary information about the assembly.
flye.log	Log file detailing the Flye run.

Which file should I use as my final assembly?

The assembly.fasta file contains the final polished assembly sequence and is typically used as the primary output for downstream analyses. This file represents the best estimate of the assembled genome based on the input data and the Flye assembly process. You can use this file for further analyses, such as gene prediction, variant calling, or comparative genomics studies.

Quick look at metrics for this assembly:

BASH

ml --force purge
ml biocontainers
ml quast
quast.py \
    --fast \
    --threads ${SLURM_CPUS_ON_NODE} \
    -o quast_basic_stats \
    flye_ont/assembly.fasta

Which of these assemblies look better?

Check the quast_basic_stats/report.txt file to check assembly statistics. Based on your previous assembly using hifiasm, what assembly do you think is better? What metrics are you using to make this decision? Discuss which assembly has better contiguity and completeness based on these statistics.

Other important parameters

flye provides several additional parameters that can be used to customize the assembly process and improve results. Some key parameters include:

Pick the right input type (--nano-hq, --pacbio-hifi, etc.) → Incorrect selection affects accuracy.
Always specify --out-dir and --threads for faster and organized runs.
Use --keep-haplotypes if you don’t want a collapsed assembly.
For metagenomes, use --meta to handle variable coverage.
If the assembly fails, use --resume to avoid losing progress.

Interested in exploring more about Flye?

Check out the Flye FAQ for answers to common questions and troubleshooting tips. You can also explore the Flye GitHub repository for the latest updates, documentation, and discussions about the assembler.

Improving Assembly Quality with Polishing (Optional)

After generating the initial assembly, it is often beneficial to polish the assembly to improve base-level accuracy. Polishing involves aligning the raw reads back to the assembly and correcting errors to produce a more accurate consensus sequence. This step can significantly enhance the quality of the assembly, especially for error-prone long-read data like ONT reads.

Flye provides built-in polishing capabilities. By default, Flye performs one round of polishing to refine the assembly. However, you can customize the polishing process by running polishing separately after the initial assembly.

An example command to polish assembly with accurate PacBio HiFi reads:

BASH

ml --force purge
ml biocontainers
ml flye
flye \
  --polish-target flye_ont/assembly.fasta \
  --pacbio-raw At_pacbio-hifi-filtered.fastq \
  --genome-size 135m \
  --iterations 1 \
  --out-dir flye_ont_polished \
  --threads ${SLURM_CPUS_ON_NODE}

*You can also provide Bam file as input instead of reads

There are many other polishing tools available, such as Racon, Nanoploish, and medaka, which can be used to further refine the ONT assembly. Each tool has its strengths and limitations, so it is recommended to try different polishing strategies to achieve the best results for your specific dataset. Medaka is a popular choice for polishing ONT assemblies due to its accuracy and efficiency.

HiFiasm for ONT Data (Optional)

Since HiFiasm also supports ONT reads for assembly, we can test it out to access the quality of the assembly. The basic command structure is as follows:

BASH

ml --force purge
ml biocontainers
ml hifiasm
hifiasm \
    -t ${SLURM_CPUS_ON_NODE} \
    -o athaliana_ont.asm \
    --ont \
    At_ont-reads-filtered.fastq

Post processing

Once the run completes (~30 mins with 32 threads), you can convert GFA to FASTA using the following command:

BASH

for ctg in *_ctg.gfa; do
    awk '/^S/{print ">"$2"\n"$3}' ${ctg} > ${ctg%.gfa}.fasta
done

Get teh basic stats using quast:

BASH

ml --force purge
ml biocontainers
ml quast
quast.py \
    --fast \
    --threads ${SLURM_CPUS_ON_NODE} \
    -o quast_basic_stats \
    *_ctg.fasta

and run compleasm to get the assembly completeness:

BASH

ml --force purge
ml biocontainers
ml completeasm
for fasta in *_ctg.fasta; do
    compleasm run \
       -a ${fasta} \
       -o ${fasta%.*} \
       -l brassicales_odb10 \
       -t ${SLURM_CPUS_ON_NODE}
done

Key Points

ONT provides long-read sequencing data with high error rates.
Flye is a long-read assembler optimized for handling ONT data and producing highly contiguous assemblies.
The Flye assembly workflow involves read preprocessing, repeat graph construction, graph resolution, polishing, and post-processing.
Flye output includes the final assembly sequence, assembly graph, and summary information for evaluation.
Polishing the assembly can improve base-level accuracy and overall assembly quality.
Flye provides built-in polishing capabilities, and other tools like Racon, Nanopolish, and Medaka can be used for further refinement.