Oxford Nanopore Assembly using Flye
Last updated on 2026-02-10 | Edit this page
Estimated time: 60 minutes
Overview
Questions
- What are the key features of ONT reads?
- Why is Flye good for assembling ONT reads?
- What are the main steps in the Flye assembly workflow?
- How can you evaluate the quality of a Flye assembly?
Objectives
- Understand the characteristics of ONT reads.
- Learn about the Flye assembler and its advantages for ONT data.
- Explore the key steps in the Flye assembly workflow.
- Evaluate the quality of a Flye assembly using common metrics.
Introduction to ONT reads and Flye Assembler
Oxford Nanopore Technologies (ONT) has revolutionized sequencing by providing long-read data, enabling the resolution of complex genomic structures that were previously intractable with short-read technologies. However, ONT reads are error-prone, necessitating specialized assembly algorithms that can handle high sequencing error rates while maximizing contiguity and accuracy.
Traditional assemblers designed for short reads rely on de Bruijn
graph approaches, which break sequences into fixed k-mers and struggle
with error-rich long reads. In contrast, modern long-read assemblers
like Flye use alternative graph-based strategies to
overcome these limitations. Flye specifically constructs repeat graphs
to accurately reconstruct genomes while addressing challenges posed by
structural variations and repeats. This makes it particularly
well-suited for ONT data, producing high-quality, contiguous assemblies
for small microbial genomes to large eukaryotic genomes.
The latest ultra-long ONT reads, such as those generated by the PromethION platform, have further improved assembly quality and contiguity. Flye can leverage these ultra-long reads to generate even more accurate and contiguous assemblies, making it a powerful tool for a wide range of genomic analyses.
Installation and Setup
Flye is available as module on RCAC clusters. You can load the module using the following command:
You can also use the Singularity container for Flye, which provides a consistent environment across different systems. The container can be pulled from the BioContainers registry using the following command:
Overview of Flye Assembler
Flye is a de novo assembler designed for high-error, long-read sequencing data from Oxford Nanopore Technologies (ONT) and PacBio. It is optimized to handle the inherent noise in single-molecule sequencing (SMS) reads while producing highly contiguous assemblies. Flye is particularly well-suited for assembling complex genomes, resolving repetitive regions, and reconstructing structural variations that short-read assemblers struggle with.
The Flye assembly workflow typically involves:
-
Read preprocessing – filtering and quality-checking
raw ONT reads
-
Disjointig generation – constructing long,
error-prone sequences from overlapping reads
-
Repeat graph construction – building a repeat-aware
assembly graph to represent genome structure
-
Graph resolution – disentangling repeats and
structural variations to produce accurate contigs
-
Polishing – refining assemblies to improve
base-level accuracy using read alignment
- Post-processing – assessing assembly quality and generating final output
Flye is optimized for this process, leveraging repeat graph-based assembly to generate longer, more contiguous sequences than many traditional long-read assemblers. Its ability to handle highly repetitive regions, coupled with its fast runtime and efficient memory usage, makes it a powerful choice for ONT genome assembly.
Flye: basic workflow
To run flye, you need to provide the input long reads in
FASTA or FASTQ format, specifying the long read type, provide estimated
genome size, output directory and the threads to use. The basic command
structure is as follows:
BASH
ml --force purge
ml biocontainers
ml flye
flye \
--nano-hq At_ont-reads-filtered.fastq \
--genome-size 135m \
--out-dir flye_ont \
--threads ${SLURM_CPUS_ON_NODE}
Options used
-
--nano-hqspecifies high-quality ONT reads (Dorado HAC/SUP basecalled, R10.4+ chemistry, typically Q20+) -
--genome-sizeprovides an estimate of the genome size to guide assembly -
--out-dirspecifies the output directory for Flye results -
--threadsspecifies the number of CPU threads to use for assembly
The input can either be fastq or fasta, compressed or uncompressed. The output will be stored in the directory provided.
Choosing the right Flye input type
Flye provides several input options depending on your read type and quality:
-
--nano-hq: For Dorado HAC/SUP basecalled reads with R10.4+ chemistry (Q20+). This is what we use for our data. -
--nano-raw: For older, lower-quality ONT reads (R9, fast basecalling, <Q10) -
--pacbio-hifi: For PacBio HiFi/CCS reads -
--pacbio-raw: For PacBio CLR (continuous long reads)
Using the wrong input type can significantly affect assembly quality.
Since our ONT data was basecalled with Dorado HAC on R10.4.1 chemistry,
--nano-hq is the correct choice.
While you wait
Flye will take approximately 30-60 minutes with 32 threads on the A. thaliana dataset. While you wait, you can:
- Review the Flye output file descriptions in the table below
- Compare the Flye assembly graph approach to the HiFiasm approach from the previous episode
- Read about Flye’s repeat graph algorithm
Understanding Flye Output
The output of Flye includes several files and directories that provide information about the assembly process and results. Key components of the Flye output include:
| File/Folder | Description |
|---|---|
| 00-assembly/ | Initial draft assembly output. |
| 10-consensus/ | Consensus refinement step. |
| 20-repeat/ | Repeat graph construction and analysis. |
| 30-contigger/ | Final contig generation step. |
| 40-polishing/ | Final polishing step for improving assembly quality. |
| assembly.fasta | Final polished assembly sequence. |
| assembly_graph.gfa | Final assembly graph in GFA format. |
| assembly_graph.gv | Visualization of final assembly graph. |
| assembly_info.txt | Summary information about the assembly. |
| flye.log | Log file detailing the Flye run. |
Which file should I use as my final assembly?
The assembly.fasta file contains the final polished
assembly sequence and is typically used as the primary output for
downstream analyses. This file represents the best estimate of the
assembled genome based on the input data and the Flye assembly process.
You can use this file for further analyses, such as gene prediction,
variant calling, or comparative genomics studies.
Quick look at metrics for this assembly:
BASH
ml --force purge
ml biocontainers
ml quast
quast.py \
--fast \
--threads ${SLURM_CPUS_ON_NODE} \
-o quast_basic_stats \
flye_ont/assembly.fasta
Which of these assemblies look better?
Check the quast_basic_stats/report.txt file to check
assembly statistics. Based on your previous assembly using
hifiasm, what assembly do you think is better? What metrics
are you using to make this decision? Discuss which assembly has better
contiguity and completeness based on these statistics.
Other important parameters
flye provides several additional
parameters that can be used to customize the assembly process and
improve results. Some key parameters include:
- Pick the right input type (
--nano-hq,--pacbio-hifi, etc.) → Incorrect selection affects accuracy. - Always specify
--out-dirand--threadsfor faster and organized runs. - Use
--keep-haplotypesif you don’t want a collapsed assembly. - For metagenomes, use
--metato handle variable coverage. - If the assembly fails, use
--resumeto avoid losing progress.
Interested in exploring more about Flye?
Check out the Flye FAQ for answers to common questions and troubleshooting tips. You can also explore the Flye GitHub repository for the latest updates, documentation, and discussions about the assembler.
Improving Assembly Quality with Polishing (Optional)
After generating the initial assembly, it is often beneficial to polish the assembly to improve base-level accuracy. Polishing involves aligning the raw reads back to the assembly and correcting errors to produce a more accurate consensus sequence. This step can significantly enhance the quality of the assembly, especially for error-prone long-read data like ONT reads.
Flye provides built-in polishing capabilities. By default, Flye performs one round of polishing to refine the assembly. However, you can customize the polishing process by running polishing separately after the initial assembly.
A recommended approach for polishing ONT assemblies is to use
Medaka, which uses neural networks trained on ONT data to
correct errors:
BASH
ml --force purge
ml biocontainers
ml medaka
medaka_polish \
-i At_ont-reads-filtered.fastq \
-d flye_ont/assembly.fasta \
-o medaka_polished \
-t ${SLURM_CPUS_ON_NODE} \
-m r1041_e82_400bps_hac_v5.0.0
Medaka model selection
The -m flag specifies the Medaka model, which should
match your basecalling chemistry and model. For our data (R10.4.1,
Dorado HAC), we use r1041_e82_400bps_hac_v5.0.0. You can
list available models with medaka --list_models. Using the
wrong model will produce suboptimal results.
There are other polishing tools available, such as Racon
and Nanopolish, which can be used to further refine the ONT
assembly. Each tool has its strengths and limitations, so it is
recommended to try different polishing strategies to achieve the best
results for your specific dataset. Medaka is the current
recommended choice for polishing ONT assemblies due to its accuracy and
efficiency. For the latest basecalling and polishing, ONT’s
Dorado toolkit is also emerging as an integrated
solution.
HiFiasm for ONT Data (Optional)
Since HiFiasm also supports ONT reads for assembly, we can test it out to access the quality of the assembly. The basic command structure is as follows:
BASH
ml --force purge
ml biocontainers
ml hifiasm
hifiasm \
-t ${SLURM_CPUS_ON_NODE} \
-o athaliana_ont.asm \
--ont \
At_ont-reads-filtered.fastq
Post processing
Once the run completes (~30 mins with 32 threads), you can convert GFA to FASTA using the following command:
Get the basic stats using quast:
BASH
ml --force purge
ml biocontainers
ml quast
quast.py \
--fast \
--threads ${SLURM_CPUS_ON_NODE} \
-o quast_basic_stats \
*_ctg.fasta
and run compleasm to get the assembly completeness:
- ONT provides long-read sequencing data with high error rates.
- Flye is a long-read assembler optimized for handling ONT data and producing highly contiguous assemblies.
- The Flye assembly workflow involves read preprocessing, repeat graph construction, graph resolution, polishing, and post-processing.
- Flye output includes the final assembly sequence, assembly graph, and summary information for evaluation.
- Polishing the assembly can improve base-level accuracy and overall assembly quality.
- Flye provides built-in polishing capabilities, and other tools like Racon, Nanopolish, and Medaka can be used for further refinement.