Oxford Nanopore Assembly using Flye
Last updated on 2025-02-19 | Edit this page
Overview
Questions
- What are the key features of ONT reads?
- Why is Flye good for assembling ONT reads?
- What are the main steps in the Flye assembly workflow?
- How can you evaluate the quality of a Flye assembly?
Objectives
- Understand the characteristics of ONT reads.
- Learn about the Flye assembler and its advantages for ONT data.
- Explore the key steps in the Flye assembly workflow.
- Evaluate the quality of a Flye assembly using common metrics.
Introduction to ONT reads and Flye Assembler
Oxford Nanopore Technologies (ONT) has revolutionized sequencing by providing long-read data, enabling the resolution of complex genomic structures that were previously intractable with short-read technologies. However, ONT reads are error-prone, necessitating specialized assembly algorithms that can handle high sequencing error rates while maximizing contiguity and accuracy.
Traditional assemblers designed for short reads rely on de Bruijn
graph approaches, which break sequences into fixed k-mers and struggle
with error-rich long reads. In contrast, modern long-read assemblers
like Flye
use alternative graph-based strategies to
overcome these limitations. Flye specifically constructs repeat graphs
to accurately reconstruct genomes while addressing challenges posed by
structural variations and repeats. This makes it particularly
well-suited for ONT data, producing high-quality, contiguous assemblies
for small microbial genomes to large eukaryotic genomes.
The latest ultra-long ONT reads, such as those generated by the PromethION platform, have further improved assembly quality and contiguity. Flye can leverage these ultra-long reads to generate even more accurate and contiguous assemblies, making it a powerful tool for a wide range of genomic analyses.
Installation and Setup
Flye is available as module on RCAC clusters. You can load the module using the following command:
You can also use the Singularity container for HiFiasm, which provides a consistent environment across different systems. The container can be pulled from the BioContainers registry using the following command:
Overview of Flye Assembler
Flye is a de novo assembler designed for high-error, long-read sequencing data from Oxford Nanopore Technologies (ONT) and PacBio. It is optimized to handle the inherent noise in single-molecule sequencing (SMS) reads while producing highly contiguous assemblies. Flye is particularly well-suited for assembling complex genomes, resolving repetitive regions, and reconstructing structural variations that short-read assemblers struggle with.
The Flye assembly workflow typically involves:
-
Read preprocessing – filtering and quality-checking
raw ONT reads
-
Disjointig generation – constructing long,
error-prone sequences from overlapping reads
-
Repeat graph construction – building a repeat-aware
assembly graph to represent genome structure
-
Graph resolution – disentangling repeats and
structural variations to produce accurate contigs
-
Polishing – refining assemblies to improve
base-level accuracy using read alignment
- Post-processing – assessing assembly quality and generating final output
Flye is optimized for this process, leveraging repeat graph-based assembly to generate longer, more contiguous sequences than many traditional long-read assemblers. Its ability to handle highly repetitive regions, coupled with its fast runtime and efficient memory usage, makes it a powerful choice for ONT genome assembly.
Flye: basic workflow
To run flye
, you need to provide the input long reads in
FASTA or FASTQ format, specifying the long read type, provide estimated
genome size, output directory and the threads to use. The basic command
structure is as follows:
BASH
ml --force purge
ml biocontainers
ml flye
flye \
--nano-raw At_ont-reads-filtered.fastq \
--genome-size 135m \
--out-dir flye_ont \
--threads ${SLURM_CPUS_ON_NODE}
Options used
-
--nano-raw
specifies the input ONT long reads in FASTQ format -
--genome-size
provides an estimate of the genome size to guide assembly -
--out-dir
specifies the output directory for Flye results -
--threads
specifies the number of CPU threads to use for assembly
The input can either be fastq or fasta, compressed or uncompressed. The output will be stored in the directory provided.
Understanding Flye Output
The output of Flye includes several files and directories that provide information about the assembly process and results. Key components of the Flye output include:
File/Folder | Description |
---|---|
00-assembly/ | Initial draft assembly output. |
10-consensus/ | Consensus refinement step. |
20-repeat/ | Repeat graph construction and analysis. |
30-contigger/ | Final contig generation step. |
40-polishing/ | Final polishing step for improving assembly quality. |
assembly.fasta | Final polished assembly sequence. |
assembly_graph.gfa | Final assembly graph in GFA format. |
assembly_graph.gv | Visualization of final assembly graph. |
assembly_info.txt | Summary information about the assembly. |
flye.log | Log file detailing the Flye run. |
Which file should I use as my final assembly?
The assembly.fasta
file contains the final polished
assembly sequence and is typically used as the primary output for
downstream analyses. This file represents the best estimate of the
assembled genome based on the input data and the Flye assembly process.
You can use this file for further analyses, such as gene prediction,
variant calling, or comparative genomics studies.
Quick look at metrics for this assembly:
BASH
ml --force purge
ml biocontainers
ml quast
quast.py \
--fast \
--threads ${SLURM_CPUS_ON_NODE} \
-o quast_basic_stats \
flye_ont/assembly.fasta
Which of these assemblies look better?
Check the quast_basic_stats/report.txt
file to check
assembly statistics. Based on your previous assembly using
hifiasm
, what assembly do you think is better? What metrics
are you using to make this decision? Discuss which assembly has better
contiguity and completeness based on these statistics.
Other important parameters
flye
provides several additional
parameters that can be used to customize the assembly process and
improve results. Some key parameters include:
- Pick the right input type (
--nano-hq
,--pacbio-hifi
, etc.) → Incorrect selection affects accuracy. - Always specify
--out-dir
and--threads
for faster and organized runs. - Use
--keep-haplotypes
if you don’t want a collapsed assembly. - For metagenomes, use
--meta
to handle variable coverage. - If the assembly fails, use
--resume
to avoid losing progress.
Interested in exploring more about Flye?
Check out the Flye FAQ for answers to common questions and troubleshooting tips. You can also explore the Flye GitHub repository for the latest updates, documentation, and discussions about the assembler.
Improving Assembly Quality with Polishing (Optional)
After generating the initial assembly, it is often beneficial to polish the assembly to improve base-level accuracy. Polishing involves aligning the raw reads back to the assembly and correcting errors to produce a more accurate consensus sequence. This step can significantly enhance the quality of the assembly, especially for error-prone long-read data like ONT reads.
Flye provides built-in polishing capabilities. By default, Flye performs one round of polishing to refine the assembly. However, you can customize the polishing process by running polishing separately after the initial assembly.
An example command to polish assembly with accurate PacBio HiFi reads:
BASH
ml --force purge
ml biocontainers
ml flye
flye \
--polish-target flye_ont/assembly.fasta \
--pacbio-raw At_pacbio-hifi-filtered.fastq \
--genome-size 135m \
--iterations 1 \
--out-dir flye_ont_polished \
--threads ${SLURM_CPUS_ON_NODE}
*You can also provide Bam file as input instead of reads
There are many other polishing tools available, such as
Racon
, Nanoploish
, and medaka
,
which can be used to further refine the ONT assembly. Each tool has its
strengths and limitations, so it is recommended to try different
polishing strategies to achieve the best results for your specific
dataset. Medaka
is a popular choice for polishing ONT
assemblies due to its accuracy and efficiency.
HiFiasm for ONT Data (Optional)
Since HiFiasm also supports ONT reads for assembly, we can test it out to access the quality of the assembly. The basic command structure is as follows:
BASH
ml --force purge
ml biocontainers
ml hifiasm
hifiasm \
-t ${SLURM_CPUS_ON_NODE} \
-o athaliana_ont.asm \
--ont \
At_ont-reads-filtered.fastq
Post processing
Once the run completes (~30 mins with 32 threads), you can convert GFA to FASTA using the following command:
Get teh basic stats using quast
:
BASH
ml --force purge
ml biocontainers
ml quast
quast.py \
--fast \
--threads ${SLURM_CPUS_ON_NODE} \
-o quast_basic_stats \
*_ctg.fasta
and run compleasm
to get the assembly completeness:
Key Points
- ONT provides long-read sequencing data with high error rates.
- Flye is a long-read assembler optimized for handling ONT data and producing highly contiguous assemblies.
- The Flye assembly workflow involves read preprocessing, repeat graph construction, graph resolution, polishing, and post-processing.
- Flye output includes the final assembly sequence, assembly graph, and summary information for evaluation.
- Polishing the assembly can improve base-level accuracy and overall assembly quality.
- Flye provides built-in polishing capabilities, and other tools like Racon, Nanopolish, and Medaka can be used for further refinement.