Oxford Nanopore Assembly using Flye

Last updated on 2026-02-18 | Edit this page

Estimated time: 60 minutes

Overview

Questions

What are the key features of ONT reads?
Why is Flye good for assembling ONT reads?
What are the main steps in the Flye assembly workflow?
How can you evaluate the quality of a Flye assembly?

Objectives

Understand the characteristics of ONT reads.
Learn about the Flye assembler and its advantages for ONT data.
Explore the key steps in the Flye assembly workflow.
Evaluate the quality of a Flye assembly using common metrics.

Introduction to ONT reads and Flye Assembler

Oxford Nanopore Technologies (ONT) has revolutionized sequencing by providing long-read data, enabling the resolution of complex genomic structures that were previously intractable with short-read technologies. However, ONT reads are error-prone, necessitating specialized assembly algorithms that can handle high sequencing error rates while maximizing contiguity and accuracy.

Traditional assemblers designed for short reads rely on de Bruijn graph approaches, which break sequences into fixed k-mers and struggle with error-rich long reads. In contrast, modern long-read assemblers like Flye use alternative graph-based strategies to overcome these limitations. Flye specifically constructs repeat graphs to accurately reconstruct genomes while addressing challenges posed by structural variations and repeats. This makes it particularly well-suited for ONT data, producing high-quality, contiguous assemblies for small microbial genomes to large eukaryotic genomes.

The latest ultra-long ONT reads, such as those generated by the PromethION platform, have further improved assembly quality and contiguity. Flye can leverage these ultra-long reads to generate even more accurate and contiguous assemblies, making it a powerful tool for a wide range of genomic analyses.

Diagram showing how ultra-long ONT reads enable telomere-to-telomere genome assemblies

Installation and Setup

Flye is available as module on RCAC clusters. You can load the module using the following command:

BASH

ml --force purge
ml biocontainers
ml flye
flye --version

You can also use the Singularity container for Flye, which provides a consistent environment across different systems. The container can be pulled from the BioContainers registry using the following command:

BASH

apptainer pull docker://quay.io/biocontainers/flye:2.9.5--py311h2de2dd3_2
apptainer exec flye_2.9.5--py311h2de2dd3_2.sif flye --version

Overview of Flye Assembler

Flye is a de novo assembler designed for high-error, long-read sequencing data from Oxford Nanopore Technologies (ONT) and PacBio. It is optimized to handle the inherent noise in single-molecule sequencing (SMS) reads while producing highly contiguous assemblies. Flye is particularly well-suited for assembling complex genomes, resolving repetitive regions, and reconstructing structural variations that short-read assemblers struggle with.

The Flye assembly workflow typically involves:

Read preprocessing – filtering and quality-checking raw ONT reads
Disjointig generation – constructing long, error-prone sequences from overlapping reads
Repeat graph construction – building a repeat-aware assembly graph to represent genome structure
Graph resolution – disentangling repeats and structural variations to produce accurate contigs
Polishing – refining assemblies to improve base-level accuracy using read alignment
Post-processing – assessing assembly quality and generating final output

Flye is optimized for this process, leveraging repeat graph-based assembly to generate longer, more contiguous sequences than many traditional long-read assemblers. Its ability to handle highly repetitive regions, coupled with its fast runtime and efficient memory usage, makes it a powerful choice for ONT genome assembly.

Overview of the Flye assembler workflow from disjointig generation through repeat graph construction to polished contigs

Flye: basic workflow

To run flye, you need to provide the input long reads in FASTA or FASTQ format, specifying the long read type, provide estimated genome size, output directory and the threads to use. The basic command structure is as follows:

BASH

ml --force purge
ml biocontainers
ml flye
flye \
  --nano-hq ../01_data-qc/At_ont-reads-filtered.fastq \
  --genome-size 135m \
  --out-dir flye_ont \
  --threads ${SLURM_CPUS_PER_TASK}

Callout

Options used

--nano-hq specifies high-quality ONT reads (Dorado HAC/SUP basecalled, R10.4+ chemistry, typically Q20+)
--genome-size provides an estimate of the genome size to guide assembly
--out-dir specifies the output directory for Flye results
--threads specifies the number of CPU threads to use for assembly

The input can either be fastq or fasta, compressed or uncompressed. The output will be stored in the directory provided.

Callout

Choosing the right Flye input type

Flye provides several input options depending on your read type and quality:

--nano-hq: For Dorado HAC/SUP basecalled reads with R10.4+ chemistry (Q20+). This is what we use for our data.
--nano-raw: For older, lower-quality ONT reads (R9, fast basecalling, <Q10)
--pacbio-hifi: For PacBio HiFi/CCS reads
--pacbio-raw: For PacBio CLR (continuous long reads)

Using the wrong input type can significantly affect assembly quality. Since our ONT data was basecalled with Dorado HAC on R10.4.1 chemistry, --nano-hq is the correct choice.

Callout

While you wait

Flye will take approximately 30-60 minutes with 32 threads on the A. thaliana dataset. While you wait, you can:

Review the Flye output file descriptions in the table below
Compare the Flye assembly graph approach to the HiFiasm approach from the previous episode
Read about Flye’s repeat graph algorithm

Understanding Flye Output

The output of Flye includes several files and directories that provide information about the assembly process and results. Key components of the Flye output include:

File/Folder	Description
00-assembly/	Initial draft assembly output.
10-consensus/	Consensus refinement step.
20-repeat/	Repeat graph construction and analysis.
30-contigger/	Final contig generation step.
40-polishing/	Final polishing step for improving assembly quality.
assembly.fasta	Final polished assembly sequence.
assembly_graph.gfa	Final assembly graph in GFA format.
assembly_graph.gv	Visualization of final assembly graph.
assembly_info.txt	Summary information about the assembly.
flye.log	Log file detailing the Flye run.

Expected assembly_info.txt (top 15 contigs)

Contig	Length (Mb)	Coverage	Circular	Repeat	Multiplicity
contig_5	15.52	38x	N	N	1
contig_19	14.50	39x	N	N	1
contig_20	14.08	37x	N	N	1
contig_7	12.73	40x	N	N	1
contig_8	11.82	32x	N	N	1
contig_36	11.24	38x	N	N	1
contig_6	10.09	35x	N	N	1
contig_23	8.77	39x	N	N	1
contig_102	6.39	40x	N	N	1
contig_12	6.23	34x	N	N	1
contig_58	3.27	31x	N	N	1
contig_11	3.24	27x	N	N	1
contig_13	3.23	33x	N	N	1
contig_10	2.35	29x	N	N	1
contig_15	0.94	22x	N	N	1

The assembly has 50 contigs total. The top 5 contigs (>11 Mb each) likely correspond to the 5 chromosomes of A. thaliana. Notable entries include contig_104 (84 kb, 2769x coverage, repeat=Y, mult=64) and contig_34 (26 kb, 4808x, repeat=Y, mult=111), which represent collapsed repetitive elements (likely rDNA or centromeric repeats).

Prerequisite

Which file should I use as my final assembly?

The assembly.fasta file contains the final polished assembly sequence and is typically used as the primary output for downstream analyses. This file represents the best estimate of the assembled genome based on the input data and the Flye assembly process. You can use this file for further analyses, such as gene prediction, variant calling, or comparative genomics studies.

Quick look at metrics for this assembly:

BASH

ml --force purge
ml biocontainers
ml quast
quast.py \
    --fast \
    --threads ${SLURM_CPUS_PER_TASK} \
    -o quast_basic_stats \
    flye_ont/assembly.fasta

Prerequisite

Which of these assemblies look better?

Check the quast_basic_stats/report.txt file to check assembly statistics. Based on your previous assembly using hifiasm, what assembly do you think is better? What metrics are you using to make this decision? Discuss which assembly has better contiguity and completeness based on these statistics.

Expected QUAST output for Flye ONT assembly

Metric	Flye ONT	HiFiasm HiFi (default)
# Contigs	50	146
Largest contig (Mb)	15.52	13.76
Total length (Mb)	128.74	135.75
N50 (Mb)	11.82	7.98
L50	5	7
auN (Mb)	10.68	7.70
N90 (Mb)	3.24	1.13
# N’s per 100 kbp	0.00	0.00

The Flye ONT assembly has remarkably high contiguity (N50 = 11.82 Mb) with only 50 contigs, significantly better than HiFiasm HiFi in terms of contiguity. However, the total size (128.74 Mb) is ~6 Mb smaller than expected (~135 Mb), suggesting some genomic regions may be missing or collapsed. Both assemblies are gap-free.

Expected Compleasm results for Flye ONT assembly

Category	Value
Single (S)	98.93%
Duplicated (D)	1.07%
Fragmented (F)	0.00%
Missing (M)	0.00%

The Flye ONT assembly achieves near-perfect BUSCO completeness with 0% missing genes, indicating comprehensive genome coverage despite the slightly smaller total size (128.74 Mb vs expected ~135 Mb).

Other important parameters

flye provides several additional parameters that can be used to customize the assembly process and improve results. Some key parameters include:

Pick the right input type (--nano-hq, --pacbio-hifi, etc.) → Incorrect selection affects accuracy.
Always specify --out-dir and --threads for faster and organized runs.
Use --keep-haplotypes if you don’t want a collapsed assembly.
For metagenomes, use --meta to handle variable coverage.
If the assembly fails, use --resume to avoid losing progress.

Prerequisite

Interested in exploring more about Flye?

Check out the Flye FAQ for answers to common questions and troubleshooting tips. You can also explore the Flye GitHub repository for the latest updates, documentation, and discussions about the assembler.

Improving Assembly Quality with Polishing (Optional)

After generating the initial assembly, it is often beneficial to polish the assembly to improve base-level accuracy. Polishing involves aligning the raw reads back to the assembly and correcting errors to produce a more accurate consensus sequence. This step can significantly enhance the quality of the assembly, especially for error-prone long-read data like ONT reads.

Flye provides built-in polishing capabilities. By default, Flye performs one round of polishing to refine the assembly. However, you can customize the polishing process by running polishing separately after the initial assembly.

A recommended approach for polishing ONT assemblies is to use Medaka, which uses neural networks trained on ONT data to correct errors:

BASH

ml --force purge
ml biocontainers
ml medaka
medaka_consensus \
   -i ../01_data-qc/At_ont-reads-filtered.fastq \
   -d flye_ont/assembly.fasta \
   -o medaka_polished \
   -t ${SLURM_CPUS_PER_TASK} \
   -m r1041_e82_400bps_hac_v5.0.0

Callout

Medaka model selection

The -m flag specifies the Medaka model, which should match your basecalling chemistry and model. For our data (R10.4.1, Dorado HAC), we use r1041_e82_400bps_hac_v5.0.0. You can list available models with medaka tools list_models. Using the wrong model will produce suboptimal results. The output will be consensus.fasta in the specified output directory.

There are other polishing tools available, such as Racon and Nanopolish, which can be used to further refine the ONT assembly. Each tool has its strengths and limitations, so it is recommended to try different polishing strategies to achieve the best results for your specific dataset. Medaka is the current recommended choice for polishing ONT assemblies due to its accuracy and efficiency. For the latest basecalling and polishing, ONT’s Dorado toolkit is also emerging as an integrated solution.

HiFiasm for ONT Data (Optional)

Since HiFiasm also supports ONT reads for assembly, we can test it out to access the quality of the assembly. The basic command structure is as follows:

BASH

ml --force purge
ml biocontainers
ml hifiasm
mkdir -p hifiasm_ont
hifiasm \
    -t ${SLURM_CPUS_PER_TASK} \
    -o hifiasm_ont/At_hifiasm_ont.asm \
    --ont \
    ../01_data-qc/At_ont-reads-filtered.fastq

Callout

Post processing

Once the run completes (~30 mins with 32 threads), you can convert GFA to FASTA using the following command:

BASH

cd hifiasm_ont
for ctg in *_ctg.gfa; do
    awk '/^S/{print ">"$2"\n"$3}' ${ctg} > ${ctg%.gfa}.fasta
done
cd ..

Get the basic stats using quast:

BASH

ml --force purge
ml biocontainers
ml quast
quast.py \
    --fast \
    --threads ${SLURM_CPUS_PER_TASK} \
    -o hifiasm_ont/quast_basic_stats \
    hifiasm_ont/*_ctg.fasta

and run compleasm to get the assembly completeness:

BASH

ml --force purge
ml biocontainers
ml compleasm
for fasta in hifiasm_ont/*_ctg.fasta; do
    compleasm run \
       -a ${fasta} \
       -o ${fasta%.*} \
       -l brassicales_odb10 \
       -t ${SLURM_CPUS_PER_TASK}
done

Expected HiFiasm ONT results

QUAST results

Metric	Primary	Hap1	Hap2
# Contigs	105	104	41
Largest contig (Mb)	13.18	13.18	13.10
Total length (Mb)	127.42	125.76	89.21
N50 (Mb)	11.34	11.34	11.46
L50	6	6	4
auN (Mb)	8.70	8.79	9.90

Compleasm results (primary contigs, brassicales_odb10)

Category	Value
Single (S)	98.61%
Duplicated (D)	1.04%
Fragmented (F)	0.00%
Missing (M)	0.35%

HiFiasm with ONT data produces a highly contiguous assembly (N50 = 11.34 Mb), comparable to Flye ONT. However, HiFiasm generates more contigs (105 vs 50) and has a total size of 127.42 Mb (smaller than Flye’s 128.74 Mb). The BUSCO completeness is excellent at 98.61% single-copy. HiFiasm also provides haplotype-resolved output, with hap2 capturing ~89 Mb of sequence.

Key Points

ONT provides long-read sequencing data with high error rates.
Flye is a long-read assembler optimized for handling ONT data and producing highly contiguous assemblies.
The Flye assembly workflow involves read preprocessing, repeat graph construction, graph resolution, polishing, and post-processing.
Flye output includes the final assembly sequence, assembly graph, and summary information for evaluation.
Polishing the assembly can improve base-level accuracy and overall assembly quality.
Flye provides built-in polishing capabilities, and other tools like Racon, Nanopolish, and Medaka can be used for further refinement.