Skip to content

Genome assembly using HiFi reads

A short guide to assemble genomes from long reads (PacBio HiFi) using HiFiasm

Manually installing hifiasm is easy. The instructions as listed in the GitHub repository are as follows:

git clone https://github.com/chhylp123/hifiasm
cd hifiasm && make

But in case if you have any problems, you can use the Singularity container. First, download the recipe file from the GitHub gist. Rename the file to hifiasm_v0.20.0-r639.def and build the container

singularity build hifiasm.sif hifiasm_v0.20.0-r639.def

This will create hifiasm.sif file in the current directory.

To run HiFiasm, we need HiFi reads. We will download sample datasets from PacBio website. For this specific tutorial, we will use the PacBio HiFi data for maize B73 genome.

wget https://downloads.pacbcloud.com/public/revio/2023Q1/maize-B73-rep1/m84006_221229_002525_s1.hifi_reads.bam

First step is to convert the HiFi reads in bam format to fasta (or fastq) format. We can use samtools for this.

ml biocontainers
ml samtools
samtools fasta \
-threads ${SLURM_CPUS_ON_NODE} \
m84006_221229_002525_s1.hifi_reads.bam > maize-hifi.fasta

Before proceeding, lets quickly check our input file (read depth and other stats).

ml biocontainers
ml seqkit
seqkit stats \
--all \
--threads ${SLURM_CPUS_ON_NODE} \
--out-file summary-stats.out \
maize-hifi.fasta

This will generate a file summary-stats.out with the statistics of the input file.

Table 1: Summary statistics of the input file maize-hifi.fasta

PropertyValue
filemaize-hifi.fasta
formatFASTA
typeDNA
num_seqs5,142,754
sum_len70,099,174,236
min_len94
avg_len13,630.70
max_len55,776
Q110,374
Q213,501
Q316,921
sum_gap0
N5015,265
Q20(%)0
Q30(%)0
GC(%)45.62

So assuming 2,400Mb genome size for maize, our sequencing depth will be 70,099,174,236 bp / 2,400,000,000 = 29.2x. Ohter stats reported are the number of sequences, minimum, maximum, and average length of the sequences, N50, and GC content. They all reaffirm the quality of the input file.

Now we can run HiFiasm on the fasta file. Be sure to check the documentation for all the options available.

First, we will run hifiasm with default suggested parameters.

singularity exec hifiasm.sif hifiasm \
-t ${SLURM_CPUS_ON_NODE} \
-o maize_B73_hifi.asm \
maize-hifi.fasta

This will generate partially phased contigs with filenames maize_B73_hifi.asm.bp.hap[12].p_ctg.gfa. This pair of files can be thought to represent the two haplotypes in a diploid genome, though with occasional switch errors. The frequency of switches is determined by the heterozygosity of the input sample. Since B73 is an inbred, we can expect most of the genome to be identical to each other in these files.

The other files of importance are .bp.p_ctg.gfa, .bp.p_utg.gfa and .bp.r_utg.gfa (the counterparts of these files have an extension .noseq.gfa instead of .gfa, that contains no sequence and can be used for visualization using programs such as bandage). These files represent the primary contigs, primary unitigs, and redundant unitigs, respectively. The files are usually not necessary and can be ignored for most purposes.

First, let’s check the statistics of the output files. For this, we will need to convert the .gfa files to .fasta format.

for f in *.hap[12].p_ctg.gfa; do
awk '/^S/{print ">"$2"\n"$3}' ${f} > ${f%.gfa}.fasta
done

Now we can check the statistics of the output files.

ml biocontainers
ml quast
quast.py \
--fast \
--threads ${SLURM_CPUS_ON_NODE} \
-o maize_B73_hifi.asm.stats \
*.hap[12].p_ctg.fasta

This will generate a folder maize_B73_hifi.asm.stats with the statistics of the output files.

Table 2: Summary statistics of the assembled contigs

Assemblyhap1.p_ctghap2.p_ctg
# contigs (>= 0 bp)695340
# contigs (>= 1000 bp)695340
# contigs (>= 5000 bp)695340
# contigs (>= 10000 bp)695340
# contigs (>= 25000 bp)648336
# contigs (>= 50000 bp)311245
Total length (>= 0 bp)2,115,818,1972,146,055,694
Total length (>= 1000 bp)2,115,818,1972,146,055,694
Total length (>= 5000 bp)2,115,818,1972,146,055,694
Total length (>= 10000 bp)2,115,818,1972,146,055,694
Total length (>= 25000 bp)2,114,814,2052,145,970,005
Total length (>= 50000 bp)2,102,601,7062,142,223,242
# contigs695340
Largest contig230,900,369229,694,453
Total length2,115,818,1972,146,055,694
N5092,331,732156,748,501
N908,066,86910,922,511
auN110,840,927.5131,176,388.9
L5076
L903923
# N’s per 100 kbp0.000.00

The statistics show that the assembly is of high quality with N50 of 92MB and 156MB for hap1 and hap2, respectively. The largest contig is 230MB and 229MB for hap1 and hap2, respectively, suggest that we probably have an entire chromosome in a single contig. The entire genome (2.1GB) is represented in 695 and 340 contigs for hap1 and hap2!

The batch error file reports various statistics that can be used to further optimize the assembly. Here are some of the important statistics:

Table 3: Log file statistics from the HiFiasm assembler

Stage/MetricEC Round 1EC Round 2EC Round 3Final Statistics
Total bases70,099,174,23670,119,906,16870,120,349,660
Corrected bases103,540,7472,197,978542,087
Distinct minimizers (k-mers)21,419,63655,569,14252,995,420
Total minimizers (k-mers)1,933,199,1401,922,092,1951,921,066,641
Indexed positions1,900,001,0471,919,860,4581,919,551,332
Lowest minimizer count120,08388,08485,356
Highest minimizer count2,173,7282,149,4452,149,330
peak_hom coverage313031
peak_het coverage-1-1-1
Primary contig coverage range---[25, infinity]
Homozygous read coverage threshold---30
Purge duplication coverage threshold---38
Total Overlaps---319,716,290
Strong Overlaps---90,793,983
Weak Overlaps---228,922,307
Exact Overlaps---316,212,759
Inexact Overlaps---3,503,531
Overlaps without Large Indels---319,293,584
Reverse Overlaps---45,529,962
Real Time (seconds)---10,477.790
CPU Time (seconds)---479,529.220
Peak Memory Usage (GB)---68.521

Some ket takeaways from the log file are:

  • The total number of bases in the input file is 70,099,174,236, corresponding to approximately 29.2x coverage of the maize genome.
  • The assembler made significant corrections in the first two rounds, correcting 103,540,747 and 2,197,978 bases, respectively, but only 542,087 in the third round. If the third round number is still too high, consider increasing the -r parameter (which sets the rounds of haplotype-aware error corrections, defaulting to 3).
  • The counts of distinct minimizer k-mers illustrate how the complexity of the data evolves throughout the assembly process.
  • The total number of overlaps, categorized by type (strong, weak, exact, and inexact), provides insights into the quality of the assembly.
  • The peak_hom and peak_het values indicate the levels of heterozygosity, while the homozygous read coverage threshold reflects the value used for the --hom-cov parameter. This can serve as a starting point for fine-tuning this option in future runs.
  • The purge duplication coverage threshold was automatically set to 38. Depending on the observed heterozygosity level, you may want to adjust --purge-max value in subsequent assemblies.
  • The peak memory usage of 68.521 GB is a critical metric for assessing the computational resources required for the assembly process. This information can be used to optimize future runs based on the available resources.

To better understand how well the assembly is constructed, we can use the noseq.gfa files generated by hifiasm to visualize the assembly graph. We can use bandage for this purpose. These files are small and bandage can be run from your local machine. Get the required files and install bandage on your local machine.

rsync -avP negishi:/scratch/negishi/$USER/hifasm/maize_B73_hifi.asm.bp.hap[12].p_ctg.noseq.gfa ./

Now you can run bandage and open the *noseq.gfa files.

Fig1

Figure 1: Bandage visualization of the assembly graph (hap1). While most contigs are constructed from a single unitig, there are a few contigs that maybe joined from multiple unitigs. The largest contig appears to be made of multiple unitigs and terminal region may not be fully resolved.

There are likely other contigs that may not be fully resolved but since we plan to use Optical Genome Mapping (OGM) data to scaffold the assembly, we can proceed with the current assembly. The OGM data will help resolve these regions and provide a more contiguous assembly.

Fig2

Figure 2: Bandage visualization of the assembly graph (hap2). Similar to hap1, most contigs are constructed from a single unitig, and largest contig may not be fully resolved.

After the first round of assembly, you will have the files *.ec.bin, *.ovlp.source.bin, and *.ovlp.reverse.bin. Save these files and try various options to see if you can improve the assembly. First, make a folder to move the .gfa, .fasta, and .bed files. These are the results from the first round of assembly. Second, adjust the parameters in the hifiasm command and run the assembler again. Third, move results to a new folder and compare the results of the first folder. You can re-run the assembly quickly and generate statistics for each of these folders and compare them to see if the changes improved the assembly.

The purge level option (-l) in hifiasm controls the level of purging of haplotigs. The default value is 3, which means it will purge all types of haplotigs in the most aggressive way. If you are running it against a diploid inbred line like B73, this might not be an ideal number. Here is how it alters the assembly at -l 1 and -l 2 and -l 3.

Table 4: Summary statistics of the assembled contigs with different options

Assemblya_ctgp_ctg
# contigs (>= 0 bp)837710
# contigs (>= 10000 bp)831709
# contigs (>= 25000 bp)680663
# contigs (>= 50000 bp)33325
Total length (>= 0 bp) (mb)28.142,235.42
Total length (>= 10000 bp) (mb)28.092,235.41
Total length (>= 25000 bp) (mb)25.022,234.42
Total length (>= 50000 bp) (mb)2.372,222.08
# contigs837710
Largest contig (mb)0.21230.90
Total length (mb)28.142,235.42
N50 (mb)0.04156.47
N90 (mb)0.0220.84
auN (mb)0.04140.18
L503266
L9069318
# N’s per 100 kbp00

In this tutorial, we have assembled the maize B73 genome using PacBio HiFi reads. The assembled contigs are of high quality and can be further scaffolded using Optical Genome Mapping data or Hi-C data. These contigs are purely based on the HiFi reads there could be unresolved or incorrect contigs. Using homology approches, we can further validate the assembly and fix those regions.

1. What other options are available in hifiasm that affect the assembly? There are several options available in hifiasm that can affect the assembly. The important assembly options are listed in the manual page here.

2. Why did I get two haplotypes in the output? The two haplotypes generated by hifiasm are controlled by level of purge duplication option -l. The default value is 3, which means it will purge all types of haplotigs in the most aggressive way. You can disable it using -l 0 option or you can only purge heterozygous overlaps using -l 1.

3. I have too many small contigs/fragmented assembly, how can I reduce them? You can use the -n option to remove unitigs with coverage less than the specified value. The default value is 3 and increasing it will remove unitigs with poor coverage and reduce the number of contigs. The manual also suggests raising -D or -N to improve the resolution of repetitive regions, but it may take longer time to run.

4. My assembly as shown in Bandage has a lot of bubbles/loops, how can I resolve them? The bubbles/loops in the assembly graph can be resolved during the scaffolding step, using either Hi-C data or Optical Genome Mapping data. But you can also try -u to disable post-join step for contigs. This may result in more contigs and poor N50 but may help resolve the bubbles/loops. The manual also says to set smaller value for --purge-max, -s and -O to reduce misassemblies.

5. I’ve more questions! Where can I find more information? See the hifiasm FAQs for more information. If you have specific questions, you can ask the developers directly (via GitHub issues) or ask us for help!