Annotation using Helixer

Last updated on 2026-02-24 | Edit this page

Estimated time: 22 minutes

Overview

Questions

How to predict genes using Helixer?
How to download trained models for Helixer?
How to run Helixer on the HPC cluster (Gilbreth)?

Objectives

Learn how to predict genes using Helixer.
Learn how to download trained models for Helixer.
Learn how to run Helixer on the HPC cluster (Gilbreth).

Helixer is a deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes. Helixer is trained on a wide range of eukaryotic genomes and can predict genes in both plant and animal genomes. Helixer can predict genes without any extrinsic information such as RNA-seq data or homology information, purely based on the sequence of the genome.

Prerequisite

This section should be run on Gilbreth HPC cluster.

Due to the GPU requirement for Helixer, you need to run this section on the Gilbreth HPC cluster. You don’t have to copy fastq, or bamfiles, but only need athaliana.fasta file in the 00_datasets/genome directory.

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/05_helixer
cp /depot/workshop/data/annotation_workshop/00_datasets/genome/athaliana.fasta ${WORKSHOP_DIR}/05_helixer/athaliana.fasta

Setup

Helixer is available as a Singularity/apptainer container. You can pull the container using the following apptainer pull command. See the Helixer Docker repository for more information.

Helixer is installed as a module on the Gilbreth cluster, and can be loaded using the following commands:

BASH

ml --force purge
ml biocontainers
ml helixer

Downloading trained models

Helixer requires a trained model to predict genes. With the included script fetch_helixer_models.py you can download models for specific lineages. Currently, models are available for the following lineages:

land_plant
vertebrate
invertibrate
fungi

You can download the models using the following command:

BASH

# all models
# fetch_helixer_models.py --all
# or for a specific lineage
fetch_helixer_models.py --lineage land_plant

This will download all lineage models in the models directory. You can also download models for specific lineages using the --lineage option as shown above.

By default, files will be downloaded to ~/.local/share/Helixer directory. You should see the following files:

.
└── models
    └── land_plant
        ├── land_plant_v0.3_a_0080.h5
        ├── land_plant_v0.3_a_0090.h5
        ├── land_plant_v0.3_a_0100.h5
        ├── land_plant_v0.3_a_0200.h5
        ├── land_plant_v0.3_a_0300.h5
        ├── land_plant_v0.3_a_0400.h5
        ├── land_plant_v0.3_m_0100.h5
        └── land_plant_v0.3_m_0200.h5

land_plant_v0.3_a_0080.h5 is the smallest model and land_plant_v0.3_m_0200.h5 is the largest model. The model size is determined by the number of parameters in the model. The larger models are more accurate but require more memory and time to run.

Running Helixer

Helixer requires GPU for prediction. For running Helixer, you need to request a GPU node. You will also need the genome sequence in fasta format. For this tutorial, we will use the Arabidopsis thaliana genome, and use the land_plant model to predict genes.

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --gpus-per-node=1
#SBATCH --time=04:00:00
#SBATCH --account=standby
#SBATCH --job-name=helixer
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
ml --force purge
ml biocontainers
ml helixer
genome=athaliana.fasta
species="Arabidopsis thaliana"
output="athaliana_helixer.gff"
Helixer.py \
    --lineage land_plant \
    --fasta-path ${genome} \
    --species ${species} \
    --gff-output-path ${output}

A typical job will take ~40 mins to finish depending on the GPU the job gets allocated.

You can count the number of predictions in your gff3 file using the following command:

BASH

awk '$3=="gene"' athaliana_helixer.gff | wc -l

The GFF format output had 27,201 genes predicted using Helixer. You can view the various features in the gff file using the following command:

BASH

grep -v "^#" athaliana_helixer.gff |\
   cut -f 3 |\
   sort |\
   uniq -c

To get cds and pep files, you can use the following command:

BASH

ml --force purge
ml biocontainers
ml cufflinks
gffread athaliana_helixer.gff \
   -g athaliana.fasta \
   -y helixer_pep.fa \
   -x helixer_cds.fa

As you may have noticed, the number of mRNA and gene features are the same. This is because isoforms aren’t predicted by Helixer and you only have one transcript per gene. Exons are identified with high confidence and alternative isoforms are usually collapsed into a single gene model. This is one of the known limitations of Helixer.

Challenge

Exercise 1: Understanding Helixer GFF Output

Look at the feature counts from the Helixer GFF output. Why is the number of mRNA features the same as the number of gene features?

Show me the solution

The number of mRNA and gene features are the same because Helixer does not predict alternative isoforms. Each gene has exactly one transcript (mRNA) associated with it. Unlike tools that incorporate RNA-seq evidence, Helixer’s deep learning model produces a single gene model per locus, collapsing any potential alternative splicing into one representative transcript.

Challenge

Exercise 2: Helixer vs. BRAKER3

What are the advantages and disadvantages of using Helixer compared to BRAKER3 for gene annotation?

Show me the solution

Advantages of Helixer:

Does not require any extrinsic evidence data (RNA-seq or protein data), making it useful when such data is unavailable.
Runs faster when GPU hardware is available.
Simpler setup with fewer input requirements.

Disadvantages of Helixer:

Cannot predict alternative isoforms (only one transcript per gene).
Requires GPU hardware, which may not be available on all HPC clusters.
May be less accurate for species that are poorly represented in the training data.

Key Points

Helixer is a deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes.
Helixer can predict genes without any extrinsic information such as RNA-seq data or homology information, purely based on the sequence of the genome.
Helixer requires a trained model and GPU for prediction.
Helixer predicts genes in the GFF3 format, but will not predict isoforms.