Skip to content

GeMoMa to merge annotations

Prerequisites

  • Target genome assembly (fasta format) - we will use B73.v5 genome
  • ab initio predictions (.gff3 format) - we will use Helixer output
  • Homology predictions (.gff3 format) - Sorghum and B97 annotations

Software

  • GeMoMa (version ≥ 1.9 recommended)
  • Dependencies: java >1.8, blast or mmseqs

Gene Model Mapper (GeMoMa) is a homology-based gene prediction tool that transfers protein-coding gene models from one or more reference genomes to a target genome. It can incorporate RNA-seq evidence, filter predictions using customizable criteria, and merge multiple annotations—including ab initio and transcriptome-based predictions—into a unified gene set.

Helixer is a deep learning-based predictor that provides accurate gene predictions from genome sequence alone. However, it does not predict alternative isoforms and tends to collapse all transcript variants into a single flattened gene model, especially in regions with overlapping exons or alternative splicing.

Homology predictions can be used to recover these missing isoforms and provide additional context for gene structure. GeMoMa excels at this task by leveraging evolutionary conservation across related species, allowing it to transfer annotations from well-annotated genomes to the target genome.

By merging Helixer predictions with GeMoMa:

  • You retain Helixer’s high confidence gene boundaries and structure for primary transcripts.
  • You recover missing isoforms and splice variants that Helixer collapses.
  • You gain orthology-supported annotation refinement using multiple maize lines and related species like Sorghum bicolor.
  • You generate a more complete and biologically realistic gene set, even without long-read or isoform-resolving RNA-seq data.

This hybrid strategy takes advantage of both machine learning accuracy and evolutionary conservation to compensate for the limitations of any single method.

In this example, for the B73.v5 genome, we will merge ab initio predictions obtained from Helixer with the homology predictions of its close relatives (Maize inbred B97 and Sorghum).

  1. Download target genome - this is the genome we will be annotating with GeMoMa:
wget https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/Zm-B73-REFERENCE-NAM-5.0.fa.gz
  1. Download Helixer predictions - these are the ab initio predictions we will be merging with GeMoMa:
wget
  1. Download homology predictions (genome and gff3 files) - these are the annotations from related species that will be used to inform GeMoMa:
wget https://download.maizegdb.org/Zm-B97-REFERENCE-NAM-1.0/Zm-B97-REFERENCE-NAM-1.0.fa.gz
wget https://download.maizegdb.org/Zm-B97-REFERENCE-NAM-1.0/Zm-B97-REFERENCE-NAM-1.0_Zm00018ab.1.gff3.gz
wget https://ftp.sorghumbase.org/release-9/fasta/sorghum_bicolor/dna/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa.gz
wget https://ftp.sorghumbase.org/release-9/gff3/sorghum_bicolor/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.gff3.gz
  1. Unzip the downloaded files:
gunzip Zm-B73-REFERENCE-NAM-5.0.fa.gz
gunzip Zm-B97-REFERENCE-NAM-1.0_Zm00018ab.1.gff3.gz
gunzip Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa.gz
gunzip Sorghum_bicolor.Sorghum_bicolor_NCBIv3.gff3.gz

GeMoMa is executed using a Java-based CLI pipeline. Here we will run GeMoMa with the target genome, Helixer predictions, and homology-based annotations from related species. We will generate a merged annotation file in GFF3 format.

We will install GeMoMa using BioConda:

module --force purge
module load conda
conda create -n gemoma # Press 'y' when prompted
conda activate gemoma
conda install -c bioconda gemoma

After installation, the GeMoMa JAR file is located under the environment’s shared directory. You can set the path like this:

gemoma_jar="${CONDA_ENVS_PATH}/share/gemoma-1.9-0/GeMoMa-1.9.jar"
run_gemoma.sh
#!/bin/bash
# activate conda environment
module --force purge
module load conda
conda activate gemoma
# Set GeMoMa JAR path
gemoma_jar="${CONDA_ENVS_PATH}/share/gemoma-1.9-0/GeMoMa-1.9.jar"
# Function to check file existence and resolve full path
check_file() {
if [[ ! -f "$1" ]]; then
echo "ERROR: File not found: $1" >&2
exit 1
fi
realpath "$1"
}
# Check and assign input files
genome=$(check_file Zm-B73-REFERENCE-NAM-5.0.fa)
sorghumFa=$(check_file Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa)
sorghumGFF=$(check_file Sorghum_bicolor.Sorghum_bicolor_NCBIv3.gff3)
maize1=$(check_file Zm-B97-REFERENCE-NAM-1.0.fa)
maize1GFF3=$(check_file Zm-B97-REFERENCE-NAM-1.0_Zm00018ab.1.gff3)
helixer=$(check_file Zm-B73-HELIXER-NAM-1.0.gff3)
# Other config
name="B73_v5_GeMoMa" # Prefix for gene names
cpus=${SLURM_CPUS_ON_NODE:-8} # Default to 8 CPUs if not running on SLURM
TMPDIR=${TMPDIR:-/tmp} # Use system temp directory if not set
dir="gemoma_output" # Output directory
# Run GeMoMa
java -Xms100g -Xmx200g -Djava.io.tmpdir=$TMPDIR -jar ${gemoma_jar} CLI GeMoMaPipeline \
t=$genome \
s=own i=sb a=$sorghumGFF g=$sorghumFa \
s=own i=zm a=$maize1GFF3 g=$maize1 \
ID=helixer e=$helixer \
AnnotationFinalizer.u=NO \
AnnotationFinalizer.r=SIMPLE AnnotationFinalizer.p=$name \
AnnotationFinalizer.n=true \
sc=false o=false p=true pc=true pgr=false \
threads=$cpus outdir=$dir &> ${dir}/gemoma.log

Save this script as run_gemoma.sh and run it in a terminal with the command:

chmod +x run_gemoma.sh
./run_gemoma.sh

The options used in the command above are:

OptionExplanation
t=$genomeTarget genome (FASTA) to annotate.
s=ownReference species data will be extracted by GeMoMa from annotation + genome.
i=sbIdentifier for reference species (Sorghum bicolor).
a=$sorghumGFFGFF3 gene annotation for Sorghum bicolor.
g=$sorghumFaFASTA genome for Sorghum bicolor.
i=zmIdentifier for second reference species (e.g., maize B97).
a=$maize1GFF3GFF3 gene annotation for maize B97.
g=$maize1FASTA genome for maize B97.
ID=helixerUnique label for Helixer-based annotation source.
e=$helixerExternal GFF3 annotation to be merged (Helixer output).
AnnotationFinalizer.u=Disable UTR prediction (no RNA evidence available).
AnnotationFinalizer.r=SIMPLERename gene/transcript IDs using a simple prefix-based system.
AnnotationFinalizer.p=$namePrefix used in renaming genes and transcripts.
AnnotationFinalizer.n=trueAdd new names as Name= attributes in GFF3 output.
sc=falseDisable synteny check.
o=falseDo not write individual predictions per reference species.
p=trueOutput predicted proteins as FASTA.
pc=trueOutput predicted CDSs as FASTA.
pgr=falseDo not output predicted genomic regions.
threads=$cpusNumber of compute threads to use.
outdir=$dirOutput directory for all GeMoMa results.

The GeMoMa output will be stored in the specified gemoma_output directory. The main results include:

  • Directorygemoma_output
    • final_annotation.gff merged gff3 annotation file
    • predicted_proteins.fasta predicted proteins
    • predicted_cds.fasta predicted cds sequences
    • gemoma.log the log file

If you examine the log file (gemoma.log), you will see some important stats, including any warnings or errors encountered during the run.

You can also check the number of genes and transcripts in the final annotation:

grep -v '^#' gemoma_output/final_annotation.gff | cut -f3 | sort | uniq -c

Which should give you the counts for all features.

246097 CDS
58847 gene
60137 mRNA

You can clean up the gff file using the genometools utility:

ml --force purge
ml genometools
ml agat
gff3_file=gemoma_output/final_annotation.gff
gt gff3 \
-sort \
-tidy \
-setsource "gemoma" \
-force \
-o B73-gemoma_gt.gff3 \
$gff3_file
agat_convert_sp_gxf2gxf.pl \
-g B73-gemoma_gt.gff3 \
-o B73-gemoma_v1.0.gff3
gffread B73-gemoma_v1.0.gff3 \
-g Zm-B73-REFERENCE-NAM-5.0.fa \
-x B73-gemoma_v1.0.cds.fasta \
-y B73-gemoma_v1.0.pep.fasta

Run BUSCO to assess the completeness of the merged annotation:

ml --force purge
ml busco
busco \
-i B73-gemoma_v1.0.pep.fasta \
-c $SLURM_CPUS_ON_NODE \
-o B73-gemoma_v1.0_busco \
-m prot \
-l poales_odb10

We can compare the results with the previously generated Helixer and B73.v5 annotations and whole genome assembly BUSCO results.

Busco-results

Figure 1: BUSCO results for Helixer, B73.v5 (MaizeGDB) and merged (GeMoMa) annotations.

Compare Helixer original predictions with GeMoMa:

conda activate mikado
mikado compare \
--protein-coding \
-r Zm-B73-HELIXER-NAM-1.0.gff3 \
-p B73-gemoma_v1.0.gff3 \
-o ref_NAM.B73v5-vs-prediction_HELIXERv1_compared \
--log compare.log

Compare B73.v5 (MaizeGDB) predictions with GeMoMa:

mikado compare \
--protein-coding \
-r Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3 \
-p B73-gemoma_v1.0.gff3 \
-o ref_NAM.B73v5-vs-prediction_HELIXERv1_compared \
--log compare.log

The Eukaryotic Non-Model Transcriptome Annotation Pipeline (EnTAP) can be used for functional annotation of the predicted genes.

ml purge
ml anaconda
conda activate entap
cds="B73-gemoma_v1.0.cds.fasta"
EnTAP \
--runP \
-i ${cds} \
-d ${RCAC_SCRATCH}/entap_db/bin/ncbi_refseq_plants.dmnd \
-d ${RCAC_SCRATCH}/entap_db/bin/uniprot_sprot.dmnd \
-t ${SLURM_CPUS_ON_NODE} \
--ini ${RCAC_SCRATCH}/entap_db/entap_config.ini

You can summarize the results using the agat utility:

ml biocontainers
ml agat
agat_sp_statistics.pl \
-g B73-gemoma_v1.0.gff3 \
-o B73-gemoma_v1.0.stats

OMArk can be used to assess the quality of the predicted proteome:

ml --force purge
ml seqtk
omark_sif="${RCAC_SCRATCH}/omark/omark_0.3.0--pyh7cba7a3_0.sif"
peptide="B73-gemoma_v1.0.pep.fasta"
database="${RCAC_SCRATCH}/omark/LUCA.h5"
grep ">" ${peptide} |\
sed -e 's/gene=/\t/g' -e 's/>//g' |\
sort -uk2,2 |\
cut -f 1 > primary.ids
seqtk subseq ${peptide} primary.ids > ${peptide%.*}.primary.pep.fasta
apptainer exec ${omark_sif} omamer search \
--db ${database} \
--query ${peptide%.*}.primary.pep.fasta \
--out ${peptide%.*}.omamer \
--nthreads ${SLURM_CPUS_ON_NODE}
cds="B73-gemoma_v1.0.cds.fasta"
bioawk -c fastx 'BEGIN{OFS="\t"}{
print $name,length($seq),gc($seq),substr($seq,0,3),reverse(substr(reverse($seq),0,3))
}' ${cds} > ${cds%.*}.info

The plots were generated using the following R script: cds_assesment.R

cds-length

Figure 8: Distribution of CDS length for GeMoMa, Helixer and NAM.v5 predictions.