Prerequisites
- Target genome assembly (
fasta
format) - we will use B73.v5 genome - ab initio predictions (
.gff3
format) - we will use Helixer output - Homology predictions (
.gff3
format) - Sorghum and B97 annotations
Prerequisites
fasta
format) - we will use B73.v5 genome.gff3
format) - we will use Helixer output.gff3
format) - Sorghum and B97 annotationsSoftware
java >1.8
, blast
or mmseqs
Gene Model Mapper (GeMoMa) is a homology-based gene prediction tool that transfers protein-coding gene models from one or more reference genomes to a target genome. It can incorporate RNA-seq evidence, filter predictions using customizable criteria, and merge multiple annotations—including ab initio and transcriptome-based predictions—into a unified gene set.
Helixer is a deep learning-based predictor that provides accurate gene predictions from genome sequence alone. However, it does not predict alternative isoforms and tends to collapse all transcript variants into a single flattened gene model, especially in regions with overlapping exons or alternative splicing.
Homology predictions can be used to recover these missing isoforms and provide additional context for gene structure. GeMoMa excels at this task by leveraging evolutionary conservation across related species, allowing it to transfer annotations from well-annotated genomes to the target genome.
By merging Helixer predictions with GeMoMa:
This hybrid strategy takes advantage of both machine learning accuracy and evolutionary conservation to compensate for the limitations of any single method.
In this example, for the B73.v5 genome, we will merge ab initio predictions obtained from Helixer with the homology predictions of its close relatives (Maize inbred B97 and Sorghum).
wget https://download.maizegdb.org/Zm-B73-REFERENCE-NAM-5.0/Zm-B73-REFERENCE-NAM-5.0.fa.gz
wget
wget https://download.maizegdb.org/Zm-B97-REFERENCE-NAM-1.0/Zm-B97-REFERENCE-NAM-1.0.fa.gzwget https://download.maizegdb.org/Zm-B97-REFERENCE-NAM-1.0/Zm-B97-REFERENCE-NAM-1.0_Zm00018ab.1.gff3.gzwget https://ftp.sorghumbase.org/release-9/fasta/sorghum_bicolor/dna/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa.gzwget https://ftp.sorghumbase.org/release-9/gff3/sorghum_bicolor/Sorghum_bicolor.Sorghum_bicolor_NCBIv3.gff3.gz
gunzip Zm-B73-REFERENCE-NAM-5.0.fa.gzgunzip Zm-B97-REFERENCE-NAM-1.0_Zm00018ab.1.gff3.gzgunzip Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa.gzgunzip Sorghum_bicolor.Sorghum_bicolor_NCBIv3.gff3.gz
GeMoMa is executed using a Java-based CLI pipeline. Here we will run GeMoMa with the target genome, Helixer predictions, and homology-based annotations from related species. We will generate a merged annotation file in GFF3 format.
We will install GeMoMa using BioConda:
module --force purgemodule load condaconda create -n gemoma # Press 'y' when promptedconda activate gemomaconda install -c bioconda gemoma
After installation, the GeMoMa JAR file is located under the environment’s shared directory. You can set the path like this:
gemoma_jar="${CONDA_ENVS_PATH}/share/gemoma-1.9-0/GeMoMa-1.9.jar"
#!/bin/bash# activate conda environmentmodule --force purgemodule load condaconda activate gemoma# Set GeMoMa JAR pathgemoma_jar="${CONDA_ENVS_PATH}/share/gemoma-1.9-0/GeMoMa-1.9.jar"
# Function to check file existence and resolve full pathcheck_file() { if [[ ! -f "$1" ]]; then echo "ERROR: File not found: $1" >&2 exit 1 fi realpath "$1"}
# Check and assign input filesgenome=$(check_file Zm-B73-REFERENCE-NAM-5.0.fa)sorghumFa=$(check_file Sorghum_bicolor.Sorghum_bicolor_NCBIv3.dna.toplevel.fa)sorghumGFF=$(check_file Sorghum_bicolor.Sorghum_bicolor_NCBIv3.gff3)maize1=$(check_file Zm-B97-REFERENCE-NAM-1.0.fa)maize1GFF3=$(check_file Zm-B97-REFERENCE-NAM-1.0_Zm00018ab.1.gff3)helixer=$(check_file Zm-B73-HELIXER-NAM-1.0.gff3)
# Other configname="B73_v5_GeMoMa" # Prefix for gene namescpus=${SLURM_CPUS_ON_NODE:-8} # Default to 8 CPUs if not running on SLURMTMPDIR=${TMPDIR:-/tmp} # Use system temp directory if not setdir="gemoma_output" # Output directory
# Run GeMoMajava -Xms100g -Xmx200g -Djava.io.tmpdir=$TMPDIR -jar ${gemoma_jar} CLI GeMoMaPipeline \ t=$genome \ s=own i=sb a=$sorghumGFF g=$sorghumFa \ s=own i=zm a=$maize1GFF3 g=$maize1 \ ID=helixer e=$helixer \ AnnotationFinalizer.u=NO \ AnnotationFinalizer.r=SIMPLE AnnotationFinalizer.p=$name \ AnnotationFinalizer.n=true \ sc=false o=false p=true pc=true pgr=false \ threads=$cpus outdir=$dir &> ${dir}/gemoma.log
Save this script as run_gemoma.sh
and run it in a terminal with the command:
chmod +x run_gemoma.sh./run_gemoma.sh
The options used in the command above are:
Option | Explanation |
---|---|
t=$genome | Target genome (FASTA) to annotate. |
s=own | Reference species data will be extracted by GeMoMa from annotation + genome. |
i=sb | Identifier for reference species (Sorghum bicolor). |
a=$sorghumGFF | GFF3 gene annotation for Sorghum bicolor. |
g=$sorghumFa | FASTA genome for Sorghum bicolor. |
i=zm | Identifier for second reference species (e.g., maize B97). |
a=$maize1GFF3 | GFF3 gene annotation for maize B97. |
g=$maize1 | FASTA genome for maize B97. |
ID=helixer | Unique label for Helixer-based annotation source. |
e=$helixer | External GFF3 annotation to be merged (Helixer output). |
AnnotationFinalizer.u= | Disable UTR prediction (no RNA evidence available). |
AnnotationFinalizer.r=SIMPLE | Rename gene/transcript IDs using a simple prefix-based system. |
AnnotationFinalizer.p=$name | Prefix used in renaming genes and transcripts. |
AnnotationFinalizer.n=true | Add new names as Name= attributes in GFF3 output. |
sc=false | Disable synteny check. |
o=false | Do not write individual predictions per reference species. |
p=true | Output predicted proteins as FASTA. |
pc=true | Output predicted CDSs as FASTA. |
pgr=false | Do not output predicted genomic regions. |
threads=$cpus | Number of compute threads to use. |
outdir=$dir | Output directory for all GeMoMa results. |
The GeMoMa output will be stored in the specified gemoma_output
directory. The main results include:
If you examine the log file (gemoma.log
), you will see some important stats, including any warnings or errors encountered during the run.
You can also check the number of genes and transcripts in the final annotation:
grep -v '^#' gemoma_output/final_annotation.gff | cut -f3 | sort | uniq -c
Which should give you the counts for all features.
246097 CDS 58847 gene 60137 mRNA
You can clean up the gff
file using the genometools
utility:
ml --force purgeml genometoolsml agatgff3_file=gemoma_output/final_annotation.gffgt gff3 \ -sort \ -tidy \ -setsource "gemoma" \ -force \ -o B73-gemoma_gt.gff3 \ $gff3_fileagat_convert_sp_gxf2gxf.pl \ -g B73-gemoma_gt.gff3 \ -o B73-gemoma_v1.0.gff3gffread B73-gemoma_v1.0.gff3 \ -g Zm-B73-REFERENCE-NAM-5.0.fa \ -x B73-gemoma_v1.0.cds.fasta \ -y B73-gemoma_v1.0.pep.fasta
Run BUSCO to assess the completeness of the merged annotation:
ml --force purgeml buscobusco \ -i B73-gemoma_v1.0.pep.fasta \ -c $SLURM_CPUS_ON_NODE \ -o B73-gemoma_v1.0_busco \ -m prot \ -l poales_odb10
We can compare the results with the previously generated Helixer and B73.v5 annotations and whole genome assembly BUSCO results.
Figure 1: BUSCO results for Helixer, B73.v5 (MaizeGDB) and merged (GeMoMa) annotations.
Compare Helixer original predictions with GeMoMa:
conda activate mikadomikado compare \ --protein-coding \ -r Zm-B73-HELIXER-NAM-1.0.gff3 \ -p B73-gemoma_v1.0.gff3 \ -o ref_NAM.B73v5-vs-prediction_HELIXERv1_compared \ --log compare.log
Compare B73.v5 (MaizeGDB) predictions with GeMoMa:
mikado compare \ --protein-coding \ -r Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3 \ -p B73-gemoma_v1.0.gff3 \ -o ref_NAM.B73v5-vs-prediction_HELIXERv1_compared \ --log compare.log
The Eukaryotic Non-Model Transcriptome Annotation Pipeline (EnTAP) can be used for functional annotation of the predicted genes.
ml purgeml anacondaconda activate entapcds="B73-gemoma_v1.0.cds.fasta"EnTAP \ --runP \ -i ${cds} \ -d ${RCAC_SCRATCH}/entap_db/bin/ncbi_refseq_plants.dmnd \ -d ${RCAC_SCRATCH}/entap_db/bin/uniprot_sprot.dmnd \ -t ${SLURM_CPUS_ON_NODE} \ --ini ${RCAC_SCRATCH}/entap_db/entap_config.ini
You can summarize the results using the agat
utility:
ml biocontainersml agatagat_sp_statistics.pl \ -g B73-gemoma_v1.0.gff3 \ -o B73-gemoma_v1.0.stats
OMArk can be used to assess the quality of the predicted proteome:
ml --force purgeml seqtkomark_sif="${RCAC_SCRATCH}/omark/omark_0.3.0--pyh7cba7a3_0.sif"peptide="B73-gemoma_v1.0.pep.fasta"database="${RCAC_SCRATCH}/omark/LUCA.h5"grep ">" ${peptide} |\ sed -e 's/gene=/\t/g' -e 's/>//g' |\ sort -uk2,2 |\ cut -f 1 > primary.idsseqtk subseq ${peptide} primary.ids > ${peptide%.*}.primary.pep.fastaapptainer exec ${omark_sif} omamer search \ --db ${database} \ --query ${peptide%.*}.primary.pep.fasta \ --out ${peptide%.*}.omamer \ --nthreads ${SLURM_CPUS_ON_NODE}
cds="B73-gemoma_v1.0.cds.fasta"bioawk -c fastx 'BEGIN{OFS="\t"}{ print $name,length($seq),gc($seq),substr($seq,0,3),reverse(substr(reverse($seq),0,3)) }' ${cds} > ${cds%.*}.info
The plots were generated using the following R script: cds_assesment.R
Figure 8: Distribution of CDS length for GeMoMa, Helixer and NAM.v5 predictions.
Figure 9: Distribution of GC content for GeMoMa, Helixer and NAM.v5 predictions.
Figure 10: Distribution of start and stop codons for GeMoMa, Helixer and NAM.v5 predictions.