Content from Introduction to Genome Annotation


Last updated on 2026-02-24 | Edit this page

Overview

Questions

  • What is genome annotation, and why is it important?
  • What are the different types of genome annotation?
  • What challenges make genome annotation difficult?
  • What data sources are used to improve annotation accuracy?
  • How does the annotation process fit into genomics research?

Objectives

  • Define genome annotation and its role in genomics research
  • Differentiate between structural and functional annotation
  • Identify key challenges in genome annotation and how they impact accuracy
  • Understand the types of data used in annotation and their significance
  • Provide an overview of the annotation workflow used in this workshop

What Is Genome Annotation?


  • Genome annotation is the process of identifying and labeling functional elements in a genome, such as genes, regulatory regions, and non-coding RNAs.
  • It involves predicting gene structures, including exons, introns, and untranslated regions, using computational and experimental data.
  • Functional annotation assigns biological roles to predicted genes based on sequence similarity, protein domains, and pathway associations.
  • Accurate annotation is essential for understanding genome function, evolutionary relationships, and applications in research and biotechnology.
The structure of a eukaryotic protein-coding gene
The structure of a eukaryotic protein-coding gene

source

Callout

Why Is Annotation Important?

  • Annotation provides a functional map of the genome, helping researchers identify genes and regulatory elements.
  • It enables comparative genomics, allowing insights into evolution, gene conservation, and species-specific adaptations.
  • Accurate gene annotation is essential for biomedical and agricultural research, including disease studies and crop improvement.
  • Functional annotations link genes to biological processes, aiding in pathway analysis and experimental validation.

Types of Genome Annotation


  • Structural annotation identifies genes, exon-intron boundaries, untranslated regions (UTRs), and regulatory elements such as promoters and enhancers.
  • Functional annotation assigns biological roles to genes based on sequence similarity, protein domains, gene ontology (GO), and pathway associations.
  • Repeat annotation detects transposable elements/centromere repeats/knob sequences and other repetitive sequences that prevent misannotation of genes.
  • Epigenomic annotation integrates chromatin accessibility, DNA methylation, and histone modifications to understand gene regulation.
  • Non-coding RNA annotation includes small RNAs (miRNAs, siRNAs, piRNAs), long non-coding RNAs (lncRNAs), ribosomal RNAs (rRNAs), and transfer RNAs (tRNAs).

Key Challenges in Genome Annotation


  • Incomplete or fragmented assemblies: Gaps and misassemblies can lead to missing or incorrect gene predictions.
  • Alternative splicing and isoforms: Complex gene structures make it difficult to define a single gene model.
  • Pseudogenes and transposable elements: Distinguishing functional genes from non-functional sequences is challenging.
  • Low-quality or missing evidence: Incomplete RNA-seq, proteomics, or evolutionary data can reduce annotation accuracy.
  • Annotation discrepancies across tools: Different prediction methods may produce conflicting gene models.
  • Evolutionary divergence: Gene structures and functional elements can vary widely across species, complicating homology-based annotation.

Data Sources for Annotation


Accurate genome annotation relies on multiple data sources to refine gene predictions and validate functional elements. Different types of experimental and computational evidence contribute to identifying gene structures, transcript isoforms, coding potential, and functional assignments.

1. Long-read transcript sequencing (Iso-Seq)
- Captures full-length transcripts, helping define complete exon-intron structures.
- Useful for identifying alternative splicing events and novel isoforms.

2. Ribosome profiling (Ribo-Seq)
- Detects ribosome-protected mRNA fragments to identify actively translated regions.
- Helps differentiate coding from non-coding RNAs and refine start/stop codons.

3. Cap analysis gene expression (CAGE) and RAMPAGE
- Maps transcription start sites by detecting 5’ capped mRNAs.
- Provides precise information on promoter locations and alternative start sites.

4. Proteomics (Mass Spectrometry Data)
- Confirms protein-coding genes by detecting peptide fragments in expressed proteins.
- Helps validate translation and refine gene models.

5. RNA sequencing (Short-read RNA-Seq)
- Provides gene expression levels and transcript coverage across different conditions.
- Essential for identifying intron-exon boundaries and alternative splicing events.

6. Comparative genomics (Synteny & Homology)
- Uses known gene models from related species to guide annotation.
- Helps predict conserved genes and validate gene structures.

7. Additional Data Sources
- Poly(A) site sequencing (PAS-Seq): Defines transcript end sites for more precise UTR annotation.
- Chromatin accessibility data (ATAC-Seq, DNase-Seq): Identifies regulatory regions like enhancers and promoters.
- Epigenomic data (ChIP-Seq, DNA methylation): Provides evidence for active and repressed genes.

Integrating diverse datasets improves the accuracy of gene predictions, ensuring that both structural and functional annotations reflect real biological features.
Integrating diverse datasets improves the accuracy of gene predictions, ensuring that both structural and functional annotations reflect real biological features.

Scope of This Workshop


In this workshop, we will focus on protein-coding gene annotation in eukaryotic genomes using computational tools and available datasets. We will annotate genes in a Arabidopsis genome using a combination of ab initio gene prediction, homology-based methods, and RNA-seq evidence methods, can compare the results to a reference annotation. The workflow will cover data preparation, gene prediction, functional annotation, and quality assessment steps to provide a comprehensive overview of the annotation process.

Key Points
  • Genome annotation identifies functional elements in a genome, including genes, regulatory regions, and non-coding RNAs.
  • Structural annotation predicts gene structures, while functional annotation assigns biological roles to genes.
  • Challenges in annotation include incomplete assemblies, alternative splicing, pseudogenes, and data quality issues.
  • Data sources like RNA-seq, proteomics, comparative genomics, and epigenomics improve annotation accuracy.
  • This workshop focuses on eukaryotic gene annotation using computational tools and diverse datasets.

Content from Annotation Strategies


Last updated on 2026-02-24 | Edit this page

Overview

Questions

  • What are the different strategies used for genome annotation?
  • How do various methods predict genes, and what are their strengths and limitations?
  • What role do evidence play in improving gene predictions?
  • How do deep learning and large models enhance genome annotation accuracy?
  • What tools are used in this workshop, and how do they fit into different strategies?

Objectives

  • Understand the principles behind different genome annotation approaches.
  • Learn how intrinsic, extrinsic, and hybrid methods predict gene structures.
  • Explore the role of multi-evidence approaches, including transcriptomics, proteomics, and functional annotation.
  • Introduce modern machine learning and large model-based methods for gene annotation.
  • Familiarize with annotation tools used in this workshop and their specific applications.

Introduction


  • Genome annotation identifies functional elements in a genome, including protein-coding genes, non-coding rnas, and regulatory regions
  • Annotation strategies vary between evidence-based, ab initio, and hybrid approaches, each with strengths and limitations
  • Selecting the right method depends on genome complexity, available transcriptomic and proteomic data, and the need for accuracy versus sensitivity

Annotation Methods


Intrinsic (Ab Initio) Methods

  • Predict genes based on sequence features without external evidence.
  • Use statistical models such as hidden Markov models (HMMs) to recognize coding regions, exon-intron boundaries, and gene structures.
  • Rely on nucleotide composition patterns like codon usage, GC content, and splice site signals.
  • Perform well in well-studied genomes but struggle with novel genes and complex gene structures.

Extrinsic (Evidence-Based) Methods

  • Utilize external data such as RNA-seq, ESTs, and protein homology to guide gene predictions.
  • Align transcriptomic and proteomic data to the genome to refine exon-intron structures.
  • Improve annotation accuracy by incorporating evolutionary conservation and known gene models.
  • Require high-quality reference datasets, making them less effective in newly sequenced organisms with little prior data.

Hybrid Methods

  • Combine ab initio predictions with evidence-based approaches to improve accuracy and completeness.
  • Use gene predictions from statistical models and refine them with transcriptomic and proteomic data.
  • Balance sensitivity (detecting novel genes) and specificity (avoiding false positives).
  • Often used in large-scale genome annotation projects where both approaches complement each other.

Multi-Evidence based methods

  • Incorporate diverse biological data sources to improve gene prediction accuracy.
  • Use ribosome profiling (Ribo-seq) and proteomics to confirm protein-coding potential.
  • Leverage poly(A) signal presence to distinguish mature transcripts from non-coding RNA.
  • Integrate RNA-seq, long-read sequencing, and EST data to refine exon-intron structures.
  • Provide direct experimental validation, reducing reliance on purely computational models.

Large Model-Based Methods

  • Utilize deep learning and large language models (LMMs) trained on vast genomic datasets.
  • Predict gene structures by recognizing sequence patterns learned from diverse organisms.
  • Adapt to various genome complexities without requiring manually curated training data.
  • Identify non-canonical gene structures, alternative splicing events, and novel genes.
  • Example: Helixer, a GPU-based model, predicts genes using high-dimensional sequence patterns, improving over traditional HMM-based methods.
Prerequisite

What do we annotate in the genome?

  • Protein-coding genes: Genes that encode proteins, identified based on open reading frames (ORFs) and coding sequence features.
  • Pseudogenes: Non-functional remnants of once-active genes that have accumulated mutations.
  • Non-coding RNAs (ncRNAs): Functional RNAs that do not code for proteins, including tRNAs, rRNAs, and lncRNAs.
  • Upstream Open Reading Frames (uORFs): Small ORFs in the 5’ untranslated region (UTR) that can regulate translation.
  • Alternative Splicing Variants: Different transcript isoforms from the same gene, generated by exon-intron rearrangements.
  • Repeat Elements and Transposons: Mobile DNA sequences that can impact gene function and genome structure.
  • Regulatory Elements: Promoters, enhancers, and other sequences controlling gene expression.

For this workshop, we focus only on protein-coding genes and their annotation.

Annotation Workflows Used


1. BRAKER3: Combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes.

  • Integrates GeneMark and AUGUSTUS with RNA-Seq and/or protein evidence.
  • Supports various data types or runs without them.
  • Fully automated training and gene prediction.
  • Requires evidence data for high accuracy; performs poorly without it.
  • Struggles with scalability on larger genomes.

2. Helixer: A deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes.

  • Utilizes a deep learning model trained on diverse animal and plant genomes, reducing the need for retraining on closely related species.
  • Provides probabilities for every base, enabling detailed predictions for intergenic, untranslated, coding, and intronic regions.
  • Outperforms traditional tools like AUGUSTUS in both base-wise metrics and RNA-Seq consistency.
  • Handles large and complex eukaryotic genomes effectively, even for species with longer genes.
  • Requires no additional data except genome – no repeatmasking!
  • Cannot predict alternative splicing events or spliced start codons.
Helixer
Helixer

3. EASEL: Efficient, Accurate, Scalable Eukaryotic modeLs: eukaryotic genome annotation tool

  • Combines machine learning, RNA folding, and functional annotations to enhance prediction accuracy.
  • Leverages RNA-Seq, protein alignments, and additional evidence for gene prediction.
  • Built on Nextflow, ensuring portability across HPC systems and compatibility with Docker and Singularity.
  • Features EnTAP integration for streamlined functional annotation.
  • Provides multiple formats (GFF, GTF, CDS, protein sequences) for downstream analysis.
  • Employs BUSCO for quality assessment and automates parameter optimization.
EASEL
EASEL

4. EnTAP: eukaryotic non-model annotation pipeline (functional annotation)

  • Optimized for functional annotation of transcriptomes without reference genomes, addressing challenges like fragmentation and assembly artifacts.
  • Much faster than traditional annotation pipelines.
  • Integrates taxonomic filtering, contaminant detection, and informativeness metrics for selecting optimal alignments.
  • Enables annotation across diverse repositories (e.g., RefSeq, Swiss-Prot) with customizable database integration.
  • Includes gene family assignments, protein domains, and GO terms for enriched annotation.
  • User-friendly, Unix-based pipeline suitable for HPC environments, combining simplicity with high throughput.
EnTAP
EnTAP
Callout

How do we assess gene predictions?

  • Busco and Omark assess gene set completeness by checking the presence of conserved orthologs across species, helping identify missing or fragmented genes
  • Gff3 metrics provide structural statistics on gene models, including exon-intron distribution, gene lengths, and coding sequence properties, highlighting inconsistencies in annotation
  • Feature assignment using RNA-seq read mapping quantifies how well predicted genes capture transcriptomic evidence, indicating potential missing or incorrect annotations
  • Reference comparison with known annotations measures sensitivity and precision, identifying correctly predicted, missed, or falsely annotated genes
  • Multiple validation methods combine structural, functional, and comparative assessments to improve confidence in gene predictions and refine annotation quality
Challenge

Exercise 1: Matching Tools to Data

You are annotating a newly sequenced fungal genome. You have short-read RNA-seq data from three tissue types but no protein sequences from closely related species. Which annotation strategy (ab initio, evidence-based, hybrid) and which tool from this workshop would you choose? Why?

A hybrid approach using BRAKER3 Case 1 (RNA-seq mode) would be the best choice. BRAKER3 can use the RNA-seq BAM files as evidence to train GeneMark-ET and AUGUSTUS, combining ab initio gene prediction with transcript evidence. This approach leverages your available data without requiring protein homology. If you later obtain protein sequences, you could re-run with Case 3 (combined mode) for improved accuracy.

Challenge

Exercise 2: Evaluating Trade-offs

Consider two annotation scenarios: 1. A well-studied crop species with extensive RNA-seq, protein databases, and a closely related reference annotation 2. A non-model organism with no transcriptomic data and few related species in databases

For each scenario, which annotation method(s) would you prioritize and why?

Scenario 1 (well-studied crop): Use a multi-evidence hybrid approach like BRAKER3 Case 3 (RNA-seq + proteins). The rich evidence data will produce highly accurate gene models. Follow up with EnTAP for functional annotation using the available databases. You could also run Helixer for comparison.

Scenario 2 (non-model organism): Start with Helixer since it requires no external evidence and can predict genes using only the genome sequence. Also run BRAKER3 Case 4 (ab initio mode) as a complement. If you can obtain protein sequences from the nearest available clade via OrthoDB, run BRAKER3 Case 2 as well. Compare all results using BUSCO to identify the best prediction.

Key Points
  • Genome annotation strategies include ab initio, evidence-based, hybrid, multi-evidence, and large model-based methods, each with different strengths and data requirements.
  • Ab initio methods predict genes using sequence patterns, while evidence-based methods use RNA-seq, proteins, and evolutionary conservation for refinement.
  • Multi-evidence approaches integrate diverse biological data such as ribosome profiling, proteomics, and long-read RNA-seq to improve accuracy.
  • Large model-based methods, like Helixer, use deep learning to predict genes without requiring manual training, outperforming traditional tools in complex genomes.
  • The workshop focuses on protein-coding gene annotation using tools like BRAKER, Helixer, EASEL, and EnTAP, followed by assessment with BUSCO, OMARK, and structural metrics.

Content from Annotation Setup


Last updated on 2026-02-24 | Edit this page

Overview

Questions

  • How should files and directories be structured for a genome annotation workflow?
  • Why is RNA-seq read mapping important for gene prediction?
  • How does repeat masking improve annotation accuracy?
  • What preprocessing steps are necessary before running gene prediction tools?

Objectives

  • Set up a structured workspace for genome annotation
  • Map RNA-seq reads to the genome to provide evidence for gene prediction
  • Perform repeat masking to prevent false gene annotations
  • Prepare genome and transcriptomic data for downstream annotation analyses

Setup


This section outlines the preprocessing steps required before gene annotation, including organizing directories, mapping RNA-Seq reads, and masking repetitive elements. These steps have already been completed to save time, but the instructions are provided for reference in case you need to run them on custom datasets.

Prerequisite

Optional Section!

Feel free to skip this section as it is already performed to save time. The steps below are provided for reference - to run on custom data.

Folder structure


BASH

mkdir -p ${RCAC_SCRATCH}/annotation_workshop
cd ${RCAC_SCRATCH}/annotation_workshop
mkdir -p 00_datasets/{fastq,genome,bamfiles}
mkdir -p 04_braker
mkdir -p 05_helixer
mkdir -p 06_easel
mkdir -p 07_entap
mkdir -p 08_assessment
rsync -avP /depot/workshop/data/annotation_workshop/00_datasets/ 00_datasets/
Organization
Organization

RNA-Seq mapping


The RNA-seq dataset used for gene prediction comes from the accession E-MTAB-6422, that provides a gene expression atlas of Arabidopsis thaliana (Columbia accession), capturing transcriptional programs across 36 libraries representing different tissues (seedling, root, inflorescence, flower, silique, seed) and developmental stages (2-leaf, 6-leaf, 12-leaf, senescence, dry mature, and imbibed seed), with three biological replicates per condition.

Step 1: Download the reference genome and annotation files

BASH

# Download the reference genome
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/00_datasets/genome
cd ${WORKSHOP_DIR}/00_datasets/genome 
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
ml --force purge
ml biocontainers
ml bioawk
bioawk -c fastx '{print ">"$name; print $seq}' \
   Arabidopsis_thaliana.TAIR10.dna.toplevel.fa |\
   fold > athaliana.fasta
rm Arabidopsis_thaliana.TAIR10.dna.toplevel.fa

Step 2: Index the reference genome using STAR

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml star
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta"
index="${WORKSHOP_DIR}/00_datasets/genome/star_index"
STAR --runThreadN ${SLURM_CPUS_ON_NODE} \
   --runMode genomeGenerate \
   --genomeDir ${index}  \
   --genomeSAindexNbases 12 \
   --genomeFastaFiles ${genome}
Callout

Options

Option Description
--runThreadN $SLURM_CPUS_ON_NODE Specifies the number of CPU threads to use. In an HPC environment, $SLURM_CPUS_ON_NODE dynamically assigns available CPUs per node.
--runMode genomeGenerate Runs STAR in genome index generation mode, required before alignment.
--genomeDir ${index} Path to the directory where the genome index will be stored.
--genomeSAindexNbases 12 Defines the length of the SA pre-indexing string. A recommended value is min(14, log2(GenomeLength)/2 - 1), but 12 is commonly used for smaller genomes.
--genomeFastaFiles ${genome} Path to the FASTA file containing the reference genome sequences.

Step 3: Map the RNA-Seq reads to the reference genome

We need to setup a script star-mapping.sh to map the RNA-Seq reads to the reference genome.

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml star
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta"
index="${WORKSHOP_DIR}/00_datasets/genome/star_index"
read1=$1
read2=$2
cpus=${SLURM_CPUS_ON_NODE}
outname=$(basename ${read1} | cut -f 1 -d "_")
STAR \
--runThreadN ${cpus} \
--genomeDir ${index} \
--outSAMtype BAM SortedByCoordinate \
--outFilterScoreMinOverLread 0.3 \
--outFilterMatchNminOverLread 0.3 \
--outFileNamePrefix ${outname}_ \
--readFilesCommand zcat \
--outWigType bedGraph \
--outWigStrand Unstranded \
--outWigNorm RPM \
--readFilesIn ${read1} ${read2}
Callout

Options

Option Description
--runThreadN ${cpus} Specifies the number of CPU threads to use for alignment.
--genomeDir ${index} Path to the pre-generated STAR genome index directory.
--outSAMtype BAM SortedByCoordinate Outputs aligned reads in BAM format, sorted by genomic coordinates.
--outFilterScoreMinOverLread 0.3 Sets the minimum alignment score relative to read length (30% of the max possible score).
--outFilterMatchNminOverLread 0.3 Sets the minimum number of matching bases relative to read length (30% of the read must match).
--outFileNamePrefix ${outname}_ Prefix for all output files generated by STAR.
--readFilesCommand zcat Specifies a command (zcat) to decompress input FASTQ files if they are gzipped.
--outWigType bedGraph Outputs signal coverage in bedGraph format for visualization.
--outWigStrand Unstranded Specifies that the RNA-seq library is unstranded (reads are not strand-specific).
--outWigNorm RPM Normalizes coverage signal to Reads Per Million (RPM).
--readFilesIn ${read1} ${read2} Specifies input paired-end FASTQ files (Read 1 and Read 2).

Step 4: Run the STAR mapping script

BASH

#!/bin/bash
# Define the directory containing FASTQ files
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
FASTQ_DIR="${WORKSHOP_DIR}/00_datasets/fastq"
# Loop through all R1 files and find corresponding R2 files
for R1 in ${FASTQ_DIR}/*_R1.fq.gz; do
    R2="${R1/_R1.fq.gz/_R2.fq.gz}"  # Replace R1 with R2
    if [[ -f "$R2" ]]; then  # Check if R2 exists
        echo "Running STAR mapping for: $R1 and $R2"
        ./star-mapping.sh "$R1" "$R2"
    else
        echo "Error: Matching R2 file for $R1 not found!" >&2
    fi
done

Step 5: store BAM files

BASH

# Define the directory containing BAM files
BAM_DIR="${WORKSHOP_DIR}/00_datasets/bamfiles"
# Move all BAM files to the BAM directory
mv *_Aligned.sortedByCoord.out.bam ${BAM_DIR}/

RepeatMasker


Some gene prediction tools require repeat-masked genomes to avoid mis-annotation of transposable elements as genes. We will use RepeatMasker to mask the repeats in the genome. But first, we need to download the RepeatMasker database.

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
cd ${WORKSHOP_DIR}/00_datasets/genome
# Download the RepeatMasker database
wget https://www.girinst.org/server/RepBase/protected/RepBase30.01.fasta.tar.gz
tar xf RepBase30.01.fasta.tar.gz
replib=$(find $(pwd) -name "athrep.ref")
ml --force purge
ml biocontainers
ml repeatmasker
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta"
RepeatMasker \
   -e ncbi \
   -pa ${SLURM_CPUS_ON_NODE} \
   -q \
   -xsmall \
   -lib ${replib} \
   -nocut \
   -gff \
   ${genome}

Once done, we will rename it to athaliana.fasta.masked

BASH

mv athaliana.fasta.masked athaliana_softmasked.fasta
Challenge

Exercise 1: Calculating STAR --genomeSAindexNbases

The STAR aligner requires the --genomeSAindexNbases parameter to be tuned based on genome size. The recommended formula is min(14, log2(GenomeLength)/2 - 1).

The Arabidopsis thaliana genome is approximately 119 Mb (119,000,000 bp). What value should you use for --genomeSAindexNbases?

What about for a larger genome, such as Zea mays (~2,300 Mb)?

For Arabidopsis thaliana (~119 Mb):

log2(119000000) / 2 - 1 = 26.83 / 2 - 1 = 12.4
min(14, 12.4) = 12

So --genomeSAindexNbases 12 is the correct value (as used in this lesson).

For Zea mays (~2,300 Mb):

log2(2300000000) / 2 - 1 = 31.1 / 2 - 1 = 14.55
min(14, 14.55) = 14

So --genomeSAindexNbases 14 (the default) would be appropriate for maize.

As a rule of thumb: for genomes smaller than ~500 Mb, you should calculate and lower this value; for large genomes (over 1 Gb), the default of 14 is usually fine.

Challenge

Exercise 2: Choosing the Right Genome File

Different annotation tools have different requirements for repeat masking. Given the genome files produced in this episode:

  • athaliana.fasta (unmasked)
  • athaliana_softmasked.fasta (soft-masked with RepeatMasker)

Which genome file should you use for each of the following tools, and why?

  1. BRAKER3
  2. Helixer
  3. EASEL
  1. BRAKER3 – Use athaliana_softmasked.fasta. BRAKER3 benefits from soft-masked genomes because its underlying gene finder (Augustus/GeneMark) can recognize repeat regions (in lowercase) and avoid predicting genes within transposable elements.

  2. Helixer – Use athaliana.fasta (unmasked). Helixer is a deep-learning-based tool that has been trained on both genic and repetitive sequences. It does not require or benefit from repeat masking.

  3. EASEL – Use athaliana_softmasked.fasta. EASEL’s internal pipeline expects a soft-masked genome to improve the accuracy of gene predictions by de-prioritizing repetitive regions.

Key Points
  • Organizing files and directories ensures a reproducible and efficient workflow
  • Mapping RNA-seq reads to the genome provides essential evidence for gene prediction
  • Repeat masking prevents transposable elements from being mis-annotated as genes
  • Preprocessing steps are crucial for accurate and high-quality genome annotation

Content from Annotation using BRAKER


Last updated on 2026-02-24 | Edit this page

Overview

Questions

  • What is BRAKER3?
  • What are the different scenarios in which BRAKER3 can be used?
  • How to run BRAKER3 with different input requirements?
  • What are the different output files generated by BRAKER3?

Objectives

  • Understand the different scenarios in which BRAKER3 can be used
  • Learn how to run BRAKER3 with different input requirements
  • Learn how to interpret the output files generated by BRAKER3

BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes. This pipeline is particularly useful for annotating newly sequenced genomes. The flexibility of BRAKER3 allows users to provide various input datasets for improving gene prediction accuracy. In this example, we will use various scenarios to predict genes in an Arabidopsis thaliana genome using BRAKER3. Following are the scenarios we will cover

Organizing the Data


Before running BRAKER3, we need to organize the data in the following format:

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/04_braker/braker_case{1..5}
Folder organization
Folder organization

Setup


Before running BRAKER3, we need to set up:

  1. GeneMark-ES/ET/EP/ETP license key
  2. The AUGUSTUS_CONFIG_PATH configuration path

The license key for GeneMark-ES/ET/EP/ETP can be obtained from the GeneMark website. Once downloaded, you need to place it in your home directory:

BASH

tar xf gm_key_64.gz
cp gm_key_64 ~/.gm_key

For the AUGUSTUS_CONFIG_PATH, we need to copy the config directory from the Singularity container to the scratch directory. This is required because BRAKER3 needs to write to the config directory, and the Singularity container is read-only. To copy the config directory, run the following command:

BASH

ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/04_braker/augustus
copy_augustus_config ${WORKSHOP_DIR}/04_braker/augustus
export AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"

Case 1: Using RNAseq Reads


To run BRAKER3 with RNAseq reads, we need to provide the RNAseq reads in BAM format. We will use the following command to run BRAKER3 with RNAseq reads:

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --account=workshop
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case1
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
BAM_DIR="${WORKSHOP_DIR}/00_datasets/bamfiles"
BAM_FILES=$(find "${BAM_DIR}" -name "*.bam" | tr '\n' ',')
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case1
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --bam=${BAM_FILES} \
   --genome=${genome} \
   --species=At_$(date +"%Y%m%d").c1b \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The job takes ~55 mins to finish. The results of the BRAKER3 run will be stored in the respective braker_case1 directory, as braker.gff3 file.

Case 2: Using Conserved proteins only


Using the orthodb-clades tool, we can download protein sequences for a specific clade. In this scenario, since we are using the Arabidopsis genome, we can download the clades specific Viridiplantae.fa OrthoDB v12 protein sets.

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/04_braker/braker_case2
mkdir -p ${workdir}
git clone git@github.com:tomasbruna/orthodb-clades.git
cd orthodb-clades
ml --force purge
ml biocontainers
ml snakemake
snakemake --cores ${SLURM_CPUS_ON_NODE} selectViridiplantae

When this is done, you should see a folder named clades with Viridiplantae.fa in the orthodb-clades directory. We will use this as one of the input datasets for BRAKER3.
The following command will run BRAKER3 with the input genome and protein sequences:

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --account=workshop
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case2
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
proteins="${WORKSHOP_DIR}/04_braker/braker_case2/orthodb-clades/clades/Viridiplantae.fa"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case2
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --genome=${genome} \
   --prot_seq=${proteins} \
   --species=At_$(date +"%Y%m%d").c2 \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The results of the BRAKER3 run will be stored in the respective braker_case2 directory, as braker.gff3 file. The run time for this job is ~170 mins.

Case 3: Using RNAseq and Proteins Together


BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --account=workshop
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case3
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
BAM_DIR="${WORKSHOP_DIR}/00_datasets/bamfiles"
BAM_FILES=$(find "${BAM_DIR}" -name "*.bam" | tr '\n' ',')
proteins="${WORKSHOP_DIR}/04_braker/braker_case2/orthodb-clades/clades/Viridiplantae.fa"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case3
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --genome=${genome} \
   --prot_seq=${proteins} \
   --bam=${BAM_FILES} \
   --species=At_$(date +"%Y%m%d").c3 \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The results of the BRAKER3 run will be stored in the respective braker_case3 directory, as braker.gff3 file. The run time for this job is ~120 mins.

Case 4: ab Initio mode


BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --account=workshop
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case4
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case4
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --esmode \
   --genome=${genome} \
   --species=At_$(date +"%Y%m%d").c4 \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The results of the BRAKER3 run will be stored in the respective braker_case4 directory, as braker.gff3 file. The run time for this job is ~60 mins.

Case 5: Using Pre-trained Model


BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --account=workshop
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case5
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case5
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --skipAllTraining \
   --genome=${genome} \
   --species=arabidopsis \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The results of the BRAKER3 run will be stored in the respective braker_case5/Augustus directory. The main result file augustus.ab_initio.gff3 will be used for downstream analysis. Runtime for this job is ~7 mins.

Results and Outputs


Each BRAKER3 run produces several output files in the respective case directory. The most important files are:

File Description
braker.gff3 Gene predictions in GFF3 format (Cases 1-4)
braker.aa Predicted protein sequences
braker.codingseq Predicted coding sequences (CDS)
hintsfile.gff Evidence hints used for gene prediction
augustus.ab_initio.gff3 Gene predictions from Case 5 (pretrained model)

You can count the number of predicted genes in each case using:

BASH

for case in braker_case{1..4}; do
    echo "${case}: $(awk '$3=="gene"' ${case}/braker.gff3 | wc -l) genes"
done
echo "braker_case5: $(awk '$3=="gene"' braker_case5/Augustus/augustus.ab_initio.gff3 | wc -l) genes"

The TAIR10 reference annotation for Arabidopsis contains approximately 27,600 protein-coding genes. Your results should be in a similar range:

Case Evidence Used Expected Gene Count Runtime
Case 1 RNA-seq only ~25,000-28,000 ~55 min
Case 2 Proteins only ~26,000-30,000 ~170 min
Case 3 RNA-seq + Proteins ~26,000-29,000 ~120 min
Case 4 Ab initio (no evidence) ~20,000-35,000 ~60 min
Case 5 Pretrained model ~26,000-28,000 ~7 min

Cases that use evidence data (Cases 1-3) generally produce more accurate gene models. Case 3 (combined evidence) typically gives the best results. Case 4 (ab initio) tends to over-predict or under-predict genes depending on the genome. Case 5 uses a pretrained Arabidopsis model and produces results quickly, but may not generalize to other species.

Challenge

Exercise 1: Compare BRAKER3 Cases

After running all five BRAKER3 cases, compare the gene counts from each case to the TAIR10 reference annotation (approximately 27,600 protein-coding genes). Which case produces results closest to the reference? Why do you think that is?

Case 3 (RNA-seq + Proteins) typically produces results closest to the TAIR10 reference, because it leverages both transcriptomic and proteomic evidence to guide gene prediction. The RNA-seq data provides direct evidence of expressed genes and splice sites, while the protein data provides evolutionary conservation information. Combining both evidence types allows BRAKER3 to make more accurate gene models than using either source alone. Case 5 (pretrained model) also performs well for Arabidopsis specifically, since the pretrained AUGUSTUS model was trained on high-quality Arabidopsis annotations. However, this approach would not generalize to organisms without existing pretrained models.

Challenge

Exercise 2: Choosing a BRAKER3 Strategy

Imagine you are annotating a newly sequenced genome for a non-model plant species. You have RNA-seq data from three tissue types but no closely related protein database. Which BRAKER3 case would you choose, and why? What if you had no RNA-seq data at all?

With RNA-seq data from three tissue types but no protein database, you should use Case 1 (RNA-seq only). The RNA-seq evidence from multiple tissues will help BRAKER3 identify expressed genes and accurately predict splice sites across a broad set of genes. You could also consider downloading OrthoDB protein sequences for the relevant clade (as shown in Case 2) and then running Case 3 (RNA-seq + Proteins) for potentially even better results.

If you had no RNA-seq data at all, you could use Case 2 (proteins only) with OrthoDB proteins from the closest available clade, or Case 4 (ab initio) as a last resort. Ab initio prediction does not require any external evidence but typically produces less accurate gene models. It is generally recommended to generate at least some RNA-seq data before attempting genome annotation, as this significantly improves prediction quality.

Key Points
  • BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes
  • BRAKER3 can be used with different input datasets to improve gene prediction accuracy
  • BRAKER3 can be run in different scenarios to predict genes in a genome

Content from Annotation using Helixer


Last updated on 2026-02-24 | Edit this page

Overview

Questions

  • How to predict genes using Helixer?
  • How to download trained models for Helixer?
  • How to run Helixer on the HPC cluster (Gilbreth)?

Objectives

  • Learn how to predict genes using Helixer.
  • Learn how to download trained models for Helixer.
  • Learn how to run Helixer on the HPC cluster (Gilbreth).

Helixer is a deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes. Helixer is trained on a wide range of eukaryotic genomes and can predict genes in both plant and animal genomes. Helixer can predict genes without any extrinsic information such as RNA-seq data or homology information, purely based on the sequence of the genome.

Prerequisite

This section should be run on Gilbreth HPC cluster.

Due to the GPU requirement for Helixer, you need to run this section on the Gilbreth HPC cluster. You don’t have to copy fastq, or bamfiles, but only need athaliana.fasta file in the 00_datasets/genome directory.

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/05_helixer
cp /depot/workshop/data/annotation_workshop/00_datasets/genome/athaliana.fasta ${WORKSHOP_DIR}/05_helixer/athaliana.fasta
Folder organization
Folder organization

Setup


Helixer is available as a Singularity/apptainer container. You can pull the container using the following apptainer pull command. See the Helixer Docker repository for more information.

Helixer is installed as a module on the Gilbreth cluster, and can be loaded using the following commands:

BASH

ml --force purge
ml biocontainers
ml helixer

Downloading trained models


Helixer requires a trained model to predict genes. With the included script fetch_helixer_models.py you can download models for specific lineages. Currently, models are available for the following lineages:

  • land_plant
  • vertebrate
  • invertibrate
  • fungi

You can download the models using the following command:

BASH

# all models
# fetch_helixer_models.py --all
# or for a specific lineage
fetch_helixer_models.py --lineage land_plant

This will download all lineage models in the models directory. You can also download models for specific lineages using the --lineage option as shown above.

By default, files will be downloaded to ~/.local/share/Helixer directory. You should see the following files:

.
└── models
    └── land_plant
        ├── land_plant_v0.3_a_0080.h5
        ├── land_plant_v0.3_a_0090.h5
        ├── land_plant_v0.3_a_0100.h5
        ├── land_plant_v0.3_a_0200.h5
        ├── land_plant_v0.3_a_0300.h5
        ├── land_plant_v0.3_a_0400.h5
        ├── land_plant_v0.3_m_0100.h5
        └── land_plant_v0.3_m_0200.h5

land_plant_v0.3_a_0080.h5 is the smallest model and land_plant_v0.3_m_0200.h5 is the largest model. The model size is determined by the number of parameters in the model. The larger models are more accurate but require more memory and time to run.

Running Helixer


Helixer requires GPU for prediction. For running Helixer, you need to request a GPU node. You will also need the genome sequence in fasta format. For this tutorial, we will use the Arabidopsis thaliana genome, and use the land_plant model to predict genes.

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --gpus-per-node=1
#SBATCH --time=04:00:00
#SBATCH --account=standby
#SBATCH --job-name=helixer
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
ml --force purge
ml biocontainers
ml helixer
genome=athaliana.fasta
species="Arabidopsis thaliana"
output="athaliana_helixer.gff"
Helixer.py \
    --lineage land_plant \
    --fasta-path ${genome} \
    --species ${species} \
    --gff-output-path ${output}

A typical job will take ~40 mins to finish depending on the GPU the job gets allocated.

You can count the number of predictions in your gff3 file using the following command:

BASH

awk '$3=="gene"' athaliana_helixer.gff | wc -l

The GFF format output had 27,201 genes predicted using Helixer. You can view the various features in the gff file using the following command:

BASH

grep -v "^#" athaliana_helixer.gff |\
   cut -f 3 |\
   sort |\
   uniq -c

To get cds and pep files, you can use the following command:

BASH

ml --force purge
ml biocontainers
ml cufflinks
gffread athaliana_helixer.gff \
   -g athaliana.fasta \
   -y helixer_pep.fa \
   -x helixer_cds.fa

As you may have noticed, the number of mRNA and gene features are the same. This is because isoforms aren’t predicted by Helixer and you only have one transcript per gene. Exons are identified with high confidence and alternative isoforms are usually collapsed into a single gene model. This is one of the known limitations of Helixer.

Challenge

Exercise 1: Understanding Helixer GFF Output

Look at the feature counts from the Helixer GFF output. Why is the number of mRNA features the same as the number of gene features?

The number of mRNA and gene features are the same because Helixer does not predict alternative isoforms. Each gene has exactly one transcript (mRNA) associated with it. Unlike tools that incorporate RNA-seq evidence, Helixer’s deep learning model produces a single gene model per locus, collapsing any potential alternative splicing into one representative transcript.

Challenge

Exercise 2: Helixer vs. BRAKER3

What are the advantages and disadvantages of using Helixer compared to BRAKER3 for gene annotation?

Advantages of Helixer:

  • Does not require any extrinsic evidence data (RNA-seq or protein data), making it useful when such data is unavailable.
  • Runs faster when GPU hardware is available.
  • Simpler setup with fewer input requirements.

Disadvantages of Helixer:

  • Cannot predict alternative isoforms (only one transcript per gene).
  • Requires GPU hardware, which may not be available on all HPC clusters.
  • May be less accurate for species that are poorly represented in the training data.
Key Points
  • Helixer is a deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes.
  • Helixer can predict genes without any extrinsic information such as RNA-seq data or homology information, purely based on the sequence of the genome.
  • Helixer requires a trained model and GPU for prediction.
  • Helixer predicts genes in the GFF3 format, but will not predict isoforms.

Content from Annotation using Easel


Last updated on 2026-02-24 | Edit this page

Overview

Questions

  • What are the steps required to set up and run EASEL on an HPC system?
  • How is Nextflow used to manage and execute the EASEL workflow?
  • What configuration files need to be modified before running EASEL?
  • How do you submit and monitor an EASEL job using Slurm?

Objectives

  • Set up and configure EASEL for execution on an HPC system.
  • Install Nextflow and pull the EASEL workflow from the repository.
  • Modify the necessary configuration files to match the HPC environment.
  • Submit and run EASEL as a Slurm job and verify successful execution.

EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs) is a genome annotation tool that integrates machine learning, RNA folding, and functional annotations to improve gene prediction accuracy, leveraging optimized AUGUSTUS parameters, transcriptome and protein evidence, and a Nextflow-based scalable pipeline for reproducible analysis.

Setup


We will do a custom installation of nextflow (we will not use ml nextflow):

BASH

ml --force purge
ml openjdk
curl -s https://get.nextflow.io | bash
mv nextflow ~/.local/bin

check the installation:

BASH

nextflow -v
which nextflow

We can organize our folder structure as follows:

Folder organization
Folder organization

Running Easel


Step 1: create a rcac.config file to run the jobs using slurm scheduler.

BASH

params {
    config_profile_description = 'Negishi HPC, RCAC Purdue University.'
    config_profile_contact     = 'Arun Seetharam (aseethar@purdue.edu)'
    config_profile_url         = "https://www.rcac.purdue.edu/knowledge/negishi"
    partition                  = 'testpbs'
    project                    = 'your_project_name'  // Replace with a valid Slurm project
    max_memory                 = '256GB'
    max_cpus                   = 128
    max_time                   = '48h'
}
singularity {
    enabled    = true
    autoMounts = true
    singularity.cacheDir = "${System.getenv('RCAC_SCRATCH')}/.singularity"
    singularity.pullTimeout = "1h"
}
process {
    resourceLimits = [
        memory: 256.GB,
        cpus: 128,
        time: 48.h
    ]
    executor       = 'slurm'
    clusterOptions = "-A ${params.project} -p ${params.partition}"
}
// Executor-specific configurations
if (params.executor == 'slurm') {
    process {
        cpus   = { check_max( 16    * task.attempt, 'cpus'   ) }
        memory = { check_max( 32.GB * task.attempt, 'memory' ) }
        time   = { check_max( 8.h  * task.attempt, 'time'   ) }
        clusterOptions = "-A ${params.project} -p ${params.partition}"

        errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
        maxRetries    = 1
        maxErrors     = '-1'

        withLabel:process_single {
            cpus   = { check_max( 16                  , 'cpus'    ) }
            memory = { check_max( 32.GB  * task.attempt, 'memory'  ) }
            time   = { check_max( 8.h    * task.attempt, 'time'    ) }
        }
        withLabel:process_low {
            cpus   = { check_max( 32     * task.attempt, 'cpus'    ) }
            memory = { check_max( 64.GB  * task.attempt, 'memory'  ) }
            time   = { check_max( 24.h   * task.attempt, 'time'    ) }
        }
        withLabel:process_medium {
            cpus   = { check_max( 64     * task.attempt, 'cpus'    ) }
            memory = { check_max( 128.GB * task.attempt, 'memory'  ) }
            time   = { check_max( 24.h   * task.attempt, 'time'    ) }
        }
        withLabel:process_high {
            cpus   = { check_max( 128    * task.attempt, 'cpus'    ) }
            memory = { check_max( 256.GB * task.attempt, 'memory'  ) }
            time   = { check_max( 24.h   * task.attempt, 'time'    ) }
        }
        withLabel:process_long {
            time   = { check_max( 48.h  * task.attempt, 'time'    ) }
        }
        withLabel:process_high_memory {
            memory = { check_max( 256.GB * task.attempt, 'memory' ) }
        }
        withLabel:error_ignore {
            errorStrategy = 'ignore'
        }
        withLabel:error_retry {
            errorStrategy = 'retry'
            maxRetries    = 2
        }
    }
}

Step 2: pull the Easel nextflow workflow:

BASH

ml --force purge
ml biocontainers
ml nextflow
ml openjdk
nextflow pull -hub gitlab PlantGenomicsLab/easel

This will pull the latest version of the Easel workflow from the PlantGenomicsLab GitLab repository.

Step 3: modify the ~/.nextflow/assets/PlantGenomicsLab/easel/nextflow.config file to include the rcac.config file:

BASH

vim ~/.nextflow/assets/PlantGenomicsLab/easel/nextflow.config

or if you prefer nano:

BASH

nano ~/.nextflow/assets/PlantGenomicsLab/easel/nextflow.config

Add the following line to the nextflow.config file:

BASH

profiles {
   debug { process.beforeScript = 'echo $HOSTNAME' }
   ...
   ...
   test  { includeConfig 'conf/test.config' }
   rcac  { includeConfig 'conf/rcac.config' }
}

Note that this is in line # 235 in the nextflow.config file. and all the lines between debug and test aren’t shown here.

Now, we are ready to run the Easel workflow.

Step 4: create param.yaml file to run the Easel workflow:

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/06_easel
cd ${WORKSHOP_DIR}/06_easel

for the param.yaml file:

YAML

outdir: easel
genome: /scratch/negishi/YOUR_USERNAME/annotation_workshop/00_datasets/genome/athaliana_softmasked.fasta
bam: /scratch/negishi/YOUR_USERNAME/annotation_workshop/00_datasets/bamfiles/*.bam
busco_lineage: embryophyta
order: Brassicales
prefix: arabidopsis
taxon: arabidopsis
singularity_cache_dir: /scratch/negishi/YOUR_USERNAME/singularity_cache
training_set: plant
executor: slurm
account: testpbs
qos: normal
project: testpbs
Callout

Update Paths in params.yaml

Replace YOUR_USERNAME with your actual username in the params.yaml file. You can find your username by running whoami on the cluster. For example, if your username is train01, the genome path would be: /scratch/negishi/train01/annotation_workshop/00_datasets/genome/athaliana_softmasked.fasta

Now we are ready to run the Easel workflow:

Our slurm file to run the Easel workflow:

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=04-00:00:00
#SBATCH --job-name=easel
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml openjdk
nextflow run \
   -hub gitlab PlantGenomicsLab/easel \
   -params-file params.yaml \
   -profile rcac \
   --project testpbs

This will run the Easel workflow on the RCAC HPC.

Results and Outputs


EASEL produces output in the easel/ directory (as specified by outdir in params.yaml). The key output files are:

File/Directory Description
easel/annotation/*.gff3 Final gene predictions in GFF3 format
easel/annotation/*.cds.fa Predicted coding sequences
easel/annotation/*.pep.fa Predicted protein sequences
easel/busco/ BUSCO completeness assessment results
easel/entap/ Functional annotation results from EnTAP
easel/pipeline_info/ Nextflow execution reports and timelines

You can check the gene count in the EASEL output:

BASH

awk '$3=="gene"' easel/annotation/*.gff3 | wc -l

For Arabidopsis thaliana, EASEL should predict approximately 25,000-30,000 genes. The pipeline also generates a BUSCO report automatically, which you can find in the easel/busco/ directory.

Check the Nextflow execution report in easel/pipeline_info/ to review runtime, resource usage, and any failed processes.

Challenge

Exercise 1: Adapting EASEL for a Different Organism

If you were annotating a vertebrate genome (e.g., zebrafish) instead of a plant genome, which parameters in the params.yaml file would you need to change? List all the parameters and explain why each would need updating.

You would need to change the following parameters in params.yaml:

  • busco_lineage: Change from embryophyta (plants) to the appropriate vertebrate lineage (e.g., actinopterygii for ray-finned fish, or vertebrata for a broader assessment).
  • order: Change from Brassicales to the correct taxonomic order for your organism (e.g., Cypriniformes for zebrafish).
  • prefix: Change from arabidopsis to an appropriate prefix for your organism (e.g., zebrafish or drerio).
  • taxon: Change from arabidopsis to your target organism name (e.g., zebrafish).
  • training_set: Change from plant to the appropriate AUGUSTUS training set for your organism (e.g., vertebrate or a closely related species like zebrafish).

The genome, bam, and other path-related parameters would also change, but those are specific to your data rather than to the organism type.

Challenge

Exercise 2: Resuming a Failed EASEL Run

Your EASEL Nextflow pipeline failed partway through execution. How would you resume the run without recomputing already completed steps? Where would you look for error logs to diagnose the issue?

To resume a failed Nextflow run, add the -resume flag to your nextflow run command:

BASH

nextflow run \
   -hub gitlab PlantGenomicsLab/easel \
   -params-file params.yaml \
   -profile rcac \
   --project testpbs \
   -resume

Nextflow caches completed tasks, so only failed or pending steps will be re-executed.

To diagnose the failure, check these locations:

  1. .nextflow.log in the working directory: Contains detailed Nextflow execution logs, including error messages and stack traces.
  2. work/ directory: Each task runs in a subdirectory under work/. Navigate to the failed task’s directory (shown in the error output) and check the .command.log, .command.err, and .command.out files for the specific error.
  3. SLURM output files: Check the easel.o* and easel.e* files for SLURM-level errors (e.g., memory limits, time limits).
Key Points
  • EASEL is executed using Nextflow, which simplifies workflow management and ensures reproducibility.
  • Proper configuration of resource settings and HPC parameters is essential for successful job execution.
  • Running EASEL requires setting up input files, modifying configuration files, and submitting jobs via Slurm.
  • Understanding how to monitor and troubleshoot jobs helps ensure efficient pipeline execution.

Content from Functional annotation using EnTAP


Last updated on 2026-02-24 | Edit this page

Overview

Questions

  • What is EnTAP, and how does it improve functional annotation for non-model transcriptomes?
  • How does EnTAP filter, annotate, and assign functional roles to predicted transcripts?
  • What databases and evidence sources does EnTAP integrate for annotation?
  • What are the key steps required to set up and execute EnTAP on an HPC system?

Objectives

  • Understand how EnTAP improves functional annotation for non-model eukaryotes.
  • Learn how EnTAP processes transcript data through filtering, alignment, and functional assignment.
  • Set up and modify EnTAP configuration files for correct execution.
  • Run EnTAP on an HPC system and interpret the generated annotations.

EnTAP Overview


EnTAP is a bioinformatics pipeline designed to enhance the accuracy, speed, and flexibility of functional annotation for de novo assembled transcriptomes in non-model eukaryotic organisms. It mitigates assembly fragmentation, improves annotation rates, and provides extensive functional insights into transcriptomes. You can provide predicted transcripts in FASTA format and a GFF file containing gene models to run EnTAP.

Callout

Key Features

  • Optimized for non-model eukaryotes – Overcomes challenges of fragmented transcriptome assemblies.
  • Fast & efficient – Runs significantly faster than comparable annotation tools.
  • Customizable – Supports optional filtering and analysis steps for user-specific needs.
  • Comprehensive functional insights – Integrates multiple annotation sources for high-confidence gene assignments.
  • Contaminant detection – Helps remove misleading sequences for cleaner datasets.

How EnTAP Works


  1. Transcriptome Filtering – Identifies true coding sequences (CDS) and removes assembly artifacts:
    • Expression Filtering (optional) – Filters transcripts based on gene expression levels using RSEM.
    • Frame Selection (optional) – Further refines CDS predictions using TransDecoder.
  2. Transcriptome Annotation – Assigns functional information to sequences:
    • Similarity Search – Rapid alignment against user-selected databases using DIAMOND.
    • Contaminant Filtering & Best Hit Selection – Identifies optimal annotations and flags potential contaminants.
    • Orthologous Group Assignment – Assigns translated proteins to gene families using eggNOG/eggnog-mapper, including:
      • Protein Domains (SMART/Pfam)
      • Gene Ontology (GO) Terms
      • KEGG Pathway Annotation
    • InterProScan (optional) – Searches InterPro databases for additional domain, GO, and pathway annotations.
    • Horizontal Gene Transfer Analysis (optional) – Detects potential horizontal gene transfer (HGT) events via DIAMOND.

Running EnTAP


EnTAP is available as a module on the HPC cluster. You can load the module using the following commands:

BASH

ml --force purge
ml biocontainers
ml entap
Prerequisite

First-Time Setup

When running for the first time, you will have to set up the databases for EnTAP. This includes downloading files from various databases, and can be time consuming. This section is already performed so you can skip this step and is included for reference.

Callout

Pre-staged Databases

For this workshop, the EnTAP databases have been pre-configured and are available at /depot/itap/datasets/entap_db/ on the training cluster. You can copy the configuration files from there:

BASH

cp /depot/itap/datasets/entap_db/entap_config.ini .
cp /depot/itap/datasets/entap_db/entap_run.params .

If you are running EnTAP on your own data outside this workshop, follow the database setup instructions in the spoiler section above.

entap_run.params file should be setup as follows (be sure to select the correct databases for your organism):

INI

out-dir=entap_dbfiles
overwrite=false
resume=false
input=
database=uniprot_sprot,refseq_plant
no-trim=false
threads=1
output-format=1,3,4,7,
fpkm=0.5
align=
single-end=false
frame-selection=false
transdecoder-m=100
transdecoder-no-refine-starts=false
taxon=
qcoverage=50
tcoverage=50
contam=
e-value=1e-05
uninformative=conserved,predicted,unknown,unnamed,hypothetical,putative,unidentified,uncharacterized,uncultured,uninformative,
diamond-sensitivity=very-sensitive
ontology_source=0,
eggnog-contaminant=true
eggnog-dbmem=true
eggnog-sensitivity=more-sensitive
interproscan-db=
hgt-donor=
hgt-recipient=
hgt-gff=
ncbi-api-key=
ncbi-api-enable=true

entap_config.ini file should be setup as follows (be sure to modify the paths <custom_location> to your desired location):

INI

data-generate=false
data-type=0,
entap-db-bin=<custom_location>/entap_db/bin/entap_database.bin
entap-db-sql=entap_database.db
entap-graph=entap_graphing.py
rsem-calculate-expression=rsem-calculate-expression
rsem-sam-validator=rsem-sam-validator
rsem-prepare-reference=rsem-prepare-reference
convert-sam-for-rsem=convert-sam-for-rsem
transdecoder-long-exe=TransDecoder.LongOrfs
transdecoder-predict-exe=TransDecoder.Predict
diamond-exe=diamond
eggnog-map-exe=emapper.py
eggnog-map-data=<custom_location>/entap_db/databases
eggnog-map-dmnd=<custom_location>/entap_db/bin/eggnog_proteins.dmnd
interproscan-exe=interproscan.sh

Once done, for the first time setup, you can run the following command:

BASH

ml --force purge
ml biocontainers
ml entap
EnTAP \
   --config \
   --run-ini ./entap_run.params \
   --entap-ini ./entap_config.ini \
   --threads ${SLURM_CPUS_ON_NODE}

This will download the databases and set up the configuration files for EnTAP.

Step 1: Prepare files


Your input files should be in the following format:

  • Transcript FASTA file – Contains predicted transcripts in FASTA format.
  • Configuration file – Specifies parameters for EnTAP execution.
    • entap_run.params – Contains runtime parameters for EnTAP.
    • entap_config.ini – Specifies paths to EnTAP binaries and databases. (you can copy the files from /depot/itap/datasets/entap_db/entap_{config.ini,run.params})

Edit the entap_run.params file to specify the output directory for EnTAP results and the correct input file

INI

out-dir=entap_out
input=input_cds.fasta

Step 2: Run EnTAP


Run EnTAP using the following command:

BASH

ml --force purge
ml biocontainers
ml entap
EnTAP \
   --run \
   --run-ini ./entap_run.params \
   --entap-ini ./entap_config.ini \
   --threads ${SLURM_CPUS_ON_NODE}

Interpreting Results


EnTAP generates several output files, but the key results will be in the entap_out/final_results directory.

EnTAP Output Files Summary

File/Directory Description
Final Results (final_results/)
annotated.fnn FASTA file of annotated transcripts.
annotated.tsv Tab-separated file with functional annotations.
annotated_gene_ontology_terms.tsv GO terms assigned to annotated transcripts.
entap_results.tsv Master summary of all results, including annotations.
unannotated.fnn FASTA file of unannotated transcripts.
unannotated.tsv List of transcripts that failed annotation.
gene_family/ Stores eggNOG gene family assignments, including orthologs and functional annotations.
similarity_search/ Contains results from DIAMOND BLASTX searches against selected databases.
transcriptomes/ Holds the input transcriptome (CDS) and the processed version after filtering.
Challenge

Exercise 1: Annotation Coverage

After running EnTAP, examine the output files in the final_results/ directory. What percentage of your input transcripts received functional annotations?

Hint: Compare the number of entries in annotated.tsv and unannotated.tsv.

You can count the lines (excluding the header) in each file:

BASH

# Count annotated transcripts
tail -n +2 entap_out/final_results/annotated.tsv | wc -l

# Count unannotated transcripts
tail -n +2 entap_out/final_results/unannotated.tsv | wc -l

The annotation rate is calculated as:

annotated / (annotated + unannotated) * 100

This gives you the percentage of transcripts that received at least one functional annotation from the databases used by EnTAP.

Challenge

Exercise 2: Uninformative Annotations

Look at the uninformative parameter in the entap_run.params file. What does this parameter do and why is it important for functional annotation quality?

The uninformative parameter specifies a list of keywords used to filter out low-quality or non-descriptive annotations. In the default configuration, it includes terms like:

uninformative=conserved,predicted,unknown,unnamed,hypothetical,putative,unidentified,uncharacterized,uncultured,uninformative,

When EnTAP encounters a DIAMOND hit whose description contains any of these keywords, it treats that hit as uninformative and attempts to find a better annotation from other database matches. This matters because many protein database entries carry vague descriptions such as “hypothetical protein” or “predicted protein” that provide no meaningful functional insight. By filtering these out, EnTAP ensures that only informative, descriptive functional assignments are kept in the final results, leading to higher-quality annotations for downstream analysis.

Key Points
  • EnTAP enhances functional annotation by integrating multiple evidence sources, including homology, protein domains, and gene ontology.
  • Proper setup of configuration files and databases is essential for accurate and efficient EnTAP execution.
  • Running EnTAP involves transcript filtering, similarity searches, and functional annotation through automated workflows.
  • The pipeline provides extensive insights into transcript function, improving downstream biological interpretations.

Content from Annotation Assessment


Last updated on 2026-02-24 | Edit this page

Overview

Questions

  • How to assess the quality of a genome annotation?
  • What are the different tools available for assessing the quality of a genome annotation?
  • How to compare the predicted annotation with the reference annotation?
  • How to measure the number of raw reads assigned to the features predicted by the annotation?

Objectives

  • Assess the quality of a genome annotation using different tools.
  • Compare the predicted annotation with the reference annotation.
  • Measure the number of raw reads assigned to the features predicted by the annotation.

Setup


The folder organization is as follows:

08_assessment/
├── braker_case1.gff3
├── braker_case2.gff3
├── braker_case3.gff3
├── braker_case4.gff3
├── braker_case5.gff3
├── athaliana_helixer.gff3
├── easel.gff3
├── busco_results/
├── gff3_stats/
└── featureCounts/
Callout

Running Assessment Tools

The commands in this episode can be run as SLURM batch jobs or interactively on a compute node. For interactive use, request a compute node with:

BASH

salloc -A workshop -n 16 -t 2:00:00

All #!/bin/bash headers in the scripts below can be extended with #SBATCH directives if you prefer batch submission (see the BRAKER episode for examples).

The annotation files (gff3) are from the previous steps. But are collected in this folder for convenience.

We will extract CDS and Protein sequences from the gff3 file before we begin the assesment.

BASH

ml --force purge
ml biocontainers
ml cufflinks
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for i in *.gff3; do
    base=$(basename ${i%.*})
    gffread \
       -g ${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta \
       -x ${base}_cds.fa \
       -y ${base}_pep.fa $i
done

GFF3 metrics


We will use agat_sp_statistics.pl to calculate the statistics of the GFF3 file.

BASH

ml --force purge
ml biocontainers
ml agat
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
mkdir -p gff3_stats
for gff3 in *.gff3; do
    base=$(basename $gff3 .gff3)
    agat_sp_statistics.pl \
       --gff ${gff3} \
       -o gff3_stats/${base}_stats.txt
done

BUSCO assesment


We will use BUSCO to asses the quality of the annotation.

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml busco
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for pep in *_pep.fa; do
    base=$(basename ${pep%.*})
     busco \
        --in ${pep} \
        --cpu ${SLURM_CPUS_ON_NODE} \
        --out ${base}_busco \
        --mode prot \
        --lineage_dataset brassicales_odb10 \
        --force
done

Once the BUSCO assesment is complete, we can view the results using the following command:

BASH

ml --force purge
ml biocontainers
ml busco
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
mkdir -p busco_results
for f in */*_busco/short_summary*busco.txt; do
   cp $f busco_results/;
done
generate_plot.py --wd busco_results

BUSCO scores reflect how many conserved single-copy orthologs are found in your predicted proteome. A good annotation should have:

  • Complete (C) > 90% – most conserved genes are found as complete single-copy or duplicated
  • Fragmented (F) < 5% – few genes are only partially recovered
  • Missing (M) < 10% – few conserved genes are absent

Compare scores across annotation methods to identify the most complete prediction. Ab initio methods (BRAKER Case 4) typically score lower than evidence-based methods.

Omark assesment


OMArk is a software for proteome (protein-coding gene repertoire) quality assessment. It provides measures of proteome completeness, characterizes the consistency of all protein coding genes with regard to their homologs, and identifies the presence of contamination from other species.

The proteomes should be filtered to take only primary isoforms before running OMark.

BASH

ml --force purge
ml biocontainers
ml seqkit
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for pep in braker*_pep.fa; do
    base=$(basename ${pep%.*})
    grep "\.t1$" $pep | sed 's/>//g' > ${base}.primary.ids
    seqkit grep -f ${base}.primary.ids $pep > ${base}.primary.pep.fa
    rm ${base}.primary.ids
done
cp helixer_pep.fa helixer.primary.pep.fa
cp easel_pep.fa easel.primary.pep.fa

Running OMark:

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml omark
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
database="/depot/itap/datasets/omark/LUCA.h5"
cd ${workdir}
for pep in *.primary.pep.fa; do
    base=$(basename ${pep%.*})
    omamer search \
       --db ${database} \
       --query ${pep} \
       --out ${pep%.*}.omamer \
       --nthreads ${SLURM_CPUS_ON_NODE}
    omark \
       --file ${pep%.*}.omamer \
       --database ${database} \
       --outputFolder ${pep%.*} \
       --og_fasta ${pep}
done

The following OMArk results are for the TAIR10 reference annotation, shown as a baseline for comparison. Run OMArk on your predicted annotations and compare against these values.

Conserved HOGs in Brassicaceae

Category Count Percentage
Total Conserved HOGs 17,996 100%
Single-Copy HOGs (S) 16,237 90.23%
Duplicated HOGs (D) 1,315 7.31%
Unexpected Duplications (U) 351 1.95%
Expected Duplications (E) 964 5.36%
Missing HOGs (M) 444 2.47%

Proteome Composition

Category Count Percentage
Total Proteins 26,994 100%
Consistent (A) 25,480 94.39%
Partial Hits (P) 1,306 4.84%
Fragmented (F) 630 2.33%
Inconsistent (I) 167 0.62%
Partial Hits (P) 106 0.39%
Fragmented (F) 29 0.11%
Likely Contamination (C) 0 0.00%
Partial Hits (P) 0 0.00%
Fragmented (F) 0 0.00%
Unknown (U) 1,347 4.99%

HOG Placement - Detected Species

Species NCBI TaxID Associated Proteins % of Total Proteome
Arabidopsis thaliana 3702 25,647 95.01%

Feature assignment


To measure the number of raw reads assigned to the features predicted by the annotation, we will use featureCounts from the subread package.

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml subread
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
mkdir -p featureCounts
for gff3 in *.gff3; do
    base=$(basename ${gff3%.*})
    featureCounts \
      -T ${SLURM_CPUS_ON_NODE} \
      -a ${gff3} \
      -t exon \
      -g ID \
      -p \
      -B \
      -o ${base}_merged_counts.txt \
      --tmpDir /tmp ${WORKSHOP_DIR}/00_datasets/bamfiles/*.bam
done

featureCounts reports how many RNA-seq reads map to predicted gene features. Key metrics to check:

  • Assigned reads – higher percentages indicate that more of the transcriptome is captured by the gene models
  • Unassigned_NoFeatures – reads that do not overlap any predicted feature, suggesting missing genes
  • Unassigned_Ambiguity – reads mapping to multiple overlapping features

A good annotation should assign 60-80% of reads to predicted features. Annotations with low assignment rates may be missing genes in expressed regions.

Reference annotation comparison


We will use mikado compare the reference annotation with the predicted annotation.

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml mikado
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for gff3 in *.gff3; do
    base=$(basename ${gff3%.*})
    mikado compare \
      --protein-coding \
      -r ${WORKSHOP_DIR}/00_datasets/genome/athaliana_TAIR10.gff3 \
      -p ${gff3} \
      -o ref-TAIR10_vs_prediction_${base}_compared \
      --log ${base}_compare.log
done

Mikado compare evaluates how closely your predicted gene models match the reference annotation. Key metrics:

  • Sensitivity – proportion of reference genes that are correctly predicted (higher is better)
  • Precision – proportion of predicted genes that match the reference (higher is better)
  • F1 score – harmonic mean of sensitivity and precision

At the transcript level: - Sensitivity > 60% is reasonable for a first-pass annotation - Precision > 70% indicates low false-positive rate

Compare results across your annotation methods to determine which produces the best balance of sensitivity and precision.

Challenge

Exercise 1: Compare BUSCO Scores

Review the BUSCO results across all annotation methods (BRAKER cases 1-5, Helixer, and EASEL). Rank them from highest to lowest completeness. Which method produced the most complete annotation? Which performed the worst? Can you explain why?

Evidence-based methods (BRAKER Case 1 with RNA-seq, Case 2 with proteins, and Case 3 combining both) typically score highest, with completeness (C) values above 90%. Helixer and EASEL also tend to perform well since they leverage external evidence or pre-trained models. Ab initio prediction (BRAKER Case 4) usually scores the lowest because it relies solely on intrinsic sequence features without external evidence. BRAKER Case 5 (pretrained) should perform better than pure ab initio but may still lag behind evidence-informed runs. The ranking highlights the importance of incorporating experimental evidence into gene prediction.

Challenge

Exercise 2: Most Informative Assessment Metric

Consider the three assessment approaches used in this episode: BUSCO, featureCounts, and mikado compare. Which metric do you think is the most informative for evaluating an annotation, and why? Are there situations where one metric might be misleading?

There is no single “best” metric – each captures a different aspect of annotation quality:

  • BUSCO measures gene completeness using conserved orthologs, but it only assesses a subset of genes (those that are universally conserved). Lineage-specific or novel genes are not evaluated.
  • featureCounts measures how well the annotation captures expressed regions of the genome. However, high assignment rates can occur even with over-predicted gene models, and lowly expressed genes may not be well represented in the RNA-seq data.
  • mikado compare directly evaluates structural accuracy against a reference, but it requires a high-quality reference annotation, which is not always available.

A comprehensive assessment should use all three metrics together. BUSCO and OMArk evaluate gene content completeness, featureCounts checks consistency with expression data, and mikado compare validates structural accuracy. An annotation that scores well across all metrics is more reliable than one that excels in only one.

Summary


Key Points
  • busco and omark assess how well conserved genes are represented in the predicted gene set
  • gff3 metrics provide structural insights and highlight discrepancies compared to known annotations
  • featureCounts assignment quantifies the number of RNA-seq reads aligning to predicted features
  • Reference annotation comparison evaluates how closely the predicted genes match an established reference
  • Multiple assessment methods ensure a comprehensive evaluation of annotation quality