Content from Introduction to Genome Annotation


Last updated on 2025-04-22 | Edit this page

Overview

Questions

  • What is genome annotation, and why is it important?
  • What are the different types of genome annotation?
  • What challenges make genome annotation difficult?
  • What data sources are used to improve annotation accuracy?
  • How does the annotation process fit into genomics research?

Objectives

  • Define genome annotation and its role in genomics research
  • Differentiate between structural and functional annotation
  • Identify key challenges in genome annotation and how they impact accuracy
  • Understand the types of data used in annotation and their significance
  • Provide an overview of the annotation workflow used in this workshop

What Is Genome Annotation?


  • Genome annotation is the process of identifying and labeling functional elements in a genome, such as genes, regulatory regions, and non-coding RNAs.
  • It involves predicting gene structures, including exons, introns, and untranslated regions, using computational and experimental data.
  • Functional annotation assigns biological roles to predicted genes based on sequence similarity, protein domains, and pathway associations.
  • Accurate annotation is essential for understanding genome function, evolutionary relationships, and applications in research and biotechnology.
The structure of a eukaryotic protein-coding gene
The structure of a eukaryotic protein-coding gene

source

Why Is Annotation Important?

  • Annotation provides a functional map of the genome, helping researchers identify genes and regulatory elements.
  • It enables comparative genomics, allowing insights into evolution, gene conservation, and species-specific adaptations.
  • Accurate gene annotation is essential for biomedical and agricultural research, including disease studies and crop improvement.
  • Functional annotations link genes to biological processes, aiding in pathway analysis and experimental validation.

Types of Genome Annotation


  • Structural annotation identifies genes, exon-intron boundaries, untranslated regions (UTRs), and regulatory elements such as promoters and enhancers.
  • Functional annotation assigns biological roles to genes based on sequence similarity, protein domains, gene ontology (GO), and pathway associations.
  • Repeat annotation detects transposable elements/centromere repeats/knob sequences and other repetitive sequences that prevent misannotation of genes.
  • Epigenomic annotation integrates chromatin accessibility, DNA methylation, and histone modifications to understand gene regulation.
  • Non-coding RNA annotation includes small RNAs (miRNAs, siRNAs, piRNAs), long non-coding RNAs (lncRNAs), ribosomal RNAs (rRNAs), and transfer RNAs (tRNAs).

Key Challenges in Genome Annotation


  • Incomplete or fragmented assemblies: Gaps and misassemblies can lead to missing or incorrect gene predictions.
  • Alternative splicing and isoforms: Complex gene structures make it difficult to define a single gene model.
  • Pseudogenes and transposable elements: Distinguishing functional genes from non-functional sequences is challenging.
  • Low-quality or missing evidence: Incomplete RNA-seq, proteomics, or evolutionary data can reduce annotation accuracy.
  • Annotation discrepancies across tools: Different prediction methods may produce conflicting gene models.
  • Evolutionary divergence: Gene structures and functional elements can vary widely across species, complicating homology-based annotation.

Data Sources for Annotation


Accurate genome annotation relies on multiple data sources to refine gene predictions and validate functional elements. Different types of experimental and computational evidence contribute to identifying gene structures, transcript isoforms, coding potential, and functional assignments.

1. Long-read transcript sequencing (Iso-Seq)
- Captures full-length transcripts, helping define complete exon-intron structures.
- Useful for identifying alternative splicing events and novel isoforms.

2. Ribosome profiling (Ribo-Seq)
- Detects ribosome-protected mRNA fragments to identify actively translated regions.
- Helps differentiate coding from non-coding RNAs and refine start/stop codons.

3. Cap analysis gene expression (CAGE) and RAMPAGE
- Maps transcription start sites by detecting 5’ capped mRNAs.
- Provides precise information on promoter locations and alternative start sites.

4. Proteomics (Mass Spectrometry Data)
- Confirms protein-coding genes by detecting peptide fragments in expressed proteins.
- Helps validate translation and refine gene models.

5. RNA sequencing (Short-read RNA-Seq)
- Provides gene expression levels and transcript coverage across different conditions.
- Essential for identifying intron-exon boundaries and alternative splicing events.

6. Comparative genomics (Synteny & Homology)
- Uses known gene models from related species to guide annotation.
- Helps predict conserved genes and validate gene structures.

7. Additional Data Sources
- Poly(A) site sequencing (PAS-Seq): Defines transcript end sites for more precise UTR annotation.
- Chromatin accessibility data (ATAC-Seq, DNase-Seq): Identifies regulatory regions like enhancers and promoters.
- Epigenomic data (ChIP-Seq, DNA methylation): Provides evidence for active and repressed genes.

Integrating diverse datasets improves the accuracy of gene predictions, ensuring that both structural and functional annotations reflect real biological features.
Integrating diverse datasets improves the accuracy of gene predictions, ensuring that both structural and functional annotations reflect real biological features.

Scope of This Workshop


In this workshop, we will focus on protein-coding gene annotation in eukaryotic genomes using computational tools and available datasets. We will annotate genes in a Arabidopsis genome using a combination of ab initio gene prediction, homology-based methods, and RNA-seq evidence methods, can compare the results to a reference annotation. The workflow will cover data preparation, gene prediction, functional annotation, and quality assessment steps to provide a comprehensive overview of the annotation process.

Key Points

  • Genome annotation identifies functional elements in a genome, including genes, regulatory regions, and non-coding RNAs.
  • Structural annotation predicts gene structures, while functional annotation assigns biological roles to genes.
  • Challenges in annotation include incomplete assemblies, alternative splicing, pseudogenes, and data quality issues.
  • Data sources like RNA-seq, proteomics, comparative genomics, and epigenomics improve annotation accuracy.
  • This workshop focuses on eukaryotic gene annotation using computational tools and diverse datasets.

Content from Annotation Strategies


Last updated on 2025-04-22 | Edit this page

Overview

Questions

  • What are the different strategies used for genome annotation?
  • How do various methods predict genes, and what are their strengths and limitations?
  • What role do evidence play in improving gene predictions?
  • How do deep learning and large models enhance genome annotation accuracy?
  • What tools are used in this workshop, and how do they fit into different strategies?

Objectives

  • Understand the principles behind different genome annotation approaches.
  • Learn how intrinsic, extrinsic, and hybrid methods predict gene structures.
  • Explore the role of multi-evidence approaches, including transcriptomics, proteomics, and functional annotation.
  • Introduce modern machine learning and large model-based methods for gene annotation.
  • Familiarize with annotation tools used in this workshop and their specific applications.

Introduction


  • Genome annotation identifies functional elements in a genome, including protein-coding genes, non-coding rnas, and regulatory regions
  • Annotation strategies vary between evidence-based, ab initio, and hybrid approaches, each with strengths and limitations
  • Selecting the right method depends on genome complexity, available transcriptomic and proteomic data, and the need for accuracy versus sensitivity

Annotation Methods


Intrinsic (Ab Initio) Methods

  • Predict genes based on sequence features without external evidence.
  • Use statistical models such as hidden Markov models (HMMs) to recognize coding regions, exon-intron boundaries, and gene structures.
  • Rely on nucleotide composition patterns like codon usage, GC content, and splice site signals.
  • Perform well in well-studied genomes but struggle with novel genes and complex gene structures.

Extrinsic (Evidence-Based) Methods

  • Utilize external data such as RNA-seq, ESTs, and protein homology to guide gene predictions.
  • Align transcriptomic and proteomic data to the genome to refine exon-intron structures.
  • Improve annotation accuracy by incorporating evolutionary conservation and known gene models.
  • Require high-quality reference datasets, making them less effective in newly sequenced organisms with little prior data.

Hybrid Methods

  • Combine ab initio predictions with evidence-based approaches to improve accuracy and completeness.
  • Use gene predictions from statistical models and refine them with transcriptomic and proteomic data.
  • Balance sensitivity (detecting novel genes) and specificity (avoiding false positives).
  • Often used in large-scale genome annotation projects where both approaches complement each other.

Multi-Evidence based methods

  • Incorporate diverse biological data sources to improve gene prediction accuracy.
  • Use ribosome profiling (Ribo-seq) and proteomics to confirm protein-coding potential.
  • Leverage poly(A) signal presence to distinguish mature transcripts from non-coding RNA.
  • Integrate RNA-seq, long-read sequencing, and EST data to refine exon-intron structures.
  • Provide direct experimental validation, reducing reliance on purely computational models.

Large Model-Based Methods

  • Utilize deep learning and large language models (LMMs) trained on vast genomic datasets.
  • Predict gene structures by recognizing sequence patterns learned from diverse organisms.
  • Adapt to various genome complexities without requiring manually curated training data.
  • Identify non-canonical gene structures, alternative splicing events, and novel genes.
  • Example: Helixer, a GPU-based model, predicts genes using high-dimensional sequence patterns, improving over traditional HMM-based methods.

What do we annotate in the genome?

  • Protein-coding genes: Genes that encode proteins, identified based on open reading frames (ORFs) and coding sequence features.
  • Pseudogenes: Non-functional remnants of once-active genes that have accumulated mutations.
  • Non-coding RNAs (ncRNAs): Functional RNAs that do not code for proteins, including tRNAs, rRNAs, and lncRNAs.
  • Upstream Open Reading Frames (uORFs): Small ORFs in the 5’ untranslated region (UTR) that can regulate translation.
  • Alternative Splicing Variants: Different transcript isoforms from the same gene, generated by exon-intron rearrangements.
  • Repeat Elements and Transposons: Mobile DNA sequences that can impact gene function and genome structure.
  • Regulatory Elements: Promoters, enhancers, and other sequences controlling gene expression.

For this workshop, we focus only on protein-coding genes and their annotation.

Annotation Workflows Used


1. BRAKER3: Combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes.

  • Integrates GeneMark and AUGUSTUS with RNA-Seq and/or protein evidence.
  • Supports various data types or runs without them.
  • Fully automated training and gene prediction.
  • Requires evidence data for high accuracy; performs poorly without it.
  • Struggles with scalability on larger genomes.

2. Helixer: A deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes.

  • Utilizes a deep learning model trained on diverse animal and plant genomes, reducing the need for retraining on closely related species.
  • Provides probabilities for every base, enabling detailed predictions for intergenic, untranslated, coding, and intronic regions.
  • Outperforms traditional tools like AUGUSTUS in both base-wise metrics and RNA-Seq consistency.
  • Handles large and complex eukaryotic genomes effectively, even for species with longer genes.
  • Requires no additional data except genome – no repeatmasking!
  • Cannot predict alternative splicing events or spliced start codons.
Helixer
Helixer

3. EASEL: Efficient, Accurate, Scalable Eukaryotic modeLs: eukaryotic genome annotation tool

  • Combines machine learning, RNA folding, and functional annotations to enhance prediction accuracy.
  • Leverages RNA-Seq, protein alignments, and additional evidence for gene prediction.
  • Built on Nextflow, ensuring portability across HPC systems and compatibility with Docker and Singularity.
  • Features EnTAP integration for streamlined functional annotation.
  • Provides multiple formats (GFF, GTF, CDS, protein sequences) for downstream analysis.
  • Employs BUSCO for quality assessment and automates parameter optimization.
EASEL
EASEL

4. EnTAP: eukaryotic non-model annotation pipeline (functional annotation)

  • Optimized for functional annotation of transcriptomes without reference genomes, addressing challenges like fragmentation and assembly artifacts.
  • Much faster than traditional annotation pipelines.
  • Integrates taxonomic filtering, contaminant detection, and informativeness metrics for selecting optimal alignments.
  • Enables annotation across diverse repositories (e.g., RefSeq, Swiss-Prot) with customizable database integration.
  • Includes gene family assignments, protein domains, and GO terms for enriched annotation.
  • User-friendly, Unix-based pipeline suitable for HPC environments, combining simplicity with high throughput.
EnTAP
EnTAP

How to we access gene predictions?

  • Busco and Omark assess gene set completeness by checking the presence of conserved orthologs across species, helping identify missing or fragmented genes
  • Gff3 metrics provide structural statistics on gene models, including exon-intron distribution, gene lengths, and coding sequence properties, highlighting inconsistencies in annotation
  • Feature assignment using RNA-seq read mapping quantifies how well predicted genes capture transcriptomic evidence, indicating potential missing or incorrect annotations
  • Reference comparison with known annotations measures sensitivity and precision, identifying correctly predicted, missed, or falsely annotated genes
  • Multiple validation methods combine structural, functional, and comparative assessments to improve confidence in gene predictions and refine annotation quality

Key Points

  • Genome annotation strategies include ab initio, evidence-based, hybrid, multi-evidence, and large model-based methods, each with different strengths and data requirements.
  • Ab initio methods predict genes using sequence patterns, while evidence-based methods use RNA-seq, proteins, and evolutionary conservation for refinement.
  • Multi-evidence approaches integrate diverse biological data such as ribosome profiling, proteomics, and long-read RNA-seq to improve accuracy.
  • Large model-based methods, like Helixer, use deep learning to predict genes without requiring manual training, outperforming traditional tools in complex genomes.
  • The workshop focuses on protein-coding gene annotation using tools like BRAKER, Helixer, EASEL, and EnTAP, followed by assessment with BUSCO, OMARK, and structural metrics.

Content from Annotation Setup


Last updated on 2025-04-22 | Edit this page

Overview

Questions

  • How should files and directories be structured for a genome annotation workflow?
  • Why is RNA-seq read mapping important for gene prediction?
  • How does repeat masking improve annotation accuracy?
  • What preprocessing steps are necessary before running gene prediction tools?

Objectives

  • Set up a structured workspace for genome annotation
  • Map RNA-seq reads to the genome to provide evidence for gene prediction
  • Perform repeat masking to prevent false gene annotations
  • Prepare genome and transcriptomic data for downstream annotation analyses

Setup


This section outlines the preprocessing steps required before gene annotation, including organizing directories, mapping RNA-Seq reads, and masking repetitive elements. These steps have already been completed to save time, but the instructions are provided for reference in case you need to run them on custom datasets.

Optional Section!

Feel free to skip this section as it is already performed to save time. The steps below are provided for reference - to run on custom data.

Folder structure


BASH

mkdir -p ${RCAC_SCRATCH}/annotation_workshop
cd ${RCAC_SCRATCH}/annotation_workshop
mkdir -p 00_datasets/{fastq,genome,bamfiles}
mkdir -p 03_setup
mkdir -p 04_braker
mkdir -p 05_helixer
mkdir -p 06_easel
mkdir -p 07_entap
mkdir -p 08_assessment
rsync -avP SOURCE/00_datasets/ 00_datasets/
Organization
Organization

RNA-Seq mapping


The RNA-seq dataset used for gene prediction comes from the accession E-MTAB-6422, that provides a gene expression atlas of Arabidopsis thaliana (Columbia accession), capturing transcriptional programs across 36 libraries representing different tissues (seedling, root, inflorescence, flower, silique, seed) and developmental stages (2-leaf, 6-leaf, 12-leaf, senescence, dry mature, and imbibed seed), with three biological replicates per condition.

Step 1: Download the reference genome and annotation files

BASH

# Download the reference genome
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/00_datasets/genome
cd ${WORKSHOP_DIR}/00_datasets/genome 
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
ml --force purge
ml biocontainers
ml bioawk
bioaw -c fastx '{print ">"$name; print $seq}' \
   Arabidopsis_thaliana.TAIR10.dna.toplevel.fa |\
   fold > athaliana.fasta
rm Arabidopsis_thaliana.TAIR10.dna.toplevel.fa

Step 2: Index the reference genome using STAR

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml star
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta"
index="${WORKSHOP_DIR}/00_datasets/genome/star_index"
STAR --runThreadN ${SLURM_JOB_CPUS_PER_NODE} \
   --runMode genomeGenerate \
   --genomeDir ${index}  \
   --genomeSAindexNbases 12 \
   --genomeFastaFiles ${genome}

Options

Option Description
--runThreadN $SLURM_JOB_CPUS_PER_NODE Specifies the number of CPU threads to use. In an HPC environment, $SLURM_JOB_CPUS_PER_NODE dynamically assigns available CPUs per node.
--runMode genomeGenerate Runs STAR in genome index generation mode, required before alignment.
--genomeDir ${index} Path to the directory where the genome index will be stored.
--genomeSAindexNbases 12 Defines the length of the SA pre-indexing string. A recommended value is min(14, log2(GenomeLength)/2 - 1), but 12 is commonly used for smaller genomes.
--genomeFastaFiles ${genome} Path to the FASTA file containing the reference genome sequences.

Step 3: Map the RNA-Seq reads to the reference genome

We need to setup a script star-mapping.sh to map the RNA-Seq reads to the reference genome.

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml star
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta"
index="${WORKSHOP_DIR}/00_datasets/genome/star_index"
read1=$1
read2=$2
cpus=${SLURM_JOB_CPUS_PER_NODE}
outname=$(basename ${read1} | cut -f 1 -d "_")
STAR \
--runThreadN ${cpus} \
--genomeDir ${index} \
--outSAMtype BAM SortedByCoordinate \
--outFilterScoreMinOverLread 0.3 \
--outFilterMatchNminOverLread 0.3 \
--outFileNamePrefix ${outname}_ \
--readFilesCommand zcat \
--outWigType bedGraph \
--outWigStrand Unstranded \
--outWigNorm RPM \
--readFilesIn ${read1} ${read2}

Options

Option Description
--runThreadN ${cpus} Specifies the number of CPU threads to use for alignment.
--genomeDir ${index} Path to the pre-generated STAR genome index directory.
--outSAMtype BAM SortedByCoordinate Outputs aligned reads in BAM format, sorted by genomic coordinates.
--outFilterScoreMinOverLread 0.3 Sets the minimum alignment score relative to read length (30% of the max possible score).
--outFilterMatchNminOverLread 0.3 Sets the minimum number of matching bases relative to read length (30% of the read must match).
--outFileNamePrefix ${outname}_ Prefix for all output files generated by STAR.
--readFilesCommand zcat Specifies a command (zcat) to decompress input FASTQ files if they are gzipped.
--outWigType bedGraph Outputs signal coverage in bedGraph format for visualization.
--outWigStrand Unstranded Specifies that the RNA-seq library is unstranded (reads are not strand-specific).
--outWigNorm RPM Normalizes coverage signal to Reads Per Million (RPM).
--readFilesIn ${read1} ${read2} Specifies input paired-end FASTQ files (Read 1 and Read 2).

Step 4: Run the STAR mapping script

BASH

#!/bin/bash
# Define the directory containing FASTQ files
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
FASTQ_DIR="WORKSHOP_DIR/00_datasets/fastq"
# Loop through all R1 files and find corresponding R2 files
for R1 in ${FASTQ_DIR}/*_R1.fq.gz; do
    R2="${R1/_R1.fq.gz/_R2.fq.gz}"  # Replace R1 with R2
    if [[ -f "$R2" ]]; then  # Check if R2 exists
        echo "Running STAR mapping for: $R1 and $R2"
        ./star-mapping.sh "$R1" "$R2"
    else
        echo "Error: Matching R2 file for $R1 not found!" >&2
    fi
done

Step 5: store BAM files

BASH

# Define the directory containing BAM files
BAM_DIR="WORKSHOP_DIR/00_datasets/bamfiles"
# Move all BAM files to the BAM directory
mv *_Aligned.sortedByCoord.out.bam ${BAM_DIR}/

RepeatMasker


Some gene prediction tools require repeat-masked genomes to avoid mis-annotation of transposable elements as genes. We will use RepeatMasker to mask the repeats in the genome. But first, we need to download the RepeatMasker database.

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
cd ${WORKSHOP_DIR}/00_datasets/genome
# Download the RepeatMasker database
wget https://www.girinst.org/server/RepBase/protected/RepBase30.01.fasta.tar.gz
tar xf RepBase30.01.fasta.tar.gz
replib=$(find $(pwd) -name "athrep.ref")
ml --force purge
ml biocontainers
ml repeatmasker
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta"
RepeatMasker \
   -e ncbi \
   -pa ${SLURM_CPUS_ON_NODE} \
   -q \
   -xsmall \
   -lib ${replib} \ 
   -nocut \
   -gff \
   ${genome}

Once done, we will rename it to athaliana.fasta.masked

BASH

mv athaliana.fasta.masked athaliana_softmasked.fasta

Key Points

  • Organizing files and directories ensures a reproducible and efficient workflow
  • Mapping RNA-seq reads to the genome provides essential evidence for gene prediction
  • Repeat masking prevents transposable elements from being mis-annotated as genes
  • Preprocessing steps are crucial for accurate and high-quality genome annotation

Content from Annotation using BRAKER


Last updated on 2025-04-22 | Edit this page

Overview

Questions

  • What is BRAKER3?
  • What are the different scenarios in which BRAKER3 can be used?
  • How to run BRAKER3 with different input requirements?
  • What are the different output files generated by BRAKER3?

Objectives

  • Understand the different scenarios in which BRAKER3 can be used
  • Learn how to run BRAKER3 with different input requirements
  • Learn how to interpret the output files generated by BRAKER3

BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes. This pipeline is particularly useful for annotating newly sequenced genomes. The flexibility of BRAKER3 allows users to provide various input datasets for improving gene prediction accuracy. In this example, we will use various scenarios to predict genes in a Maize genome using BRAKER3. Following are the scenarios we will cover

Organizing the Data


Before running BRAKER3, we need to organize the data in the following format:

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/04_braker/braker_case{1..5}
Folder organization
Folder organization

Setup


Before running BRAKER3, we need to set up:

  1. GeneMark-ES/ET/EP/ETP license key
  2. The AUGUSTUS_CONFIG_PATH configuration path

The license key for GeneMark-ES/ET/EP/ETP can be obtained from the GeneMark website. Once downloaded, you need to place it in your home directory:

BASH

tar xf gm_key_64.gz
cp gm_key_64 ~/.gm_key

For the AUGUSTUS_CONFIG_PATH, we need to copy the config directory from the Singularity container to the scratch directory. This is required because BRAKER3 needs to write to the config directory, and the Singularity container is read-only. To copy the config directory, run the following command:

BASH

ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/04_braker/augustus
copy_augustus_config ${WORKSHOP_DIR}/04_braker/augustus
export AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"

Case 1: Using RNAseq Reads


To run BRAKER3 with RNAseq reads, we need to provide the RNAseq reads in BAM format. We will use the following command to run BRAKER3 with RNAseq reads:

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --account=workshop
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case1
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
BAM_DIR="${WORKSHOP_DIR}/00_datasets/bamfiles"
BAM_FILES=$(find "${BAM_DIR}" -name "*.bam" | tr '\n' ',')
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case1
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --bam=${BAM_FILES} \
   --genome=${genome} \
   --species=At_$(date +"%Y%m%d").c1b \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The job takes ~55 mins to finish. The results of the BRAKER3 run will be stored in the respective braker_case1 directory, as braker.gff3 file.

Case 2: Using Conserved proteins only


Using the orthodb-clades tool, we can download protein sequences for a specific clade. In this scenario, since we are using the Arabidopsis genome, we can download the clades specific Viridiplantae.fa OrthoDB v12 protein sets.

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/04_braker/braker_case2
mkdir -p ${workdir}
git clone git@github.com:tomasbruna/orthodb-clades.git
cd orthodb-clades
ml --force purge
ml biocontainers
ml snakemake
snakemake --cores ${SLURM_CPUS_ON_NODE} selectViridiplantae

When this is done, you should see a folder named clade with Viridiplantae.fa in the orthodb-clades directory. We will use this as one of the input datasets for BRAKER3.
The following command will run BRAKER3 with the input genome and protein sequences:

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case2
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
proteins="${WORKSHOP_DIR}/04_braker/braker_case2/orthodb-clades/clades/Viridiplantae.fa"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case2
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --genome=${genome} \
   --prot_seq=${proteins} \
   --species=At_$(date +"%Y%m%d").c2 \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The results of the BRAKER3 run will be stored in the respective braker_case2 directory, as braker.gff3 file. The run time for this job is ~170 mins.

Case 3: Using RNAseq and Proteins Together


BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case3
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
BAM_DIR="${WORKSHOP_DIR}/00_datasets/bamfiles"
BAM_FILES=$(find "${BAM_DIR}" -name "*.bam" | tr '\n' ',')
proteins="${WORKSHOP_DIR}/04_braker/braker_case2/orthodb-clades/clade/Viridiplantae.fa"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case3
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --genome=${genome} \
   --prot_seq=${proteins} \
   --bam=${BAM_FILES} \
   --species=At_$(date +"%Y%m%d").c3 \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The results of the BRAKER3 run will be stored in the respective braker_case3 directory, as braker.gff3 file. The run time for this job is ~120 mins.

Case 4: ab Initio mode


BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case4
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case4
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --esmode \
   --genome=${genome} \
   --species=Zm_$(date +"%Y%m%d").c4 \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The results of the BRAKER3 run will be stored in the respective braker_case4 directory, as braker.gff3 file. The run time for this job is ~60 mins.

Case 5: Using Pre-trained Model


BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case5
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case5
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --skipAllTraining \
   --genome=${genome} \
   --species=arabidopsis \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The results of the BRAKER3 run will be stored in the respective braker_case5/Augustus directory. The main result file augustus.ab_initio.gff3 will be used for downstream analysis. Runtime for this job is ~7 mins.

Results and Outputs


coming soon!

Key Points

  • BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes
  • BRAKER3 can be used with different input datasets to improve gene prediction accuracy
  • BRAKER3 can be run in different scenarios to predict genes in a genome

Content from Annotation using Helixer


Last updated on 2025-04-22 | Edit this page

Overview

Questions

  • How to predict genes using Helixer?
  • How to download trained models for Helixer?
  • How to run Helixer on the HPC cluster (Gilbreth)?

Objectives

  • Learn how to predict genes using Helixer.
  • Learn how to download trained models for Helixer.
  • Learn how to run Helixer on the HPC cluster (Gilbreth).

Helixer is a deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes. Helixer is trained on a wide range of eukaryotic genomes and can predict genes in both plant and animal genomes. Helixer can predict genes wihtout any extrinisic information such as RNA-seq data or homology information, purely based on the sequence of the genome.

This section should be run on Gilbreth HPC cluster.

Due to the GPU requirement for Helixer, you need to run this section on the Gilbreth HPC cluster. You don’t have to copy fastq, or bamfiles, but only need athaliana.fasta file in the 00_datasets/genome directory.

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/05_helixer
cp <depot_location>/genome/athaliana.fasta ${WORKSHOP_DIR}/05_helixer/athaliana.fasta
Folder organization
Folder organization

Setup


Helixer is available as a Singularity/apptainer container. You can pull the container using the following apptainer pull command. See the Helixer Docker repository for more information.

Helixer is installed as a module on the Gilbreth cluster, and can be loaded using the following commands:

BASH

ml --force purge
ml biocontainers
ml helixer

Downloading trained models


Helixer requires a trained model to predict genes. With the included script fetch_helixer_models.py you can download models for specific lineages. Currently, models are available for the following lineages:

  • land_plant
  • vertebrate
  • invertibrate
  • fungi

You can download the models using the following command:

BASH

# all models
# fetch_helixer_models.py --all
# or for a specific lineage
fetch_helixer_models.py --lineage land_plant

This will download all lineage models in the models directory. You can also download models for specific lineages using the --lineage option as shown above.

By default, files will be downloaded to ~/.local/share/Helixer directory. You should see the follwing files:

.
└── models
    └── land_plant
        ├── land_plant_v0.3_a_0080.h5
        ├── land_plant_v0.3_a_0090.h5
        ├── land_plant_v0.3_a_0100.h5
        ├── land_plant_v0.3_a_0200.h5
        ├── land_plant_v0.3_a_0300.h5
        ├── land_plant_v0.3_a_0400.h5
        ├── land_plant_v0.3_m_0100.h5
        └── land_plant_v0.3_m_0200.h5

land_plant_v0.3_a_0080.h5 is the smallest model and land_plant_v0.3_m_0200.h5 is the largest model. The model size is determined by the number of parameters in the model. The larger models are more accurate but require more memory and time to run.

Running Helixer


Helixer requires GPU for prediction. For running Helixer, you need to request a GPU node. You will also need the genome sequence in fasta format. For this tutorial, we will use Maize genome (Zea mays subsp. mays), and use the land_plant model to predict genes.

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --gpus-per-node=1
#SBATCH --time=04:00:00
#SBATCH --account=standby
#SBATCH --job-name=helixer
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
ml --force purge
ml biocontainers
ml helixer
genome=athaliana.fasta
species="Arabidopsis thaliana"
output="athaliana_helixer.gff"
Helixer.py \
    --lineage land_plant \
    --fasta-path ${genome} \
    --species ${species} \
    --gff-output-path ${output}

A typical job will take ~40 mins to finish depedning on the GPU the job gets allocated.

You can count the nubmer of predictions in your gff3 file using the following command:

BASH

awk '$3=="gene"' athaliana_helixer.gff | wc -l

The GFF format output had 27,201 genes predicted using Helixer. You can view the various features in the gff file using the following command:

BASH

grep -v "^#" athaliana_helixer.gff |\
   cut -f 3 |\
   sort |\
   uniq -c

To get cds and pep files, you can use the following command:

BASH

ml --force purge
ml biocontainers
ml cufflinks
gffread Arabidopsis_thaliana.gff3 \
   -g athaliana.fasta \
   -y helixer_pep.fa \
   -x helixer_cds.fa

As you may have noticed, the number of mRNA and gene features are the same. This is because isoforms aren’t predicted by Helixer and you only have one transcript per gene. Exons are indetified with high confidence and alternative isoforms are usually collapsed into a single gene model. This is one of the known limitations of Helixer.

Key Points

  • Helixer is a deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes.
  • Helixer can predict genes wihtout any extrinisic information such as RNA-seq data or homology information, purely based on the sequence of the genome.
  • Helixer requires a trained model and GPU for prediction.
  • Helixer predicts genes in the GFF3 format, but will not predict isoforms.

Content from Annotation using Easel


Last updated on 2025-04-22 | Edit this page

Overview

Questions

  • What are the steps required to set up and run EASEL on an HPC system?
  • How is Nextflow used to manage and execute the EASEL workflow?
  • What configuration files need to be modified before running EASEL?
  • How do you submit and monitor an EASEL job using Slurm?

Objectives

  • Set up and configure EASEL for execution on an HPC system.
  • Install Nextflow and pull the EASEL workflow from the repository.
  • Modify the necessary configuration files to match the HPC environment.
  • Submit and run EASEL as a Slurm job and verify successful execution.

EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs) is a genome annotation tool that integrates machine learning, RNA folding, and functional annotations to improve gene prediction accuracy, leveraging optimized AUGUSTUS parameters, transcriptome and protein evidence, and a Nextflow-based scalable pipeline for reproducible analysis.

Setup


We will do a custom installation of nextflow (we will not use ml nextflow):

BASH

ml --force purge
ml openjdk
curl -s https://get.nextflow.io | bash
mv nextflow ~/.local/bin

check the installation:

BASH

nextflow -v
which nextflow

We can organize our folder structure as follows:

Folder organization
Folder organization

Running Easel


Step 1: create a rcac.config file to run the jobs using slurm scheduler.

BASH

params {
    config_profile_description = 'Negishi HPC, RCAC Purdue University.'
    config_profile_contact     = 'Arun Seetharam (aseethar@purdue.edu)'
    config_profile_url         = "https://www.rcac.purdue.edu/knowledge/negishi"
    partition                  = 'testpbs'
    project                    = 'your_project_name'  // Replace with a valid Slurm project
    max_memory                 = '256GB'
    max_cpus                   = 128
    max_time                   = '48h'
}
singularity {
    enabled    = true
    autoMounts = true
    singularity.cacheDir = "${System.getenv('RCAC_SCRATCH')}/.singularity"
    singularity.pullTimeout = "1h"
}
process {
    resourceLimits = [
        memory: 256.GB,
        cpus: 128,
        time: 48.h
    ]
    executor       = 'slurm'
    clusterOptions = "-A ${params.project} -p ${params.partition}"
}
// Executor-specific configurations
if (params.executor == 'slurm') {
    process {
        cpus   = { check_max( 16    * task.attempt, 'cpus'   ) }
        memory = { check_max( 32.GB * task.attempt, 'memory' ) }
        time   = { check_max( 8.h  * task.attempt, 'time'   ) }
        clusterOptions = "-A ${params.project} -p ${params.partition}"

        errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
        maxRetries    = 1
        maxErrors     = '-1'

        withLabel:process_single {
            cpus   = { check_max( 16                  , 'cpus'    ) }
            memory = { check_max( 32.GB  * task.attempt, 'memory'  ) }
            time   = { check_max( 8.h    * task.attempt, 'time'    ) }
        }
        withLabel:process_low {
            cpus   = { check_max( 32     * task.attempt, 'cpus'    ) }
            memory = { check_max( 64.GB  * task.attempt, 'memory'  ) }
            time   = { check_max( 24.h   * task.attempt, 'time'    ) }
        }
        withLabel:process_medium {
            cpus   = { check_max( 64     * task.attempt, 'cpus'    ) }
            memory = { check_max( 128.GB * task.attempt, 'memory'  ) }
            time   = { check_max( 24.h   * task.attempt, 'time'    ) }
        }
        withLabel:process_high {
            cpus   = { check_max( 128    * task.attempt, 'cpus'    ) }
            memory = { check_max( 256.GB * task.attempt, 'memory'  ) }
            time   = { check_max( 24.h   * task.attempt, 'time'    ) }
        }
        withLabel:process_long {
            time   = { check_max( 48.h  * task.attempt, 'time'    ) }
        }
        withLabel:process_high_memory {
            memory = { check_max( 256.GB * task.attempt, 'memory' ) }
        }
        withLabel:error_ignore {
            errorStrategy = 'ignore'
        }
        withLabel:error_retry {
            errorStrategy = 'retry'
            maxRetries    = 2
        }
    }
}

Step 2: pull the Easel nextflow workflow:

BASH

ml --force purge
ml biocontainers
ml nextflow
ml openjdk
nextflow pull -hub gitlab PlantGenomicsLab/easel

This will pull the latest version of the Easel workflow from the PlantGenomicsLab GitLab repository.

Step 3: modify the ~/.nextflow/assets/PlantGenomicsLab/easel/nextflow.config file to include the rcac.config file:

BASH

vim ~/.nextflow/assets/PlantGenomicsLab/easel/nextflow.config

or if you prefer nano:

BASH

nano ~/.nextflow/assets/PlantGenomicsLab/easel/nextflow.config

Add the following line to the nextflow.config file:

BASH

profiles {
   debug { process.beforeScript = 'echo $HOSTNAME' }
   ...
   ...
   test  { includeConfig 'conf/test.config' }
   rcac  { includeConfig 'conf/rcac.config' }
}

Note that this is in line # 235 in the nextflow.config file. and all the lines between debug and test aren’t shown here.

Now, we are ready to run the Easel workflow.

Step 4: create param.yaml file to run the Easel workflow:

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/06_easel
cd ${WORKSHOP_DIR}/06_easel

for the param.yaml file:

YAML

outdir: easel
genome: /scratch/negishi/aseethar/annotation_workshop/00_datasets/genome/athaliana_softmasked.fasta
bam: /scratch/negishi/aseethar/annotation_workshop/00_datasets/bamfiles/*.bam
busco_lineage: embryophyta
order: Brassicales
prefix: arabidopsis
taxon: arabidopsis
singularity_cache_dir: /scratch/negishi/aseethar/singularity_cache
training_set: plant
executor: slurm
account: testpbs
qos: normal
project: testpbs

Now we are ready to run the Easel workflow:

Our slurm file to run the Easel workflow:

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=04-00:00:00
#SBATCH --job-name=easel
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml purge
ml biocontainers
ml openjdk
nextflow run \
   -hub gitlab PlantGenomicsLab/easel \
   -params-file params.yaml \
   -profile rcac \
   --project testpbs

This will run the Easel workflow on the RCAC HPC.

Results and Outputs


Key Points

  • EASEL is executed using Nextflow, which simplifies workflow management and ensures reproducibility.
  • Proper configuration of resource settings and HPC parameters is essential for successful job execution.
  • Running EASEL requires setting up input files, modifying configuration files, and submitting jobs via Slurm.
  • Understanding how to monitor and troubleshoot jobs helps ensure efficient pipeline execution.

Content from Functional annotation using EnTAP


Last updated on 2025-04-22 | Edit this page

Overview

Questions

  • What is EnTAP, and how does it improve functional annotation for non-model transcriptomes?
  • How does EnTAP filter, annotate, and assign functional roles to predicted transcripts?
  • What databases and evidence sources does EnTAP integrate for annotation?
  • What are the key steps required to set up and execute EnTAP on an HPC system?

Objectives

  • Understand how EnTAP improves functional annotation for non-model eukaryotes.
  • Learn how EnTAP processes transcript data through filtering, alignment, and functional assignment.
  • Set up and modify EnTAP configuration files for correct execution.
  • Run EnTAP on an HPC system and interpret the generated annotations.

EnTAP Overview


EnTAP is a bioinformatics pipeline designed to enhance the accuracy, speed, and flexibility of functional annotation for de novo assembled transcriptomes in non-model eukaryotic organisms. It mitigates assembly fragmentation, improves annotation rates, and provides extensive functional insights into transcriptomes. You can provide predicted transcripts in FASTA format and a GFF file containing gene models to run EnTAP.

Key Features

  • Optimized for non-model eukaryotes – Overcomes challenges of fragmented transcriptome assemblies.
  • Fast & efficient – Runs significantly faster than comparable annotation tools.
  • Customizable – Supports optional filtering and analysis steps for user-specific needs.
  • Comprehensive functional insights – Integrates multiple annotation sources for high-confidence gene assignments.
  • Contaminant detection – Helps remove misleading sequences for cleaner datasets.

How EnTAP Works


  1. Transcriptome Filtering – Identifies true coding sequences (CDS) and removes assembly artifacts:
    • Expression Filtering (optional) – Filters transcripts based on gene expression levels using RSEM.
    • Frame Selection (optional) – Further refines CDS predictions using TransDecoder.
  2. Transcriptome Annotation – Assigns functional information to sequences:
    • Similarity Search – Rapid alignment against user-selected databases using DIAMOND.
    • Contaminant Filtering & Best Hit Selection – Identifies optimal annotations and flags potential contaminants.
    • Orthologous Group Assignment – Assigns translated proteins to gene families using eggNOG/eggnog-mapper, including:
      • Protein Domains (SMART/Pfam)
      • Gene Ontology (GO) Terms
      • KEGG Pathway Annotation
    • InterProScan (optional) – Searches InterPro databases for additional domain, GO, and pathway annotations.
    • Horizontal Gene Transfer Analysis (optional) – Detects potential horizontal gene transfer (HGT) events via DIAMOND.

Running EnTAP


EnTAP is available as a module on the HPC cluster. You can load the module using the following commands:

BASH

ml --force purge
ml biocontainers
ml entap

First-Time Setup

When running for the first time, you will have to set up the databases for EnTAP. This includes downloading files from various databases, and can be time consuming. This section is already performed so you can skip this step and is included for reference.

entap_run.params file should be setup as follows (be sure to select the correct databases for your organism):

INI

out-dir=entap_dbfiles
overwrite=false
resume=false
input=
database=uniprot_sprot,refseq_plant
no-trim=false
threads=1
output-format=1,3,4,7,
fpkm=0.5
align=
single-end=false
frame-selection=false
transdecoder-m=100
transdecoder-no-refine-starts=false
taxon=
qcoverage=50
tcoverage=50
contam=
e-value=1e-05
uninformative=conserved,predicted,unknown,unnamed,hypothetical,putative,unidentified,uncharacterized,uncultured,uninformative,
diamond-sensitivity=very-sensitive
ontology_source=0,
eggnog-contaminant=true
eggnog-dbmem=true
eggnog-sensitivity=more-sensitive
interproscan-db=
hgt-donor=
hgt-recipient=
hgt-gff=
ncbi-api-key=
ncbi-api-enable=true

entap_config.ini file should be setup as follows (be sure to modify the paths <custom_location> to your desired location):

INI

data-generate=false
data-type=0,
entap-db-bin=<custom_location>/entap_db/bin/entap_database.bin
entap-db-sql=entap_database.db
entap-graph=entap_graphing.py
rsem-calculate-expression=rsem-calculate-expression
rsem-sam-validator=rsem-sam-validator
rsem-prepare-reference=rsem-prepare-reference
convert-sam-for-rsem=convert-sam-for-rsem
transdecoder-long-exe=TransDecoder.LongOrfs
transdecoder-predict-exe=TransDecoder.Predict
diamond-exe=diamond
eggnog-map-exe=emapper.py
eggnog-map-data=<custom_location>/entap_db/databases
eggnog-map-dmnd=<custom_location>/entap_db/bin/eggnog_proteins.dmnd
interproscan-exe=interproscan.sh

Once done, for the first time setup, you can run the following command:

BASH

ml --force purge
ml biocontainers
ml entap
EnTAP \
   --config \
   --run-ini ./entap_run.params \
   --entap-ini ./entap_config.ini \
   --threads ${SLURM_CPUS_ON_NODE}

This will download the databases and set up the configuration files for EnTAP.

Step 1: Prepare files


Your input files should be in the following format:

  • Transcript FASTA file – Contains predicted transcripts in FASTA format.
  • Configuration file – Specifies parameters for EnTAP execution.
    • entap_run.params – Contains runtime parameters for EnTAP.
    • entap_config.ini – Specifies paths to EnTAP binaries and databases. (you can copy the files from /depot/itap/datasets/entap_db/entap_{config.ini,run.params})

Edit the entap_run.params file to specify the output directory for EnTAP results and the correct input file

INI

out-dir=entap_out
input=input_cds.fasta

Step 2: Run EnTAP


Run EnTAP using the following command:

BASH

ml --force purge
ml biocontainers
ml entap
EnTAP -\
   -run \
   --run-ini ./entap_run.params \
   --entap-ini ./entap_config.ini \
   --threads ${SLURM_CPUS_ON_NODE}

Interpreting Results


EnTAP generates several output files, but the key results will be in the entap_out/final_results directory.

EnTAP Output Files Summary

File/Directory Description
Final Results (final_results/)
annotated.fnn FASTA file of annotated transcripts.
annotated.tsv Tab-separated file with functional annotations.
annotated_gene_ontology_terms.tsv GO terms assigned to annotated transcripts.
entap_results.tsv Master summary of all results, including annotations.
unannotated.fnn FASTA file of unannotated transcripts.
unannotated.tsv List of transcripts that failed annotation.
gene_family/ Stores eggNOG gene family assignments, including orthologs and functional annotations.
similarity_search/ Contains results from DIAMOND BLASTX searches against selected databases.
transcriptomes/ Holds the input transcriptome (CDS) and the processed version after filtering.

Key Points

  • EnTAP enhances functional annotation by integrating multiple evidence sources, including homology, protein domains, and gene ontology.
  • Proper setup of configuration files and databases is essential for accurate and efficient EnTAP execution.
  • Running EnTAP involves transcript filtering, similarity searches, and functional annotation through automated workflows.
  • The pipeline provides extensive insights into transcript function, improving downstream biological interpretations.

Content from Annotation Assesment


Last updated on 2025-04-22 | Edit this page

Overview

Questions

  • How to assess the quality of a genome annotation?
  • What are the different tools available for assessing the quality of a genome annotation?
  • How to compare the predicted annotation with the reference annotation?
  • How to measure the number of raw reads assigned to the features predicted by the annotation?

Objectives

  • Assess the quality of a genome annotation using different tools.
  • Compare the predicted annotation with the reference annotation.
  • Measure the number of raw reads assigned to the features predicted by the annotation.

Setup


The folder organization is as follows:

The annotation files (gff3) are from the previous steps. But are collected in this folder for convenience.

We will extract CDS and Protein sequences from the gff3 file before we begin the assesment.

BASH

ml --force purge
ml biocontainers
ml cufflinks
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for i in *.gff3; do
    base=$(basename ${i%.*})
    gffread \
       -g ${WORKSHOP_DIR}/00_datasets/genome/athaliana.fa \
       -x ${base}_cds.fa \
       -y ${base}_pep.fa $i
done

GFF3 metrics


We will use agat_sp_statistics.pl to calculate the statistics of the GFF3 file.

BASH

ml --force purge
ml biocontainers
ml agat
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
mkdir -p gff3_stats
for gff3 in *.gff3; do
    base=$(basename $gff3 .gff3)
    agat_sp_statistics.pl \
       --gff ${gff3} \
       -o gff3_stats/${base}_stats.txt
done

BUSCO assesment


We will use BUSCO to asses the quality of the annotation.

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml busco
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for pep in *_pep.fa; do
    base=$(basename ${pep%.*})
     busco \
        --in ${pep} \
        --cpu ${SLURM_CPUS_ON_NODE} \
        --out ${base}_busco \
        --mode prot \
        --lineage_dataset brassicales_odb10 \
        --force
done

Once the BUSCO assesment is complete, we can view the results using the following command:

BASH

ml --force purge
ml biocontainers
ml busco
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
mkdir -p busco_results
for f in */*_busco/short_summary*busco.txt; do
   cp $f busco_results/;
done
generate_plot.py –wd busco_results

Omark assesment


OMArk is a software for proteome (protein-coding gene repertoire) quality assessment. It provides measures of proteome completeness, characterizes the consistency of all protein coding genes with regard to their homologs, and identifies the presence of contamination from other species.

The proteomes should be filtered to take only primary isoforms before running OMark.

BASH

ml --force purge
ml biocontainers
ml seqkit
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for pep in braker*_pep.fa; do
    base=$(basename ${pep%.*})
    grep "\.t1$" $pep | sed 's/>//g' > ${base}.primary.ids
    seqkit grep -f ${base}.primary.ids $pep > ${base}.primary.pep.fa
    rm ${base}.primary.ids
done
cp helixer_pep.fa helixer.primary.pep.fa

Running OMark:

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml omark
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
database="/depot/itap/datasets/omark/LUCA.h5"
cd ${workdir}
for pep in *.primary.pep.fa; do
    base=$(basename ${pep%.*})
    omamer search \
       --db ${database} \
       --query ${pep} \
       --out ${pep%.*}.omamer \
       --nthreads ${SLURM_CPUS_ON_NODE}
    omark \
       --file ${pep%.*}.omamer \
       --database ${database} \
       --outputFolder ${pep%.*} \
       --og_fasta ${pep}
done

Conserved HOGs in Brassicaceae

Category Count Percentage
Total Conserved HOGs 17,996 100%
Single-Copy HOGs (S) 16,237 90.23%
Duplicated HOGs (D) 1,315 7.31%
Unexpected Duplications (U) 351 1.95%
Expected Duplications (E) 964 5.36%
Missing HOGs (M) 444 2.47%

Proteome Composition

Category Count Percentage
Total Proteins 26,994 100%
Consistent (A) 25,480 94.39%
Partial Hits (P) 1,306 4.84%
Fragmented (F) 630 2.33%
Inconsistent (I) 167 0.62%
Partial Hits (P) 106 0.39%
Fragmented (F) 29 0.11%
Likely Contamination (C) 0 0.00%
Partial Hits (P) 0 0.00%
Fragmented (F) 0 0.00%
Unknown (U) 1,347 4.99%

HOG Placement - Detected Species

Species NCBI TaxID Associated Proteins % of Total Proteome
Arabidopsis thaliana 3702 25,647 95.01%

Feature assignment


To measure the number of raw reads assigned to the features predicted by the annotation, we will use featureCounts from the subread package.

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml subread
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
mkdir -p featureCounts
for gff3 in *.gff3; do
    base=$(basename ${gff3%.*})
    featureCounts \
      -T ${SLURM_CPUS_ON_NODE} \
      -a ${gff3} \
      -t exon \
      -g ID \
      -p \
      -B \
      -o ${base}_merged_counts.txt \
      --tmpDir /tmp ${RCAC_SCRATCH}/annotation_workshop/00_datasets/bamfiles/*.bam
done

Reference annotation comparison


We will use mikado compare the reference annotation with the predicted annotation.

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml mikado
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for gff3 in *.gff3; do
    base=$(basename ${gff3%.*})
    mikado compare \
      --protein-coding \
      -r ${RCAC_SCRATCH}/annotation_workshop/00_datasets/genome/athaliana_TAIR10.gff3 \
      -p ${gff3} \
      -o ref-TAIR10_vs_prediction_${base}_compared \
      --log ${base}_compare.log
done

Summary


Key Points

  • busco and omark assess how well conserved genes are represented in the predicted gene set
  • gff3 metrics provide structural insights and highlight discrepancies compared to known annotations
  • featureCounts assignment quantifies the number of RNA-seq reads aligning to predicted features
  • Reference annotation comparison evaluates how closely the predicted genes match an established reference
  • Multiple assessment methods ensure a comprehensive evaluation of annotation quality