Content from Introduction to Genome Annotation
Last updated on 2025-04-22 | Edit this page
Overview
Questions
- What is genome annotation, and why is it important?
- What are the different types of genome annotation?
- What challenges make genome annotation difficult?
- What data sources are used to improve annotation accuracy?
- How does the annotation process fit into genomics research?
Objectives
- Define genome annotation and its role in genomics research
- Differentiate between structural and functional annotation
- Identify key challenges in genome annotation and how they impact
accuracy
- Understand the types of data used in annotation and their
significance
- Provide an overview of the annotation workflow used in this workshop
What Is Genome Annotation?
- Genome annotation is the process of identifying and labeling
functional elements in a genome, such as genes, regulatory regions, and
non-coding RNAs.
- It involves predicting gene structures, including exons, introns,
and untranslated regions, using computational and experimental
data.
- Functional annotation assigns biological roles to predicted genes
based on sequence similarity, protein domains, and pathway
associations.
- Accurate annotation is essential for understanding genome function, evolutionary relationships, and applications in research and biotechnology.
Why Is Annotation Important?
- Annotation provides a functional map of the genome, helping
researchers identify genes and regulatory elements.
- It enables comparative genomics, allowing insights into evolution,
gene conservation, and species-specific adaptations.
- Accurate gene annotation is essential for biomedical and
agricultural research, including disease studies and crop
improvement.
- Functional annotations link genes to biological processes, aiding in pathway analysis and experimental validation.
Types of Genome Annotation
-
Structural annotation identifies genes, exon-intron
boundaries, untranslated regions (UTRs), and regulatory elements such as
promoters and enhancers.
-
Functional annotation assigns biological roles to
genes based on sequence similarity, protein domains, gene ontology (GO),
and pathway associations.
-
Repeat annotation detects transposable
elements/centromere repeats/knob sequences and other repetitive
sequences that prevent misannotation of genes.
- Epigenomic annotation integrates chromatin accessibility, DNA methylation, and histone modifications to understand gene regulation.
- Non-coding RNA annotation includes small RNAs (miRNAs, siRNAs, piRNAs), long non-coding RNAs (lncRNAs), ribosomal RNAs (rRNAs), and transfer RNAs (tRNAs).
Key Challenges in Genome Annotation
-
Incomplete or fragmented assemblies: Gaps and
misassemblies can lead to missing or incorrect gene predictions.
-
Alternative splicing and isoforms: Complex gene
structures make it difficult to define a single gene model.
-
Pseudogenes and transposable elements:
Distinguishing functional genes from non-functional sequences is
challenging.
-
Low-quality or missing evidence: Incomplete
RNA-seq, proteomics, or evolutionary data can reduce annotation
accuracy.
-
Annotation discrepancies across tools: Different
prediction methods may produce conflicting gene models.
- Evolutionary divergence: Gene structures and functional elements can vary widely across species, complicating homology-based annotation.
Data Sources for Annotation
Accurate genome annotation relies on multiple data sources to refine gene predictions and validate functional elements. Different types of experimental and computational evidence contribute to identifying gene structures, transcript isoforms, coding potential, and functional assignments.
1. Long-read transcript sequencing (Iso-Seq)
- Captures full-length transcripts, helping define complete exon-intron
structures.
- Useful for identifying alternative splicing events and novel
isoforms.
2. Ribosome profiling (Ribo-Seq)
- Detects ribosome-protected mRNA fragments to identify actively
translated regions.
- Helps differentiate coding from non-coding RNAs and refine start/stop
codons.
3. Cap analysis gene expression (CAGE) and
RAMPAGE
- Maps transcription start sites by detecting 5’ capped mRNAs.
- Provides precise information on promoter locations and alternative
start sites.
4. Proteomics (Mass Spectrometry Data)
- Confirms protein-coding genes by detecting peptide fragments in
expressed proteins.
- Helps validate translation and refine gene models.
5. RNA sequencing (Short-read RNA-Seq)
- Provides gene expression levels and transcript coverage across
different conditions.
- Essential for identifying intron-exon boundaries and alternative
splicing events.
6. Comparative genomics (Synteny &
Homology)
- Uses known gene models from related species to guide annotation.
- Helps predict conserved genes and validate gene structures.
7. Additional Data Sources
- Poly(A) site sequencing (PAS-Seq): Defines transcript
end sites for more precise UTR annotation.
- Chromatin accessibility data (ATAC-Seq, DNase-Seq):
Identifies regulatory regions like enhancers and promoters.
- Epigenomic data (ChIP-Seq, DNA methylation): Provides
evidence for active and repressed genes.
Scope of This Workshop
In this workshop, we will focus on protein-coding gene annotation in eukaryotic genomes using computational tools and available datasets. We will annotate genes in a Arabidopsis genome using a combination of ab initio gene prediction, homology-based methods, and RNA-seq evidence methods, can compare the results to a reference annotation. The workflow will cover data preparation, gene prediction, functional annotation, and quality assessment steps to provide a comprehensive overview of the annotation process.
Key Points
- Genome annotation identifies functional elements in a genome, including genes, regulatory regions, and non-coding RNAs.
- Structural annotation predicts gene structures, while functional annotation assigns biological roles to genes.
- Challenges in annotation include incomplete assemblies, alternative splicing, pseudogenes, and data quality issues.
- Data sources like RNA-seq, proteomics, comparative genomics, and epigenomics improve annotation accuracy.
- This workshop focuses on eukaryotic gene annotation using computational tools and diverse datasets.
Content from Annotation Strategies
Last updated on 2025-04-22 | Edit this page
Overview
Questions
- What are the different strategies used for genome annotation?
- How do various methods predict genes, and what are their strengths
and limitations?
- What role do evidence play in improving gene predictions?
- How do deep learning and large models enhance genome annotation
accuracy?
- What tools are used in this workshop, and how do they fit into different strategies?
Objectives
- Understand the principles behind different genome annotation
approaches.
- Learn how intrinsic, extrinsic, and hybrid methods predict gene
structures.
- Explore the role of multi-evidence approaches, including
transcriptomics, proteomics, and functional annotation.
- Introduce modern machine learning and large model-based methods for
gene annotation.
- Familiarize with annotation tools used in this workshop and their specific applications.
Introduction
- Genome annotation identifies functional elements in a genome,
including protein-coding genes, non-coding rnas, and regulatory
regions
- Annotation strategies vary between evidence-based, ab initio, and
hybrid approaches, each with strengths and limitations
- Selecting the right method depends on genome complexity, available transcriptomic and proteomic data, and the need for accuracy versus sensitivity
Annotation Methods
Intrinsic (Ab Initio) Methods
- Predict genes based on sequence features without external
evidence.
- Use statistical models such as hidden Markov models (HMMs) to
recognize coding regions, exon-intron boundaries, and gene
structures.
- Rely on nucleotide composition patterns like codon usage, GC
content, and splice site signals.
- Perform well in well-studied genomes but struggle with novel genes and complex gene structures.
Extrinsic (Evidence-Based) Methods
- Utilize external data such as RNA-seq, ESTs, and protein homology to
guide gene predictions.
- Align transcriptomic and proteomic data to the genome to refine
exon-intron structures.
- Improve annotation accuracy by incorporating evolutionary
conservation and known gene models.
- Require high-quality reference datasets, making them less effective in newly sequenced organisms with little prior data.
Hybrid Methods
- Combine ab initio predictions with evidence-based approaches to
improve accuracy and completeness.
- Use gene predictions from statistical models and refine them with
transcriptomic and proteomic data.
- Balance sensitivity (detecting novel genes) and specificity
(avoiding false positives).
- Often used in large-scale genome annotation projects where both approaches complement each other.
Multi-Evidence based methods
- Incorporate diverse biological data sources to improve gene
prediction accuracy.
- Use ribosome profiling (Ribo-seq) and
proteomics to confirm protein-coding potential.
- Leverage poly(A) signal presence to distinguish
mature transcripts from non-coding RNA.
- Integrate RNA-seq, long-read sequencing, and EST
data to refine exon-intron structures.
- Provide direct experimental validation, reducing reliance on purely computational models.
Large Model-Based Methods
- Utilize deep learning and large language models (LMMs) trained on
vast genomic datasets.
- Predict gene structures by recognizing sequence patterns learned
from diverse organisms.
- Adapt to various genome complexities without requiring manually
curated training data.
- Identify non-canonical gene structures, alternative splicing events,
and novel genes.
- Example: Helixer, a GPU-based model, predicts genes using high-dimensional sequence patterns, improving over traditional HMM-based methods.
What do we annotate in the genome?
-
Protein-coding genes: Genes that encode proteins,
identified based on open reading frames (ORFs) and coding sequence
features.
-
Pseudogenes: Non-functional remnants of once-active
genes that have accumulated mutations.
-
Non-coding RNAs (ncRNAs): Functional RNAs that do
not code for proteins, including tRNAs, rRNAs, and lncRNAs.
-
Upstream Open Reading Frames (uORFs): Small ORFs in
the 5’ untranslated region (UTR) that can regulate translation.
-
Alternative Splicing Variants: Different transcript
isoforms from the same gene, generated by exon-intron
rearrangements.
-
Repeat Elements and Transposons: Mobile DNA
sequences that can impact gene function and genome structure.
- Regulatory Elements: Promoters, enhancers, and other sequences controlling gene expression.
For this workshop, we focus only on protein-coding genes and their annotation.
Annotation Workflows Used
1. BRAKER3: Combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes.
- Integrates GeneMark and AUGUSTUS with RNA-Seq and/or protein evidence.
- Supports various data types or runs without them.
- Fully automated training and gene prediction.
- Requires evidence data for high accuracy; performs poorly without it.
- Struggles with scalability on larger genomes.
2. Helixer: A deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes.
- Utilizes a deep learning model trained on diverse animal and plant genomes, reducing the need for retraining on closely related species.
- Provides probabilities for every base, enabling detailed predictions for intergenic, untranslated, coding, and intronic regions.
- Outperforms traditional tools like AUGUSTUS in both base-wise metrics and RNA-Seq consistency.
- Handles large and complex eukaryotic genomes effectively, even for species with longer genes.
- Requires no additional data except genome – no repeatmasking!
- Cannot predict alternative splicing events or spliced start codons.
3. EASEL: Efficient, Accurate, Scalable Eukaryotic modeLs: eukaryotic genome annotation tool
- Combines machine learning, RNA folding, and functional annotations to enhance prediction accuracy.
- Leverages RNA-Seq, protein alignments, and additional evidence for gene prediction.
- Built on Nextflow, ensuring portability across HPC systems and compatibility with Docker and Singularity.
- Features EnTAP integration for streamlined functional annotation.
- Provides multiple formats (GFF, GTF, CDS, protein sequences) for downstream analysis.
- Employs BUSCO for quality assessment and automates parameter optimization.
4. EnTAP: eukaryotic non-model annotation pipeline (functional annotation)
- Optimized for functional annotation of transcriptomes without reference genomes, addressing challenges like fragmentation and assembly artifacts.
- Much faster than traditional annotation pipelines.
- Integrates taxonomic filtering, contaminant detection, and informativeness metrics for selecting optimal alignments.
- Enables annotation across diverse repositories (e.g., RefSeq, Swiss-Prot) with customizable database integration.
- Includes gene family assignments, protein domains, and GO terms for enriched annotation.
- User-friendly, Unix-based pipeline suitable for HPC environments, combining simplicity with high throughput.
How to we access gene predictions?
-
Busco and Omark assess gene set completeness by
checking the presence of conserved orthologs across species, helping
identify missing or fragmented genes
-
Gff3 metrics provide structural statistics on gene
models, including exon-intron distribution, gene lengths, and coding
sequence properties, highlighting inconsistencies in annotation
-
Feature assignment using RNA-seq read mapping
quantifies how well predicted genes capture transcriptomic evidence,
indicating potential missing or incorrect annotations
-
Reference comparison with known annotations
measures sensitivity and precision, identifying correctly predicted,
missed, or falsely annotated genes
- Multiple validation methods combine structural, functional, and comparative assessments to improve confidence in gene predictions and refine annotation quality
Key Points
- Genome annotation strategies include ab initio,
evidence-based, hybrid, multi-evidence, and large model-based
methods, each with different strengths and data
requirements.
-
Ab initio methods predict genes using sequence
patterns, while evidence-based methods use RNA-seq,
proteins, and evolutionary conservation for refinement.
-
Multi-evidence approaches integrate diverse
biological data such as ribosome profiling, proteomics, and long-read
RNA-seq to improve accuracy.
-
Large model-based methods, like Helixer, use deep
learning to predict genes without requiring manual training,
outperforming traditional tools in complex genomes.
- The workshop focuses on protein-coding gene annotation using tools like BRAKER, Helixer, EASEL, and EnTAP, followed by assessment with BUSCO, OMARK, and structural metrics.
Content from Annotation Setup
Last updated on 2025-04-22 | Edit this page
Overview
Questions
- How should files and directories be structured for a genome
annotation workflow?
- Why is RNA-seq read mapping important for gene prediction?
- How does repeat masking improve annotation accuracy?
- What preprocessing steps are necessary before running gene prediction tools?
Objectives
- Set up a structured workspace for genome annotation
- Map RNA-seq reads to the genome to provide evidence for gene
prediction
- Perform repeat masking to prevent false gene annotations
- Prepare genome and transcriptomic data for downstream annotation analyses
Setup
This section outlines the preprocessing steps required before gene annotation, including organizing directories, mapping RNA-Seq reads, and masking repetitive elements. These steps have already been completed to save time, but the instructions are provided for reference in case you need to run them on custom datasets.
Optional Section!
Feel free to skip this section as it is already performed to save time. The steps below are provided for reference - to run on custom data.
Folder structure
BASH
mkdir -p ${RCAC_SCRATCH}/annotation_workshop
cd ${RCAC_SCRATCH}/annotation_workshop
mkdir -p 00_datasets/{fastq,genome,bamfiles}
mkdir -p 03_setup
mkdir -p 04_braker
mkdir -p 05_helixer
mkdir -p 06_easel
mkdir -p 07_entap
mkdir -p 08_assessment
rsync -avP SOURCE/00_datasets/ 00_datasets/
RNA-Seq mapping
The RNA-seq dataset used for gene prediction comes from the accession E-MTAB-6422, that provides a gene expression atlas of Arabidopsis thaliana (Columbia accession), capturing transcriptional programs across 36 libraries representing different tissues (seedling, root, inflorescence, flower, silique, seed) and developmental stages (2-leaf, 6-leaf, 12-leaf, senescence, dry mature, and imbibed seed), with three biological replicates per condition.
Step 1: Download the reference genome and annotation files
BASH
# Download the reference genome
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/00_datasets/genome
cd ${WORKSHOP_DIR}/00_datasets/genome
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
ml --force purge
ml biocontainers
ml bioawk
bioaw -c fastx '{print ">"$name; print $seq}' \
Arabidopsis_thaliana.TAIR10.dna.toplevel.fa |\
fold > athaliana.fasta
rm Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
Step 2: Index the reference genome using STAR
BASH
#!/bin/bash
ml --force purge
ml biocontainers
ml star
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta"
index="${WORKSHOP_DIR}/00_datasets/genome/star_index"
STAR --runThreadN ${SLURM_JOB_CPUS_PER_NODE} \
--runMode genomeGenerate \
--genomeDir ${index} \
--genomeSAindexNbases 12 \
--genomeFastaFiles ${genome}
Options
Option | Description |
---|---|
--runThreadN $SLURM_JOB_CPUS_PER_NODE |
Specifies the number of CPU threads to use. In an HPC environment,
$SLURM_JOB_CPUS_PER_NODE dynamically assigns available CPUs
per node. |
--runMode genomeGenerate |
Runs STAR in genome index generation mode, required before alignment. |
--genomeDir ${index} |
Path to the directory where the genome index will be stored. |
--genomeSAindexNbases 12 |
Defines the length of the SA pre-indexing string. A recommended
value is min(14, log2(GenomeLength)/2 - 1) , but
12 is commonly used for smaller genomes. |
--genomeFastaFiles ${genome} |
Path to the FASTA file containing the reference genome sequences. |
Step 3: Map the RNA-Seq reads to the reference genome
We need to setup a script star-mapping.sh
to map the
RNA-Seq reads to the reference genome.
BASH
#!/bin/bash
ml --force purge
ml biocontainers
ml star
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta"
index="${WORKSHOP_DIR}/00_datasets/genome/star_index"
read1=$1
read2=$2
cpus=${SLURM_JOB_CPUS_PER_NODE}
outname=$(basename ${read1} | cut -f 1 -d "_")
STAR \
--runThreadN ${cpus} \
--genomeDir ${index} \
--outSAMtype BAM SortedByCoordinate \
--outFilterScoreMinOverLread 0.3 \
--outFilterMatchNminOverLread 0.3 \
--outFileNamePrefix ${outname}_ \
--readFilesCommand zcat \
--outWigType bedGraph \
--outWigStrand Unstranded \
--outWigNorm RPM \
--readFilesIn ${read1} ${read2}
Options
Option | Description |
---|---|
--runThreadN ${cpus} |
Specifies the number of CPU threads to use for alignment. |
--genomeDir ${index} |
Path to the pre-generated STAR genome index directory. |
--outSAMtype BAM SortedByCoordinate |
Outputs aligned reads in BAM format, sorted by genomic coordinates. |
--outFilterScoreMinOverLread 0.3 |
Sets the minimum alignment score relative to read length (30% of the max possible score). |
--outFilterMatchNminOverLread 0.3 |
Sets the minimum number of matching bases relative to read length (30% of the read must match). |
--outFileNamePrefix ${outname}_ |
Prefix for all output files generated by STAR. |
--readFilesCommand zcat |
Specifies a command (zcat ) to decompress input FASTQ
files if they are gzipped. |
--outWigType bedGraph |
Outputs signal coverage in bedGraph format for visualization. |
--outWigStrand Unstranded |
Specifies that the RNA-seq library is unstranded (reads are not strand-specific). |
--outWigNorm RPM |
Normalizes coverage signal to Reads Per Million (RPM). |
--readFilesIn ${read1} ${read2} |
Specifies input paired-end FASTQ files (Read 1 and Read 2). |
Step 4: Run the STAR mapping script
BASH
#!/bin/bash
# Define the directory containing FASTQ files
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
FASTQ_DIR="WORKSHOP_DIR/00_datasets/fastq"
# Loop through all R1 files and find corresponding R2 files
for R1 in ${FASTQ_DIR}/*_R1.fq.gz; do
R2="${R1/_R1.fq.gz/_R2.fq.gz}" # Replace R1 with R2
if [[ -f "$R2" ]]; then # Check if R2 exists
echo "Running STAR mapping for: $R1 and $R2"
./star-mapping.sh "$R1" "$R2"
else
echo "Error: Matching R2 file for $R1 not found!" >&2
fi
done
Step 5: store BAM files
RepeatMasker
Some gene prediction tools require repeat-masked genomes to avoid mis-annotation of transposable elements as genes. We will use RepeatMasker to mask the repeats in the genome. But first, we need to download the RepeatMasker database.
BASH
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
cd ${WORKSHOP_DIR}/00_datasets/genome
# Download the RepeatMasker database
wget https://www.girinst.org/server/RepBase/protected/RepBase30.01.fasta.tar.gz
tar xf RepBase30.01.fasta.tar.gz
replib=$(find $(pwd) -name "athrep.ref")
ml --force purge
ml biocontainers
ml repeatmasker
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta"
RepeatMasker \
-e ncbi \
-pa ${SLURM_CPUS_ON_NODE} \
-q \
-xsmall \
-lib ${replib} \
-nocut \
-gff \
${genome}
Once done, we will rename it to
athaliana.fasta.masked
Key Points
- Organizing files and directories ensures a reproducible and
efficient workflow
- Mapping RNA-seq reads to the genome provides essential evidence for
gene prediction
- Repeat masking prevents transposable elements from being
mis-annotated as genes
- Preprocessing steps are crucial for accurate and high-quality genome annotation
Content from Annotation using BRAKER
Last updated on 2025-04-22 | Edit this page
Overview
Questions
- What is BRAKER3?
- What are the different scenarios in which BRAKER3 can be used?
- How to run BRAKER3 with different input requirements?
- What are the different output files generated by BRAKER3?
Objectives
- Understand the different scenarios in which BRAKER3 can be used
- Learn how to run BRAKER3 with different input requirements
- Learn how to interpret the output files generated by BRAKER3
BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes. This pipeline is particularly useful for annotating newly sequenced genomes. The flexibility of BRAKER3 allows users to provide various input datasets for improving gene prediction accuracy. In this example, we will use various scenarios to predict genes in a Maize genome using BRAKER3. Following are the scenarios we will cover
Organizing the Data
Before running BRAKER3, we need to organize the data in the following format:
BASH
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/04_braker/braker_case{1..5}
Setup
Before running BRAKER3, we need to set up:
-
GeneMark-ES/ET/EP/ETP
license key - The
AUGUSTUS_CONFIG_PATH
configuration path
The license key for GeneMark-ES/ET/EP/ETP
can be
obtained from the GeneMark
website. Once downloaded, you need to place it in your home
directory:
For the AUGUSTUS_CONFIG_PATH
, we need to copy the
config
directory from the Singularity container to the
scratch directory. This is required because BRAKER3 needs to write to
the config
directory, and the Singularity container is
read-only. To copy the config
directory, run the following
command:
Case 1: Using RNAseq Reads
To run BRAKER3 with RNAseq reads, we need to provide the RNAseq reads in BAM format. We will use the following command to run BRAKER3 with RNAseq reads:
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --account=workshop
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case1
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
BAM_DIR="${WORKSHOP_DIR}/00_datasets/bamfiles"
BAM_FILES=$(find "${BAM_DIR}" -name "*.bam" | tr '\n' ',')
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case1
cd ${workdir}
braker.pl \
--AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
--GENEMARK_PATH=${GENEMARK_PATH} \
--bam=${BAM_FILES} \
--genome=${genome} \
--species=At_$(date +"%Y%m%d").c1b \
--workingdir=${workdir} \
--gff3 \
--threads ${SLURM_CPUS_ON_NODE}
The job takes ~55 mins to finish. The results of the BRAKER3 run will
be stored in the respective braker_case1
directory, as
braker.gff3
file.
Case 2: Using Conserved proteins only
Using the orthodb-clades
tool, we can download protein sequences for a specific clade. In this
scenario, since we are using the Arabidopsis genome, we can download the
clades
specific Viridiplantae.fa
OrthoDB v12 protein sets.
BASH
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/04_braker/braker_case2
mkdir -p ${workdir}
git clone git@github.com:tomasbruna/orthodb-clades.git
cd orthodb-clades
ml --force purge
ml biocontainers
ml snakemake
snakemake --cores ${SLURM_CPUS_ON_NODE} selectViridiplantae
When this is done, you should see a folder named clade
with Viridiplantae.fa
in the orthodb-clades
directory. We will use this as one of the input datasets for
BRAKER3.
The following command will run BRAKER3 with the input genome and protein
sequences:
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case2
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
proteins="${WORKSHOP_DIR}/04_braker/braker_case2/orthodb-clades/clades/Viridiplantae.fa"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case2
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
--AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
--GENEMARK_PATH=${GENEMARK_PATH} \
--genome=${genome} \
--prot_seq=${proteins} \
--species=At_$(date +"%Y%m%d").c2 \
--workingdir=${workdir} \
--gff3 \
--threads ${SLURM_CPUS_ON_NODE}
The results of the BRAKER3 run will be stored in the respective
braker_case2
directory, as braker.gff3
file.
The run time for this job is ~170 mins.
Case 3: Using RNAseq and Proteins Together
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case3
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
BAM_DIR="${WORKSHOP_DIR}/00_datasets/bamfiles"
BAM_FILES=$(find "${BAM_DIR}" -name "*.bam" | tr '\n' ',')
proteins="${WORKSHOP_DIR}/04_braker/braker_case2/orthodb-clades/clade/Viridiplantae.fa"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case3
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
--AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
--GENEMARK_PATH=${GENEMARK_PATH} \
--genome=${genome} \
--prot_seq=${proteins} \
--bam=${BAM_FILES} \
--species=At_$(date +"%Y%m%d").c3 \
--workingdir=${workdir} \
--gff3 \
--threads ${SLURM_CPUS_ON_NODE}
The results of the BRAKER3 run will be stored in the respective
braker_case3
directory, as braker.gff3
file.
The run time for this job is ~120 mins.
Case 4: ab Initio mode
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case4
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case4
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
--AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
--GENEMARK_PATH=${GENEMARK_PATH} \
--esmode \
--genome=${genome} \
--species=Zm_$(date +"%Y%m%d").c4 \
--workingdir=${workdir} \
--gff3 \
--threads ${SLURM_CPUS_ON_NODE}
The results of the BRAKER3 run will be stored in the respective
braker_case4
directory, as braker.gff3
file.
The run time for this job is ~60 mins.
Case 5: Using Pre-trained Model
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case5
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case5
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
--AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
--GENEMARK_PATH=${GENEMARK_PATH} \
--skipAllTraining \
--genome=${genome} \
--species=arabidopsis \
--workingdir=${workdir} \
--gff3 \
--threads ${SLURM_CPUS_ON_NODE}
The results of the BRAKER3 run will be stored in the respective
braker_case5/Augustus
directory. The main result file
augustus.ab_initio.gff3
will be used for downstream
analysis. Runtime for this job is ~7 mins.
Results and Outputs
coming soon!
Key Points
- BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes
- BRAKER3 can be used with different input datasets to improve gene prediction accuracy
- BRAKER3 can be run in different scenarios to predict genes in a genome
Content from Annotation using Helixer
Last updated on 2025-04-22 | Edit this page
Overview
Questions
- How to predict genes using Helixer?
- How to download trained models for Helixer?
- How to run Helixer on the HPC cluster (Gilbreth)?
Objectives
- Learn how to predict genes using Helixer.
- Learn how to download trained models for Helixer.
- Learn how to run Helixer on the HPC cluster (Gilbreth).
Helixer is a deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes. Helixer is trained on a wide range of eukaryotic genomes and can predict genes in both plant and animal genomes. Helixer can predict genes wihtout any extrinisic information such as RNA-seq data or homology information, purely based on the sequence of the genome.
This section should be run on Gilbreth HPC cluster.
Due to the GPU requirement for Helixer, you need to run this section
on the Gilbreth HPC cluster. You don’t have to copy fastq, or bamfiles,
but only need athaliana.fasta
file in the
00_datasets/genome
directory.
BASH
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/05_helixer
cp <depot_location>/genome/athaliana.fasta ${WORKSHOP_DIR}/05_helixer/athaliana.fasta
Setup
Helixer is available as a Singularity/apptainer container. You can
pull the container using the following apptainer pull
command. See the Helixer Docker
repository for more information.
Helixer is installed as a module on the Gilbreth cluster, and can be loaded using the following commands:
Downloading trained models
Helixer requires a trained model to predict genes. With the included
script fetch_helixer_models.py
you can download models for
specific lineages. Currently, models are available for the following
lineages:
land_plant
vertebrate
invertibrate
fungi
You can download the models using the following command:
BASH
# all models
# fetch_helixer_models.py --all
# or for a specific lineage
fetch_helixer_models.py --lineage land_plant
This will download all lineage models in the models directory. You
can also download models for specific lineages using the
--lineage
option as shown above.
By default, files will be downloaded to
~/.local/share/Helixer
directory. You should see the
follwing files:
.
└── models
└── land_plant
├── land_plant_v0.3_a_0080.h5
├── land_plant_v0.3_a_0090.h5
├── land_plant_v0.3_a_0100.h5
├── land_plant_v0.3_a_0200.h5
├── land_plant_v0.3_a_0300.h5
├── land_plant_v0.3_a_0400.h5
├── land_plant_v0.3_m_0100.h5
└── land_plant_v0.3_m_0200.h5
land_plant_v0.3_a_0080.h5
is the smallest model and
land_plant_v0.3_m_0200.h5
is the largest model. The model
size is determined by the number of parameters in the model. The larger
models are more accurate but require more memory and time to run.
Running Helixer
Helixer requires GPU for prediction. For running Helixer, you need to
request a GPU node. You will also need the genome sequence in fasta
format. For this tutorial, we will use Maize genome (Zea mays subsp.
mays), and use the land_plant
model to predict genes.
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --gpus-per-node=1
#SBATCH --time=04:00:00
#SBATCH --account=standby
#SBATCH --job-name=helixer
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
ml --force purge
ml biocontainers
ml helixer
genome=athaliana.fasta
species="Arabidopsis thaliana"
output="athaliana_helixer.gff"
Helixer.py \
--lineage land_plant \
--fasta-path ${genome} \
--species ${species} \
--gff-output-path ${output}
A typical job will take ~40 mins to finish depedning on the GPU the job gets allocated.
You can count the nubmer of predictions in your gff3 file using the following command:
The GFF format output had 27,201 genes predicted using Helixer. You can view the various features in the gff file using the following command:
To get cds
and pep
files, you can use the
following command:
BASH
ml --force purge
ml biocontainers
ml cufflinks
gffread Arabidopsis_thaliana.gff3 \
-g athaliana.fasta \
-y helixer_pep.fa \
-x helixer_cds.fa
As you may have noticed, the number of mRNA and gene features are the same. This is because isoforms aren’t predicted by Helixer and you only have one transcript per gene. Exons are indetified with high confidence and alternative isoforms are usually collapsed into a single gene model. This is one of the known limitations of Helixer.
Key Points
- Helixer is a deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes.
- Helixer can predict genes wihtout any extrinisic information such as RNA-seq data or homology information, purely based on the sequence of the genome.
- Helixer requires a trained model and GPU for prediction.
- Helixer predicts genes in the GFF3 format, but will not predict isoforms.
Content from Annotation using Easel
Last updated on 2025-04-22 | Edit this page
Overview
Questions
- What are the steps required to set up and run EASEL on an HPC system?
- How is Nextflow used to manage and execute the EASEL workflow?
- What configuration files need to be modified before running EASEL?
- How do you submit and monitor an EASEL job using Slurm?
Objectives
- Set up and configure EASEL for execution on an HPC system.
- Install Nextflow and pull the EASEL workflow from the repository.
- Modify the necessary configuration files to match the HPC environment.
- Submit and run EASEL as a Slurm job and verify successful execution.
EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs) is a genome annotation tool that integrates machine learning, RNA folding, and functional annotations to improve gene prediction accuracy, leveraging optimized AUGUSTUS parameters, transcriptome and protein evidence, and a Nextflow-based scalable pipeline for reproducible analysis.
Setup
We will do a custom installation of nextflow (we will not use
ml nextflow
):
check the installation:
We can organize our folder structure as follows:
Running Easel
Step 1: create a rcac.config
file to
run the jobs using slurm
scheduler.
BASH
params {
config_profile_description = 'Negishi HPC, RCAC Purdue University.'
config_profile_contact = 'Arun Seetharam (aseethar@purdue.edu)'
config_profile_url = "https://www.rcac.purdue.edu/knowledge/negishi"
partition = 'testpbs'
project = 'your_project_name' // Replace with a valid Slurm project
max_memory = '256GB'
max_cpus = 128
max_time = '48h'
}
singularity {
enabled = true
autoMounts = true
singularity.cacheDir = "${System.getenv('RCAC_SCRATCH')}/.singularity"
singularity.pullTimeout = "1h"
}
process {
resourceLimits = [
memory: 256.GB,
cpus: 128,
time: 48.h
]
executor = 'slurm'
clusterOptions = "-A ${params.project} -p ${params.partition}"
}
// Executor-specific configurations
if (params.executor == 'slurm') {
process {
cpus = { check_max( 16 * task.attempt, 'cpus' ) }
memory = { check_max( 32.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
clusterOptions = "-A ${params.project} -p ${params.partition}"
errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
maxRetries = 1
maxErrors = '-1'
withLabel:process_single {
cpus = { check_max( 16 , 'cpus' ) }
memory = { check_max( 32.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
}
withLabel:process_low {
cpus = { check_max( 32 * task.attempt, 'cpus' ) }
memory = { check_max( 64.GB * task.attempt, 'memory' ) }
time = { check_max( 24.h * task.attempt, 'time' ) }
}
withLabel:process_medium {
cpus = { check_max( 64 * task.attempt, 'cpus' ) }
memory = { check_max( 128.GB * task.attempt, 'memory' ) }
time = { check_max( 24.h * task.attempt, 'time' ) }
}
withLabel:process_high {
cpus = { check_max( 128 * task.attempt, 'cpus' ) }
memory = { check_max( 256.GB * task.attempt, 'memory' ) }
time = { check_max( 24.h * task.attempt, 'time' ) }
}
withLabel:process_long {
time = { check_max( 48.h * task.attempt, 'time' ) }
}
withLabel:process_high_memory {
memory = { check_max( 256.GB * task.attempt, 'memory' ) }
}
withLabel:error_ignore {
errorStrategy = 'ignore'
}
withLabel:error_retry {
errorStrategy = 'retry'
maxRetries = 2
}
}
}
Step 2: pull the Easel nextflow workflow:
BASH
ml --force purge
ml biocontainers
ml nextflow
ml openjdk
nextflow pull -hub gitlab PlantGenomicsLab/easel
This will pull the latest version of the Easel workflow from the PlantGenomicsLab GitLab repository.
Step 3: modify the
~/.nextflow/assets/PlantGenomicsLab/easel/nextflow.config
file to include the rcac.config
file:
or if you prefer nano
:
Add the following line to the nextflow.config
file:
BASH
profiles {
debug { process.beforeScript = 'echo $HOSTNAME' }
...
...
test { includeConfig 'conf/test.config' }
rcac { includeConfig 'conf/rcac.config' }
}
Note that this is in line # 235 in the nextflow.config
file. and all the lines between debug
and test
aren’t shown here.
Now, we are ready to run the Easel workflow.
Step 4: create param.yaml
file to run
the Easel workflow:
BASH
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/06_easel
cd ${WORKSHOP_DIR}/06_easel
for the param.yaml
file:
YAML
outdir: easel
genome: /scratch/negishi/aseethar/annotation_workshop/00_datasets/genome/athaliana_softmasked.fasta
bam: /scratch/negishi/aseethar/annotation_workshop/00_datasets/bamfiles/*.bam
busco_lineage: embryophyta
order: Brassicales
prefix: arabidopsis
taxon: arabidopsis
singularity_cache_dir: /scratch/negishi/aseethar/singularity_cache
training_set: plant
executor: slurm
account: testpbs
qos: normal
project: testpbs
Now we are ready to run the Easel workflow:
Our slurm
file to run the Easel workflow:
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=04-00:00:00
#SBATCH --job-name=easel
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml purge
ml biocontainers
ml openjdk
nextflow run \
-hub gitlab PlantGenomicsLab/easel \
-params-file params.yaml \
-profile rcac \
--project testpbs
This will run the Easel workflow on the RCAC HPC.
Results and Outputs
Key Points
- EASEL is executed using Nextflow, which simplifies workflow management and ensures reproducibility.
- Proper configuration of resource settings and HPC parameters is essential for successful job execution.
- Running EASEL requires setting up input files, modifying configuration files, and submitting jobs via Slurm.
- Understanding how to monitor and troubleshoot jobs helps ensure efficient pipeline execution.
Content from Functional annotation using EnTAP
Last updated on 2025-04-22 | Edit this page
Overview
Questions
- What is EnTAP, and how does it improve functional annotation for non-model transcriptomes?
- How does EnTAP filter, annotate, and assign functional roles to predicted transcripts?
- What databases and evidence sources does EnTAP integrate for annotation?
- What are the key steps required to set up and execute EnTAP on an HPC system?
Objectives
- Understand how EnTAP improves functional annotation for non-model eukaryotes.
- Learn how EnTAP processes transcript data through filtering, alignment, and functional assignment.
- Set up and modify EnTAP configuration files for correct execution.
- Run EnTAP on an HPC system and interpret the generated annotations.
EnTAP Overview
EnTAP is a bioinformatics pipeline designed to enhance the accuracy, speed, and flexibility of functional annotation for de novo assembled transcriptomes in non-model eukaryotic organisms. It mitigates assembly fragmentation, improves annotation rates, and provides extensive functional insights into transcriptomes. You can provide predicted transcripts in FASTA format and a GFF file containing gene models to run EnTAP.
Key Features
- Optimized for non-model eukaryotes – Overcomes challenges of fragmented transcriptome assemblies.
- Fast & efficient – Runs significantly faster than comparable annotation tools.
- Customizable – Supports optional filtering and analysis steps for user-specific needs.
- Comprehensive functional insights – Integrates multiple annotation sources for high-confidence gene assignments.
- Contaminant detection – Helps remove misleading sequences for cleaner datasets.
How EnTAP Works
-
Transcriptome Filtering – Identifies true coding
sequences (CDS) and removes assembly artifacts:
- Expression Filtering (optional) – Filters transcripts based on gene expression levels using RSEM.
- Frame Selection (optional) – Further refines CDS predictions using TransDecoder.
-
Transcriptome Annotation – Assigns functional
information to sequences:
- Similarity Search – Rapid alignment against user-selected databases using DIAMOND.
- Contaminant Filtering & Best Hit Selection – Identifies optimal annotations and flags potential contaminants.
-
Orthologous Group Assignment – Assigns translated
proteins to gene families using eggNOG/eggnog-mapper,
including:
- Protein Domains (SMART/Pfam)
- Gene Ontology (GO) Terms
- KEGG Pathway Annotation
- InterProScan (optional) – Searches InterPro databases for additional domain, GO, and pathway annotations.
- Horizontal Gene Transfer Analysis (optional) – Detects potential horizontal gene transfer (HGT) events via DIAMOND.
Running EnTAP
EnTAP is available as a module on the HPC cluster. You can load the module using the following commands:
First-Time Setup
When running for the first time, you will have to set up the databases for EnTAP. This includes downloading files from various databases, and can be time consuming. This section is already performed so you can skip this step and is included for reference.
entap_run.params
file should be setup as follows (be
sure to select the correct databases
for your
organism):
INI
out-dir=entap_dbfiles
overwrite=false
resume=false
input=
database=uniprot_sprot,refseq_plant
no-trim=false
threads=1
output-format=1,3,4,7,
fpkm=0.5
align=
single-end=false
frame-selection=false
transdecoder-m=100
transdecoder-no-refine-starts=false
taxon=
qcoverage=50
tcoverage=50
contam=
e-value=1e-05
uninformative=conserved,predicted,unknown,unnamed,hypothetical,putative,unidentified,uncharacterized,uncultured,uninformative,
diamond-sensitivity=very-sensitive
ontology_source=0,
eggnog-contaminant=true
eggnog-dbmem=true
eggnog-sensitivity=more-sensitive
interproscan-db=
hgt-donor=
hgt-recipient=
hgt-gff=
ncbi-api-key=
ncbi-api-enable=true
entap_config.ini
file should be setup as follows (be
sure to modify the paths <custom_location>
to your
desired location):
INI
data-generate=false
data-type=0,
entap-db-bin=<custom_location>/entap_db/bin/entap_database.bin
entap-db-sql=entap_database.db
entap-graph=entap_graphing.py
rsem-calculate-expression=rsem-calculate-expression
rsem-sam-validator=rsem-sam-validator
rsem-prepare-reference=rsem-prepare-reference
convert-sam-for-rsem=convert-sam-for-rsem
transdecoder-long-exe=TransDecoder.LongOrfs
transdecoder-predict-exe=TransDecoder.Predict
diamond-exe=diamond
eggnog-map-exe=emapper.py
eggnog-map-data=<custom_location>/entap_db/databases
eggnog-map-dmnd=<custom_location>/entap_db/bin/eggnog_proteins.dmnd
interproscan-exe=interproscan.sh
Once done, for the first time setup, you can run the following command:
BASH
ml --force purge
ml biocontainers
ml entap
EnTAP \
--config \
--run-ini ./entap_run.params \
--entap-ini ./entap_config.ini \
--threads ${SLURM_CPUS_ON_NODE}
This will download the databases and set up the configuration files for EnTAP.
Step 1: Prepare files
Your input files should be in the following format:
- Transcript FASTA file – Contains predicted transcripts in FASTA format.
-
Configuration file – Specifies parameters for EnTAP
execution.
-
entap_run.params
– Contains runtime parameters for EnTAP. -
entap_config.ini
– Specifies paths to EnTAP binaries and databases. (you can copy the files from/depot/itap/datasets/entap_db/entap_{config.ini,run.params}
)
-
Edit the entap_run.params
file to specify the output
directory for EnTAP results and the correct input file
Step 2: Run EnTAP
Run EnTAP using the following command:
Interpreting Results
EnTAP generates several output files, but the key results will be in
the entap_out/final_results
directory.
EnTAP Output Files Summary
File/Directory | Description |
---|---|
Final Results
(final_results/ ) |
|
annotated.fnn |
FASTA file of annotated transcripts. |
annotated.tsv |
Tab-separated file with functional annotations. |
annotated_gene_ontology_terms.tsv |
GO terms assigned to annotated transcripts. |
entap_results.tsv |
Master summary of all results, including annotations. |
unannotated.fnn |
FASTA file of unannotated transcripts. |
unannotated.tsv |
List of transcripts that failed annotation. |
gene_family/ |
Stores eggNOG gene family assignments, including orthologs and functional annotations. |
similarity_search/ |
Contains results from DIAMOND BLASTX searches against selected databases. |
transcriptomes/ |
Holds the input transcriptome (CDS) and the processed version after filtering. |
Key Points
- EnTAP enhances functional annotation by integrating multiple evidence sources, including homology, protein domains, and gene ontology.
- Proper setup of configuration files and databases is essential for accurate and efficient EnTAP execution.
- Running EnTAP involves transcript filtering, similarity searches, and functional annotation through automated workflows.
- The pipeline provides extensive insights into transcript function, improving downstream biological interpretations.
Content from Annotation Assesment
Last updated on 2025-04-22 | Edit this page
Overview
Questions
- How to assess the quality of a genome annotation?
- What are the different tools available for assessing the quality of a genome annotation?
- How to compare the predicted annotation with the reference annotation?
- How to measure the number of raw reads assigned to the features predicted by the annotation?
Objectives
- Assess the quality of a genome annotation using different tools.
- Compare the predicted annotation with the reference annotation.
- Measure the number of raw reads assigned to the features predicted by the annotation.
Setup
The folder organization is as follows:
The annotation files (gff3
) are from the previous steps.
But are collected in this folder for convenience.
We will extract CDS and Protein sequences from the gff3
file before we begin the assesment.
BASH
ml --force purge
ml biocontainers
ml cufflinks
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for i in *.gff3; do
base=$(basename ${i%.*})
gffread \
-g ${WORKSHOP_DIR}/00_datasets/genome/athaliana.fa \
-x ${base}_cds.fa \
-y ${base}_pep.fa $i
done
GFF3 metrics
We will use agat_sp_statistics.pl
to calculate the
statistics of the GFF3 file.
BUSCO assesment
We will use BUSCO to asses the quality of the annotation.
BASH
#!/bin/bash
ml --force purge
ml biocontainers
ml busco
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for pep in *_pep.fa; do
base=$(basename ${pep%.*})
busco \
--in ${pep} \
--cpu ${SLURM_CPUS_ON_NODE} \
--out ${base}_busco \
--mode prot \
--lineage_dataset brassicales_odb10 \
--force
done
Once the BUSCO assesment is complete, we can view the results using the following command:
Omark assesment
OMArk is a software for proteome (protein-coding gene repertoire) quality assessment. It provides measures of proteome completeness, characterizes the consistency of all protein coding genes with regard to their homologs, and identifies the presence of contamination from other species.
The proteomes should be filtered to take only primary isoforms before running OMark.
BASH
ml --force purge
ml biocontainers
ml seqkit
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for pep in braker*_pep.fa; do
base=$(basename ${pep%.*})
grep "\.t1$" $pep | sed 's/>//g' > ${base}.primary.ids
seqkit grep -f ${base}.primary.ids $pep > ${base}.primary.pep.fa
rm ${base}.primary.ids
done
cp helixer_pep.fa helixer.primary.pep.fa
Running OMark:
BASH
#!/bin/bash
ml --force purge
ml biocontainers
ml omark
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
database="/depot/itap/datasets/omark/LUCA.h5"
cd ${workdir}
for pep in *.primary.pep.fa; do
base=$(basename ${pep%.*})
omamer search \
--db ${database} \
--query ${pep} \
--out ${pep%.*}.omamer \
--nthreads ${SLURM_CPUS_ON_NODE}
omark \
--file ${pep%.*}.omamer \
--database ${database} \
--outputFolder ${pep%.*} \
--og_fasta ${pep}
done
Conserved HOGs in Brassicaceae
Category | Count | Percentage |
---|---|---|
Total Conserved HOGs | 17,996 | 100% |
Single-Copy HOGs (S) | 16,237 | 90.23% |
Duplicated HOGs (D) | 1,315 | 7.31% |
Unexpected Duplications (U) | 351 | 1.95% |
Expected Duplications (E) | 964 | 5.36% |
Missing HOGs (M) | 444 | 2.47% |
Proteome Composition
Category | Count | Percentage |
---|---|---|
Total Proteins | 26,994 | 100% |
Consistent (A) | 25,480 | 94.39% |
Partial Hits (P) | 1,306 | 4.84% |
Fragmented (F) | 630 | 2.33% |
Inconsistent (I) | 167 | 0.62% |
Partial Hits (P) | 106 | 0.39% |
Fragmented (F) | 29 | 0.11% |
Likely Contamination (C) | 0 | 0.00% |
Partial Hits (P) | 0 | 0.00% |
Fragmented (F) | 0 | 0.00% |
Unknown (U) | 1,347 | 4.99% |
HOG Placement - Detected Species
Species | NCBI TaxID | Associated Proteins | % of Total Proteome |
---|---|---|---|
Arabidopsis thaliana | 3702 | 25,647 | 95.01% |
Feature assignment
To measure the number of raw reads assigned to the features predicted
by the annotation, we will use featureCounts
from the
subread
package.
BASH
#!/bin/bash
ml --force purge
ml biocontainers
ml subread
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
mkdir -p featureCounts
for gff3 in *.gff3; do
base=$(basename ${gff3%.*})
featureCounts \
-T ${SLURM_CPUS_ON_NODE} \
-a ${gff3} \
-t exon \
-g ID \
-p \
-B \
-o ${base}_merged_counts.txt \
--tmpDir /tmp ${RCAC_SCRATCH}/annotation_workshop/00_datasets/bamfiles/*.bam
done
Reference annotation comparison
We will use mikado compare
the reference annotation with
the predicted annotation.
BASH
#!/bin/bash
ml --force purge
ml biocontainers
ml mikado
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for gff3 in *.gff3; do
base=$(basename ${gff3%.*})
mikado compare \
--protein-coding \
-r ${RCAC_SCRATCH}/annotation_workshop/00_datasets/genome/athaliana_TAIR10.gff3 \
-p ${gff3} \
-o ref-TAIR10_vs_prediction_${base}_compared \
--log ${base}_compare.log
done
Summary
Key Points
-
busco
andomark
assess how well conserved genes are represented in the predicted gene set
-
gff3
metrics provide structural insights and highlight discrepancies compared to known annotations
-
featureCounts
assignment quantifies the number of RNA-seq reads aligning to predicted features
- Reference annotation comparison evaluates how closely the predicted
genes match an established reference
- Multiple assessment methods ensure a comprehensive evaluation of annotation quality