Genome Annotation: Hands-on Training: Key Points

Introduction to Genome Annotation

Genome annotation identifies functional elements in a genome, including genes, regulatory regions, and non-coding RNAs.
Structural annotation predicts gene structures, while functional annotation assigns biological roles to genes.
Challenges in annotation include incomplete assemblies, alternative splicing, pseudogenes, and data quality issues.
Data sources like RNA-seq, proteomics, comparative genomics, and epigenomics improve annotation accuracy.
This workshop focuses on eukaryotic gene annotation using computational tools and diverse datasets.

Genome annotation strategies include ab initio, evidence-based, hybrid, multi-evidence, and large model-based methods, each with different strengths and data requirements.
Ab initio methods predict genes using sequence patterns, while evidence-based methods use RNA-seq, proteins, and evolutionary conservation for refinement.
Multi-evidence approaches integrate diverse biological data such as ribosome profiling, proteomics, and long-read RNA-seq to improve accuracy.
Large model-based methods, like Helixer, use deep learning to predict genes without requiring manual training, outperforming traditional tools in complex genomes.
The workshop focuses on protein-coding gene annotation using tools like BRAKER, Helixer, EASEL, and EnTAP, followed by assessment with BUSCO, OMARK, and structural metrics.

Organizing files and directories ensures a reproducible and efficient workflow
Mapping RNA-seq reads to the genome provides essential evidence for gene prediction
Repeat masking prevents transposable elements from being mis-annotated as genes
Preprocessing steps are crucial for accurate and high-quality genome annotation

BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes
BRAKER3 can be used with different input datasets to improve gene prediction accuracy
BRAKER3 can be run in different scenarios to predict genes in a genome

Helixer is a deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes.
Helixer can predict genes wihtout any extrinisic information such as RNA-seq data or homology information, purely based on the sequence of the genome.
Helixer requires a trained model and GPU for prediction.
Helixer predicts genes in the GFF3 format, but will not predict isoforms.

EASEL is executed using Nextflow, which simplifies workflow management and ensures reproducibility.
Proper configuration of resource settings and HPC parameters is essential for successful job execution.
Running EASEL requires setting up input files, modifying configuration files, and submitting jobs via Slurm.
Understanding how to monitor and troubleshoot jobs helps ensure efficient pipeline execution.

EnTAP enhances functional annotation by integrating multiple evidence sources, including homology, protein domains, and gene ontology.
Proper setup of configuration files and databases is essential for accurate and efficient EnTAP execution.
Running EnTAP involves transcript filtering, similarity searches, and functional annotation through automated workflows.
The pipeline provides extensive insights into transcript function, improving downstream biological interpretations.

busco and omark assess how well conserved genes are represented in the predicted gene set
gff3 metrics provide structural insights and highlight discrepancies compared to known annotations
featureCounts assignment quantifies the number of RNA-seq reads aligning to predicted features
Reference annotation comparison evaluates how closely the predicted genes match an established reference
Multiple assessment methods ensure a comprehensive evaluation of annotation quality