Introduction to Genome Annotation


  • Genome annotation identifies functional elements in a genome, including genes, regulatory regions, and non-coding RNAs.
  • Structural annotation predicts gene structures, while functional annotation assigns biological roles to genes.
  • Challenges in annotation include incomplete assemblies, alternative splicing, pseudogenes, and data quality issues.
  • Data sources like RNA-seq, proteomics, comparative genomics, and epigenomics improve annotation accuracy.
  • This workshop focuses on eukaryotic gene annotation using computational tools and diverse datasets.

Annotation Strategies


  • Genome annotation strategies include ab initio, evidence-based, hybrid, multi-evidence, and large model-based methods, each with different strengths and data requirements.
  • Ab initio methods predict genes using sequence patterns, while evidence-based methods use RNA-seq, proteins, and evolutionary conservation for refinement.
  • Multi-evidence approaches integrate diverse biological data such as ribosome profiling, proteomics, and long-read RNA-seq to improve accuracy.
  • Large model-based methods, like Helixer, use deep learning to predict genes without requiring manual training, outperforming traditional tools in complex genomes.
  • The workshop focuses on protein-coding gene annotation using tools like BRAKER, Helixer, EASEL, and EnTAP, followed by assessment with BUSCO, OMARK, and structural metrics.

Annotation Setup


  • Organizing files and directories ensures a reproducible and efficient workflow
  • Mapping RNA-seq reads to the genome provides essential evidence for gene prediction
  • Repeat masking prevents transposable elements from being mis-annotated as genes
  • Preprocessing steps are crucial for accurate and high-quality genome annotation

Annotation using BRAKER


  • BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes
  • BRAKER3 can be used with different input datasets to improve gene prediction accuracy
  • BRAKER3 can be run in different scenarios to predict genes in a genome

Annotation using Helixer


  • Helixer is a deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes.
  • Helixer can predict genes wihtout any extrinisic information such as RNA-seq data or homology information, purely based on the sequence of the genome.
  • Helixer requires a trained model and GPU for prediction.
  • Helixer predicts genes in the GFF3 format, but will not predict isoforms.

Annotation using Easel


  • EASEL is executed using Nextflow, which simplifies workflow management and ensures reproducibility.
  • Proper configuration of resource settings and HPC parameters is essential for successful job execution.
  • Running EASEL requires setting up input files, modifying configuration files, and submitting jobs via Slurm.
  • Understanding how to monitor and troubleshoot jobs helps ensure efficient pipeline execution.

Functional annotation using EnTAP


  • EnTAP enhances functional annotation by integrating multiple evidence sources, including homology, protein domains, and gene ontology.
  • Proper setup of configuration files and databases is essential for accurate and efficient EnTAP execution.
  • Running EnTAP involves transcript filtering, similarity searches, and functional annotation through automated workflows.
  • The pipeline provides extensive insights into transcript function, improving downstream biological interpretations.

Annotation Assesment


  • busco and omark assess how well conserved genes are represented in the predicted gene set
  • gff3 metrics provide structural insights and highlight discrepancies compared to known annotations
  • featureCounts assignment quantifies the number of RNA-seq reads aligning to predicted features
  • Reference annotation comparison evaluates how closely the predicted genes match an established reference
  • Multiple assessment methods ensure a comprehensive evaluation of annotation quality