Introduction to Genome Annotation
- Genome annotation identifies functional elements in a genome, including genes, regulatory regions, and non-coding RNAs.
- Structural annotation predicts gene structures, while functional annotation assigns biological roles to genes.
- Challenges in annotation include incomplete assemblies, alternative splicing, pseudogenes, and data quality issues.
- Data sources like RNA-seq, proteomics, comparative genomics, and epigenomics improve annotation accuracy.
- This workshop focuses on eukaryotic gene annotation using computational tools and diverse datasets.
Annotation Strategies
- Genome annotation strategies include ab initio,
evidence-based, hybrid, multi-evidence, and large model-based
methods, each with different strengths and data
requirements.
-
Ab initio methods predict genes using sequence
patterns, while evidence-based methods use RNA-seq,
proteins, and evolutionary conservation for refinement.
-
Multi-evidence approaches integrate diverse
biological data such as ribosome profiling, proteomics, and long-read
RNA-seq to improve accuracy.
-
Large model-based methods, like Helixer, use deep
learning to predict genes without requiring manual training,
outperforming traditional tools in complex genomes.
- The workshop focuses on protein-coding gene annotation using tools like BRAKER, Helixer, EASEL, and EnTAP, followed by assessment with BUSCO, OMARK, and structural metrics.
Annotation Setup
- Organizing files and directories ensures a reproducible and
efficient workflow
- Mapping RNA-seq reads to the genome provides essential evidence for
gene prediction
- Repeat masking prevents transposable elements from being
mis-annotated as genes
- Preprocessing steps are crucial for accurate and high-quality genome annotation
Annotation using BRAKER
- BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes
- BRAKER3 can be used with different input datasets to improve gene prediction accuracy
- BRAKER3 can be run in different scenarios to predict genes in a genome
Annotation using Helixer
- Helixer is a deep learning-based gene prediction tool that uses a convolutional neural network (CNN) to predict genes in eukaryotic genomes.
- Helixer can predict genes wihtout any extrinisic information such as RNA-seq data or homology information, purely based on the sequence of the genome.
- Helixer requires a trained model and GPU for prediction.
- Helixer predicts genes in the GFF3 format, but will not predict isoforms.
Annotation using Easel
- EASEL is executed using Nextflow, which simplifies workflow management and ensures reproducibility.
- Proper configuration of resource settings and HPC parameters is essential for successful job execution.
- Running EASEL requires setting up input files, modifying configuration files, and submitting jobs via Slurm.
- Understanding how to monitor and troubleshoot jobs helps ensure efficient pipeline execution.
Functional annotation using EnTAP
- EnTAP enhances functional annotation by integrating multiple evidence sources, including homology, protein domains, and gene ontology.
- Proper setup of configuration files and databases is essential for accurate and efficient EnTAP execution.
- Running EnTAP involves transcript filtering, similarity searches, and functional annotation through automated workflows.
- The pipeline provides extensive insights into transcript function, improving downstream biological interpretations.
Annotation Assesment
-
busco
andomark
assess how well conserved genes are represented in the predicted gene set
-
gff3
metrics provide structural insights and highlight discrepancies compared to known annotations
-
featureCounts
assignment quantifies the number of RNA-seq reads aligning to predicted features
- Reference annotation comparison evaluates how closely the predicted
genes match an established reference
- Multiple assessment methods ensure a comprehensive evaluation of annotation quality