Reference
Last updated on 2025-04-22 | Edit this page
Additional Reading
For those interested in further exploring genome annotation and bioinformatics, here are some recommended resources:
Navigating Eukaryotic Genome Annotation Pipelines: A Route Map to BRAKER, Galba, and TSEBRA - Hoff et al.
A practical guide to using automated genome annotation pipelines, focusing on eukaryotic genomes and providing insights into best practices for gene prediction.Ten Steps to Get Started in Genome Assembly and Annotation - Keith Bradnam & Ian Korf
A step-by-step approach to genome assembly and annotation, discussing essential methodologies, tools, and quality assessment strategies.Review on the Computational Genome Annotation of Sequences - Yandell & Ence
A comprehensive review discussing the principles and computational strategies involved in structural and functional genome annotation.
Glossary
annotation
The process of identifying and labeling functional elements within a
genome, including genes, regulatory elements, and other sequence
features.
gene prediction
The computational process of identifying coding and non-coding genes in
a genome.
structural annotation
The process of defining gene models, including exon-intron boundaries,
UTRs, and transcription start sites.
functional annotation
The assignment of biological meaning to predicted genes, typically by
comparing them to known sequences or functional databases.
homology-based annotation
A method of functional annotation that uses similarity to known genes in
related species to infer function.
ab initio gene prediction
A method of gene prediction that relies on statistical and
machine-learning models without requiring homology information.
evidence-based gene prediction
A method that integrates transcriptomic, proteomic, or homologous
sequence data to predict gene structures.
RNA-Seq
A sequencing technique used to capture the transcriptome, aiding in gene
prediction and annotation by providing evidence for expressed genes.
Iso-Seq
A long-read sequencing approach that captures full-length transcripts,
improving annotation accuracy for alternative splicing and isoforms.
Ribo-Seq
A technique that captures actively translated mRNA regions, providing
insights into functional protein-coding genes.
alternative splicing
A process where different mRNA isoforms are generated from a single
gene, leading to multiple protein products.
transcription start site (TSS)
The location in the genome where RNA polymerase begins transcribing a
gene into RNA.
coding sequence (CDS)
The region of a gene that is translated into protein.
untranslated region (UTR)
Non-coding regions at the ends of mRNA molecules that play roles in
stability and translation regulation.
intergenic region
The DNA sequences located between genes that may contain regulatory
elements.
transposable elements (TEs)
Mobile genetic elements that can move within a genome and affect gene
expression and genome structure.
repeat masking
A process of identifying and masking repetitive sequences in a genome to
prevent misannotation.
gene ontology (GO)
A framework for categorizing gene functions based on biological
processes, molecular functions, and cellular components.
KEGG
The Kyoto Encyclopedia of Genes and Genomes, a database used to assign
functional pathways to genes.
Pfam
A database of protein families used to annotate protein domains and
predict gene function.
InterPro
A comprehensive database that integrates multiple protein signature
databases to provide functional annotations.
eggNOG
A database that groups genes into orthologous clusters for evolutionary
and functional annotation.
OrthoFinder
A tool used to infer orthologous gene relationships across multiple
species.
BUSCO
Benchmarking Universal Single-Copy Orthologs, a tool used to assess
genome and annotation completeness.
MAKER
A genome annotation pipeline that integrates multiple gene prediction
methods and evidence sources.
BRAKER3
An automated tool for training gene prediction models and annotating
genomes using RNA-Seq and protein homology evidence.
Helixer
A deep-learning-based gene annotation tool that predicts gene models
using trained neural networks.
Easel
A genome annotation framework that incorporates machine learning and
statistical modeling for gene prediction.
EnTAP
The Evolutionary Gene Functional Annotation Pipeline, which provides
functional annotations for predicted genes.
RefSeq
A curated database of reference genome annotations maintained by
NCBI.
GFF3 (General Feature Format)
A file format used to store genome annotations, including gene models
and other sequence features.
FASTA
A text-based format used for storing nucleotide or protein
sequences.
BLAST (Basic Local Alignment Search Tool)
A tool used to compare nucleotide or protein sequences to known
databases for similarity-based functional annotation.
orthologs
Genes in different species that evolved from a common ancestor and
typically retain similar functions.
paralogs
Genes within the same species that arose from a duplication event and
may have diverged in function.
pseudogene
A non-functional copy of a gene that has lost its ability to encode a
protein due to mutations.
HMMER
A bioinformatics tool that uses hidden Markov models to search for
protein families and domains in sequence databases.
RNA secondary structure
The folded structure of RNA molecules, which can influence gene
regulation and function.
transcription factor binding site (TFBS)
A DNA sequence where transcription factors bind to regulate gene
expression.
ChIP-Seq
A sequencing method used to identify protein-DNA interactions, including
transcription factor binding sites.
CpG island
A region of DNA with a high frequency of CG dinucleotides, often
associated with gene regulatory elements.
epigenetic modification
Chemical modifications to DNA or histones that affect gene expression
without altering the DNA sequence.
methylation
The addition of methyl groups to DNA, often involved in gene silencing
and epigenetic regulation.
histone modification
Chemical modifications to histone proteins that influence chromatin
structure and gene regulation.
non-coding RNA (ncRNA)
RNA molecules that do not encode proteins but play regulatory or
structural roles in the cell.
lncRNA (long non-coding RNA)
A class of non-coding RNA that regulates gene expression through various
mechanisms.
microRNA (miRNA)
Small non-coding RNA molecules that regulate gene expression by
targeting mRNA for degradation or translational repression.
tRNA (transfer RNA)
A type of RNA that carries amino acids to ribosomes for protein
synthesis.
rRNA (ribosomal RNA)
A structural component of ribosomes essential for protein synthesis.
snoRNA (small nucleolar RNA)
A class of RNA involved in the chemical modification of ribosomal
RNA.
polyadenylation
The addition of a poly(A) tail to mRNA, affecting its stability and
translation efficiency.
transcriptome assembly
The process of reconstructing full-length transcripts from RNA-Seq
data.
splice site prediction
A computational method used to identify exon-intron boundaries in gene
models.
alternative polyadenylation
The use of different polyadenylation sites within a gene, leading to
mRNA isoforms with varying stability or regulatory properties.
functional enrichment analysis
A method used to identify biological processes or pathways
overrepresented in a set of annotated genes.