Annotation Assesment
Last updated on 2025-04-22 | Edit this page
Overview
Questions
- How to assess the quality of a genome annotation?
- What are the different tools available for assessing the quality of a genome annotation?
- How to compare the predicted annotation with the reference annotation?
- How to measure the number of raw reads assigned to the features predicted by the annotation?
Objectives
- Assess the quality of a genome annotation using different tools.
- Compare the predicted annotation with the reference annotation.
- Measure the number of raw reads assigned to the features predicted by the annotation.
Setup
The folder organization is as follows:
The annotation files (gff3
) are from the previous steps.
But are collected in this folder for convenience.
We will extract CDS and Protein sequences from the gff3
file before we begin the assesment.
BASH
ml --force purge
ml biocontainers
ml cufflinks
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for i in *.gff3; do
base=$(basename ${i%.*})
gffread \
-g ${WORKSHOP_DIR}/00_datasets/genome/athaliana.fa \
-x ${base}_cds.fa \
-y ${base}_pep.fa $i
done
GFF3 metrics
We will use agat_sp_statistics.pl
to calculate the
statistics of the GFF3 file.
BUSCO assesment
We will use BUSCO to asses the quality of the annotation.
BASH
#!/bin/bash
ml --force purge
ml biocontainers
ml busco
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for pep in *_pep.fa; do
base=$(basename ${pep%.*})
busco \
--in ${pep} \
--cpu ${SLURM_CPUS_ON_NODE} \
--out ${base}_busco \
--mode prot \
--lineage_dataset brassicales_odb10 \
--force
done
Once the BUSCO assesment is complete, we can view the results using the following command:
Omark assesment
OMArk is a software for proteome (protein-coding gene repertoire) quality assessment. It provides measures of proteome completeness, characterizes the consistency of all protein coding genes with regard to their homologs, and identifies the presence of contamination from other species.
The proteomes should be filtered to take only primary isoforms before running OMark.
BASH
ml --force purge
ml biocontainers
ml seqkit
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for pep in braker*_pep.fa; do
base=$(basename ${pep%.*})
grep "\.t1$" $pep | sed 's/>//g' > ${base}.primary.ids
seqkit grep -f ${base}.primary.ids $pep > ${base}.primary.pep.fa
rm ${base}.primary.ids
done
cp helixer_pep.fa helixer.primary.pep.fa
Running OMark:
BASH
#!/bin/bash
ml --force purge
ml biocontainers
ml omark
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
database="/depot/itap/datasets/omark/LUCA.h5"
cd ${workdir}
for pep in *.primary.pep.fa; do
base=$(basename ${pep%.*})
omamer search \
--db ${database} \
--query ${pep} \
--out ${pep%.*}.omamer \
--nthreads ${SLURM_CPUS_ON_NODE}
omark \
--file ${pep%.*}.omamer \
--database ${database} \
--outputFolder ${pep%.*} \
--og_fasta ${pep}
done
Conserved HOGs in Brassicaceae
Category | Count | Percentage |
---|---|---|
Total Conserved HOGs | 17,996 | 100% |
Single-Copy HOGs (S) | 16,237 | 90.23% |
Duplicated HOGs (D) | 1,315 | 7.31% |
Unexpected Duplications (U) | 351 | 1.95% |
Expected Duplications (E) | 964 | 5.36% |
Missing HOGs (M) | 444 | 2.47% |
Proteome Composition
Category | Count | Percentage |
---|---|---|
Total Proteins | 26,994 | 100% |
Consistent (A) | 25,480 | 94.39% |
Partial Hits (P) | 1,306 | 4.84% |
Fragmented (F) | 630 | 2.33% |
Inconsistent (I) | 167 | 0.62% |
Partial Hits (P) | 106 | 0.39% |
Fragmented (F) | 29 | 0.11% |
Likely Contamination (C) | 0 | 0.00% |
Partial Hits (P) | 0 | 0.00% |
Fragmented (F) | 0 | 0.00% |
Unknown (U) | 1,347 | 4.99% |
HOG Placement - Detected Species
Species | NCBI TaxID | Associated Proteins | % of Total Proteome |
---|---|---|---|
Arabidopsis thaliana | 3702 | 25,647 | 95.01% |
Feature assignment
To measure the number of raw reads assigned to the features predicted
by the annotation, we will use featureCounts
from the
subread
package.
BASH
#!/bin/bash
ml --force purge
ml biocontainers
ml subread
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
mkdir -p featureCounts
for gff3 in *.gff3; do
base=$(basename ${gff3%.*})
featureCounts \
-T ${SLURM_CPUS_ON_NODE} \
-a ${gff3} \
-t exon \
-g ID \
-p \
-B \
-o ${base}_merged_counts.txt \
--tmpDir /tmp ${RCAC_SCRATCH}/annotation_workshop/00_datasets/bamfiles/*.bam
done
Reference annotation comparison
We will use mikado compare
the reference annotation with
the predicted annotation.
BASH
#!/bin/bash
ml --force purge
ml biocontainers
ml mikado
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for gff3 in *.gff3; do
base=$(basename ${gff3%.*})
mikado compare \
--protein-coding \
-r ${RCAC_SCRATCH}/annotation_workshop/00_datasets/genome/athaliana_TAIR10.gff3 \
-p ${gff3} \
-o ref-TAIR10_vs_prediction_${base}_compared \
--log ${base}_compare.log
done
Summary
Key Points
-
busco
andomark
assess how well conserved genes are represented in the predicted gene set -
gff3
metrics provide structural insights and highlight discrepancies compared to known annotations -
featureCounts
assignment quantifies the number of RNA-seq reads aligning to predicted features - Reference annotation comparison evaluates how closely the predicted
genes match an established reference
- Multiple assessment methods ensure a comprehensive evaluation of annotation quality