Annotation Assesment

Last updated on 2025-04-22 | Edit this page

Overview

Questions

How to assess the quality of a genome annotation?
What are the different tools available for assessing the quality of a genome annotation?
How to compare the predicted annotation with the reference annotation?
How to measure the number of raw reads assigned to the features predicted by the annotation?

Objectives

Assess the quality of a genome annotation using different tools.
Compare the predicted annotation with the reference annotation.
Measure the number of raw reads assigned to the features predicted by the annotation.

Setup

The folder organization is as follows:

The annotation files (gff3) are from the previous steps. But are collected in this folder for convenience.

We will extract CDS and Protein sequences from the gff3 file before we begin the assesment.

BASH

ml --force purge
ml biocontainers
ml cufflinks
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for i in *.gff3; do
    base=$(basename ${i%.*})
    gffread \
       -g ${WORKSHOP_DIR}/00_datasets/genome/athaliana.fa \
       -x ${base}_cds.fa \
       -y ${base}_pep.fa $i
done

GFF3 metrics

We will use agat_sp_statistics.pl to calculate the statistics of the GFF3 file.

BASH

ml --force purge
ml biocontainers
ml agat
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
mkdir -p gff3_stats
for gff3 in *.gff3; do
    base=$(basename $gff3 .gff3)
    agat_sp_statistics.pl \
       --gff ${gff3} \
       -o gff3_stats/${base}_stats.txt
done

BUSCO assesment

We will use BUSCO to asses the quality of the annotation.

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml busco
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for pep in *_pep.fa; do
    base=$(basename ${pep%.*})
     busco \
        --in ${pep} \
        --cpu ${SLURM_CPUS_ON_NODE} \
        --out ${base}_busco \
        --mode prot \
        --lineage_dataset brassicales_odb10 \
        --force
done

Once the BUSCO assesment is complete, we can view the results using the following command:

BASH

ml --force purge
ml biocontainers
ml busco
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
mkdir -p busco_results
for f in */*_busco/short_summary*busco.txt; do
   cp $f busco_results/;
done
generate_plot.py –wd busco_results

Omark assesment

OMArk is a software for proteome (protein-coding gene repertoire) quality assessment. It provides measures of proteome completeness, characterizes the consistency of all protein coding genes with regard to their homologs, and identifies the presence of contamination from other species.

The proteomes should be filtered to take only primary isoforms before running OMark.

BASH

ml --force purge
ml biocontainers
ml seqkit
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for pep in braker*_pep.fa; do
    base=$(basename ${pep%.*})
    grep "\.t1$" $pep | sed 's/>//g' > ${base}.primary.ids
    seqkit grep -f ${base}.primary.ids $pep > ${base}.primary.pep.fa
    rm ${base}.primary.ids
done
cp helixer_pep.fa helixer.primary.pep.fa

Running OMark:

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml omark
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
database="/depot/itap/datasets/omark/LUCA.h5"
cd ${workdir}
for pep in *.primary.pep.fa; do
    base=$(basename ${pep%.*})
    omamer search \
       --db ${database} \
       --query ${pep} \
       --out ${pep%.*}.omamer \
       --nthreads ${SLURM_CPUS_ON_NODE}
    omark \
       --file ${pep%.*}.omamer \
       --database ${database} \
       --outputFolder ${pep%.*} \
       --og_fasta ${pep}
done

OMark Results Summary

Conserved HOGs in Brassicaceae

Category	Count	Percentage
Total Conserved HOGs	17,996	100%
Single-Copy HOGs (S)	16,237	90.23%
Duplicated HOGs (D)	1,315	7.31%
Unexpected Duplications (U)	351	1.95%
Expected Duplications (E)	964	5.36%
Missing HOGs (M)	444	2.47%

Proteome Composition

Category	Count	Percentage
Total Proteins	26,994	100%
Consistent (A)	25,480	94.39%
Partial Hits (P)	1,306	4.84%
Fragmented (F)	630	2.33%
Inconsistent (I)	167	0.62%
Partial Hits (P)	106	0.39%
Fragmented (F)	29	0.11%
Likely Contamination (C)	0	0.00%
Partial Hits (P)	0	0.00%
Fragmented (F)	0	0.00%
Unknown (U)	1,347	4.99%

HOG Placement - Detected Species

Species	NCBI TaxID	Associated Proteins	% of Total Proteome
Arabidopsis thaliana	3702	25,647	95.01%

Feature assignment

To measure the number of raw reads assigned to the features predicted by the annotation, we will use featureCounts from the subread package.

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml subread
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
mkdir -p featureCounts
for gff3 in *.gff3; do
    base=$(basename ${gff3%.*})
    featureCounts \
      -T ${SLURM_CPUS_ON_NODE} \
      -a ${gff3} \
      -t exon \
      -g ID \
      -p \
      -B \
      -o ${base}_merged_counts.txt \
      --tmpDir /tmp ${RCAC_SCRATCH}/annotation_workshop/00_datasets/bamfiles/*.bam
done

Reference annotation comparison

We will use mikado compare the reference annotation with the predicted annotation.

BASH

#!/bin/bash
ml --force purge
ml biocontainers
ml mikado
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/08_assessment
cd ${workdir}
for gff3 in *.gff3; do
    base=$(basename ${gff3%.*})
    mikado compare \
      --protein-coding \
      -r ${RCAC_SCRATCH}/annotation_workshop/00_datasets/genome/athaliana_TAIR10.gff3 \
      -p ${gff3} \
      -o ref-TAIR10_vs_prediction_${base}_compared \
      --log ${base}_compare.log
done

Summary

Key Points

busco and omark assess how well conserved genes are represented in the predicted gene set
gff3 metrics provide structural insights and highlight discrepancies compared to known annotations
featureCounts assignment quantifies the number of RNA-seq reads aligning to predicted features
Reference annotation comparison evaluates how closely the predicted genes match an established reference
Multiple assessment methods ensure a comprehensive evaluation of annotation quality