Annotation using BRAKER

Last updated on 2025-07-01 | Edit this page

Overview

Questions

What is BRAKER3?
What are the different scenarios in which BRAKER3 can be used?
How to run BRAKER3 with different input requirements?
What are the different output files generated by BRAKER3?

Objectives

Understand the different scenarios in which BRAKER3 can be used
Learn how to run BRAKER3 with different input requirements
Learn how to interpret the output files generated by BRAKER3

BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes. This pipeline is particularly useful for annotating newly sequenced genomes. The flexibility of BRAKER3 allows users to provide various input datasets for improving gene prediction accuracy. In this example, we will use various scenarios to predict genes in a Maize genome using BRAKER3. Following are the scenarios we will cover

Organizing the Data

Before running BRAKER3, we need to organize the data in the following format:

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/04_braker/braker_case{1..5}

Setup

Before running BRAKER3, we need to set up:

GeneMark-ES/ET/EP/ETP license key
The AUGUSTUS_CONFIG_PATH configuration path

The license key for GeneMark-ES/ET/EP/ETP can be obtained from the GeneMark website. Once downloaded, you need to place it in your home directory:

BASH

tar xf gm_key_64.gz
cp gm_key_64 ~/.gm_key

For the AUGUSTUS_CONFIG_PATH, we need to copy the config directory from the Singularity container to the scratch directory. This is required because BRAKER3 needs to write to the config directory, and the Singularity container is read-only. To copy the config directory, run the following command:

BASH

ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/04_braker/augustus
copy_augustus_config ${WORKSHOP_DIR}/04_braker/augustus
export AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"

Case 1: Using RNAseq Reads

To run BRAKER3 with RNAseq reads, we need to provide the RNAseq reads in BAM format. We will use the following command to run BRAKER3 with RNAseq reads:

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --account=workshop
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case1
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
BAM_DIR="${WORKSHOP_DIR}/00_datasets/bamfiles"
BAM_FILES=$(find "${BAM_DIR}" -name "*.bam" | tr '\n' ',')
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case1
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --bam=${BAM_FILES} \
   --genome=${genome} \
   --species=At_$(date +"%Y%m%d").c1b \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The job takes ~55 mins to finish. The results of the BRAKER3 run will be stored in the respective braker_case1 directory, as braker.gff3 file.

Case 2: Using Conserved proteins only

Using the orthodb-clades tool, we can download protein sequences for a specific clade. In this scenario, since we are using the Arabidopsis genome, we can download the clades specific Viridiplantae.fa OrthoDB v12 protein sets.

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/04_braker/braker_case2
mkdir -p ${workdir}
git clone git@github.com:tomasbruna/orthodb-clades.git
cd orthodb-clades
ml --force purge
ml biocontainers
ml snakemake
snakemake --cores ${SLURM_CPUS_ON_NODE} selectViridiplantae

When this is done, you should see a folder named clade with Viridiplantae.fa in the orthodb-clades directory. We will use this as one of the input datasets for BRAKER3.
The following command will run BRAKER3 with the input genome and protein sequences:

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case2
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
proteins="${WORKSHOP_DIR}/04_braker/braker_case2/orthodb-clades/clades/Viridiplantae.fa"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case2
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --genome=${genome} \
   --prot_seq=${proteins} \
   --species=At_$(date +"%Y%m%d").c2 \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The results of the BRAKER3 run will be stored in the respective braker_case2 directory, as braker.gff3 file. The run time for this job is ~170 mins.

Case 3: Using RNAseq and Proteins Together

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case3
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
BAM_DIR="${WORKSHOP_DIR}/00_datasets/bamfiles"
BAM_FILES=$(find "${BAM_DIR}" -name "*.bam" | tr '\n' ',')
proteins="${WORKSHOP_DIR}/04_braker/braker_case2/orthodb-clades/clade/Viridiplantae.fa"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case3
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --genome=${genome} \
   --prot_seq=${proteins} \
   --bam=${BAM_FILES} \
   --species=At_$(date +"%Y%m%d").c3 \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The results of the BRAKER3 run will be stored in the respective braker_case3 directory, as braker.gff3 file. The run time for this job is ~120 mins.

Case 4: ab Initio mode

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case4
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case4
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --esmode \
   --genome=${genome} \
   --species=Zm_$(date +"%Y%m%d").c4 \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The results of the BRAKER3 run will be stored in the respective braker_case4 directory, as braker.gff3 file. The run time for this job is ~60 mins.

Case 5: Using Pre-trained Model

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case5
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case5
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
   --AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
   --GENEMARK_PATH=${GENEMARK_PATH} \
   --skipAllTraining \
   --genome=${genome} \
   --species=arabidopsis \
   --workingdir=${workdir} \
   --gff3 \
   --threads ${SLURM_CPUS_ON_NODE}

The results of the BRAKER3 run will be stored in the respective braker_case5/Augustus directory. The main result file augustus.ab_initio.gff3 will be used for downstream analysis. Runtime for this job is ~7 mins.

Results and Outputs

coming soon!

Key Points

BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes
BRAKER3 can be used with different input datasets to improve gene prediction accuracy
BRAKER3 can be run in different scenarios to predict genes in a genome