Annotation using BRAKER
Last updated on 2025-04-22 | Edit this page
Overview
Questions
- What is BRAKER3?
- What are the different scenarios in which BRAKER3 can be used?
- How to run BRAKER3 with different input requirements?
- What are the different output files generated by BRAKER3?
Objectives
- Understand the different scenarios in which BRAKER3 can be used
- Learn how to run BRAKER3 with different input requirements
- Learn how to interpret the output files generated by BRAKER3
BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes. This pipeline is particularly useful for annotating newly sequenced genomes. The flexibility of BRAKER3 allows users to provide various input datasets for improving gene prediction accuracy. In this example, we will use various scenarios to predict genes in a Maize genome using BRAKER3. Following are the scenarios we will cover
Organizing the Data
Before running BRAKER3, we need to organize the data in the following format:
BASH
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/04_braker/braker_case{1..5}
Setup
Before running BRAKER3, we need to set up:
-
GeneMark-ES/ET/EP/ETP
license key - The
AUGUSTUS_CONFIG_PATH
configuration path
The license key for GeneMark-ES/ET/EP/ETP
can be
obtained from the GeneMark
website. Once downloaded, you need to place it in your home
directory:
For the AUGUSTUS_CONFIG_PATH
, we need to copy the
config
directory from the Singularity container to the
scratch directory. This is required because BRAKER3 needs to write to
the config
directory, and the Singularity container is
read-only. To copy the config
directory, run the following
command:
Case 1: Using RNAseq Reads
To run BRAKER3 with RNAseq reads, we need to provide the RNAseq reads in BAM format. We will use the following command to run BRAKER3 with RNAseq reads:
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --account=workshop
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case1
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
BAM_DIR="${WORKSHOP_DIR}/00_datasets/bamfiles"
BAM_FILES=$(find "${BAM_DIR}" -name "*.bam" | tr '\n' ',')
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case1
cd ${workdir}
braker.pl \
--AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
--GENEMARK_PATH=${GENEMARK_PATH} \
--bam=${BAM_FILES} \
--genome=${genome} \
--species=At_$(date +"%Y%m%d").c1b \
--workingdir=${workdir} \
--gff3 \
--threads ${SLURM_CPUS_ON_NODE}
The job takes ~55 mins to finish. The results of the BRAKER3 run will
be stored in the respective braker_case1
directory, as
braker.gff3
file.
Case 2: Using Conserved proteins only
Using the orthodb-clades
tool, we can download protein sequences for a specific clade. In this
scenario, since we are using the Arabidopsis genome, we can download the
clades
specific Viridiplantae.fa
OrthoDB v12 protein sets.
BASH
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
workdir=${WORKSHOP_DIR}/04_braker/braker_case2
mkdir -p ${workdir}
git clone git@github.com:tomasbruna/orthodb-clades.git
cd orthodb-clades
ml --force purge
ml biocontainers
ml snakemake
snakemake --cores ${SLURM_CPUS_ON_NODE} selectViridiplantae
When this is done, you should see a folder named clade
with Viridiplantae.fa
in the orthodb-clades
directory. We will use this as one of the input datasets for
BRAKER3.
The following command will run BRAKER3 with the input genome and protein
sequences:
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case2
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
proteins="${WORKSHOP_DIR}/04_braker/braker_case2/orthodb-clades/clades/Viridiplantae.fa"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case2
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
--AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
--GENEMARK_PATH=${GENEMARK_PATH} \
--genome=${genome} \
--prot_seq=${proteins} \
--species=At_$(date +"%Y%m%d").c2 \
--workingdir=${workdir} \
--gff3 \
--threads ${SLURM_CPUS_ON_NODE}
The results of the BRAKER3 run will be stored in the respective
braker_case2
directory, as braker.gff3
file.
The run time for this job is ~170 mins.
Case 3: Using RNAseq and Proteins Together
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case3
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
BAM_DIR="${WORKSHOP_DIR}/00_datasets/bamfiles"
BAM_FILES=$(find "${BAM_DIR}" -name "*.bam" | tr '\n' ',')
proteins="${WORKSHOP_DIR}/04_braker/braker_case2/orthodb-clades/clade/Viridiplantae.fa"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case3
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
--AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
--GENEMARK_PATH=${GENEMARK_PATH} \
--genome=${genome} \
--prot_seq=${proteins} \
--bam=${BAM_FILES} \
--species=At_$(date +"%Y%m%d").c3 \
--workingdir=${workdir} \
--gff3 \
--threads ${SLURM_CPUS_ON_NODE}
The results of the BRAKER3 run will be stored in the respective
braker_case3
directory, as braker.gff3
file.
The run time for this job is ~120 mins.
Case 4: ab Initio mode
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case4
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case4
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
--AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
--GENEMARK_PATH=${GENEMARK_PATH} \
--esmode \
--genome=${genome} \
--species=Zm_$(date +"%Y%m%d").c4 \
--workingdir=${workdir} \
--gff3 \
--threads ${SLURM_CPUS_ON_NODE}
The results of the BRAKER3 run will be stored in the respective
braker_case4
directory, as braker.gff3
file.
The run time for this job is ~60 mins.
Case 5: Using Pre-trained Model
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=04-00:00:00
#SBATCH --job-name=braker_case5
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml braker3/v3.0.7.5
# directory/file paths
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana_softmasked.fasta"
# runtime parameters
AUGUSTUS_CONFIG_PATH="${WORKSHOP_DIR}/04_braker/augustus/config"
GENEMARK_PATH="/opt/ETP/bin/gmes"
# Run BRAKER3
workdir=${WORKSHOP_DIR}/04_braker/braker_case5
mkdir -p ${workdir}
cd ${workdir}
braker.pl \
--AUGUSTUS_CONFIG_PATH=${AUGUSTUS_CONFIG_PATH} \
--GENEMARK_PATH=${GENEMARK_PATH} \
--skipAllTraining \
--genome=${genome} \
--species=arabidopsis \
--workingdir=${workdir} \
--gff3 \
--threads ${SLURM_CPUS_ON_NODE}
The results of the BRAKER3 run will be stored in the respective
braker_case5/Augustus
directory. The main result file
augustus.ab_initio.gff3
will be used for downstream
analysis. Runtime for this job is ~7 mins.
Results and Outputs
coming soon!
Key Points
- BRAKER3 is a pipeline that combines GeneMark-ET and AUGUSTUS to predict genes in eukaryotic genomes
- BRAKER3 can be used with different input datasets to improve gene prediction accuracy
- BRAKER3 can be run in different scenarios to predict genes in a genome