Annotation Setup
Last updated on 2025-04-22 | Edit this page
Overview
Questions
- How should files and directories be structured for a genome
annotation workflow?
- Why is RNA-seq read mapping important for gene prediction?
- How does repeat masking improve annotation accuracy?
- What preprocessing steps are necessary before running gene prediction tools?
Objectives
- Set up a structured workspace for genome annotation
- Map RNA-seq reads to the genome to provide evidence for gene
prediction
- Perform repeat masking to prevent false gene annotations
- Prepare genome and transcriptomic data for downstream annotation analyses
Setup
This section outlines the preprocessing steps required before gene annotation, including organizing directories, mapping RNA-Seq reads, and masking repetitive elements. These steps have already been completed to save time, but the instructions are provided for reference in case you need to run them on custom datasets.
Optional Section!
Feel free to skip this section as it is already performed to save time. The steps below are provided for reference - to run on custom data.
Folder structure
BASH
mkdir -p ${RCAC_SCRATCH}/annotation_workshop
cd ${RCAC_SCRATCH}/annotation_workshop
mkdir -p 00_datasets/{fastq,genome,bamfiles}
mkdir -p 03_setup
mkdir -p 04_braker
mkdir -p 05_helixer
mkdir -p 06_easel
mkdir -p 07_entap
mkdir -p 08_assessment
rsync -avP SOURCE/00_datasets/ 00_datasets/
RNA-Seq mapping
The RNA-seq dataset used for gene prediction comes from the accession E-MTAB-6422, that provides a gene expression atlas of Arabidopsis thaliana (Columbia accession), capturing transcriptional programs across 36 libraries representing different tissues (seedling, root, inflorescence, flower, silique, seed) and developmental stages (2-leaf, 6-leaf, 12-leaf, senescence, dry mature, and imbibed seed), with three biological replicates per condition.
Step 1: Download the reference genome and annotation files
BASH
# Download the reference genome
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/00_datasets/genome
cd ${WORKSHOP_DIR}/00_datasets/genome
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
ml --force purge
ml biocontainers
ml bioawk
bioaw -c fastx '{print ">"$name; print $seq}' \
Arabidopsis_thaliana.TAIR10.dna.toplevel.fa |\
fold > athaliana.fasta
rm Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
Step 2: Index the reference genome using STAR
BASH
#!/bin/bash
ml --force purge
ml biocontainers
ml star
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta"
index="${WORKSHOP_DIR}/00_datasets/genome/star_index"
STAR --runThreadN ${SLURM_JOB_CPUS_PER_NODE} \
--runMode genomeGenerate \
--genomeDir ${index} \
--genomeSAindexNbases 12 \
--genomeFastaFiles ${genome}
Options
Option | Description |
---|---|
--runThreadN $SLURM_JOB_CPUS_PER_NODE |
Specifies the number of CPU threads to use. In an HPC environment,
$SLURM_JOB_CPUS_PER_NODE dynamically assigns available CPUs
per node. |
--runMode genomeGenerate |
Runs STAR in genome index generation mode, required before alignment. |
--genomeDir ${index} |
Path to the directory where the genome index will be stored. |
--genomeSAindexNbases 12 |
Defines the length of the SA pre-indexing string. A recommended
value is min(14, log2(GenomeLength)/2 - 1) , but
12 is commonly used for smaller genomes. |
--genomeFastaFiles ${genome} |
Path to the FASTA file containing the reference genome sequences. |
Step 3: Map the RNA-Seq reads to the reference genome
We need to setup a script star-mapping.sh
to map the
RNA-Seq reads to the reference genome.
BASH
#!/bin/bash
ml --force purge
ml biocontainers
ml star
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta"
index="${WORKSHOP_DIR}/00_datasets/genome/star_index"
read1=$1
read2=$2
cpus=${SLURM_JOB_CPUS_PER_NODE}
outname=$(basename ${read1} | cut -f 1 -d "_")
STAR \
--runThreadN ${cpus} \
--genomeDir ${index} \
--outSAMtype BAM SortedByCoordinate \
--outFilterScoreMinOverLread 0.3 \
--outFilterMatchNminOverLread 0.3 \
--outFileNamePrefix ${outname}_ \
--readFilesCommand zcat \
--outWigType bedGraph \
--outWigStrand Unstranded \
--outWigNorm RPM \
--readFilesIn ${read1} ${read2}
Options
Option | Description |
---|---|
--runThreadN ${cpus} |
Specifies the number of CPU threads to use for alignment. |
--genomeDir ${index} |
Path to the pre-generated STAR genome index directory. |
--outSAMtype BAM SortedByCoordinate |
Outputs aligned reads in BAM format, sorted by genomic coordinates. |
--outFilterScoreMinOverLread 0.3 |
Sets the minimum alignment score relative to read length (30% of the max possible score). |
--outFilterMatchNminOverLread 0.3 |
Sets the minimum number of matching bases relative to read length (30% of the read must match). |
--outFileNamePrefix ${outname}_ |
Prefix for all output files generated by STAR. |
--readFilesCommand zcat |
Specifies a command (zcat ) to decompress input FASTQ
files if they are gzipped. |
--outWigType bedGraph |
Outputs signal coverage in bedGraph format for visualization. |
--outWigStrand Unstranded |
Specifies that the RNA-seq library is unstranded (reads are not strand-specific). |
--outWigNorm RPM |
Normalizes coverage signal to Reads Per Million (RPM). |
--readFilesIn ${read1} ${read2} |
Specifies input paired-end FASTQ files (Read 1 and Read 2). |
Step 4: Run the STAR mapping script
BASH
#!/bin/bash
# Define the directory containing FASTQ files
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
FASTQ_DIR="WORKSHOP_DIR/00_datasets/fastq"
# Loop through all R1 files and find corresponding R2 files
for R1 in ${FASTQ_DIR}/*_R1.fq.gz; do
R2="${R1/_R1.fq.gz/_R2.fq.gz}" # Replace R1 with R2
if [[ -f "$R2" ]]; then # Check if R2 exists
echo "Running STAR mapping for: $R1 and $R2"
./star-mapping.sh "$R1" "$R2"
else
echo "Error: Matching R2 file for $R1 not found!" >&2
fi
done
Step 5: store BAM files
RepeatMasker
Some gene prediction tools require repeat-masked genomes to avoid mis-annotation of transposable elements as genes. We will use RepeatMasker to mask the repeats in the genome. But first, we need to download the RepeatMasker database.
BASH
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
cd ${WORKSHOP_DIR}/00_datasets/genome
# Download the RepeatMasker database
wget https://www.girinst.org/server/RepBase/protected/RepBase30.01.fasta.tar.gz
tar xf RepBase30.01.fasta.tar.gz
replib=$(find $(pwd) -name "athrep.ref")
ml --force purge
ml biocontainers
ml repeatmasker
genome="${WORKSHOP_DIR}/00_datasets/genome/athaliana.fasta"
RepeatMasker \
-e ncbi \
-pa ${SLURM_CPUS_ON_NODE} \
-q \
-xsmall \
-lib ${replib} \
-nocut \
-gff \
${genome}
Once done, we will rename it to
athaliana.fasta.masked
Key Points
- Organizing files and directories ensures a reproducible and
efficient workflow
- Mapping RNA-seq reads to the genome provides essential evidence for
gene prediction
- Repeat masking prevents transposable elements from being
mis-annotated as genes
- Preprocessing steps are crucial for accurate and high-quality genome annotation