Annotation using Easel
Last updated on 2026-02-24 | Edit this page
Overview
Questions
- What are the steps required to set up and run EASEL on an HPC system?
- How is Nextflow used to manage and execute the EASEL workflow?
- What configuration files need to be modified before running EASEL?
- How do you submit and monitor an EASEL job using Slurm?
Objectives
- Set up and configure EASEL for execution on an HPC system.
- Install Nextflow and pull the EASEL workflow from the repository.
- Modify the necessary configuration files to match the HPC environment.
- Submit and run EASEL as a Slurm job and verify successful execution.
EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs) is a genome annotation tool that integrates machine learning, RNA folding, and functional annotations to improve gene prediction accuracy, leveraging optimized AUGUSTUS parameters, transcriptome and protein evidence, and a Nextflow-based scalable pipeline for reproducible analysis.
Setup
We will do a custom installation of nextflow (we will not use
ml nextflow):
check the installation:
We can organize our folder structure as follows:
Running Easel
Step 1: create a rcac.config file to
run the jobs using slurm scheduler.
BASH
params {
config_profile_description = 'Negishi HPC, RCAC Purdue University.'
config_profile_contact = 'Arun Seetharam (aseethar@purdue.edu)'
config_profile_url = "https://www.rcac.purdue.edu/knowledge/negishi"
partition = 'testpbs'
project = 'your_project_name' // Replace with a valid Slurm project
max_memory = '256GB'
max_cpus = 128
max_time = '48h'
}
singularity {
enabled = true
autoMounts = true
singularity.cacheDir = "${System.getenv('RCAC_SCRATCH')}/.singularity"
singularity.pullTimeout = "1h"
}
process {
resourceLimits = [
memory: 256.GB,
cpus: 128,
time: 48.h
]
executor = 'slurm'
clusterOptions = "-A ${params.project} -p ${params.partition}"
}
// Executor-specific configurations
if (params.executor == 'slurm') {
process {
cpus = { check_max( 16 * task.attempt, 'cpus' ) }
memory = { check_max( 32.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
clusterOptions = "-A ${params.project} -p ${params.partition}"
errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
maxRetries = 1
maxErrors = '-1'
withLabel:process_single {
cpus = { check_max( 16 , 'cpus' ) }
memory = { check_max( 32.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
}
withLabel:process_low {
cpus = { check_max( 32 * task.attempt, 'cpus' ) }
memory = { check_max( 64.GB * task.attempt, 'memory' ) }
time = { check_max( 24.h * task.attempt, 'time' ) }
}
withLabel:process_medium {
cpus = { check_max( 64 * task.attempt, 'cpus' ) }
memory = { check_max( 128.GB * task.attempt, 'memory' ) }
time = { check_max( 24.h * task.attempt, 'time' ) }
}
withLabel:process_high {
cpus = { check_max( 128 * task.attempt, 'cpus' ) }
memory = { check_max( 256.GB * task.attempt, 'memory' ) }
time = { check_max( 24.h * task.attempt, 'time' ) }
}
withLabel:process_long {
time = { check_max( 48.h * task.attempt, 'time' ) }
}
withLabel:process_high_memory {
memory = { check_max( 256.GB * task.attempt, 'memory' ) }
}
withLabel:error_ignore {
errorStrategy = 'ignore'
}
withLabel:error_retry {
errorStrategy = 'retry'
maxRetries = 2
}
}
}
Step 2: pull the Easel nextflow workflow:
BASH
ml --force purge
ml biocontainers
ml nextflow
ml openjdk
nextflow pull -hub gitlab PlantGenomicsLab/easel
This will pull the latest version of the Easel workflow from the PlantGenomicsLab GitLab repository.
Step 3: modify the
~/.nextflow/assets/PlantGenomicsLab/easel/nextflow.config
file to include the rcac.config file:
or if you prefer nano:
Add the following line to the nextflow.config file:
BASH
profiles {
debug { process.beforeScript = 'echo $HOSTNAME' }
...
...
test { includeConfig 'conf/test.config' }
rcac { includeConfig 'conf/rcac.config' }
}
Note that this is in line # 235 in the nextflow.config
file. and all the lines between debug and test
aren’t shown here.
Now, we are ready to run the Easel workflow.
Step 4: create param.yaml file to run
the Easel workflow:
BASH
WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/06_easel
cd ${WORKSHOP_DIR}/06_easel
for the param.yaml file:
YAML
outdir: easel
genome: /scratch/negishi/YOUR_USERNAME/annotation_workshop/00_datasets/genome/athaliana_softmasked.fasta
bam: /scratch/negishi/YOUR_USERNAME/annotation_workshop/00_datasets/bamfiles/*.bam
busco_lineage: embryophyta
order: Brassicales
prefix: arabidopsis
taxon: arabidopsis
singularity_cache_dir: /scratch/negishi/YOUR_USERNAME/singularity_cache
training_set: plant
executor: slurm
account: testpbs
qos: normal
project: testpbs
Update Paths in params.yaml
Replace YOUR_USERNAME with your actual username in the
params.yaml file. You can find your username by running
whoami on the cluster. For example, if your username is
train01, the genome path would be:
/scratch/negishi/train01/annotation_workshop/00_datasets/genome/athaliana_softmasked.fasta
Now we are ready to run the Easel workflow:
Our slurm file to run the Easel workflow:
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=04-00:00:00
#SBATCH --job-name=easel
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml --force purge
ml biocontainers
ml openjdk
nextflow run \
-hub gitlab PlantGenomicsLab/easel \
-params-file params.yaml \
-profile rcac \
--project testpbs
This will run the Easel workflow on the RCAC HPC.
Results and Outputs
EASEL produces output in the easel/ directory (as
specified by outdir in params.yaml). The key
output files are:
| File/Directory | Description |
|---|---|
easel/annotation/*.gff3 |
Final gene predictions in GFF3 format |
easel/annotation/*.cds.fa |
Predicted coding sequences |
easel/annotation/*.pep.fa |
Predicted protein sequences |
easel/busco/ |
BUSCO completeness assessment results |
easel/entap/ |
Functional annotation results from EnTAP |
easel/pipeline_info/ |
Nextflow execution reports and timelines |
You can check the gene count in the EASEL output:
For Arabidopsis thaliana, EASEL should predict approximately
25,000-30,000 genes. The pipeline also generates a BUSCO report
automatically, which you can find in the easel/busco/
directory.
Check the Nextflow execution report in
easel/pipeline_info/ to review runtime, resource usage, and
any failed processes.
Exercise 1: Adapting EASEL for a Different Organism
If you were annotating a vertebrate genome (e.g., zebrafish) instead
of a plant genome, which parameters in the params.yaml file
would you need to change? List all the parameters and explain why each
would need updating.
You would need to change the following parameters in
params.yaml:
-
busco_lineage: Change fromembryophyta(plants) to the appropriate vertebrate lineage (e.g.,actinopterygiifor ray-finned fish, orvertebratafor a broader assessment). -
order: Change fromBrassicalesto the correct taxonomic order for your organism (e.g.,Cypriniformesfor zebrafish). -
prefix: Change fromarabidopsisto an appropriate prefix for your organism (e.g.,zebrafishordrerio). -
taxon: Change fromarabidopsisto your target organism name (e.g.,zebrafish). -
training_set: Change fromplantto the appropriate AUGUSTUS training set for your organism (e.g.,vertebrateor a closely related species likezebrafish).
The genome, bam, and other path-related parameters would also change, but those are specific to your data rather than to the organism type.
Exercise 2: Resuming a Failed EASEL Run
Your EASEL Nextflow pipeline failed partway through execution. How would you resume the run without recomputing already completed steps? Where would you look for error logs to diagnose the issue?
To resume a failed Nextflow run, add the -resume flag to
your nextflow run command:
BASH
nextflow run \
-hub gitlab PlantGenomicsLab/easel \
-params-file params.yaml \
-profile rcac \
--project testpbs \
-resume
Nextflow caches completed tasks, so only failed or pending steps will be re-executed.
To diagnose the failure, check these locations:
-
.nextflow.login the working directory: Contains detailed Nextflow execution logs, including error messages and stack traces. -
work/directory: Each task runs in a subdirectory underwork/. Navigate to the failed task’s directory (shown in the error output) and check the.command.log,.command.err, and.command.outfiles for the specific error. -
SLURM output files: Check the
easel.o*andeasel.e*files for SLURM-level errors (e.g., memory limits, time limits).
- EASEL is executed using Nextflow, which simplifies workflow management and ensures reproducibility.
- Proper configuration of resource settings and HPC parameters is essential for successful job execution.
- Running EASEL requires setting up input files, modifying configuration files, and submitting jobs via Slurm.
- Understanding how to monitor and troubleshoot jobs helps ensure efficient pipeline execution.