Annotation using Easel

Last updated on 2025-07-01 | Edit this page

Overview

Questions

What are the steps required to set up and run EASEL on an HPC system?
How is Nextflow used to manage and execute the EASEL workflow?
What configuration files need to be modified before running EASEL?
How do you submit and monitor an EASEL job using Slurm?

Objectives

Set up and configure EASEL for execution on an HPC system.
Install Nextflow and pull the EASEL workflow from the repository.
Modify the necessary configuration files to match the HPC environment.
Submit and run EASEL as a Slurm job and verify successful execution.

EASEL (Efficient, Accurate, Scalable Eukaryotic modeLs) is a genome annotation tool that integrates machine learning, RNA folding, and functional annotations to improve gene prediction accuracy, leveraging optimized AUGUSTUS parameters, transcriptome and protein evidence, and a Nextflow-based scalable pipeline for reproducible analysis.

Setup

We will do a custom installation of nextflow (we will not use ml nextflow):

BASH

ml --force purge
ml openjdk
curl -s https://get.nextflow.io | bash
mv nextflow ~/.local/bin

check the installation:

BASH

nextflow -v
which nextflow

We can organize our folder structure as follows:

Running Easel

Step 1: create a rcac.config file to run the jobs using slurm scheduler.

BASH

params {
    config_profile_description = 'Negishi HPC, RCAC Purdue University.'
    config_profile_contact     = 'Arun Seetharam (aseethar@purdue.edu)'
    config_profile_url         = "https://www.rcac.purdue.edu/knowledge/negishi"
    partition                  = 'testpbs'
    project                    = 'your_project_name'  // Replace with a valid Slurm project
    max_memory                 = '256GB'
    max_cpus                   = 128
    max_time                   = '48h'
}
singularity {
    enabled    = true
    autoMounts = true
    singularity.cacheDir = "${System.getenv('RCAC_SCRATCH')}/.singularity"
    singularity.pullTimeout = "1h"
}
process {
    resourceLimits = [
        memory: 256.GB,
        cpus: 128,
        time: 48.h
    ]
    executor       = 'slurm'
    clusterOptions = "-A ${params.project} -p ${params.partition}"
}
// Executor-specific configurations
if (params.executor == 'slurm') {
    process {
        cpus   = { check_max( 16    * task.attempt, 'cpus'   ) }
        memory = { check_max( 32.GB * task.attempt, 'memory' ) }
        time   = { check_max( 8.h  * task.attempt, 'time'   ) }
        clusterOptions = "-A ${params.project} -p ${params.partition}"

        errorStrategy = { task.exitStatus in [143,137,104,134,139] ? 'retry' : 'finish' }
        maxRetries    = 1
        maxErrors     = '-1'

        withLabel:process_single {
            cpus   = { check_max( 16                  , 'cpus'    ) }
            memory = { check_max( 32.GB  * task.attempt, 'memory'  ) }
            time   = { check_max( 8.h    * task.attempt, 'time'    ) }
        }
        withLabel:process_low {
            cpus   = { check_max( 32     * task.attempt, 'cpus'    ) }
            memory = { check_max( 64.GB  * task.attempt, 'memory'  ) }
            time   = { check_max( 24.h   * task.attempt, 'time'    ) }
        }
        withLabel:process_medium {
            cpus   = { check_max( 64     * task.attempt, 'cpus'    ) }
            memory = { check_max( 128.GB * task.attempt, 'memory'  ) }
            time   = { check_max( 24.h   * task.attempt, 'time'    ) }
        }
        withLabel:process_high {
            cpus   = { check_max( 128    * task.attempt, 'cpus'    ) }
            memory = { check_max( 256.GB * task.attempt, 'memory'  ) }
            time   = { check_max( 24.h   * task.attempt, 'time'    ) }
        }
        withLabel:process_long {
            time   = { check_max( 48.h  * task.attempt, 'time'    ) }
        }
        withLabel:process_high_memory {
            memory = { check_max( 256.GB * task.attempt, 'memory' ) }
        }
        withLabel:error_ignore {
            errorStrategy = 'ignore'
        }
        withLabel:error_retry {
            errorStrategy = 'retry'
            maxRetries    = 2
        }
    }
}

Step 2: pull the Easel nextflow workflow:

BASH

ml --force purge
ml biocontainers
ml nextflow
ml openjdk
nextflow pull -hub gitlab PlantGenomicsLab/easel

This will pull the latest version of the Easel workflow from the PlantGenomicsLab GitLab repository.

Step 3: modify the ~/.nextflow/assets/PlantGenomicsLab/easel/nextflow.config file to include the rcac.config file:

BASH

vim ~/.nextflow/assets/PlantGenomicsLab/easel/nextflow.config

or if you prefer nano:

BASH

nano ~/.nextflow/assets/PlantGenomicsLab/easel/nextflow.config

Add the following line to the nextflow.config file:

BASH

profiles {
   debug { process.beforeScript = 'echo $HOSTNAME' }
   ...
   ...
   test  { includeConfig 'conf/test.config' }
   rcac  { includeConfig 'conf/rcac.config' }
}

Note that this is in line # 235 in the nextflow.config file. and all the lines between debug and test aren’t shown here.

Now, we are ready to run the Easel workflow.

Step 4: create param.yaml file to run the Easel workflow:

BASH

WORKSHOP_DIR="${RCAC_SCRATCH}/annotation_workshop"
mkdir -p ${WORKSHOP_DIR}/06_easel
cd ${WORKSHOP_DIR}/06_easel

for the param.yaml file:

YAML

outdir: easel
genome: /scratch/negishi/aseethar/annotation_workshop/00_datasets/genome/athaliana_softmasked.fasta
bam: /scratch/negishi/aseethar/annotation_workshop/00_datasets/bamfiles/*.bam
busco_lineage: embryophyta
order: Brassicales
prefix: arabidopsis
taxon: arabidopsis
singularity_cache_dir: /scratch/negishi/aseethar/singularity_cache
training_set: plant
executor: slurm
account: testpbs
qos: normal
project: testpbs

Now we are ready to run the Easel workflow:

Our slurm file to run the Easel workflow:

BASH

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --time=04-00:00:00
#SBATCH --job-name=easel
#SBATCH --output=%x.o%j
#SBATCH --error=%x.e%j
# Load the required modules
ml purge
ml biocontainers
ml openjdk
nextflow run \
   -hub gitlab PlantGenomicsLab/easel \
   -params-file params.yaml \
   -profile rcac \
   --project testpbs

This will run the Easel workflow on the RCAC HPC.

Results and Outputs

Key Points

EASEL is executed using Nextflow, which simplifies workflow management and ensures reproducibility.
Proper configuration of resource settings and HPC parameters is essential for successful job execution.
Running EASEL requires setting up input files, modifying configuration files, and submitting jobs via Slurm.
Understanding how to monitor and troubleshoot jobs helps ensure efficient pipeline execution.