Skip to content

Running Bioinformatics Programs on RCAC

Prerequisites

  • RCAC cluster account (apply here)
  • Basic command-line familiarity (cd, ls, mkdir, nano/vim)
  • Access to Negishi, Gautschi, or Bell

What you will learn

  • Find and load bioinformatics software via biocontainers
  • Understand how containerized wrappers work
  • Create and manage Conda environments
  • Write and submit SLURM batch jobs
  • Debug common failures

This guide covers the practical skills you need to run bioinformatics software on RCAC clusters. RCAC deploys bioinformatics tools as BioContainers (pre-built Apptainer containers) accessed through the Lmod module system. For tools not in the RCAC collection, you can use Conda environments or pull your own containers. Most production work runs as batch jobs through the SLURM scheduler.

By the end of this guide you will be able to find any bioinformatics tool on RCAC, run it correctly, and submit efficient batch jobs.

You need a terminal session on the cluster before running any commands. There are two options:

  1. Go to gateway.negishi.rcac.purdue.edu and log in with your Purdue credentials. For other clusters, replace negishi with the cluster name (e.g., gateway.gautschi.rcac.purdue.edu, gateway.bell.rcac.purdue.edu, gateway.gilbreth.rcac.purdue.edu).
  2. Click Clusters in the top menu, then select the cluster shell (e.g., Negishi Shell Access).
  3. A terminal opens in your browser. You are on a login node.

RCAC uses the Lmod module system to manage software. Bioinformatics tools are deployed as pre-built BioContainers (Apptainer containers) and accessed through the biocontainers module.

First, load the biocontainers module to make bioinformatics tools visible:

module --force purge
module load biocontainers
module spider samtools

module spider searches all modules, including those not yet visible. It shows available versions and any prerequisite modules.

To get loading instructions for a specific version:

module spider samtools/1.21

To list all available biocontainer modules:

module avail
module --force purge
module load biocontainers samtools/1.21

The biocontainers module unlocks all bioinformatics software. You must load it before any tool module becomes visible.

Use module --force purge (not just module purge) to remove sticky modules like xalt that use a newer glibc and conflict with containerized tools. Here is what happens if you skip --force:

module load biocontainers bwa
bwa
/bin/sh: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by /apps/external/apps/xalt3/xalt/xalt/lib64/libxalt_init.so)
/bin/sh: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /apps/external/apps/xalt3/xalt/xalt/lib64/libxalt_init.so)

The fix is to always start with module --force purge:

module --force purge
module load biocontainers bwa
bwa

After loading, run the tool as usual — the output will appear as if the tool is installed natively.

Behind the scenes, RCAC creates shell functions that wrap each tool in an apptainer/singularity container call. When you type bwa, the function runs singularity run <container.sif> bwa for you.

Because the tools are containerized, which and type will show the shell function that wraps the container call, not the actual executable:

which bwa
bwa ()
{
/usr/bin/singularity run /apps/biocontainers/images/quay.io_biocontainers_bwa:0.7.17--h5bf99c6_8.sif env LANG=C.UTF-8 bwa "$@"
}
type bwa
bwa is a function
bwa ()
{
/usr/bin/singularity run /apps/biocontainers/images/quay.io_biocontainers_bwa:0.7.17--h5bf99c6_8.sif env LANG=C.UTF-8 bwa "$@"
}

For most use cases this does not matter — the function handles everything transparently. However, if a pipeline or workflow checks the executable path (e.g., which bwa to verify the installation), it will get the function definition instead of a file path. In that case, you may need to either:

  • Write a custom wrapper script that satisfies the pipeline’s path check
  • Contact rcac-help@purdue.edu for assistance with pipeline-specific setups

Loading the biocontainers module sets the $BIOC_IMAGE_DIR environment variable, which points to the directory containing all container images:

echo $BIOC_IMAGE_DIR
# /apps/biocontainers/images

You can use this to run containers directly with singularity run or apptainer exec when you need more control (e.g., custom bind mounts, GPU flags, or piping between containerized tools):

singularity run ${BIOC_IMAGE_DIR}/quay.io_biocontainers_bwa:0.7.17--h5bf99c6_8.sif bwa mem ref.fa reads.fq
module list
samtools --version
module --force purge

Use module --force purge at the top of every SLURM script and whenever you hit a module conflict. The --force flag is important because it also removes sticky modules (like xalt) that a plain module purge would leave behind.

If you try to load two modules built with different compiler toolchains, Lmod will refuse with an error. The fix:

module --force purge
module load biocontainers samtools/1.21 bwa-mem2/2.2.1

Loading both in a single command lets Lmod resolve the dependency tree.

All bioinformatics modules on RCAC are already containerized via BioContainers (see Finding Software with Modules above). However, if a tool is not in the RCAC collection, you can pull your own container.

  • The tool or version is not available via module spider after loading biocontainers
  • You need a specific build variant or tag not deployed by RCAC
  • You are developing or testing a custom pipeline image

If a tool is not in the biocontainers collection, pull it from a container registry:

cd ${RCAC_SCRATCH}
apptainer pull docker://quay.io/biocontainers/bwa:0.7.18--he4a0461_1

This creates a .sif file in the current directory. Run commands inside it with apptainer exec:

apptainer exec bwa_0.7.18--he4a0461_1.sif bwa

RCAC auto-binds /home, /scratch, /depot, and /tmp into containers. For data in non-standard locations, bind manually:

apptainer exec --bind /my/custom/path container.sif <command>

Conda is useful for niche Python or R packages and tools with complex dependency trees that are not available as modules or containers.

module --force purge
module load conda
conda create -n multiqc_env -c bioconda -c conda-forge multiqc=1.25 -y
conda activate multiqc_env
multiqc --version

Conda environments are large (often 2—10 GB). Your Home directory is only ~25 GB. Redirect Conda storage to Scratch by creating a .condarc file:

~/.condarc
pkgs_dirs:
- /scratch/negishi/${USER}/.conda/pkgs
envs_dirs:
- /scratch/negishi/${USER}/.conda/envs
channels:
- conda-forge
- bioconda
- defaults
auto_activate_base: false

Then create the directories:

mkdir -p /scratch/negishi/${USER}/.conda/pkgs
mkdir -p /scratch/negishi/${USER}/.conda/envs

Use this decision table to pick the right method:

StepActionCommand
1Load biocontainersmodule --force purge && module load biocontainers
2Search for the toolmodule spider <toolname>
3If foundmodule load biocontainers <tool>/<version>
4If not, search Condaconda search -c bioconda <toolname>
5If found in Condaconda create -n <env> -c bioconda -c conda-forge <tool>=<ver>
6If not found anywherePull a Docker/Apptainer container or build from source

Comparison: Biocontainers (Modules) vs Conda vs Custom Container

Section titled “Comparison: Biocontainers (Modules) vs Conda vs Custom Container”
Biocontainers (Modules)CondaCustom Container
Maintained byRCACYouYou
Install effortNoneMediumHigh
ReproducibilityExcellent (immutable image)Fragile (solver can change)Excellent
Storage costNoneHigh (2—10 GB per env)Medium (0.5—2 GB per image)
SpeedNear-nativeNativeNear-native
UpdatesRCAC managesYou manageYou manage
Best forMost bioinformatics toolsNiche packages, R/Python envsFull control, custom builds

Login nodes are for editing files and submitting jobs. All computation should happen on compute nodes through SLURM.

For quick testing, request an interactive session:

sinteractive -A <account-name> -n 4 -N 1 --time=1:00:00

This gives you a shell on a compute node where you can load modules and test commands. Type exit when done.

A SLURM batch script has three parts:

  1. Shebang: #!/bin/bash
  2. #SBATCH directives: resource requests parsed by SLURM (not executed by bash)
  3. Your commands: module loads, tool invocations, file operations
slurm_bwa_align.sh
#!/bin/bash
#SBATCH --job-name=bwa_align
#SBATCH --account=<account-name>
#SBATCH --partition=<partition-name>
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --time=04:00:00
#SBATCH --mem=32G
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
module --force purge
module load biocontainers bwa-mem2/2.2.1 samtools/1.21
WORKDIR=/scratch/negishi/${USER}/alignment_project
REF=${WORKDIR}/ref/genome.fa
R1=${WORKDIR}/fastq/sample_R1.fastq.gz
R2=${WORKDIR}/fastq/sample_R2.fastq.gz
OUTDIR=${WORKDIR}/bam
mkdir -p ${OUTDIR}
bwa-mem2 mem \
-t ${SLURM_CPUS_ON_NODE} \
-R "@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tLB:lib1" \
${REF} ${R1} ${R2} \
| samtools sort -@ 4 -m 2G -o ${OUTDIR}/sample.sorted.bam -
samtools index ${OUTDIR}/sample.sorted.bam
samtools flagstat ${OUTDIR}/sample.sorted.bam
sbatch slurm_bwa_align.sh
squeue -u ${USER}
scancel <jobid>
DirectiveDescriptionTypical value
--accountAllocation/account nameCheck with mybalance
--partitionQueue/partitionCluster-specific
--nodesNumber of nodes1 (almost always for bioinformatics)
--ntasksNumber of processes1 for single tools
--cpus-per-taskThreads per processMatch tool’s -t flag (4—32)
--timeWall clock limitStart generous, tighten after sacct
--memTotal memoryCheck tool docs; start with 16—32G
--job-nameName shown in squeueShort, descriptive
--outputstdout file%x_%j.out (name + job ID)
--errorstderr file%x_%j.err
--arrayArray job indices0-N for batch processing

When running the same tool on multiple input files, use array jobs instead of submitting separate scripts. Each array task gets a unique SLURM_ARRAY_TASK_ID (0, 1, 2, …) that you use to select the input file.

slurm_fastqc.sh
#!/bin/bash
#SBATCH --job-name=fastqc
#SBATCH --account=<account-name>
#SBATCH --partition=<partition-name>
#SBATCH --cpus-per-task=2
#SBATCH --time=01:00:00
#SBATCH --mem=4G
#SBATCH --array=0-5
#SBATCH --output=fastqc_%A_%a.out
#SBATCH --error=fastqc_%A_%a.err
module --force purge
module load biocontainers fastqc/0.12.1
FASTQ_LIST=/scratch/negishi/${USER}/project/fastq_list.txt
OUTDIR=/scratch/negishi/${USER}/project/fastqc_results
mkdir -p ${OUTDIR}
FASTQ=$(sed -n "$((SLURM_ARRAY_TASK_ID + 1))p" ${FASTQ_LIST})
fastqc --outdir ${OUTDIR} --threads ${SLURM_CPUS_ON_NODE} --quiet ${FASTQ}

Create the file list first:

ls /scratch/negishi/${USER}/project/fastq/*.fastq.gz > fastq_list.txt

Conda requires shell initialization inside batch scripts. Without it, conda activate will fail:

slurm_multiqc_conda.sh
#!/bin/bash
#SBATCH --job-name=multiqc
#SBATCH --account=<account-name>
#SBATCH --partition=<partition-name>
#SBATCH --cpus-per-task=2
#SBATCH --time=00:30:00
#SBATCH --mem=4G
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
module --force purge
module load conda
eval "$(conda shell.bash hook)"
conda activate multiqc_env
multiqc /scratch/negishi/${USER}/project/fastqc_results \
--outdir /scratch/negishi/${USER}/project/multiqc_output \
--filename multiqc_report \
--force
conda deactivate

The key line is eval "$(conda shell.bash hook)" — this initializes Conda for the non-interactive bash shell that SLURM uses.

After a job completes, check what it actually used:

sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed,State,ExitCode
  • MaxRSS: peak memory. Use this to right-size --mem next time.
  • Elapsed: actual wall time. Use this to right-size --time.

Start generous, then tighten. Over-requesting wastes your allocation priority but under-requesting kills your job.

Problem: You run a tool and get bash: samtools: command not found.

Diagnosis: The module is not loaded, or you loaded biocontainers but forgot the tool module.

Fix:

module --force purge
module load biocontainers samtools/1.21

If module spider cannot find the tool after loading biocontainers, it may not be installed on this cluster. Try Conda or pull a custom container.

Problem: Your job vanishes from squeue but produced no output files.

Diagnosis: Check the .err file and sacct:

cat <jobname>_<jobid>.err
sacct -j <jobid> --format=JobID,State,ExitCode,MaxRSS

Common states:

StateMeaning
COMPLETEDFinished successfully (exit code 0:0)
FAILEDYour script had an error
OUT_OF_MEMORYExceeded --mem request
TIMEOUTExceeded --time request
CANCELLEDManually cancelled or preempted

Problem: sacct shows OUT_OF_MEMORY.

Fix: Increase --mem. Check MaxRSS of the failed job to see peak usage, then request 20—30% more.

Problem: Your files on /scratch are gone.

Diagnosis: Scratch is purged after 60 days of inactivity. There is no warning and no recovery.

Fix: Move important results to Home or Depot promptly. For active projects, periodic access resets the clock.

Problem: Lmod has detected the following error: ... when loading modules.

Fix: Start fresh:

module --force purge
module load biocontainers <tool1>/<version> <tool2>/<version>

Problem: CommandNotFoundError: Your shell has not been properly configured...

Fix: Add shell initialization before conda activate:

module --force purge
module load conda
eval "$(conda shell.bash hook)"
conda activate myenv

Problem: Script fails because /scratch/negishi/ does not exist on Gautschi.

Fix: Use ${RCAC_SCRATCH} instead of hardcoding the cluster name:

WORKDIR=${RCAC_SCRATCH}/my_project

This resolves to the correct path on any RCAC cluster.

When a job fails, follow these steps:

  1. Check the exit code: echo $? (0 = success, non-zero = failure)
  2. Read the error log: cat <jobname>_<jobid>.err
  3. Check job accounting: sacct -j <jobid> --format=JobID,State,ExitCode,MaxRSS,Elapsed
  4. Reproduce interactively: sinteractive -A <account-name> -n 4 --time=1:00:00, load modules, run the failing command
  5. Search the error message: Google, Biostars, or the tool’s GitHub Issues
  6. Ask for help: Email rcac-help@purdue.edu with your job ID and the error message

Copy this .condarc to your home directory to redirect Conda storage off of Home:

~/.condarc
pkgs_dirs:
- /scratch/negishi/${USER}/.conda/pkgs
envs_dirs:
- /scratch/negishi/${USER}/.conda/envs
channels:
- conda-forge
- bioconda
- defaults
auto_activate_base: false

Session 6: QC for Genomics — April 7, 2026, 11:00 AM — 12:00 PM ET

Topics: FastQC interpretation, fastp trimming, MultiQC aggregation, quality control strategies for different sequencing platforms.