Downloading SRA Data

The NCBI Sequence Read Archive (SRA) is the largest publicly available repository of high-throughput sequencing data. This guide walks you through downloading FASTQ files from SRA on RCAC clusters using the SRA Toolkit.

Quick Start

If you’re familiar with HPC (Negishi/Gautschi/Bell) and just need the commands:

module load sra-tools
prefetch SRR12345678
fasterq-dump SRR12345678 --split-files -e 8 -p

For detailed instructions and best practices, continue reading below.

Understanding SRA Downloads

Before downloading, it’s important to understand the two-step workflow:

prefetch Downloads the compressed .sra file from NCBI to your local cache
fasterq-dump - Converts the .sra file to FASTQ format

Step-by-Step Guide

Load the SRA Toolkit module

module load biocontainers
module load sra-tools

Verify the installation:

prefetch --version
fasterq-dump --version

Configure your cache directory (first time only)

By default, SRA Toolkit caches files in your home directory, which has limited space. Configure it to use scratch space instead:
```
mkdir -p $RCAC_SCRATCH/ncbi
vdb-config --prefetch-to-user-repo
vdb-config -s /repository/user/main/public/root=$RCAC_SCRATCH/ncbi
```
Find your SRA accession numbers

SRA accessions typically start with:
- SRR - Individual run (most common)
- SRP - Study/Project
- SRX - Experiment
You can find accessions on NCBI SRA or ENA.
Download the SRA file using prefetch

For a single accession:
```
prefetch SRR12345678
```
For multiple accessions from a file:
```
prefetch --option-file accession_list.txt
```
Where accession_list.txt contains one accession per line.
Convert to FASTQ using fasterq-dump

For paired-end data:
```
fasterq-dump SRR12345678 --split-files -e 8 -p
```
For single-end data:
```
fasterq-dump SRR12345678 -e 8 -p
```
Key options:
- --split-files - Separates paired reads into _1.fastq and _2.fastq
- -e 8 - Use 8 threads (adjust based on your allocation)
- -p - Show progress
Compress the FASTQ files

FASTQ files are large. Compress them to save space:
```
pigz -p 8 *.fastq
```

SLURM Batch Script

For large downloads, submit a batch job rather than running interactively.

#!/bin/bash
#SBATCH --job-name=sra_download
#SBATCH --account=your_account
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=04:00:00
#SBATCH --output=sra_%j.out
#SBATCH --error=sra_%j.err

# Load required modules
module purge
module load biocontainers
module load sra-tools

# Set variables
SRR_ID="SRR12345678"
OUTDIR="$RCAC_SCRATCH/fastq_files"
THREADS=8

# Create output directory
mkdir -p ${OUTDIR}
cd ${OUTDIR}

# Step 1: Prefetch the SRA file
echo "Starting prefetch for ${SRR_ID}..."
prefetch ${SRR_ID}

# Step 2: Convert to FASTQ
echo "Converting to FASTQ..."
fasterq-dump ${SRR_ID} --split-files -e ${THREADS} -p

# Step 3: Compress FASTQ files
echo "Compressing FASTQ files..."
pigz -p ${THREADS} ${SRR_ID}*.fastq

# Step 4: Clean up cache
echo "Cleaning up..."
rm -rf $RCAC_SCRATCH/ncbi/sra/${SRR_ID}.sra

echo "Done! Files saved to ${OUTDIR}"
ls -lh ${SRR_ID}*.fastq.gz

#!/bin/bash
#SBATCH --job-name=sra_batch
#SBATCH --account=your_account
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=24:00:00
#SBATCH --output=sra_batch_%j.out
#SBATCH --error=sra_batch_%j.err

# Load required modules
module purge
module load biocontainers
module load sra-tools

# Set variables
ACCESSION_FILE="accession_list.txt"
OUTDIR="$RCAC_SCRATCH/fastq_files"
THREADS=8

# Create output directory
mkdir -p ${OUTDIR}
cd ${OUTDIR}

# Process each accession
while read -r SRR_ID; do
    # Skip empty lines and comments
    [[ -z "$SRR_ID" || "$SRR_ID" =~ ^# ]] && continue

    echo "=========================================="
    echo "Processing: ${SRR_ID}"
    echo "=========================================="

    # Prefetch
    prefetch ${SRR_ID}

    # Convert to FASTQ
    fasterq-dump ${SRR_ID} --split-files -e ${THREADS} -p

    # Compress
    pigz -p ${THREADS} ${SRR_ID}*.fastq

    # Clean up cache
    rm -rf $RCAC_SCRATCH/ncbi/sra/${SRR_ID}.sra

    echo "Completed: ${SRR_ID}"

done < ${ACCESSION_FILE}

echo "All downloads complete!"
ls -lh *.fastq.gz

#!/bin/bash
#SBATCH --job-name=sra_array
#SBATCH --account=your_account
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=04:00:00
#SBATCH --array=1-10%5
#SBATCH --output=sra_%A_%a.out
#SBATCH --error=sra_%A_%a.err

# Load required modules
module purge
module load biocontainers
module load sra-tools

# Set variables
ACCESSION_FILE="accession_list.txt"
OUTDIR="$RCAC_SCRATCH/fastq_files"
THREADS=8

# Get the SRR ID for this array task
SRR_ID=$(sed -n "${SLURM_ARRAY_TASK_ID}p" ${ACCESSION_FILE})

# Create output directory
mkdir -p ${OUTDIR}
cd ${OUTDIR}

echo "Array task ${SLURM_ARRAY_TASK_ID}: Processing ${SRR_ID}"

# Prefetch
prefetch ${SRR_ID}

# Convert to FASTQ
fasterq-dump ${SRR_ID} --split-files -e ${THREADS} -p

# Compress
pigz -p ${THREADS} ${SRR_ID}*.fastq

# Clean up cache
rm -rf $RCAC_SCRATCH/ncbi/sra/${SRR_ID}.sra

echo "Completed: ${SRR_ID}"
ls -lh ${SRR_ID}*.fastq.gz

Submit the job with:

sbatch download_sra.sh

Verification Steps

After downloading, verify your files are complete and uncorrupted:

Check file sizes

FASTQ files should be reasonably sized (typically 1-50 GB for most runs):
```
ls -lh *.fastq.gz
```
Count reads

Count the number of reads in each file:
```
zcat SRR12345678_1.fastq.gz | echo $((`wc -l`/4))
```
For paired-end data, both files should have the same read count.

Check file integrity

Verify gzip compression is intact:

gzip -t SRR12345678_1.fastq.gz && echo "File OK" || echo "File corrupted"

Inspect first few reads

Ensure the FASTQ format looks correct:
```
zcat SRR12345678_1.fastq.gz | head -12
```
You should see blocks of 4 lines: header (@), sequence, separator (+), and quality scores.

Run FastQC (optional)

For comprehensive quality assessment:

module load fastqc
fastqc SRR12345678_1.fastq.gz SRR12345678_2.fastq.gz

Expected Output

After successful download and conversion, you should have:

Directoryfastq_files/
- SRR12345678_1.fastq.gz (forward reads)
- SRR12345678_2.fastq.gz (reverse reads)

Or for single-end data:

Directoryfastq_files/
- SRR12345678.fastq.gz

Troubleshooting

Download fails with network timeout

Try these solutions:

Use prefetch with resume capability: it automatically resumes interrupted downloads
Download during off-peak hours
Check your network connection with ping www.ncbi.nlm.nih.gov

Disk quota exceeded error

Ensure your cache is set to scratch: vdb-config -s /repository/user/main/public/root=$RCAC_SCRATCH/ncbi
Clean up old cached files: rm -rf $RCAC_SCRATCH/ncbi/sra/*.sra
Check your quota with myquota

fasterq-dump runs out of memory

Request more memory in your SLURM script (--mem=32G or higher)
Reduce the number of threads (-e 4 instead of -e 8)
Use the --temp flag to specify a temp directory on scratch

Files are empty or truncated

Re-run prefetch to re-download the SRA file
Verify the accession number is correct
Check if the SRA record is still available on NCBI

FAQs

How long do downloads typically take?

Download times vary based on file size and network conditions:

Small datasets (< 5 GB): 15-30 minutes
Medium datasets (5-20 GB): 1-3 hours
Large datasets (> 20 GB): 3+ hours

The prefetch step is typically the bottleneck as it depends on network speed.

Can I download directly without prefetch?

Technically yes, but it’s not recommended:

# Not recommended - slower and less reliable
fasterq-dump SRR12345678 --split-files -e 8

Using prefetch first is faster, more reliable, and allows resuming interrupted downloads.

How do I download data from ENA instead?

ENA (European Nucleotide Archive) often has faster downloads. Use wget or curl:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/008/SRR12345678/SRR12345678_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/008/SRR12345678/SRR12345678_2.fastq.gz

The exact URL structure varies. Find the correct URLs on the ENA Browser.

What’s the difference between split-files and split-3?

--split-files: Creates _1.fastq and _2.fastq for paired data
--split-3: Creates _1.fastq, _2.fastq, and an additional file for orphaned reads

For most analyses, --split-files is sufficient.

Additional Resources

NCBI SRA Toolkit Documentation
SRA Run Selector - Find and download accession lists
ENA Browser - Alternative download source