Skip to content

Downloading SRA Data

The NCBI Sequence Read Archive (SRA) is the largest publicly available repository of high-throughput sequencing data. This guide walks you through downloading FASTQ files from SRA on RCAC clusters using the SRA Toolkit.

If you’re familiar with HPC (Negishi/Gautschi/Bell) and just need the commands:

module load sra-tools
prefetch SRR12345678
fasterq-dump SRR12345678 --split-files -e 8 -p

For detailed instructions and best practices, continue reading below.

Before downloading, it’s important to understand the two-step workflow:

  1. prefetch Downloads the compressed .sra file from NCBI to your local cache
  2. fasterq-dump - Converts the .sra file to FASTQ format
  1. Load the SRA Toolkit module

    module load biocontainers
    module load sra-tools

    Verify the installation:

    prefetch --version
    fasterq-dump --version
  2. Configure your cache directory (first time only)

    By default, SRA Toolkit caches files in your home directory, which has limited space. Configure it to use scratch space instead:

    mkdir -p $RCAC_SCRATCH/ncbi
    vdb-config --prefetch-to-user-repo
    vdb-config -s /repository/user/main/public/root=$RCAC_SCRATCH/ncbi
  3. Find your SRA accession numbers

    SRA accessions typically start with:

    • SRR - Individual run (most common)
    • SRP - Study/Project
    • SRX - Experiment

    You can find accessions on NCBI SRA or ENA.

  4. Download the SRA file using prefetch

    For a single accession:

    prefetch SRR12345678

    For multiple accessions from a file:

    prefetch --option-file accession_list.txt

    Where accession_list.txt contains one accession per line.

  5. Convert to FASTQ using fasterq-dump

    For paired-end data:

    fasterq-dump SRR12345678 --split-files -e 8 -p

    For single-end data:

    fasterq-dump SRR12345678 -e 8 -p

    Key options:

    • --split-files - Separates paired reads into _1.fastq and _2.fastq
    • -e 8 - Use 8 threads (adjust based on your allocation)
    • -p - Show progress
  6. Compress the FASTQ files

    FASTQ files are large. Compress them to save space:

    pigz -p 8 *.fastq

For large downloads, submit a batch job rather than running interactively.

download_sra.sh
#!/bin/bash
#SBATCH --job-name=sra_download
#SBATCH --account=your_account
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=04:00:00
#SBATCH --output=sra_%j.out
#SBATCH --error=sra_%j.err
# Load required modules
module purge
module load biocontainers
module load sra-tools
# Set variables
SRR_ID="SRR12345678"
OUTDIR="$RCAC_SCRATCH/fastq_files"
THREADS=8
# Create output directory
mkdir -p ${OUTDIR}
cd ${OUTDIR}
# Step 1: Prefetch the SRA file
echo "Starting prefetch for ${SRR_ID}..."
prefetch ${SRR_ID}
# Step 2: Convert to FASTQ
echo "Converting to FASTQ..."
fasterq-dump ${SRR_ID} --split-files -e ${THREADS} -p
# Step 3: Compress FASTQ files
echo "Compressing FASTQ files..."
pigz -p ${THREADS} ${SRR_ID}*.fastq
# Step 4: Clean up cache
echo "Cleaning up..."
rm -rf $RCAC_SCRATCH/ncbi/sra/${SRR_ID}.sra
echo "Done! Files saved to ${OUTDIR}"
ls -lh ${SRR_ID}*.fastq.gz

Submit the job with:

sbatch download_sra.sh

After downloading, verify your files are complete and uncorrupted:

  1. Check file sizes

    FASTQ files should be reasonably sized (typically 1-50 GB for most runs):

    ls -lh *.fastq.gz
  2. Count reads

    Count the number of reads in each file:

    zcat SRR12345678_1.fastq.gz | echo $((`wc -l`/4))

    For paired-end data, both files should have the same read count.

  3. Check file integrity

    Verify gzip compression is intact:

    gzip -t SRR12345678_1.fastq.gz && echo "File OK" || echo "File corrupted"
  4. Inspect first few reads

    Ensure the FASTQ format looks correct:

    zcat SRR12345678_1.fastq.gz | head -12

    You should see blocks of 4 lines: header (@), sequence, separator (+), and quality scores.

  5. Run FastQC (optional)

    For comprehensive quality assessment:

    module load fastqc
    fastqc SRR12345678_1.fastq.gz SRR12345678_2.fastq.gz

After successful download and conversion, you should have:

  • Directoryfastq_files/
    • SRR12345678_1.fastq.gz (forward reads)
    • SRR12345678_2.fastq.gz (reverse reads)

Or for single-end data:

  • Directoryfastq_files/
    • SRR12345678.fastq.gz
Download fails with network timeout

Try these solutions:

  • Use prefetch with resume capability: it automatically resumes interrupted downloads
  • Download during off-peak hours
  • Check your network connection with ping www.ncbi.nlm.nih.gov
Disk quota exceeded error
  • Ensure your cache is set to scratch: vdb-config -s /repository/user/main/public/root=$RCAC_SCRATCH/ncbi
  • Clean up old cached files: rm -rf $RCAC_SCRATCH/ncbi/sra/*.sra
  • Check your quota with myquota
fasterq-dump runs out of memory
  • Request more memory in your SLURM script (--mem=32G or higher)
  • Reduce the number of threads (-e 4 instead of -e 8)
  • Use the --temp flag to specify a temp directory on scratch
Files are empty or truncated
  • Re-run prefetch to re-download the SRA file
  • Verify the accession number is correct
  • Check if the SRA record is still available on NCBI
How long do downloads typically take?

Download times vary based on file size and network conditions:

  • Small datasets (< 5 GB): 15-30 minutes
  • Medium datasets (5-20 GB): 1-3 hours
  • Large datasets (> 20 GB): 3+ hours

The prefetch step is typically the bottleneck as it depends on network speed.

Can I download directly without prefetch?

Technically yes, but it’s not recommended:

# Not recommended - slower and less reliable
fasterq-dump SRR12345678 --split-files -e 8

Using prefetch first is faster, more reliable, and allows resuming interrupted downloads.

How do I download data from ENA instead?

ENA (European Nucleotide Archive) often has faster downloads. Use wget or curl:

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/008/SRR12345678/SRR12345678_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/008/SRR12345678/SRR12345678_2.fastq.gz

The exact URL structure varies. Find the correct URLs on the ENA Browser.

What’s the difference between split-files and split-3?
  • --split-files: Creates _1.fastq and _2.fastq for paired data
  • --split-3: Creates _1.fastq, _2.fastq, and an additional file for orphaned reads

For most analyses, --split-files is sufficient.