Downloading SRA Data
The NCBI Sequence Read Archive (SRA) is the largest publicly available repository of high-throughput sequencing data. This guide walks you through downloading FASTQ files from SRA on RCAC clusters using the SRA Toolkit.
Quick Start
Section titled “Quick Start”If you’re familiar with HPC (Negishi/Gautschi/Bell) and just need the commands:
module load sra-toolsprefetch SRR12345678fasterq-dump SRR12345678 --split-files -e 8 -pFor detailed instructions and best practices, continue reading below.
Understanding SRA Downloads
Section titled “Understanding SRA Downloads”Before downloading, it’s important to understand the two-step workflow:
prefetchDownloads the compressed.srafile from NCBI to your local cachefasterq-dump- Converts the.srafile to FASTQ format
Step-by-Step Guide
Section titled “Step-by-Step Guide”-
Load the SRA Toolkit module
module load biocontainersmodule load sra-toolsVerify the installation:
prefetch --versionfasterq-dump --version -
Configure your cache directory (first time only)
By default, SRA Toolkit caches files in your home directory, which has limited space. Configure it to use scratch space instead:
mkdir -p $RCAC_SCRATCH/ncbivdb-config --prefetch-to-user-repovdb-config -s /repository/user/main/public/root=$RCAC_SCRATCH/ncbi -
Find your SRA accession numbers
SRA accessions typically start with:
- SRR - Individual run (most common)
- SRP - Study/Project
- SRX - Experiment
-
Download the SRA file using prefetch
For a single accession:
prefetch SRR12345678For multiple accessions from a file:
prefetch --option-file accession_list.txtWhere
accession_list.txtcontains one accession per line. -
Convert to FASTQ using fasterq-dump
For paired-end data:
fasterq-dump SRR12345678 --split-files -e 8 -pFor single-end data:
fasterq-dump SRR12345678 -e 8 -pKey options:
--split-files- Separates paired reads into_1.fastqand_2.fastq-e 8- Use 8 threads (adjust based on your allocation)-p- Show progress
-
Compress the FASTQ files
FASTQ files are large. Compress them to save space:
pigz -p 8 *.fastq
SLURM Batch Script
Section titled “SLURM Batch Script”For large downloads, submit a batch job rather than running interactively.
#!/bin/bash#SBATCH --job-name=sra_download#SBATCH --account=your_account#SBATCH --nodes=1#SBATCH --ntasks=1#SBATCH --cpus-per-task=8#SBATCH --time=04:00:00#SBATCH --output=sra_%j.out#SBATCH --error=sra_%j.err
# Load required modulesmodule purgemodule load biocontainersmodule load sra-tools
# Set variablesSRR_ID="SRR12345678"OUTDIR="$RCAC_SCRATCH/fastq_files"THREADS=8
# Create output directorymkdir -p ${OUTDIR}cd ${OUTDIR}
# Step 1: Prefetch the SRA fileecho "Starting prefetch for ${SRR_ID}..."prefetch ${SRR_ID}
# Step 2: Convert to FASTQecho "Converting to FASTQ..."fasterq-dump ${SRR_ID} --split-files -e ${THREADS} -p
# Step 3: Compress FASTQ filesecho "Compressing FASTQ files..."pigz -p ${THREADS} ${SRR_ID}*.fastq
# Step 4: Clean up cacheecho "Cleaning up..."rm -rf $RCAC_SCRATCH/ncbi/sra/${SRR_ID}.sra
echo "Done! Files saved to ${OUTDIR}"ls -lh ${SRR_ID}*.fastq.gz#!/bin/bash#SBATCH --job-name=sra_batch#SBATCH --account=your_account#SBATCH --nodes=1#SBATCH --ntasks=1#SBATCH --cpus-per-task=8#SBATCH --time=24:00:00#SBATCH --output=sra_batch_%j.out#SBATCH --error=sra_batch_%j.err
# Load required modulesmodule purgemodule load biocontainersmodule load sra-tools
# Set variablesACCESSION_FILE="accession_list.txt"OUTDIR="$RCAC_SCRATCH/fastq_files"THREADS=8
# Create output directorymkdir -p ${OUTDIR}cd ${OUTDIR}
# Process each accessionwhile read -r SRR_ID; do # Skip empty lines and comments [[ -z "$SRR_ID" || "$SRR_ID" =~ ^# ]] && continue
echo "==========================================" echo "Processing: ${SRR_ID}" echo "=========================================="
# Prefetch prefetch ${SRR_ID}
# Convert to FASTQ fasterq-dump ${SRR_ID} --split-files -e ${THREADS} -p
# Compress pigz -p ${THREADS} ${SRR_ID}*.fastq
# Clean up cache rm -rf $RCAC_SCRATCH/ncbi/sra/${SRR_ID}.sra
echo "Completed: ${SRR_ID}"
done < ${ACCESSION_FILE}
echo "All downloads complete!"ls -lh *.fastq.gz#!/bin/bash#SBATCH --job-name=sra_array#SBATCH --account=your_account#SBATCH --nodes=1#SBATCH --ntasks=1#SBATCH --cpus-per-task=8#SBATCH --time=04:00:00#SBATCH --array=1-10%5#SBATCH --output=sra_%A_%a.out#SBATCH --error=sra_%A_%a.err
# Load required modulesmodule purgemodule load biocontainersmodule load sra-tools
# Set variablesACCESSION_FILE="accession_list.txt"OUTDIR="$RCAC_SCRATCH/fastq_files"THREADS=8
# Get the SRR ID for this array taskSRR_ID=$(sed -n "${SLURM_ARRAY_TASK_ID}p" ${ACCESSION_FILE})
# Create output directorymkdir -p ${OUTDIR}cd ${OUTDIR}
echo "Array task ${SLURM_ARRAY_TASK_ID}: Processing ${SRR_ID}"
# Prefetchprefetch ${SRR_ID}
# Convert to FASTQfasterq-dump ${SRR_ID} --split-files -e ${THREADS} -p
# Compresspigz -p ${THREADS} ${SRR_ID}*.fastq
# Clean up cacherm -rf $RCAC_SCRATCH/ncbi/sra/${SRR_ID}.sra
echo "Completed: ${SRR_ID}"ls -lh ${SRR_ID}*.fastq.gzSubmit the job with:
sbatch download_sra.shVerification Steps
Section titled “Verification Steps”After downloading, verify your files are complete and uncorrupted:
-
Check file sizes
FASTQ files should be reasonably sized (typically 1-50 GB for most runs):
ls -lh *.fastq.gz -
Count reads
Count the number of reads in each file:
zcat SRR12345678_1.fastq.gz | echo $((`wc -l`/4))For paired-end data, both files should have the same read count.
-
Check file integrity
Verify gzip compression is intact:
gzip -t SRR12345678_1.fastq.gz && echo "File OK" || echo "File corrupted" -
Inspect first few reads
Ensure the FASTQ format looks correct:
zcat SRR12345678_1.fastq.gz | head -12You should see blocks of 4 lines: header (@), sequence, separator (+), and quality scores.
-
Run FastQC (optional)
For comprehensive quality assessment:
module load fastqcfastqc SRR12345678_1.fastq.gz SRR12345678_2.fastq.gz
Expected Output
Section titled “Expected Output”After successful download and conversion, you should have:
Directoryfastq_files/
- SRR12345678_1.fastq.gz (forward reads)
- SRR12345678_2.fastq.gz (reverse reads)
Or for single-end data:
Directoryfastq_files/
- SRR12345678.fastq.gz
Troubleshooting
Section titled “Troubleshooting”Download fails with network timeout
Try these solutions:
- Use
prefetchwith resume capability: it automatically resumes interrupted downloads - Download during off-peak hours
- Check your network connection with
ping www.ncbi.nlm.nih.gov
Disk quota exceeded error
- Ensure your cache is set to scratch:
vdb-config -s /repository/user/main/public/root=$RCAC_SCRATCH/ncbi - Clean up old cached files:
rm -rf $RCAC_SCRATCH/ncbi/sra/*.sra - Check your quota with
myquota
fasterq-dump runs out of memory
- Request more memory in your SLURM script (
--mem=32Gor higher) - Reduce the number of threads (
-e 4instead of-e 8) - Use the
--tempflag to specify a temp directory on scratch
Files are empty or truncated
- Re-run
prefetchto re-download the SRA file - Verify the accession number is correct
- Check if the SRA record is still available on NCBI
How long do downloads typically take?
Download times vary based on file size and network conditions:
- Small datasets (< 5 GB): 15-30 minutes
- Medium datasets (5-20 GB): 1-3 hours
- Large datasets (> 20 GB): 3+ hours
The prefetch step is typically the bottleneck as it depends on network speed.
Can I download directly without prefetch?
Technically yes, but it’s not recommended:
# Not recommended - slower and less reliablefasterq-dump SRR12345678 --split-files -e 8Using prefetch first is faster, more reliable, and allows resuming interrupted downloads.
How do I download data from ENA instead?
ENA (European Nucleotide Archive) often has faster downloads. Use wget or curl:
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/008/SRR12345678/SRR12345678_1.fastq.gzwget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR123/008/SRR12345678/SRR12345678_2.fastq.gzThe exact URL structure varies. Find the correct URLs on the ENA Browser.
What’s the difference between split-files and split-3?
--split-files: Creates_1.fastqand_2.fastqfor paired data--split-3: Creates_1.fastq,_2.fastq, and an additional file for orphaned reads
For most analyses, --split-files is sufficient.
Additional Resources
Section titled “Additional Resources”- NCBI SRA Toolkit Documentation
- SRA Run Selector - Find and download accession lists
- ENA Browser - Alternative download source