A. Genome-based quantification (STAR + featureCounts)
Last updated on 2025-11-25 | Edit this page
Estimated time: 70 minutes
Overview
Questions
- How do we map RNA-seq reads to a reference genome?
- How do we determine library strandness before alignment?
- How do we quantify reads per gene using featureCounts?
- How do we submit mapping jobs on RCAC clusters with SLURM?
Objectives
- Check RNA-seq library strandness using Salmon.
- Index a reference genome for STAR.
- Map reads to the genome using STAR.
- Quantify gene level counts with featureCounts.
- Interpret basic alignment and counting statistics.
Introduction
In this episode we move from raw FASTQ files to genome based gene
counts.
This workflow has three major steps:
- Determine library strandness: STAR does not infer strandness
automatically.
featureCountsrequires the correct setting. - Map reads to the genome using STAR: STAR is a splice aware aligner designed for RNA-seq data.
- Count mapped reads per gene using
featureCounts
This represents one of the two standard quantification strategies in
RNA-seq.
The alternative method (B. transcript based quantification with Salmon
or Kallisto) will be covered in Episode 4B.
Step 1: Checking RNA-seq library strandness
Why strandness matters
RNA-seq libraries can be:
- unstranded
- stranded forward
- stranded reverse
Stranded protocols are common in modern kits, but old GEO and SRA
datasets are often unstranded.
featureCounts must be given the correct strand setting. If the wrong
setting is used, most reads will be counted against incorrect genes.
STAR does not infer strandness automatically, so we must determine it before alignment.
Detecting strandness with Salmon
Salmon can infer strandness directly from raw FASTQ files without any
alignment.
We run Salmon with -l A, which tells it to explore all
library types.
BASH
sinteractive -A workshop -q standby -p cpu -N 1 -n 4 --time=1:00:00
cd $SCRATCH/rnaseq-workshop
mkdir -p results/strand_check
module load biocontainers
module load salmon
salmon index --transcripts data/gencode.vM38.transcripts-clean.fa \
--index data/salmon_index \
--threads 4
salmon quant --index data/salmon_index \
--libType A \
--mates1 data/SRR33253286_1.fastq.gz \
--mates2 data/SRR33253286_2.fastq.gz \
--output results/strand_check \
--threads 4
The important output is in
results/strand_check/lib_format_counts.json:
JSON
{
"read_files": "[ data/SRR33253286_1.fastq.gz, data/SRR33253286_2.fastq.gz]",
"expected_format": "IU",
"compatible_fragment_ratio": 1.0,
"num_compatible_fragments": 21088696,
"num_assigned_fragments": 21088696,
"num_frags_with_concordant_consistent_mappings": 19116702,
"num_frags_with_inconsistent_or_orphan_mappings": 2510542,
"strand_mapping_bias": 0.5117980078362889,
"MSF": 0,
"OSF": 0,
"ISF": 9783890,
"MSR": 0,
"OSR": 0,
"ISR": 9332812,
"SF": 1234245,
"SR": 1276297,
"MU": 0,
"OU": 0,
"IU": 0,
"U": 0
}
Interpreting
lib_format_counts.jsonThis JSON file reports Salmon’s automatic library type detection.
"expected_format": "IU": This is Salmon’s conclusion. “I” means Inward (correct for paired-end reads) and “U” means Unstranded."strand_mapping_bias": 0.512: This is the key evidence. A value near 0.5 (50%) indicates that reads mapped equally to both the sense and antisense strands, the definitive sign of an unstranded library."ISF"and"ISR": These are the counts for Inward-Sense-Forward (9.78M) and Inward-Sense-Reverse (9.33M) fragments. Because these values are almost equal, they confirm the ~50/50 split seen in the strand bias.
Common results:
| Code | Meaning |
|---|---|
| IU | unstranded |
| ISR | stranded reverse (paired end) |
| ISF | stranded forward (paired end) |
| SR | stranded reverse (single end) |
| SF | stranded forward (single end) |
Conclusion: The data is unstranded. We will record this and supply it to STAR/featureCounts.
Step 2: Preparing the reference genome for alignment
STAR requires a genome index. Indexing builds a searchable data structure that speeds up alignment.
Required input:
- genome FASTA file
- annotation file (GTF or GFF3)
- output directory for the index
Optionally, annotation improves splice junction detection and yields better alignments.
Why STAR?
- Splice-aware
- Extremely fast
- Produces high-quality alignments
- Well supported and widely used
- Compatible with downstream tools such as featureCounts, StringTie, and RSEM
Optional alternatives include HISAT2 and Bowtie2, but STAR remains the standard for high-depth Illumina RNA-seq.
Exercise: Read the STAR manual
Check the “Generating genome indexes” section of
the
STAR
manual
Which options are required for indexing?
Key options:
-
--runMode genomeGenerate -
--genomeDirdirectory to store the index -
--genomeFastaFilesthe FASTA file -
--sjdbGTFfileannotation (recommended) -
--runThreadNnumber of threads -
--sjdbOverhangread length minus one (100 is a safe default)
Indexing the Genome with STAR
Below is a minimal SLURM job script for building the STAR genome
index. Create a file named index_genome.sh in
$SCRATCH/rcac_rnaseq/scripts with this content:
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --account=workshop
#SBATCH --partition=cpu
#SBATCH --time=1:00:00
#SBATCH --job-name=star_index
#SBATCH --output=cluster-%x.%j.out
#SBATCH --error=cluster-%x.%j.err
module load biocontainers
module load star
data_dir="${SCRATCH}/rnaseq-workshop/data"
genome="${data_dir}/GRCm39.primary_assembly.genome.fa"
gtf="${data_dir}/gencode.vM38.primary_assembly.basic.annotation.gtf"
indexdir="${data_dir}/star_index"
STAR \
--runMode genomeGenerate \
--runThreadN ${SLURM_CPUS_ON_NODE} \
--genomeDir ${indexdir} \
--genomeFastaFiles ${genome} \
--sjdbGTFfile ${gtf}
Submit the job:
Indexing requires about 30 to 45 minutes.
The index directory is now ready for use in mapping and should have
contents like this: Location: $SCRATCH/rcac_rnaseq/data
Contents of star_index/:
BASH
star_index/
├── chrLength.txt
├── chrNameLength.txt
├── chrName.txt
├── chrStart.txt
├── exonGeTrInfo.tab
├── exonInfo.tab
├── geneInfo.tab
├── Genome
├── genomeParameters.txt
├── Log.out
├── SA
├── SAindex
├── sjdbInfo.txt
├── sjdbList.fromGTF.out.tab
├── sjdbList.out.tab
└── transcriptInfo.tab
Do I need the GTF for indexing?
No, but it is strongly recommended.
Including it improves splice junction detection and increases mapping
accuracy.
Step 3: Mapping reads with STAR
Once the genome index is ready, we can align reads.
For aligning reads, important STAR options are:
--runThreadN--genomeDir--readFilesIn-
--readFilesCommand zcatfor gzipped FASTQ files --outSAMtype BAM SortedByCoordinate--outSAMunmapped Within
STAR’s defaults are tuned for mammalian genomes. If working with plants, fungi, or repetitive genomes, additional tuning may be required.
Exercise: Write a STAR alignment command
Write a basic STAR alignment command using your genome index and paired FASTQ files.
Mapping many FASTQ files with a SLURM array
When having multiple samples, array jobs simplify submission.
First, we will create samples.txt file with just the
sample names
BASH
cd $SCRATCH/rnaseq-workshop/data
ls *_1.fastq.gz | sed 's/_1.fastq.gz//' > ${SCRATCH}/rnaseq-workshop/scripts/samples.txt
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --account=workshop
#SBATCH --partition=cpu
#SBATCH --time=8:00:00
#SBATCH --job-name=read_mapping
#SBATCH --array=1-6
#SBATCH --output=cluster-%x.%j.out
#SBATCH --error=cluster-%x.%j.err
module load biocontainers
module load star
FASTQ_DIR="$SCRATCH/rnaseq-workshop/data"
GENOME_INDEX="$SCRATCH/rnaseq-workshop/data/star_index"
OUTDIR="$SCRATCH/rnaseq-workshop/results/mapping"
mkdir -p $OUTDIR
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.txt)
R1=${FASTQ_DIR}/${SAMPLE}_1.fastq.gz
R2=${FASTQ_DIR}/${SAMPLE}_2.fastq.gz
STAR \
--runThreadN ${SLURM_CPUS_ON_NODE} \
--genomeDir $GENOME_INDEX \
--readFilesIn $R1 $R2 \
--readFilesCommand zcat \
--outFileNamePrefix $OUTDIR/$SAMPLE \
--outSAMunmapped Within \
--outSAMtype BAM SortedByCoordinate
Submit:
Why use an array job?
Think about this:
- We have six samples, each with two FASTQ files
- Having six separate slurm jobs means:
- Copy-pasting the same command six times
- Higher chance of typos or mistakes
- Copy-pasting the same command six times
- if mapping with single slurm file, sequential execution = slower
- Samples can run independently
Why is a job array a good choice?
Arrays allow us to run identical commands across many inputs:
- All samples are processed the same way
- Runs in parallel
- Simplifies submission
- Reduces copy-paste errors
- Minimizes runtime cost on the cluster
Step 4: Assessing alignment quality
STAR produces several useful files, particularly:
Log.final.out
Contains mapping statistics, including % uniquely mapped.Aligned.sortedByCoord.out.bam
Ready for downstream quantification.
To summarize alignment statistics across samples, we use MultiQC:
BASH
cd $SCRATCH/rnaseq-workshop
module load biocontainers
module load multiqc
multiqc results/mapping -o results/qc_alignment
Key metrics to review
-
Uniquely mapped reads — typically 60–90% is
good
-
Multi-mapped reads — depends on genome, but >20%
is concerning
-
Unmapped reads — check reasons (mismatches? too
short?)
- Mismatch rate — high values indicate quality or contamination issues
Exercise: Examine your alignment metrics
Using your MultiQC alignment report:
- Which sample has the highest uniquely mapped percentage?
- Are any samples clear outliers?
- Does mapped/unmapped reads appear consistent across samples?
Example interpretation for this dataset:
- All samples have ~80% uniquely mapped reads.
- No sample is an outlier.
- mapped/unmapped reads consistent across samples
Step 5: Quantifying gene counts with featureCounts
Once the reads have been aligned and sorted, we can quantify how many
read pairs map to each gene. This gives us gene-level counts that we
will later use for differential expression. For alignment-based RNA-seq
workflows, featureCounts is the recommended tool because it
is fast, robust, and compatible with most gene annotation formats.
featureCounts uses two inputs:
- A set of aligned BAM files (sorted by coordinate)
- A GTF annotation file describing gene models
It assigns each read (or read pair) to a gene based on the exon boundaries.
Important: stranded or unstranded?
featureCounts needs to know whether your dataset is
stranded.
-
-s 0unstranded -
-s 1stranded forward -
-s 2stranded reverse
From our Salmon strandness check, this dataset behaves as
unstranded, so we will use -s 0.
Running featureCounts
We will create a new directory for count output:
Below is an example SLURM script to count all BAM files at once.
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --account=workshop
#SBATCH --partition=cpu
#SBATCH --time=1:00:00
#SBATCH --job-name=featurecounts
#SBATCH --output=cluster-%x.%j.out
#SBATCH --error=cluster-%x.%j.err
module load biocontainers
module load subread
GTF="$SCRATCH/rnaseq-workshop/data/gencode.vM38.annotation.gtf"
BAMS="$SCRATCH/rnaseq-workshop/results/mapping/*.bam"
output_dir="$SCRATCH/rnaseq-workshop/results/counts"
featureCounts \
-T ${SLURM_CPUS_PER_TASK} \
-p \
-s 0 \
-t exon \
-g gene_id \
-a ${GTF} \
-o ${output_dir}/gene_counts.txt \
${BAMS}
Submit the job:
gene_counts.txt will contain:
- one row per gene
- one column per sample
Step 6: QC on counts
Before proceeding to differential expression, it is good practice to perform some QC on the count data. We can again use MultiQC to summarize featureCounts statistics:
BASH
cd $SCRATCH/rnaseq-workshop
mkdir -p results/qc_counts
module load biocontainers
module load multiqc
multiqc results/counts -o results/qc_counts
Exercise: Examine your counting metrics
Using your MultiQC featureCounts report:
- Which sample has the lowest percent assigned, and what are common reasons for lower assignment?
- Which unassigned category is the largest, and what does it indicate?
- Does the sample with the most assigned reads also have the highest percent assigned?
- What does “Unassigned: No Features” mean, and why might it occur?
- Are multi-mapped reads a concern for RNA-seq differential expression?
Example interpretation for this dataset:
- SRR33253289 has the lowest assignment rate. Causes include low complexity, incomplete annotation, or more intronic reads.
- Unmapped or No Features dominate. These reflect reads that do not align or do not overlap annotated exons.
- No. Total assigned reads depend on depth, while percent assigned reflects library quality.
- Reads mapped outside annotated exons, often from intronic regions, unannotated transcripts, or incomplete annotation.
- Multi-mapping is expected for paralogs and repeats. It is usually fine as long as the rate is consistent across samples.
Summary
- STAR is a fast splice-aware aligner widely used for RNA-seq.
- Genome indexing requires FASTA and optionally GTF for improved splice detection.
- Mapping is efficiently performed using SLURM array jobs.
- Alignment statistics (from
Log.final.out+ MultiQC) must be reviewed. - Strandness should be inferred using aligned BAM files.
- Final BAM files are ready for downstream counting.
-
featureCountsis used to obtain gene-level counts for differential expression. - Exons are counted and summed per gene.
- The output is a simple matrix for downstream statistical analysis.