Content from Introduction to RNA-seq
Last updated on 2025-11-25 | Edit this page
Overview
Questions
- What are the different choices to consider when planning an RNA-seq experiment?
- How does one process the raw fastq files to generate a table with read counts per gene and sample?
- Where does one find information about annotated genes for a given organism?
- What are the typical steps in an RNA-seq analysis?
Objectives
- Explain what RNA-seq is.
- Describe some of the most common design choices that have to be made before running an RNA-seq experiment.
- Provide an overview of the procedure to go from the raw data to the read count matrix that will be used for downstream analysis.
- Show some common types of results and visualizations generated in RNA-seq analyses.
What are we measuring in an RNA-seq experiment?

In order to produce an RNA molecule, a stretch of DNA is first transcribed into mRNA. Subsequently, intronic regions are spliced out, and exonic regions are combined into different isoforms of a gene.

(figure adapted from Martin & Wang (2011)).
In a typical RNA-seq experiment, RNA molecules are first collected from a sample of interest. After a potential enrichment for molecules with polyA tails (predominantly mRNA), or depletion of otherwise highly abundant ribosomal RNA, the remaining molecules are fragmented into smaller pieces (there are also long-read protocols where entire molecules are considered, but those are not the focus of this lesson). It is important to keep in mind that because of the splicing excluding intronic sequences, an RNA molecule (and hence a generated fragment) may not correspond to an uninterrupted region of the genome. The RNA fragments are then reverse transcribed into cDNA, whereafter sequencing adapters are added to each end. These adapters allow the fragments to attach to the flow cell. Once attached, each fragment will be heavily amplified to generate a cluster of identical sequences on the flow cell. The sequencer then determines the sequence of the first 50-200 nucleotides of the cDNA fragments in each such cluster, starting from one end, which corresponds to a read. Many data sets are generated with so called paired-end protocols, in which the fragments are read from both ends. Millions of such reads (or pairs of reads) will be generated in an experiment, and these will be represented in a (pair of) FASTQ files. Each read is represented by four consecutive lines in such a file: first a line with a unique read identifier, next the inferred sequence of the read, then another identifier line, and finally a line containing the base quality for each inferred nucleotide, representing the probability that the nucleotide in the corresponding position has been correctly identified.
Challenge: Discuss the following points with your neighbor
- What are potential advantages and disadvantages of paired-end protocols compared to only sequencing one end of a fragment?
- What quality assessment can you think of that would be useful to perform on the FASTQ files with read sequences?
Experimental design considerations
Before starting to collect data, it is essential to take some time to think about the design of the experiment. Experimental design concerns the organization of experiments with the purpose of making sure that the right type of data, and enough of it, is available to answer the questions of interest as efficiently as possible. Aspects such as which conditions or sample groups to consider, how many replicates to collect, and how to plan the data collection in practice are important questions to consider. Many high-throughput biological experiments (including RNA-seq) are sensitive to ambient conditions, and it is often difficult to directly compare measurements that have been done on different days, by different analysts, in different centers, or using different batches of reagents. For this reason, it is very important to design experiments properly, to make it possible to disentangle different types of (primary and secondary) effects.

(figure from Lazic (2017)).
Challenge: Discuss with your neighbor
- Why is it essential to have replicates?
Importantly, not all replicates are equally useful, from a statistical point of view. One common way to classify the different types of replicates is as ‘biological’ and ‘technical’, where the latter are typically used to test the reproducibility of the measurement device, while biological replicates inform about the variability between different samples from a population of interest. Another scheme classifies replicates (or units) into ‘biological’, ‘experimental’ and ‘observational’. Here, biological units are entities we want to make inferences about (e.g., animals, persons). Replication of biological units is required to make a general statement of the effect of a treatment - we can not draw a conclusion about the effect of drug on a population of mice by studying a single mouse only. Experimental units are the smallest entities that can be independently assigned to a treatment (e.g., animal, litter, cage, well). Only replication of experimental units constitute true replication. Observational units, finally, are entities at which measurements are made.
To explore the impact of experimental design on the ability to answer questions of interest, we are going to use an interactive application, provided in the ConfoundingExplorer package.
Challenge
Launch the ConfoundingExplorer application and familiarize yourself with the interface.
Challenge
- For a balanced design (equal distribution of replicates between the two groups in each batch), what is the effect of increasing the strength of the batch effect? Does it matter whether one adjusts for the batch effect or not?
- For an increasingly unbalanced design (most or all replicates of one group coming from one batch), what is the effect of increasing the strength of the batch effect? Does it matter whether one adjusts for the batch effect or not?
RNA-seq quantification: from reads to count matrix

The read sequences contained in the FASTQ files from the sequencer are typically not directly useful as they are, since we do not have the information about which gene or transcript they originate from. Thus, the first processing step is to attempt to identify the location of origin for each read, and use this to obtain an estimate of the number of reads originating from a gene (or another features, such as an individual transcript). This can then be used as a proxy for the abundance, or expression level, of the gene. There is a plethora of RNA quantification pipelines, and the most common approaches can be categorized into three main types:
Align reads to the genome, and count the number of reads that map within the exons of each gene. This is the one of simplest methods. For species for which the transcriptome is poorly annotated, this would be the preferred approach. Example:
STARalignment to GRCm39 +RsubreadfeatureCountsAlign reads to the transcriptome, quantify transcript expression, and summarize transcript expression into gene expression. This approach can produce accurate quantification results (independent benchmarking), particularly for high-quality samples without DNA contamination. Example: RSEM quantification using
rsem-calculate-expression --staron the GENCODE GRCh38 transcriptome +tximportPseudoalign reads against the transcriptome, using the corresponding genome as a decoy, quantifying transcript expression in the process, and summarize the transcript-level expression into gene-level expression. The advantages of this approach include: computational efficiency, mitigation of the effect of DNA contamination, and GC bias correction. Example:
salmon quant --gcBias+tximport
At typical sequencing read depth, gene expression quantification is often more accurate than transcript expression quantification. However, differential gene expression analyses can be improved by having access also to transcript-level quantifications.
Other tools used in RNA-seq quantification include: TopHat2, bowtie2, kallisto, HTseq, among many others.
The choice of an appropriate RNA-seq quantification depends on the quality of the transcriptome annotation, the quality of the RNA-seq library preparation, the presence of contaminating sequences, among many factors. Often, it can be informative to compare the quantification results of multiple approaches.
The quantification method is species- and experiment-dependent, and often requires large amounts of computing resources, so we will give you a overview of how to perform quantification, and scripts to run the quantification. We will use pre-computed quantifications for the exercises in this workshop.
Challenge: Discuss the following points with your neighbor
- Which of the mentioned RNA-Seq quantification tools have you heard about? Do you know other pros and cons of the methods?
- Have you done your own RNA-Seq experiment? If so, what quantification tool did you use and why did you choose it?
- Do you have access to specific tools / local bioinformatics expert / computational resources for quantification? If you don’t, how might you gain access?
Finding the reference sequences
In order to quantify abundances of known genes or transcripts from RNA-seq data, we need a reference database informing us of the sequence of these features, to which we can then compare our reads. This information can be obtained from different online repositories. It is highly recommended to choose one of these for a particular project, and not mix information from different sources. Depending on the quantification tool you have chosen, you will need different types of reference information. If you are aligning your reads to the genome and investigating the overlap with known annotated features, you will need the full genome sequence (provided in a fasta file) and a file that tells you the genomic location of each annotated feature (typically provided in a gtf file). If you are mapping your reads to the transcriptome, you will instead need a file with the sequence of each transcript (again, a fasta file).
- If you are working with mouse or human samples, the GENCODE project provides well-curated reference files.
- Ensembl provides reference files for a large set of organisms, including plants and fungi.
- UCSC also provides reference files for many organisms.
Challenge
Download the latest mouse transcriptome fasta file from GENCODE. What
do the entries look like? Tip: to read the file into R, consider the
readDNAStringSet() function from the
Biostrings package.
Where are we heading towards in this workshop?
In this workshop, we will discuss and practice: downloading publicly available RNA-seq data, performing quality assessment of the raw data, quantifying gene expression, and performing differential expression analysis to identify genes that are expressed at different levels between two or more conditions. We will also cover some aspects of data visualization and interpretation of the results.
The outcome of a differential expression analysis is often represented using graphical representations, such as MA plots and heatmaps (see below for examples).


In the following episodes we will learn, among other things, how to generate and interpret these plots. It is also common to perform follow-up analyses to investigate whether there is a functional relationship among the top-ranked genes, so called gene set (enrichment) analysis, which will also be covered in a later episode.
- RNA-seq is a technique of measuring the amount of RNA expressed within a cell/tissue and state at a given time.
- Many choices have to be made when planning an RNA-seq experiment, such as whether to perform poly-A selection or ribosomal depletion, whether to apply a stranded or an unstranded protocol, and whether to sequence the reads in a single-end or paired-end fashion. Each of the choices have consequences for the processing and interpretation of the data.
- Many approaches exist for quantification of RNA-seq data. Some methods align reads to the genome and count the number of reads overlapping gene loci. Other methods map reads to the transcriptome and use a probabilistic approach to estimate the abundance of each gene or transcript.
- Information about annotated genes can be accessed via several sources, including Ensembl, UCSC and GENCODE.
Content from Downloading and organizing files
Last updated on 2025-11-25 | Edit this page
Overview
Questions
- What files are required to process raw RNA seq reads?
- Where do we obtain raw reads, reference genomes, annotations, and transcript sequences?
- How should project directories be organized to support a smooth workflow?
- What practical considerations matter when downloading and preparing RNA seq data?
Objectives
- Identify the essential inputs for RNA seq analysis: FASTQ files, reference genome, annotation, and transcript sequences.
- Learn where and how to download RNA seq datasets and reference materials.
- Create a clean project directory structure suitable for quality control, alignment, and quantification.
- Understand practical issues such as compressed FASTQ files and adapter trimming.
Introduction
Before we can perform quality control, mapping, or quantification, we must first gather the files required for RNA seq analysis and organize them in a reproducible way. This episode introduces the essential inputs of an RNA seq workflow and demonstrates how to obtain and structure them on a computing system.
We begin by discussing raw FASTQ reads, reference genomes, annotation files, and transcript sequences. We then download the dataset used throughout the workshop and set up the directory structure that supports downstream steps.
Most FASTQ files arrive compressed as .fastq.gz files.
Modern RNA seq tools accept compressed files directly, so uncompressing
them is usually unnecessary.
Should I uncompress FASTQ files?
In almost all RNA seq workflows, you should keep FASTQ files
compressed. Tools such as FastQC, HISAT2, STAR, salmon, and fastp can
read .gz files directly. Keeping files compressed saves
space and reduces input and output overhead.
Many library preparation protocols introduce adapter sequences at the ends of reads. For most alignment based analyses, explicit trimming is not required because aligners soft clip adapter sequences automatically.
Should I trim adapters for RNA seq analysis?
In most cases, no. Aligners such as HISAT2 and STAR soft clip adapter
sequences. Overly aggressive trimming can reduce mapping rates or
distort read length distributions. Trimming is needed only for specific
applications such as transcript assembly where uniform read lengths are
important.
More information: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4766705/
Downloading the data and project organization
For this workshop, we use a publicly available dataset that investigates the effects of cardiomyocyte specific Nat10 knockout on heart development in mice (BioProject PRJNA1254208, GEO GSE295332). The study includes three biological replicates per group, sequenced on an Illumina NovaSeq X Plus in paired end mode.
Before downloading files, we create a reproducible directory layout for raw data, scripts, mapping outputs, and count files.
Create the directories needed for this episode
Create a working directory for the workshop (e.g.,
rnaseq-workshop). Inside it, create four
subdirectories:
-
data: raw FASTQ files, reference genome, annotation
-
scripts: custom scripts, SLURM job files
-
results: alignment, counts and various other outputs
The resulting directory structure should look like this:
What files do I need for RNA-seq analysis?
You will need:
- Raw reads (FASTQ)
- Usually
.fastq.gzfiles containing nucleotide sequences and per-base quality scores.
- Usually
For alignment based workflows
- Reference genome (FASTA)
- Contains chromosome or contig sequences.
- Annotation file (GTF/GFF)
- Describes gene models: exons, transcripts, coordinates.
For transcript-level quantification workflows
- Transcript sequences (FASTA)
- Required for transcript based quantification with tools such as salmon, kallisto, and RSEM.
Reference files must be consistent with each other (same genome version). To ensure a smooth analysis workflow, keep these files logically organized. Good directory structure minimizes mistakes, simplifies scripting, and supports reproducibility.
Downloading the data
We now download:
- the GRCm39 reference genome
- the corresponding GENCODE annotation
- transcript sequences (if using transcript based quantification)
- raw FASTQ files from SRA
Reference genome and annotation
We will use GENCODE GRCm39 (M38) as the reference.
Annotation
Visit the GENCODE mouse page: https://www.gencodegenes.org/mouse/.
Which annotation file (GTF/GFF3 files section) should you download, and
why?
Use the basic gene annotation in GTF format (e.g.,
gencode.vM38.primary_assembly.basic.annotation.gtf.gz).
It contains essential gene and transcript models without pseudogenes or
other biotypes not typically used for RNA-seq quantification.

Reference genome
Select the corresponding Genome sequence (GRCm39)
FASTA (e.g.,
GRCm39.primary_assembly.genome.fa.gz).

Transcript sequences
Transcript level quantification requires a transcript FASTA file
consistent with the annotation. GENCODE provides transcript sequences
for all transcripts in GRCm39
(gencode.vM38.transcripts.fa.gz).
Transcript sequence file
Where on the GENCODE page can you find the transcript FASTA file?
The transcript FASTA file is available in the same release directory
as the genome and annotation files (under FASTA files). It is named
gencode.vM38.transcripts.fa.gz.

Downloading all reference files
BASH
cd ${SCRATCH}/rnaseq-workshop
GTFlink="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M38/gencode.vM38.primary_assembly.basic.annotation.gtf.gz"
GENOMElink="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M38/GRCm39.primary_assembly.genome.fa.gz"
CDSLink="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M38/gencode.vM38.transcripts.fa.gz"
# Download to 'data'
wget -P data ${GENOMElink}
wget -P data ${GTFlink}
wget -P data ${CDSLink}
# Uncompress
cd data
gunzip GRCm39.primary_assembly.genome.fa.gz
gunzip gencode.vM38.primary_assembly.basic.annotation.gtf.gz
gunzip gencode.vM38.transcripts.fa.gz
The transcript file header has multiple pieces of information which often interfere with downstream analysis. We create a simplified version containing only the transcript IDs.
Raw reads
The experiment we use investigates how NAT10 influences mouse heart development.
Study: The effects of NAT10 on heart development in mice
Organism: Mus musculus
Design: Nat10^flox/flox vs Nat10-CKO (n=3 per group), P10 hearts
Sequencing: Illumina NovaSeq X Plus, paired-end
BioProject: PRJNA1254208
GEO: GSE295332

Where do you find FASTQ files?
GEO provides metadata and experimental details, but FASTQ files are stored in the SRA. How do you obtain the complete list of sequencing runs?
Navigate to the SRA page for the BioProject and use Run
Selector to list runs.
On the Run Selector page, click Accession List to
download all SRR accession IDs.

Download FASTQ files with SRA Toolkit
BASH
sinteractive -A workshop -p cpu -N 1 -n 4 --time=1:00:00
cd ${SCRATCH}/rnaseq-workshop/data
module load biocontainers
module load sra-tools
while read SRR; do
fasterq-dump --threads 4 --progress $SRR;
gzip ${SRR}_1.fastq
gzip ${SRR}_2.fastq
done<SRR_Acc_List.txt
Since downloading may take a long time, we provide a pre-downloaded copy of all FASTQ files. The directory, after downloading, should look like this:
BASH
data
├── annot.tsv
├── gencode.vM38.primary_assembly.basic.annotation.gtf
├── gencode.vM38.transcripts.fa
├── GRCm39.primary_assembly.genome.fa
├── mart.tsv
├── SRR33253285_1.fastq.gz
├── SRR33253285_2.fastq.gz
├── SRR33253286_1.fastq.gz
├── SRR33253286_2.fastq.gz
├── SRR33253287_1.fastq.gz
├── SRR33253287_2.fastq.gz
├── SRR33253288_1.fastq.gz
├── SRR33253288_2.fastq.gz
├── SRR33253289_1.fastq.gz
├── SRR33253289_2.fastq.gz
├── SRR33253290_1.fastq.gz
├── SRR33253290_2.fastq.gz
└── SRR_Acc_List.txt
- RNA-seq requires three main inputs: FASTQ reads, a reference genome (FASTA), and gene annotation (GTF/GFF).
- In lieu of a reference genome and annotation, transcript sequences (FASTA) can be used for transcript-level quantification.
- Keeping files compressed and well-organized supports reproducible analysis.
- Reference files must match in genome build and version.
- Public data repositories such as GEO and SRA provide raw reads and metadata.
- Tools like
wgetandfasterq-dumpenable programmatic, reproducible data retrieval.
Content from Quality control of RNA-seq reads
Last updated on 2025-11-25 | Edit this page
Overview
Questions
- How do we check the quality of raw RNA-seq reads?
- What information do FastQC and MultiQC provide?
- How do we decide if trimming is required?
- What QC issues are common in Illumina RNA-seq data?
Objectives
- Run FastQC and MultiQC on RNA-seq FASTQ files.
- Understand how to interpret the important FastQC modules.
- Learn which FastQC warnings are expected for RNA-seq.
- Understand why trimming is usually unnecessary for standard RNA-seq.
- Identify samples that may require closer inspection.
Introduction
Before mapping reads to a genome, it is important to evaluate the
quality of the raw RNA-seq data.
Quality control helps reveal problems such as adapter contamination, low
quality cycles, and unexpected technical issues.
In this episode, we will run FastQC and MultiQC on the mouse heart
dataset and interpret the results.
We will also explain when trimming is useful and when it can be harmful
for alignment based RNA-seq workflows.
FastQC evaluates individual FASTQ files. MultiQC summarizes all FastQC reports into one view, which is helpful for comparing replicates.
Why do we perform QC on FASTQ files?
Quality control answers several questions:
- Are sequencing qualities stable across cycles?
- Are adapters present at high levels?
- Do different replicates show similar patterns?
- Are there outliers that indicate failed libraries?
QC provides confidence that downstream steps will be reliable.
Running FastQC
FastQC generates an HTML report for each FASTQ file. For paired-end datasets, you can run it on both R1 and R2 together or separately.
We will create a directory for QC output and then run FastQC on all FASTQ files.
Run FastQC on your FASTQ files
Use the commands below to run FastQC on all reads in the
data directory.
BASH
cd $SCRATCH/rnaseq-workshop
mkdir -p results/qc_fastq/
sinteractive -A workshop -p cpu -N 1 -n 4 --time=1:00:00
module load biocontainers
module load fastqc
fastqc data/*.fastq.gz --outdir results/qc_fastq/ --threads 4
BASH
qc
├── SRR33253285_1_fastqc.html
├── SRR33253285_1_fastqc.zip
├── SRR33253285_2_fastqc.html
├── SRR33253285_2_fastqc.zip
├── SRR33253286_1_fastqc.html
├── SRR33253286_1_fastqc.zip
├── SRR33253286_2_fastqc.html
├── SRR33253286_2_fastqc.zip
├── SRR33253287_1_fastqc.html
├── SRR33253287_1_fastqc.zip
├── SRR33253287_2_fastqc.html
├── SRR33253287_2_fastqc.zip
├── SRR33253288_1_fastqc.html
├── SRR33253288_1_fastqc.zip
├── SRR33253288_2_fastqc.html
├── SRR33253288_2_fastqc.zip
├── SRR33253289_1_fastqc.html
├── SRR33253289_1_fastqc.zip
├── SRR33253289_2_fastqc.html
├── SRR33253289_2_fastqc.zip
├── SRR33253290_1_fastqc.html
├── SRR33253290_1_fastqc.zip
├── SRR33253290_2_fastqc.html
└── SRR33253290_2_fastqc.zip
Understanding the FastQC report
FastQC produces several diagnostic modules. RNA-seq users often misinterpret certain modules, so we focus on the ones that matter most.
Which FastQC modules are important for RNA-seq?
Most informative modules:
- Per base sequence quality
- Per sequence quality scores
- Per base sequence content
- Per sequence GC content
- Adapter content
- Overrepresented sequences
Modules that commonly show warnings for RNA-seq, even when libraries are fine:
- Per sequence GC content (RNA-seq reflects transcript GC bias, not whole genome)
- K-mer content (poly A tails and priming artifacts often trigger warnings)
Warnings do not always indicate problems. Context and comparison across samples matter more than individual icons.
Key modules explained
Below are guidelines for interpreting the most useful modules.
Per base sequence quality
Shows the distribution of quality scores at each position across all reads. You want high and stable quality across the read. A small quality drop at the end of R2 is common and usually not a problem.

Per tile sequence quality
Reports whether specific regions of the flowcell produced lower quality reads. Uneven tiles can indicate instrument issues or bubble formation, but this module rarely shows problems in modern NovaSeq data.

Per sequence quality scores
Shows how many reads have high or low overall quality. A good dataset has most reads with high scores and very few reads with low quality.

Per base sequence content
Shows the proportion of A, C, G, and T at each cycle. For genomic resequencing, all four lines should be close to 25 percent. RNA-seq usually shows imbalanced composition in early cycles due to biased priming and transcript composition. Mild imbalance at the start of the read is normal and does not require trimming.

Per sequence GC content
Shows the GC distribution across all reads. RNA-seq often fails this test because expressed transcripts are not GC neutral. Only pronounced multi-modal or extremely shifted distributions are concerning, for example when contamination from another organism is suspected.

Per base N content
Reports the percentage of unidentified bases (N) at each position. This should be near zero for modern Illumina runs. High N content indicates failed cycles or poor sequencing chemistry.

Sequence length distribution
Shows the distribution of read lengths. Standard RNA-seq libraries should have a narrow peak at the expected read length. Variable lengths usually appear after trimming or for specialized protocols such as small RNA.

Sequence duplication levels
Reports how many reads are exact duplicates. Some duplication is normal in RNA-seq because highly expressed genes produce many identical fragments. Very high duplication, especially combined with low library size, can indicate low library complexity or overamplification.

Overrepresented sequences
Lists specific sequences that occur more often than expected. Common sources include ribosomal RNA fragments, poly A tails, adapter sequence, or reads from highly expressed transcripts. Some overrepresented sequences are expected, but very high levels may indicate contamination or poor rRNA depletion.
Aggregating reports with MultiQC
FastQC generates a separate HTML file per sample. MultiQC consolidates all reports and allows easy comparison of replicates.
Trimming adapter sequences
Before deciding whether to trim reads, it is essential to understand what adapters are and why they appear in sequencing data.
Adapters are short synthetic DNA sequences ligated to both ends of each fragment during library preparation. They provide primer binding sites needed for cluster amplification and sequencing. Ideally, the sequenced fragment is long enough that the machine reads only biological DNA. However, when the insert is shorter than the read length, the sequencer runs into the adapter sequence, which produces adapter contamination at the ends of reads.
Many users assume that adapters must always be trimmed, but this is not true for typical RNA-seq workflows.
Do we need to trim reads for standard RNA-seq workflows?
For alignment based RNA-seq, adapter trimming is usually not required. Modern aligners such as HISAT2 and STAR detect and soft clip adapter sequence automatically. Soft-clipped bases do not contribute to mismatches or mapping penalties, so adapters generally do not harm alignment quality.
In contrast, unnecessary trimming can introduce new problems:
- shorter reads and lower mapping power
- uneven trimming across samples and artificial differences
- reduction in read complexity
- broken read pairing for paired-end sequencing
- more parameters to track for reproducibility
Trimming is recommended only in specific situations:
- very short inserts, for example degraded RNA or ancient DNA
- adapter sequence dominating read tails, for example more than 10-15 percent adapter content
- transcript assembly workflows such as StringTie or Scallop, which benefit from uniform read lengths
- small RNA or miRNA libraries, where inserts are intentionally very short and adapter presence is guaranteed
Exercise: evaluate the QC results
Inspect your MultiQC report
Using the MultiQC HTML report, answer:
- Do all replicates show similar per base sequence quality profiles?
- Is adapter content high enough to impact analysis?
- Is any sample behaving differently from the rest, for example in number of reads, quality, duplication, or GC percentage?
Typical interpretation for this dataset (students should compare with their own):
- All replicates show high, stable sequence quality.
- Adapter signal appears only at the extreme ends of reads for selected samples and is less than 4 percent.
- No samples show concerning patterns or biases.
- Trimming is therefore unnecessary for this dataset.
Optional reference: how to trim adapters with fastp
(not part of the hands-on)
Sometimes you may encounter datasets where trimming is appropriate, for example short inserts, small RNA, or visibly high adapter content in QC. Although we will not trim reads in this workshop, here is a minimal example you can use in your own projects if trimming becomes necessary.
fastp is a fast, widely used tool that can automatically
detect adapters, trim them, filter low quality reads, and generate QC
reports.
Summary
- FastQC and MultiQC provide essential diagnostics for RNA-seq data.
- Some FastQC warnings are normal for RNA-seq and do not indicate problems.
- Aligners soft clip adapters, so trimming is usually unnecessary for alignment based RNA-seq.
- QC helps detect outliers before alignment and quantification.
Content from A. Genome-based quantification (STAR + featureCounts)
Last updated on 2025-11-25 | Edit this page
Overview
Questions
- How do we map RNA-seq reads to a reference genome?
- How do we determine library strandness before alignment?
- How do we quantify reads per gene using featureCounts?
- How do we submit mapping jobs on RCAC clusters with SLURM?
Objectives
- Check RNA-seq library strandness using Salmon.
- Index a reference genome for STAR.
- Map reads to the genome using STAR.
- Quantify gene level counts with featureCounts.
- Interpret basic alignment and counting statistics.
Introduction
In this episode we move from raw FASTQ files to genome based gene
counts.
This workflow has three major steps:
- Determine library strandness: STAR does not infer strandness
automatically.
featureCountsrequires the correct setting. - Map reads to the genome using STAR: STAR is a splice aware aligner designed for RNA-seq data.
- Count mapped reads per gene using
featureCounts
This represents one of the two standard quantification strategies in
RNA-seq.
The alternative method (B. transcript based quantification with Salmon
or Kallisto) will be covered in Episode 4B.
Step 1: Checking RNA-seq library strandness
Why strandness matters
RNA-seq libraries can be:
- unstranded
- stranded forward
- stranded reverse
Stranded protocols are common in modern kits, but old GEO and SRA
datasets are often unstranded.
featureCounts must be given the correct strand setting. If the wrong
setting is used, most reads will be counted against incorrect genes.
STAR does not infer strandness automatically, so we must determine it before alignment.
Detecting strandness with Salmon
Salmon can infer strandness directly from raw FASTQ files without any
alignment.
We run Salmon with -l A, which tells it to explore all
library types.
BASH
sinteractive -A workshop -q standby -p cpu -N 1 -n 4 --time=1:00:00
cd $SCRATCH/rnaseq-workshop
mkdir -p results/strand_check
module load biocontainers
module load salmon
salmon index --transcripts data/gencode.vM38.transcripts-clean.fa \
--index data/salmon_index \
--threads 4
salmon quant --index data/salmon_index \
--libType A \
--mates1 data/SRR33253286_1.fastq.gz \
--mates2 data/SRR33253286_2.fastq.gz \
--output results/strand_check \
--threads 4
The important output is in
results/strand_check/lib_format_counts.json:
JSON
{
"read_files": "[ data/SRR33253286_1.fastq.gz, data/SRR33253286_2.fastq.gz]",
"expected_format": "IU",
"compatible_fragment_ratio": 1.0,
"num_compatible_fragments": 21088696,
"num_assigned_fragments": 21088696,
"num_frags_with_concordant_consistent_mappings": 19116702,
"num_frags_with_inconsistent_or_orphan_mappings": 2510542,
"strand_mapping_bias": 0.5117980078362889,
"MSF": 0,
"OSF": 0,
"ISF": 9783890,
"MSR": 0,
"OSR": 0,
"ISR": 9332812,
"SF": 1234245,
"SR": 1276297,
"MU": 0,
"OU": 0,
"IU": 0,
"U": 0
}
Interpreting
lib_format_counts.jsonThis JSON file reports Salmon’s automatic library type detection.
"expected_format": "IU": This is Salmon’s conclusion. “I” means Inward (correct for paired-end reads) and “U” means Unstranded."strand_mapping_bias": 0.512: This is the key evidence. A value near 0.5 (50%) indicates that reads mapped equally to both the sense and antisense strands, the definitive sign of an unstranded library."ISF"and"ISR": These are the counts for Inward-Sense-Forward (9.78M) and Inward-Sense-Reverse (9.33M) fragments. Because these values are almost equal, they confirm the ~50/50 split seen in the strand bias.
Common results:
| Code | Meaning |
|---|---|
| IU | unstranded |
| ISR | stranded reverse (paired end) |
| ISF | stranded forward (paired end) |
| SR | stranded reverse (single end) |
| SF | stranded forward (single end) |
Conclusion: The data is unstranded. We will record this and supply it to STAR/featureCounts.
Step 2: Preparing the reference genome for alignment
STAR requires a genome index. Indexing builds a searchable data structure that speeds up alignment.
Required input:
- genome FASTA file
- annotation file (GTF or GFF3)
- output directory for the index
Optionally, annotation improves splice junction detection and yields better alignments.
Why STAR?
- Splice-aware
- Extremely fast
- Produces high-quality alignments
- Well supported and widely used
- Compatible with downstream tools such as featureCounts, StringTie, and RSEM
Optional alternatives include HISAT2 and Bowtie2, but STAR remains the standard for high-depth Illumina RNA-seq.
Exercise: Read the STAR manual
Check the “Generating genome indexes” section of
the
STAR
manual
Which options are required for indexing?
Key options:
-
--runMode genomeGenerate
-
--genomeDirdirectory to store the index
-
--genomeFastaFilesthe FASTA file
-
--sjdbGTFfileannotation (recommended)
-
--runThreadNnumber of threads
-
--sjdbOverhangread length minus one (100 is a safe default)
Indexing the Genome with STAR
Below is a minimal SLURM job script for building the STAR genome
index. Create a file named index_genome.sh in
$SCRATCH/rcac_rnaseq/scripts with this content:
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --account=workshop
#SBATCH --partition=cpu
#SBATCH --time=1:00:00
#SBATCH --job-name=star_index
#SBATCH --output=cluster-%x.%j.out
#SBATCH --error=cluster-%x.%j.err
module load biocontainers
module load star
data_dir="${SCRATCH}/rnaseq-workshop/data"
genome="${data_dir}/GRCm39.primary_assembly.genome.fa"
gtf="${data_dir}/gencode.vM38.primary_assembly.basic.annotation.gtf"
indexdir="${data_dir}/star_index"
STAR \
--runMode genomeGenerate \
--runThreadN ${SLURM_CPUS_ON_NODE} \
--genomeDir ${indexdir} \
--genomeFastaFiles ${genome} \
--sjdbGTFfile ${gtf}
Submit the job:
Indexing requires about 30 to 45 minutes.
The index directory is now ready for use in mapping and should have
contents like this: Location: $SCRATCH/rcac_rnaseq/data
Contents of star_index/:
BASH
star_index/
├── chrLength.txt
├── chrNameLength.txt
├── chrName.txt
├── chrStart.txt
├── exonGeTrInfo.tab
├── exonInfo.tab
├── geneInfo.tab
├── Genome
├── genomeParameters.txt
├── Log.out
├── SA
├── SAindex
├── sjdbInfo.txt
├── sjdbList.fromGTF.out.tab
├── sjdbList.out.tab
└── transcriptInfo.tab
Do I need the GTF for indexing?
No, but it is strongly recommended.
Including it improves splice junction detection and increases mapping
accuracy.
Step 3: Mapping reads with STAR
Once the genome index is ready, we can align reads.
For aligning reads, important STAR options are:
--runThreadN--genomeDir--readFilesIn-
--readFilesCommand zcatfor gzipped FASTQ files --outSAMtype BAM SortedByCoordinate--outSAMunmapped Within
STAR’s defaults are tuned for mammalian genomes. If working with plants, fungi, or repetitive genomes, additional tuning may be required.
Exercise: Write a STAR alignment command
Write a basic STAR alignment command using your genome index and paired FASTQ files.
Mapping many FASTQ files with a SLURM array
When having multiple samples, array jobs simplify submission.
First, we will create samples.txt file with just the
sample names
BASH
cd $SCRATCH/rnaseq-workshop/data
ls *_1.fastq.gz | sed 's/_1.fastq.gz//' > ${SCRATCH}/rnaseq-workshop/scripts/samples.txt
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --account=workshop
#SBATCH --partition=cpu
#SBATCH --time=8:00:00
#SBATCH --job-name=read_mapping
#SBATCH --array=1-6
#SBATCH --output=cluster-%x.%j.out
#SBATCH --error=cluster-%x.%j.err
module load biocontainers
module load star
FASTQ_DIR="$SCRATCH/rnaseq-workshop/data"
GENOME_INDEX="$SCRATCH/rnaseq-workshop/data/star_index"
OUTDIR="$SCRATCH/rnaseq-workshop/results/mapping"
mkdir -p $OUTDIR
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.txt)
R1=${FASTQ_DIR}/${SAMPLE}_1.fastq.gz
R2=${FASTQ_DIR}/${SAMPLE}_2.fastq.gz
STAR \
--runThreadN ${SLURM_CPUS_ON_NODE} \
--genomeDir $GENOME_INDEX \
--readFilesIn $R1 $R2 \
--readFilesCommand zcat \
--outFileNamePrefix $OUTDIR/$SAMPLE \
--outSAMunmapped Within \
--outSAMtype BAM SortedByCoordinate
Submit:
Why use an array job?
Think about this:
- We have six samples, each with two FASTQ files
- Having six separate slurm jobs means:
- Copy-pasting the same command six times
- Higher chance of typos or mistakes
- Copy-pasting the same command six times
- if mapping with single slurm file, sequential execution = slower
- Samples can run independently
Why is a job array a good choice?
Arrays allow us to run identical commands across many inputs:
- All samples are processed the same way
- Runs in parallel
- Simplifies submission
- Reduces copy-paste errors
- Minimizes runtime cost on the cluster
Step 4: Assessing alignment quality
STAR produces several useful files, particularly:
Log.final.out
Contains mapping statistics, including % uniquely mapped.Aligned.sortedByCoord.out.bam
Ready for downstream quantification.
To summarize alignment statistics across samples, we use MultiQC:
BASH
cd $SCRATCH/rnaseq-workshop
module load biocontainers
module load multiqc
multiqc results/mapping -o results/qc_alignment
Key metrics to review
-
Uniquely mapped reads — typically 60–90% is
good
-
Multi-mapped reads — depends on genome, but >20%
is concerning
-
Unmapped reads — check reasons (mismatches? too
short?)
- Mismatch rate — high values indicate quality or contamination issues
Exercise: Examine your alignment metrics
Using your MultiQC alignment report:
- Which sample has the highest uniquely mapped percentage?
- Are any samples clear outliers?
- Does mapped/unmapped reads appear consistent across samples?
Example interpretation for this dataset:
- All samples have ~80% uniquely mapped reads.
- No sample is an outlier.
- mapped/unmapped reads consistent across samples
Step 5: Quantifying gene counts with featureCounts
Once the reads have been aligned and sorted, we can quantify how many
read pairs map to each gene. This gives us gene-level counts that we
will later use for differential expression. For alignment-based RNA-seq
workflows, featureCounts is the recommended tool because it
is fast, robust, and compatible with most gene annotation formats.
featureCounts uses two inputs:
- A set of aligned BAM files (sorted by coordinate)
- A GTF annotation file describing gene models
It assigns each read (or read pair) to a gene based on the exon boundaries.
Important: stranded or unstranded?
featureCounts needs to know whether your dataset is
stranded.
-
-s 0unstranded -
-s 1stranded forward -
-s 2stranded reverse
From our Salmon strandness check, this dataset behaves as
unstranded, so we will use -s 0.
Running featureCounts
We will create a new directory for count output:
Below is an example SLURM script to count all BAM files at once.
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --account=workshop
#SBATCH --partition=cpu
#SBATCH --time=1:00:00
#SBATCH --job-name=featurecounts
#SBATCH --output=cluster-%x.%j.out
#SBATCH --error=cluster-%x.%j.err
module load biocontainers
module load subread
GTF="$SCRATCH/rnaseq-workshop/data/gencode.vM38.annotation.gtf"
BAMS="$SCRATCH/rnaseq-workshop/results/mapping/*.bam"
output_dir="$SCRATCH/rnaseq-workshop/results/counts"
featureCounts \
-T ${SLURM_CPUS_PER_TASK} \
-p \
-s 0 \
-t exon \
-g gene_id \
-a ${GTF} \
-o ${output_dir}/gene_counts.txt \
${BAMS}
Submit the job:
gene_counts.txt will contain:
- one row per gene
- one column per sample
Step 6: QC on counts
Before proceeding to differential expression, it is good practice to perform some QC on the count data. We can again use MultiQC to summarize featureCounts statistics:
BASH
cd $SCRATCH/rnaseq-workshop
mkdir -p results/qc_counts
module load biocontainers
module load multiqc
multiqc results/counts -o results/qc_counts
Exercise: Examine your counting metrics
Using your MultiQC featureCounts report:
- Which sample has the lowest percent assigned, and what are common reasons for lower assignment?
- Which unassigned category is the largest, and what does it indicate?
- Does the sample with the most assigned reads also have the highest percent assigned?
- What does “Unassigned: No Features” mean, and why might it occur?
- Are multi-mapped reads a concern for RNA-seq differential expression?
Example interpretation for this dataset:
- SRR33253289 has the lowest assignment rate. Causes include low complexity, incomplete annotation, or more intronic reads.
- Unmapped or No Features dominate. These reflect reads that do not align or do not overlap annotated exons.
- No. Total assigned reads depend on depth, while percent assigned reflects library quality.
- Reads mapped outside annotated exons, often from intronic regions, unannotated transcripts, or incomplete annotation.
- Multi-mapping is expected for paralogs and repeats. It is usually fine as long as the rate is consistent across samples.
Summary
- STAR is a fast splice-aware aligner widely used for RNA-seq.
- Genome indexing requires FASTA and optionally GTF for improved splice detection.
- Mapping is efficiently performed using SLURM array jobs.
- Alignment statistics (from
Log.final.out+ MultiQC) must be reviewed. - Strandness should be inferred using aligned BAM files.
- Final BAM files are ready for downstream counting.
-
featureCountsis used to obtain gene-level counts for differential expression. - Exons are counted and summed per gene.
- The output is a simple matrix for downstream statistical analysis.
Content from B. Transcript-based quantification (Salmon)
Last updated on 2025-11-25 | Edit this page
Overview
Questions
- How do we quantify expression without genome alignment?
- What inputs does Salmon require?
- How do we run Salmon for paired-end RNA-seq data?
- How do we interpret transcript-level outputs (TPM, NumReads)?
- How do we summarize transcripts to gene-level counts using tximport?
Objectives
- Build a transcriptome index for Salmon.
- Quantify transcript abundance directly from FASTQ files.
- Understand key Salmon output files.
- Summarize transcript-level estimates to gene-level counts.
- Prepare counts for downstream differential expression.
Introduction
In the previous episode, we mapped reads to the genome using STAR and generated gene-level counts with featureCounts.
In this episode, we use an alternative quantification strategy:
- Build a transcriptome index
- Quantify expression directly from FASTQ files using
Salmon (without alignment)
- Summarize transcript estimates back to gene-level counts
This workflow is faster, uses less storage, and models transcript-level uncertainty.
When should I use transcript-based quantification?
Salmon is ideal when:
- You do not need alignment files.
- You want transcript-level TPMs.
- You want fast quantification with bias correction.
- Storage is limited (no BAM files generated).
It is not ideal if you need splice junctions, variant calling, or visualization in IGV (all those depends on alignment files).
Step 1: Preparing the transcriptome reference
Salmon requires a transcriptome FASTA file (all
annotated transcripts). We previously downloaded transcripts file from
GENCODE for this purpose (gencode.vM38.transcripts.fa).
Why transcriptome choice matters
Transcript-level quantification inherits all assumptions of
the annotation.
Missing or incorrect transcripts → biased TPM estimates.
Building the Salmon index
A Salmon index is lightweight and quick to build. You may have already built this in Episode 4A.
BASH
sinteractive -A workshop -p cpu -N 1 -n 4 --time=1:00:00
cd $SCRATCH/rnaseq-workshop
mkdir -p data/salmon_index
module load biocontainers
module load salmon
salmon index \
--transcripts data/gencode.vM38.transcripts-clean.fa \
--index data/salmon_index \
--threads 4
This creates a directory containing Salmon’s k-mer index of all transcripts.
salmon_index/
├── complete_ref_lens.bin # IMPORTANT: full transcript lengths used in TPM/effLength calculations
├── ctable.bin # internal k-mer lookup table (not user-relevant)
├── ctg_offsets.bin # internal offsets for reference sequences
├── duplicate_clusters.tsv # internal duplicate k-mer cluster info
├── info.json # IMPORTANT: index metadata (k-mer size, transcripts, build parameters)
├── mphf.bin # minimal perfect hash function for fast lookup
├── pos.bin # internal k-mer position table
├── pre_indexing.log # IMPORTANT: log of preprocessing and FASTA parsing steps
├── rank.bin # internal rank/select bitvector structure
├── refAccumLengths.bin # cumulative reference lengths (internal)
├── ref_indexing.log # IMPORTANT: main indexing log; good for debugging build problems
├── reflengths.bin # IMPORTANT: transcript lengths used in quantification math
├── refseq.bin # IMPORTANT: transcript sequences stored in binary format (core index content)
├── seq.bin # internal sequence encoding structures
└── versionInfo.json # IMPORTANT: Salmon version + environment info for reproducibility
Step 2: Quantifying transcript abundances
The core Salmon command is salmon quant. We use the
transcript index and paired-end FASTQ files.
Example command:
BASH
salmon quant \
--index salmon_index \
--libType IU \
--mates1 data/SRR33253286_1.fastq.gz \
--mates2 data/SRR33253286_2.fastq.gz \
--output salmon_quant/SRR33253286 \
--threads 16 \
--validateMappings \
--seqBias \
--gcBias
Important flags:
-
--libType IU→ unstranded, inward-oriented paired-end (Replace withISR,ISF, etc. based on your strandness.) -
--validateMappings→ improves mapping accuracy -
--seqBias,--gcBias→ corrects sequence- and GC-related biases
Correct library type is essential
A wrong --libType silently distorts transcript
abundances. Use Salmon’s strandness detection from Episode 4A
if unsure.
Step 3: Running Salmon for many samples
Again, we use a SLURM array.
First, create a sample list (or reuse the one from Episode 4A):
BASH
cd $SCRATCH/rnaseq-workshop/data
ls *_1.fastq.gz | sed 's/_1.fastq.gz//' > $SCRATCH/rnaseq-workshop/scripts/samples.txt
Array job:
BASH
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --account=workshop
#SBATCH --partition=cpu
#SBATCH --time=4:00:00
#SBATCH --job-name=salmon_quant
#SBATCH --array=1-6
#SBATCH --output=cluster-%x.%j.out
#SBATCH --error=cluster-%x.%j.err
module load biocontainers
module load salmon
DATA="$SCRATCH/rnaseq-workshop/data"
INDEX="$SCRATCH/rnaseq-workshop/data/salmon_index"
OUT="$SCRATCH/rnaseq-workshop/results/salmon_quant"
mkdir -p $OUT
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.txt)
R1=${DATA}/${SAMPLE}_1.fastq.gz
R2=${DATA}/${SAMPLE}_2.fastq.gz
salmon quant \
--index $INDEX \
--libType IU \
--mates1 $R1 \
--mates2 $R2 \
--output $OUT/$SAMPLE \
--threads ${SLURM_CPUS_ON_NODE} \
--validateMappings \
--seqBias \
--gcBias
Submit:
The typical run time is ~5 minutes per sample.
Why use an array job here?
Unlike the STAR workflow in 4A, Salmon runs quickly and produces small output. The advantage of array jobs here is not speed, but consistency, all samples are quantified with identical parameters and metadata. This ensures that transcript-level TPM and count estimates are directly comparable across samples.
Inspect results
After Salmon finishes, each sample directory contains a small set of
output files.
Only a few of these are important for routine analysis:
SRR33253285/
├── quant.sf # IMPORTANT: transcript-level TPM, counts, effective lengths
├── lib_format_counts.json # IMPORTANT: inferred library type (IU, ISR, ISF)
├── aux_info
│ ├── meta_info.json # IMPORTANT: mapping rate, fragment counts, bias flags
│ ├── fld.gz # fragment length distribution (binary format)
│ ├── expected_bias.gz # bias model used for correction
│ ├── observed_bias.gz # observed sequence and GC bias
│ └── ambig_info.tsv # ambiguous equivalence class summaries (advanced QC only)
├── libParams
│ └── flenDist.txt # fragment length distribution in a human-readable format
├── cmd_info.json # command and parameters used to run Salmon
└── logs
└── salmon_quant.log # main log file for troubleshooting
What is inside quant.sf?
This file contains the transcript-level quantification results:
-
Name: transcript ID
-
Length: transcript length
-
EffectiveLength: length adjusted for fragment
distribution and bias
-
TPM: within-sample normalized abundance
- NumReads: estimated number of reads assigned to that transcript
Example:
Name Length EffectiveLength TPM NumReads
ENSMUST00000193812.2 1070 775.000 0.000000 0.000
ENSMUST00000082908.3 110 110.000 0.000000 0.000
ENSMUST00000162897.2 4153 3483.630 0.013994 1.000
ENSMUST00000159265.2 2989 2694.000 0.000000 0.000
ENSMUST00000070533.5 3634 3339.000 0.000000 0.000
ENSMUST00000192857.2 480 185.000 0.000000 0.000
ENSMUST00000195335.2 2819 2524.000 0.000000 0.000
ENSMUST00000192336.2 2233 1938.000 0.000000 0.000
ENSMUST00000194099.2 2309 2014.000 0.000000 0.000
Key QC metrics to review
Open aux_info/meta_info.json and check:
- percent_mapped: proportion of fragments assigned to transcripts
-
library_types: Salmon inferred library orientation,
which should match your
--libType - fragment length statistics: mean and standard deviation of inferred fragment lengths
These metrics provide a first-pass sanity check before summarizing the results to gene-level counts.
What does Salmon count?
NumReads is a model-based estimate, not raw counts of
reads. TPM values are within-sample normalized and should not
be directly compared across samples.
Step 4: Summarizing to gene-level counts
(tximport)
Most differential expression tools (DESeq2, edgeR) require gene-level counts. We convert transcript estimates → gene-level matrix.
Prepare tx2gene mapping
We will need a mapping file that links the transcript IDs to gene
IDs. The gencode.vM38.primary_assembly.basic.annotation.gtf
file contains this information.
BASH
cd $SCRATCH/rnaseq-workshop/data
awk -F'\t' '$3=="transcript" {
match($9, /transcript_id "([^"]+)"/, tx);
match($9, /gene_id "([^"]+)"/, gene);
print tx[1] "\t" gene[1]
}' gencode.vM38.primary_assembly.basic.annotation.gtf > tx2gene.tsv
Next, we need to process this mapping in R. Load the modules and start the R session, first:
In the R session, read in the tx2gene mapping:
R
setwd(paste0(Sys.getenv("SCRATCH"),"/rnaseq-workshop"))
library(readr)
library(tximport)
tx2gene <- read_tsv("data/tx2gene.tsv", col_types = "cc", col_names = FALSE)
samples <- read_tsv("scripts/samples.txt", col_names = FALSE)
files <- file.path("results/salmon_quant", samples$X1, "quant.sf")
names(files) <- samples$X1
txi <- tximport(files,
type = "salmon",
tx2gene = tx2gene,
countsFromAbundance = "lengthScaledTPM")
saveRDS(txi, file = "results/salmon_quant/txi.rds")
txi$counts now contains gene-level counts suitable for
DESeq2.
R
> head(txi$counts)
SRR33253285 SRR33253286 SRR33253287 SRR33253288
ENSMUSG00000000001.5 473.4426 553.784929 826.996853 676.910126
ENSMUSG00000000003.16 0.0000 0.000000 0.000000 0.000000
ENSMUSG00000000028.16 40.5783 33.966518 55.305501 97.845071
ENSMUSG00000000031.20 0.0000 0.000000 0.000000 0.000000
ENSMUSG00000000037.18 19.2220 9.599673 16.951493 20.226549
ENSMUSG00000000049.12 0.0000 5.022073 1.984551 1.016411
SRR33253289 SRR33253290
ENSMUSG00000000001.5 724.117586 813.848375
ENSMUSG00000000003.16 0.000000 0.000000
ENSMUSG00000000028.16 41.691461 63.812279
ENSMUSG00000000031.20 0.000000 0.000000
ENSMUSG00000000037.18 19.680740 18.945058
ENSMUSG00000000049.12 0.993724 9.186882
Why use lengthScaledTPM?
This method leverages Salmon’s bias correction while generating stable counts for statistical modeling.
Step 5: QC of quantification metrics
Use MultiQC to verify:
BASH
cd $SCRATCH/rnaseq-workshop
module load biocontainers
module load multiqc
multiqc results/salmon_quant -o results/qc_salmon
Inspect:
- mapping rates
- fragment length distributions
- library type consistency
Exercise: examine your Salmon outputs
Using your MultiQC summary and Salmon fragment length plots:
- What is the percent aligned for each sample, and are any samples outliers?
- Is the inferred library type consistent across samples?
- How similar are the fragment length distributions across samples?
- Does the fragment length curve show unusual peaks or multimodality?
- Do CFR and M bias values suggest any strand or positional bias?
Example interpretation for this dataset:
- All samples show about 89 to 91 percent aligned. No sample is an outlier and the range is narrow.
- The inferred library type is consistent across samples and matches an unstranded protocol.
- Fragment length distributions are nearly identical. All samples peak around the same length and have similar spread.
- The fragment length curves are smooth and unimodal. There are no secondary peaks or irregular shoulders.
- CFR is 100 percent for all samples and M bias is about 0.5, indicating balanced, strand-neutral coverage without detectable end bias.
Summary
- Salmon performs fast, alignment-free transcript quantification.
- A transcriptome FASTA is required for building the index.
-
salmon quantuses FASTQ files directly with bias correction. - Transcript-level outputs include TPM and estimated counts.
-
tximportconverts transcript estimates to gene-level counts. - Gene-level counts from Salmon are suitable for DESeq2.
Content from Gene-level QC and differential expression (DESeq2)
Last updated on 2025-11-25 | Edit this page
Overview
Questions
- How do we explore RNA seq count data before running DESeq2.
- How do we restrict analysis to protein coding genes.
- How do we perform differential expression with DESeq2.
- How do we visualize sample relationships and DE genes.
- How do we save a useful DE results table with annotation and expression values.
Objectives
- Import count and sample metadata into R.
- Attach biotype and gene symbol annotation to Ensembl gene IDs.
- Explore library sizes, variance stabilizing transforms, distances, and PCA.
- Run DESeq2 for a two group comparison.
- Create a volcano plot with gene symbols.
- Export joined DE results including normalized counts and annotation.
Introduction
In this episode we move from count matrices to differential expression analysis and visualization.
The workflow has four parts:
- Import counts and sample metadata.
- Exploratory analysis on protein coding genes (quality checks).
- Differential expression analysis with DESeq2.
- Visualization and export of annotated DE results.
What you need for this episode
-
gene_counts_clean.txtgenerated from featureCounts
- a
samples.csvfile describing the experimental groups
- RStudio session on Scholar using Open OnDemand
While we created the count matrix in the previous episode, we still
need to create a sample metadata file. This file should contain at least
two columns: sample names matching the count matrix column names, and
the experimental condition (e.g., control vs treatment). Simply
copy/paste the following into a text file and save it in your
scripts directory as samples.csv:
sample,condition
SRR33253285,Nat10-CKO
SRR33253286,Nat10-CKO
SRR33253287,Nat10-CKO
SRR33253288,Nat10floxflox
SRR33253289,Nat10flox/flox
SRR33253290,Nat10flox/flox
Also, create a direcotry for DESeq2 results:
All analyses are performed in R through Open OnDemand
Step 1: Start Open OnDemand R session and prepare data
We will be using the OOD to start an interactive session on Scholar: https://gateway.negishi.rcac.purdue.edu
- Login using your Purdue credentials after clicking the above link
- Click on “Interactive Apps” in the top menu, and select “RStudio (Bioconductor)”
- Fill the job submission form as follows:
- queue:
workshop - Walltime:
4 - Number of cores:
4
- queue:
- Click “Launch” and wait for the RStudio session to start
- Once the session starts, you’ll will be able to click on “Connect to RStudio server” which will open the RStudio interface.

Once in RStudio, load the necessary libraries:
Setup: load packages and data
R
library(RColorBrewer)
library(EnsDb.Mmusculus.v79)
library(ensembldb)
library(tidyverse)
library(DESeq2)
library(ggplot2)
library(pheatmap)
library(readr)
library(dplyr)
library(ComplexHeatmap)
library(ggrepel)
library(vsn)
# Modify the path as needed
setwd("/scratch/negishi/aseethar/rnaseq-workshop")
countsFile <- "results/counts/gene_counts_clean.txt"
# prepare this file first!
groupFile <- "scripts/samples.csv"
coldata <-
read.csv(
groupFile,
row.names = 1,
header = TRUE,
stringsAsFactors = TRUE
)
coldata$condition <- as.factor(coldata$condition)
cts <- as.matrix(read.delim(countsFile, row.names = 1, header = TRUE))
We also load gene level annotation tables prepared earlier.
R
mart <-
read.csv(
"data/mart.tsv",
sep = "\t",
header = TRUE
)
annot <-
read.csv(
"data/annot.tsv",
sep = "\t",
header = TRUE
)
Since the gene IDs in the count matrix are Ensembl IDs, it will be
hard to interpret the results without annotation. We will need to use
the org.Mm.eg.db package for attaching gene symbols, so we
can interpret the results better.0
R
library(biomaRt)
library(tidyverse)
# select Mart
ensembl = useMart("ENSEMBL_MART_ENSEMBL")
# we need to use the correct dataset for mouse
# we will query the available datasets and filter for mouse (GRCm39)
listDatasets(ensembl) %>%
filter(str_detect(description, "GRCm39"))
# we can use the dataset name to set the dataset
ensembl = useDataset("mmusculus_gene_ensembl", mart = ensembl)
# our counts have Ensembl gene IDs with version numbers
# lets find how it is referred in biomaRt
listFilters(ensembl) %>%
filter(str_detect(name, "ensembl"))
# so we will now set the filter type accordingly
filterType <- "ensembl_gene_id_version"
# from our counts data, get the list of gene IDs
filterValues <- rownames(counts)
# take a look at the available attributes (first 20)
listAttributes(ensembl) %>%
head(20)
# we will retrieve ensembl gene id, ensembl gene id with version and external gene name
attributeNames <- c('ensembl_gene_id',
'ensembl_gene_id_version',
'external_gene_name',
'gene_biotype',
'description')
# get the annotation
annot <- getBM(
attributes = attributeNames,
filters = filterType,
values = filterValues,
mart = ensembl
)
saveRDS(annot, file = "results/counts/gene_annotation.rds")
# also save as tsv for future use
write.table(
annot,
file = "data/mart.tsv",
sep = "\t",
quote = FALSE,
row.names = FALSE
)
# save a simplified annotation file with only essential columns
mart %>%
select(ensembl_gene_id, ensembl_gene_id_version, external_gene_name) %>%
write.table(
file = "data/annot.tsv",
sep = "\t",
quote = FALSE,
row.names = FALSE
The objects loaded here form the foundation for the entire workflow.
cts contains raw gene counts from featureCounts.
coldata contains the metadata that describes each sample
and defines the comparison groups. mart and
annot contain functional information for each gene so we
can attach gene symbols and biotypes later in the analysis.
Sample metadata and design
coldata must have row names that match the count matrix
column names. The condition column defines the groups for
DESeq2 (Nat10-CKO vs Nat10flox/flox).
Exploratory analysis
Attach annotation and filter to protein coding genes
We first join counts with biotype information from mart
and filter to protein coding genes.
R
cts_tbl <- cts %>%
as.data.frame() %>%
rownames_to_column("ensembl_gene_id_version")
cts_annot <- cts_tbl %>%
left_join(mart, by = "ensembl_gene_id_version")
sum(is.na(cts_annot$gene_biotype))
Joining annotation allows us to move beyond raw Ensembl IDs. This step attaches gene symbols, biotypes, and descriptions so that later plots and tables are interpretable. At this point the dataset is still large and includes many noncoding genes, pseudogenes, and biotypes we won’t analyze further.
Summarize gene biotypes:
R
biotype_df <- cts_annot %>%
dplyr::count(gene_biotype, name = "n") %>%
arrange(desc(n))
ggplot(biotype_df, aes(x = reorder(gene_biotype, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
theme_minimal() +
labs(
x = "Gene biotype",
y = "Number of genes",
title = "Distribution of gene biotypes in annotation"
)

Visualizing biotype composition gives a sense of how many genes belong to each functional class. RNA-seq quantifies all transcribed loci, but many biotypes (snRNA, TEC, pseudogenes) are not suitable for differential expression with DESeq2 because they tend to be lowly expressed or unstable. This motivates filtering to protein-coding genes for QC and visualization.
Filter to protein coding genes and lowly expressed genes:
R
cts_coding <- cts_annot %>%
filter(gene_biotype == "protein_coding") %>%
filter(rowSums(across(all_of(rownames(coldata)))) > 5) %>%
select(ensembl_gene_id_version, all_of(rownames(coldata))) %>%
column_to_rownames("ensembl_gene_id_version")
dim(cts_coding)
[1] 15839 6
Why filter protein coding genes and low expression
Noncoding biotypes (lncRNA, pseudogene, etc.) can be biologically important. Filtering lowly expressed genes reduces noise in exploratory analyses. Genes with near-zero counts across all samples inflate variance estimates and obscure real biological structure. The threshold of 5 total counts is a conservative filter that keeps genes with minimal but genuine expression while removing background noise.
Library size differences
We summarize total counts per sample for protein coding genes.
R
libSize <- colSums(cts_coding) %>%
as.data.frame() %>%
rownames_to_column("sample") %>%
rename(total_counts = 2) %>%
mutate(total_millions = total_counts / 1e6)
ggplot(libSize, aes(x = sample, y = total_counts)) +
geom_col(fill = "steelblue") +
theme_minimal() +
labs(
x = "Sample",
y = "Total assigned reads",
title = "Total reads per sample"
) +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)

Library size differences affect normalization. Although DESeq2 corrects for this through size factors, inspecting raw library sizes ensures no sample has abnormally low sequencing depth, which could compromise downstream analyses.
Create DESeq2 object for exploratory plots
R
dds <- DESeqDataSetFromMatrix(
countData = cts_coding,
colData = coldata,
design = ~ condition
)
dds <- estimateSizeFactors(dds)
Inspect size factors vs library size:
R
ggplot(
data.frame(
libSize = colSums(assay(dds)),
sizeFactor = sizeFactors(dds),
Group = dds$condition
),
aes(x = libSize, y = sizeFactor, col = Group)
) +
geom_point(size = 5) +
theme_bw() +
labs(x = "Library size", y = "Size factor")

Variance stabilization and mean SD plots
Raw count data have a strong mean–variance dependency: low count genes have high variance, and high count genes show lower relative variability. This makes Euclidean distance and PCA misleading on raw counts. VST removes this dependency so that downstream clustering is biologically meaningful rather than driven by count scale.
Inspect mean vs standard deviation plot before transformation:
R
meanSdPlot(assay(dds), ranks = FALSE)

Apply variance stabilizing transform (VST):
R
vsd <- DESeq2::vst(dds, blind = TRUE)
meanSdPlot(assay(vsd), ranks = FALSE)

Why use VST
Counts have mean dependent variance. Variance stabilizing transform (VST) makes the variance roughly constant across the range, which improves distance based methods and PCA.
Sample to sample distances
The distance heatmap helps answer a key question: Do samples group by biological condition or by an unwanted variable (batch, sequencing run, contamination)? This is one of the fastest ways to detect outliers.
R
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- colnames(vsd)
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
pheatmap(
sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
col = colors
)

PCA on top variable genes
PCA projects samples into low-dimensional space and reveals dominant sources of variation. Ideally, samples separate by biological condition along one of the top PCs. If not, the signal from the condition may be weak or confounded by other factors.
R
rv <- rowVars(assay(vsd))
select <- order(rv, decreasing = TRUE)[seq_len(min(500, length(rv)))]
pca <- prcomp(t(assay(vsd)[select, ]))
percentVar <- pca$sdev ^ 2 / sum(pca$sdev ^ 2)
intgroup <- "condition"
intgroup.df <- as.data.frame(colData(vsd)[, intgroup, drop = FALSE])
group <- if (length(intgroup) == 1) {
factor(apply(intgroup.df, 1, paste, collapse = " : "))
}
d <- data.frame(
PC1 = pca$x[, 1],
PC2 = pca$x[, 2],
intgroup.df,
name = colnames(vsd)
)
ggplot(d, aes(PC1, PC2, color = condition)) +
scale_shape_manual(values = 1:12) +
scale_color_manual(values = c(
"Nat10-CKO" = "#00b462",
"Nat10flox/flox" = "#980c80"
)) +
theme_bw() +
theme(legend.title = element_blank()) +
geom_point(size = 2, stroke = 2) +
geom_text_repel(aes(label = name)) +
xlab(paste("PC1", round(percentVar[1] * 100, 2), "% variance")) +
ylab(paste("PC2", round(percentVar[2] * 100, 2), "% variance"))

Exercise: interpret exploratory plots
Using the library size, distance heatmap, and PCA:
- Are there any obvious outlier samples.
- Do replicates from the same condition cluster together.
- Does any sample look inconsistent with its metadata label.
Example interpretation:
- Library sizes are comparable across samples and fall within a narrow range.
- Samples cluster by condition in both distance heatmaps and PCA.
- No sample appears as a clear outlier relative to its group.
Differential gene expression analysis
For DE testing we use the full count matrix
(cts) without filtering by biotype. This preserves all
features counted by featureCounts.
R
dds <- DESeqDataSetFromMatrix(
countData = cts_coding,
colData = coldata,
design = ~ condition
)
You can either run the individual DESeq2 steps:
R
dds <- estimateSizeFactors(dds)
dds <- estimateDispersions(dds)
dds <- nbinomWaldTest(dds)
Above three steps can be run together using the DESeq
function:
R
dds <- DESeq(dds)
estimating size factors
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not a warning or an error]
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
Note: levels of factors in the design contain characters other than
letters, numbers, '_' and '.'. It is recommended (but not required) to use
only letters, numbers, and delimiters '_' or '.', as these are safe characters
for column names in R. [This is a message, not a warning or an error]
final dispersion estimates
fitting model and testing
Inspect dispersion estimates:
R
plotDispEsts(dds)

Obtain results for the contrast of interest
(Nat10flox/flox vs Nat10-CKO):
R
res <-
results(
dds,
contrast = c(
"condition",
"Nat10flox/flox",
"Nat10-CKO"
)
)
summary(res)
TXT
out of 15839 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 3195, 20%
LFC < 0 (down) : 3398, 21%
outliers [1] : 362, 2.3%
low counts [2] : 308, 1.9%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Order the results by p value to see the top DE genes:
R
head(res[order(res$pvalue), ])
log2 fold change (MLE): condition Nat10flox/flox vs Nat10-CKO
Wald test p-value: condition Nat10flox.flox vs Nat10.CKO
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
18512 4645.33 -5.17453 0.1335387 -38.7493 0.00000e+00 0.00000e+00
940 5041.68 -2.88015 0.1039939 -27.6953 7.94364e-169 6.02485e-165
13789 14362.27 -2.40229 0.0951377 -25.2506 1.11514e-140 5.63852e-137
16183 42766.77 -4.68585 0.1923780 -24.3575 4.82857e-131 1.83111e-127
18278 7055.06 -3.48409 0.1535305 -22.6932 5.23170e-114 1.58719e-110
270 1469.83 -2.87951 0.1298316 -22.1788 5.50606e-109 1.39202e-105
Each row represents one gene. Key columns: baseMean:
average normalized count across all samples log2FoldChange:
direction and magnitude of change between groups pvalue /
padj: statistical significance after multiple testing
correction Genes with NA values often have too few reads to
estimate dispersion and should not be interpreted.
Create a summary table:
R
log2fc_cut <- log2(1.5)
res_df <- as.data.frame(res)
summary_table <- tibble(
total_genes = nrow(res_df),
sig = sum(res_df$padj < 0.05, na.rm = TRUE),
up = sum(res_df$padj < 0.05 & res_df$log2FoldChange > log2fc_cut, na.rm = TRUE),
down= sum(res_df$padj < 0.05 & res_df$log2FoldChange < -log2fc_cut, na.rm = TRUE)
)
print(summary_table)
# A tibble: 1 × 4
total_genes sig up down
<int> <int> <int> <int>
1 15839 5658 2180 2363
Keep gene IDs:
R
res_df <- as.data.frame(res)
res_df$ensembl_gene_id_version <- rownames(res_df)
Joining results with expression and annotation
Get normalized counts:
R
norm_counts <- counts(dds, normalized = TRUE)
norm_counts_df <- as.data.frame(norm_counts)
norm_counts_df$ensembl_gene_id_version <- rownames(norm_counts_df)
Join DE results, normalized counts, and annotation:
R
res_annot <- res_df %>%
dplyr::left_join(norm_counts_df,
by = "ensembl_gene_id_version") %>%
dplyr::left_join(mart,
by = "ensembl_gene_id_version")
Joining normalized counts and annotation makes the output
biologically interpretable. The final table becomes a comprehensive
results object containing: - gene identifiers -
gene symbols - functional descriptions -
differential expression statistics -
normalized expression values This is the format most
researchers expect when examining results or importing them into
downstream tools.
Define significance labels for plotting:
R
log2fc_cut <- log2(1.5)
res_annot <- res_annot %>%
dplyr::mutate(
label = dplyr::coalesce(external_gene_name, ensembl_gene_id_version),
sig = dplyr::case_when(
padj <= 0.05 & log2FoldChange >= log2fc_cut ~ "up",
padj <= 0.05 & log2FoldChange <= -log2fc_cut ~ "down",
TRUE ~ "ns"
)
)
Visualization: volcano plot
The volcano plot summarizes statistical significance
(padj) and effect size (log2FoldChange)
simultaneously. Genes far left or right show strong expression
differences. Genes high on the plot pass stringent significance
thresholds. Labels help identify the most biologically meaningful
genes.
R
ggplot(
res_annot,
aes(
x = log2FoldChange,
y = -log10(padj),
col = sig,
label = label
)
) +
geom_point(alpha = 0.6) +
scale_color_manual(values = c(
"up" = "firebrick",
"down" = "dodgerblue3",
"ns" = "grey70"
)) +
geom_text_repel(
data = dplyr::filter(res_annot, sig != "ns"),
max.overlaps = 12,
min.segment.length = Inf,
box.padding = 0.3,
seed = 42,
show.legend = FALSE
) +
theme_classic() +
xlab("log2 fold change") +
ylab("-log10 adjusted p value") +
ggtitle("Volcano plot: Nat10-CKO vs Nat10flox/flox")

Interpreting the volcano
Points to the right are higher in Nat10flox/flox, to the
left are higher in Nat10-CKO. Points near the top have
stronger statistical support (small adjusted p values). Colored points
mark genes that pass both fold change and padj thresholds.
Saving results
Saving both the full table and the significant subset allows flexibility: - The full table is needed for pathway analyses and reproducibility. - The subset table focuses on interpretable and biologically meaningful DE genes.
R
readr::write_tsv(
res_annot,
"results/deseq2/DESeq2_results_joined.tsv"
)
Filter to significant genes (padj ≤ 0.05 and fold change ≥ 1.5):
R
sig_res <- res_annot %>%
dplyr::filter(
padj <= 0.05,
abs(log2FoldChange) >= log2fc_cut
)
readr::write_tsv(
sig_res,
"results/deseq2/DESeq2_results_sig.tsv"
)
Exercise: explore top DE genes
Using sig_res:
- Which genes have the largest positive log2 fold change.
- Which genes have the largest negative log2 fold change.
- Are any of the top genes known to be related to Nat10 function or the studied phenotype.
Example interpretation:
- Several strongly up regulated genes are involved in cell cycle and RNA processing.
- Strongly down regulated genes may include transcription factors and signaling components.
- Cross referencing top hits with prior literature or databases can reveal mechanistic hypotheses.
Summary
- Counts and sample metadata must be aligned before building DESeq2 objects.
- Biotype annotation allows restriction to protein coding genes for exploratory QC.
- Variance stabilizing transform, distance heatmaps, and PCA help detect outliers and batch effects.
- DESeq2 provides a complete pipeline from raw counts to differential expression.
- Joining DE results with normalized counts and gene annotation yields a useful final table.
- Volcano plots summarize differential expression using fold change and adjusted p values.
Content from Gene set enrichment analysis
Last updated on 2025-11-25 | Edit this page
Overview
Questions
- How do we identify functional pathways and biological themes from DE genes.
- How do we run over representation analysis for GO, KEGG, and MSigDB sets.
- How do we create simple visualizations of enriched terms.
- How do we interpret enrichment results in a biologically meaningful way.
Objectives
- Prepare a gene list from DESeq2 results.
- Run GO enrichment using clusterProfiler.
- Run KEGG pathway enrichment with organism specific data.
- Perform MSigDB Hallmark enrichment.
- Produce barplots and dotplots for enriched terms.
Introduction
Differential expression gives a list of up regulated and down
regulated genes.
To understand the underlying biology, we examine which
functional gene sets are enriched.
This process is called over representation analysis
(ORA). ORA identifies pathways or gene sets that contain more DE genes
than expected by chance.
Here we will use clusterProfiler, org.Mm.eg.db, and msigdbr to run GO, KEGG, and Hallmark analyses on mouse data.
What ORA requires
- A background universe of all tested genes (usually all genes in the DESeq2 dataset).
- A subset of genes considered significant (for example padj < 0.05).
- A gene identifier type that matches the annotation database (Ensembl or Entrez).
Step 1: Load DE results and prepare gene lists
Start your R session:
BASH
sinteractive -A rcac -p cpu -N 1 -n 4 --time=02:00:00
module load biocontainers
module load r-rnaseq
R
Load data:
R
library(DESeq2)
library(clusterProfiler)
library(org.Mm.eg.db)
library(msigdbr)
library(dplyr)
library(ggplot2)
library(readr)
Load DESeq2 results from Episode C:
R
res <- read_csv("DESeq2_results_shrunken.csv", show_col_types = FALSE)
Define:
- universe = all genes tested
- significant set = padj < 0.05
R
universe <- res$X1
sig <- res %>% filter(padj < 0.05) %>% pull(X1)
Convert Ensembl IDs to Entrez (required for GO and KEGG):
R
id_map <- bitr(universe,
fromType = "ENSEMBL",
toType = "ENTREZID",
OrgDb = org.Mm.eg.db)
Merge IDs into results:
R
res2 <- left_join(res, id_map, by = c("X1" = "ENSEMBL"))
sig_entrez <- res2 %>% filter(padj < 0.05) %>% pull(ENTREZID)
universe_entrez <- res2$ENTREZID
Step 2: GO enrichment
Run GO Biological Process ORA:
R
ego <- enrichGO(gene = sig_entrez,
universe = universe_entrez,
OrgDb = org.Mm.eg.db,
ont = "BP",
readable = TRUE)
View top terms:
R
head(ego)
Plot:
R
dotplot(ego, showCategory = 20) +
ggtitle("GO Biological Process enrichment")
Why GO is useful
GO BP terms provide high level summaries of biological
activity.
Strongly biased terms (for example mitochondrial or translation) often
correspond to global shifts in cell state or stress.
Step 3: KEGG enrichment
KEGG requires Entrez IDs and organism code "mmu" for
mouse.
R
ekegg <- enrichKEGG(gene = sig_entrez,
universe = universe_entrez,
organism = "mmu")
Plot KEGG enrichment:
R
barplot(ekegg, showCategory = 15)
KEGG caveat
KEGG Entrez annotations may not perfectly match recent Ensembl
builds.
If a gene is missing, it will be silently dropped.
Step 4: MSigDB Hallmark enrichment
Load Hallmark gene sets for mouse:
R
hallmark <- msigdbr(species = "Mus musculus", category = "H")
hallmark_list <- split(hallmark$entrez_gene, hallmark$gs_name)
Run ORA:
R
ehall <- enricher(gene = sig_entrez,
universe = universe_entrez,
TERM2GENE = hallmark_list)
Plot:
R
dotplot(ehall, showCategory = 20) +
ggtitle("MSigDB Hallmark enrichment")
Step 5: Saving results
R
write_csv(as.data.frame(ego), "GO_enrichment.csv")
write_csv(as.data.frame(ekegg), "KEGG_enrichment.csv")
write_csv(as.data.frame(ehall), "Hallmark_enrichment.csv")
Step 6: Interpretation guidelines
Meaningful enrichment patterns should:
- involve pathways relevant to your experimental design
- contain multiple DE genes in the same direction
- show consistent themes across GO, KEGG, and Hallmark sets
Pathways enriched only by one DE gene or very small sets should be interpreted cautiously.
Exercise: interpret enrichment results
Using your GO, KEGG, and Hallmark outputs:
- Identify the pathways with the smallest adjusted p values.
- Are the enriched pathways consistent with the expected
phenotype.
- Which pathway collections (GO, KEGG, Hallmark) give the clearest signal.
Example interpretation:
- Several GO terms related to RNA processing and cell cycle appear
enriched.
- Hallmark sets provide clearer summaries of biological themes
compared to GO.
- KEGG pathways show fewer hits but corroborate the GO findings.
Summary
- ORA requires a universe and a significant gene set.
- Ensembl IDs must be converted to Entrez for GO and KEGG.
- clusterProfiler provides GO BP and KEGG ORA.
- Hallmark sets offer high quality curated pathways for
interpretation.
- Dotplots and barplots are effective for visualizing enrichment results.

