B. Differential expression using DESeq2 (Kallisto pathway)

Last updated on 2026-02-24 | Edit this page

Estimated time: 85 minutes

Overview

Questions

How do we import transcript-level quantification from Kallisto into DESeq2?
What exploratory analyses should we perform before differential expression testing?
How do we perform differential expression analysis with DESeq2 using tximport data?
How do we visualize and interpret DE results from transcript-based quantification?
What are the key differences between genome-based and transcript-based DE workflows?

Objectives

Load tximport data from Kallisto quantification into DESeq2.
Perform quality control visualizations (boxplots, density plots, PCA).
Apply variance stabilizing transformation for exploratory analysis.
Run differential expression analysis with appropriate contrasts.
Create volcano plots to visualize DE results.
Export annotated DE results for downstream analysis.

Attribution

This section is adapted from materials developed by Michael Gribskov, Professor of Computational Genomics & Systems Biology at Purdue University.

Original and related materials are available via the CGSB Wiki: https://cgsb.miraheze.org/wiki/Main_Page

Introduction

In Episode 04b, we quantified transcript expression using Kallisto and summarized the results to gene-level counts using tximport. In this episode, we use those gene-level estimates for differential expression analysis with DESeq2.

The workflow follows the same general pattern as Episode 05 (genome-based), but with important differences:

Input data: We use the txi object from tximport rather than raw counts from featureCounts.
Count handling: Kallisto estimates are model-based (not integer counts), and DESeq2 handles this appropriately via DESeqDataSetFromTximport().
Bootstrap support: Kallisto’s uncertainty estimates (from bootstraps) can be used with sleuth for transcript-level DE.

Callout

When to use this pathway

Use this transcript-based workflow when:

You quantified with Salmon, Kallisto, or similar tools.
You want to leverage transcript-level bias corrections.
You did not generate BAM files and cannot use featureCounts.

The genome-based workflow (Episode 05) is preferred when you have BAM files and need splice junction information or plan to visualize alignments.

Prerequisite

What you need for this episode

txi.rds file generated from tximport in Episode 04b
A samples.csv file describing the experimental groups
RStudio session via Open OnDemand

If you haven’t created the sample metadata file, create scripts/samples.csv with:

sample,condition
WT_Bcell_mock_rep1,WT_mock
WT_Bcell_mock_rep2,WT_mock
WT_Bcell_mock_rep3,WT_mock
WT_Bcell_mock_rep4,WT_mock
WT_Bcell_IR_rep1,WT_IR
WT_Bcell_IR_rep2,WT_IR
WT_Bcell_IR_rep3,WT_IR
WT_Bcell_IR_rep4,WT_IR

Create a results directory:

BASH

mkdir -p results/deseq2_kallisto

Step 1: Load packages and data

Start your RStudio session via Open OnDemand as described in Episode 05, then load the required packages:

R

library(DESeq2)
library(tximportData)
library(ggplot2)
library(reshape2)
library(pheatmap)
library(RColorBrewer)
library(ggrepel)
library(readr)
library(dplyr)
# Construct the path dynamically
work_dir <- file.path("/scratch/negishi", Sys.getenv("USER"), "rnaseq-workshop")
setwd(work_dir)

Load the tximport object created in Episode 04b:

R

txi <- readRDS("results/kallisto_quant/txi.rds")

Examine the structure of the tximport object:

R

names(txi)

[1] "abundance"           "counts"              "infReps"             "length"              "countsFromAbundance"

R

head(txi$counts)

                      WT_Bcell_IR_rep1 WT_Bcell_IR_rep2 WT_Bcell_IR_rep3 WT_Bcell_IR_rep4 WT_Bcell_mock_rep1 WT_Bcell_mock_rep2 WT_Bcell_mock_rep3
ENSMUSG00000000001.5         387.39815         329.4356     737.97311964        654.17550           687.1189          620.35065          661.35411
ENSMUSG00000000003.16          0.00000           1.0000       0.00000000          0.00000             0.0000            0.00000            0.00000
ENSMUSG00000000028.16         32.81579          20.0000      36.63144821         29.71743            53.5750           36.92104           39.00000
ENSMUSG00000000031.20          0.00000           0.0000       0.09733145          0.00000             0.0000            0.00000            0.00000
ENSMUSG00000000037.18          2.00000           3.0000       1.28018005          8.00000             1.0000            0.00000            1.58711
ENSMUSG00000000049.12          0.00000           0.0000       0.00000000          0.00000             0.0000            0.00000            0.00000
                      WT_Bcell_mock_rep4
ENSMUSG00000000001.5            704.8851
ENSMUSG00000000003.16             0.0000
ENSMUSG00000000028.16            36.0000
ENSMUSG00000000031.20             0.0000
ENSMUSG00000000037.18             0.0000
ENSMUSG00000000049.12             0.0000

Callout

Understanding the tximport object

The txi object contains several components:

counts: Gene-level estimated counts (used by DESeq2).
abundance: Gene-level TPM values (within-sample normalized).
length: Average transcript length per gene (used for length bias correction).
countsFromAbundance: Method used to generate counts (default: "no").

When using DESeqDataSetFromTximport(), the length matrix is automatically used to correct for gene-length bias during normalization. This is why we use the default countsFromAbundance = "no" in Episode 04b—DESeq2 handles length correction internally, so pre-scaling counts would apply the correction twice.

Load sample metadata:

R

coldata <- read.csv(
    "scripts/samples.csv",
    row.names = 1,
    header = TRUE,
    stringsAsFactors = TRUE
)
coldata$condition <- as.factor(coldata$condition)
coldata <- coldata[colnames(txi$counts), , drop = FALSE]
coldata

                   condition
WT_Bcell_IR_rep1       WT_IR
WT_Bcell_IR_rep2       WT_IR
WT_Bcell_IR_rep3       WT_IR
WT_Bcell_IR_rep4       WT_IR
WT_Bcell_mock_rep1   WT_mock
WT_Bcell_mock_rep2   WT_mock
WT_Bcell_mock_rep3   WT_mock
WT_Bcell_mock_rep4   WT_mock

Verify that sample names match between tximport and metadata:

R

all(colnames(txi$counts) == rownames(coldata))

[1] TRUE

Step 2: Create DESeq2 object from tximport

The key difference from the genome-based workflow is using DESeqDataSetFromTximport() instead of DESeqDataSetFromMatrix():

R

dds <- DESeqDataSetFromTximport(
    txi,
    colData = coldata,
    design = ~ condition
)

Callout

Why use DESeqDataSetFromTximport?

This function:

Automatically handles non-integer counts from Salmon/Kallisto.
Preserves transcript length information for accurate normalization.
Incorporates the txi$length matrix to account for gene-length bias.

Using DESeqDataSetFromMatrix() with tximport counts would lose this information.

Filter lowly expressed genes using group-aware filtering:

R

# Group-aware filtering: keep genes with >= 10 counts in at least 4 samples
# (the size of the smallest experimental group)
min_samples <- 4
min_counts <- 10
keep <- rowSums(counts(dds) >= min_counts) >= min_samples
dds <- dds[keep, ]
dim(dds)

[1] 18924     8

Callout

Why use group-aware filtering?

A simple sum filter (rowSums(counts(dds)) >= 10) can be problematic:

A gene with 10 total counts across 8 samples averages ~1.25 counts/sample—too low to be informative.
It may remove genes expressed in only one condition (biologically interesting!).

Group-aware filtering (rowSums(counts >= threshold) >= min_group_size) ensures:

Each kept gene has meaningful expression in at least one experimental group.
Genes with condition-specific expression are retained.
The threshold is interpretable (e.g., “at least 10 counts in at least 4 samples”).

Estimate size factors:

R

dds <- estimateSizeFactors(dds)
head(normalizationFactors(dds))

                      WT_Bcell_IR_rep1 WT_Bcell_IR_rep2 WT_Bcell_IR_rep3 WT_Bcell_IR_rep4 WT_Bcell_mock_rep1 WT_Bcell_mock_rep2 WT_Bcell_mock_rep3
ENSMUSG00000000001.5         0.8578693        0.8520842         1.099577        1.0524525           1.059461          0.9946634          0.9552297
ENSMUSG00000000028.16        0.9455555        0.7479666         1.149805        0.9882356           1.071439          1.0191038          0.9154327
ENSMUSG00000000056.8         0.8580737        0.8528464         1.101744        1.0540656           1.058546          0.9930345          0.9540278
ENSMUSG00000000078.8         0.8580503        0.8527592         1.101496        1.0538811           1.058650          0.9932207          0.9541651
ENSMUSG00000000085.17        0.7941016        0.7697798         1.199288        1.0181546           1.163680          0.9580967          1.0675296
ENSMUSG00000000088.8         0.8537054        0.8364591         1.055812        1.0199231           1.078562          1.0286198          0.9803416
                      WT_Bcell_mock_rep4
ENSMUSG00000000001.5            1.174355
ENSMUSG00000000028.16           1.244899
ENSMUSG00000000056.8            1.173333
ENSMUSG00000000078.8            1.173450
ENSMUSG00000000085.17           1.125634
ENSMUSG00000000088.8            1.195682

Step 3: Exploratory data analysis

Raw count distributions

Visualize the distribution of raw counts across samples:

R

counts_melted <- melt(
    log10(counts(dds) + 1),
    varnames = c("gene", "sample"),
    value.name = "log10_count"
)

ggplot(counts_melted, aes(x = sample, y = log10_count, fill = sample)) +
    geom_boxplot(show.legend = FALSE) +
    theme_minimal() +
    labs(
        x = "Sample",
        y = "log10(count + 1)",
        title = "Raw count distributions (Kallisto)"
    ) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Density plot of raw counts:

R

ggplot(counts_melted, aes(x = log10_count, fill = sample)) +
    geom_density(alpha = 0.3) +
    theme_minimal() +
    labs(
        x = "log10(count + 1)",
        y = "Density",
        title = "Count density distributions"
    )

Variance stabilizing transformation

Apply VST for exploratory analysis:

R

vsd <- vst(dds, blind = TRUE)

Callout

Why use VST?

Raw counts have a strong mean-variance relationship: highly expressed genes have higher variance. VST removes this dependency, making distance-based methods (PCA, clustering) more reliable.

Setting blind = TRUE ensures the transformation is not influenced by the experimental design, which is appropriate for quality control.

Sample-to-sample distances

R

sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- colnames(vsd)
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)

pheatmap(
    sampleDistMatrix,
    clustering_distance_rows = sampleDists,
    clustering_distance_cols = sampleDists,
    col = colors,
    main = "Sample distance heatmap (Kallisto)"
)

The distance heatmap shows how similar samples are to each other. Samples from the same condition should cluster together.

PCA plot

R

pcaData <- plotPCA(vsd, intgroup = "condition", returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))

ggplot(pcaData, aes(PC1, PC2, color = condition)) +
    geom_point(size = 4) +
    geom_text_repel(aes(label = name)) +
    xlab(paste0("PC1: ", percentVar[1], "% variance")) +
    ylab(paste0("PC2: ", percentVar[2], "% variance")) +
    theme_bw() +
    ggtitle("PCA of Kallisto-quantified samples")

Challenge

Exercise: Interpret exploratory plots

Using the distance heatmap and PCA plot:

Do samples cluster by experimental condition?
Are there any outlier samples that don’t group with their replicates?
How much variance is explained by PC1? What might this represent biologically?

Show me the solution

Example interpretation:

Samples should cluster by condition (WT_mock vs WT_IR) in both the heatmap and PCA.
If all replicates cluster together, there are no obvious outliers.
PC1 typically captures the largest source of variation. If it separates conditions, the experimental treatment is the dominant signal.

Step 4: Differential expression analysis

Run the full DESeq2 pipeline:

R

dds <- DESeq(dds)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing

Inspect dispersion estimates:

R

plotDispEsts(dds)

Volcano plot of differential expression results

Callout

Interpreting the dispersion plot

Black dots: Gene-wise dispersion estimates.
Red line: Fitted trend (shrinkage target).
Blue dots: Final shrunken estimates.

A good fit shows the red line passing through the center of the black cloud, with blue dots closer to the line than the original black dots.

Extract results for the contrast of interest:

R

res <- results(
    dds,
    contrast = c("condition", "WT_mock", "WT_IR")
)

summary(res)

out of 19693 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 3482, 18%
LFC < 0 (down)     : 3675, 19%
outliers [1]       : 10, 0.051%
low counts [2]     : 0, 0%
(mean count < 6)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Order results by adjusted p-value:

R

res_ordered <- res[order(res$padj), ]
head(res_ordered)

Output:

log2 fold change (MLE): condition WT_IR vs WT_mock
Wald test p-value: condition WT_IR vs WT_mock
DataFrame with 6 rows and 6 columns
                       baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                      <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
ENSMUSG00000004085.15   795.499        4.10779  0.157888   26.0172 3.16645e-149 6.23252e-145
ENSMUSG00000021701.9    840.251        6.34293  0.253829   24.9889 8.06388e-138 7.93607e-134
ENSMUSG00000072825.13   705.733        4.70979  0.190927   24.6680 2.35896e-134 1.54771e-130
ENSMUSG00000048458.9    820.666        5.79808  0.238203   24.3409 7.24251e-131 3.56386e-127
ENSMUSG00000028893.9    680.824        4.28171  0.179531   23.8495 1.02517e-125 4.03570e-122
ENSMUSG00000075122.6    836.194        3.74860  0.158501   23.6502 1.17375e-123 3.85048e-120

Apply log fold change shrinkage

LFC shrinkage improves estimates for genes with low counts or high dispersion.

First, check the available coefficients:

R

resultsNames(dds)

[1] "Intercept"        "condition_WT_mock_vs_WT_IR"

Apply shrinkage using apeglm (recommended for standard contrasts):

R

res_shrunk <- lfcShrink(
    dds,
    coef = "condition_WT_mock_vs_WT_IR",
    type = "apeglm"
)

Callout

Why shrink log fold changes?

Genes with low counts can have unreliably large fold changes. Shrinkage:

Reduces noise in LFC estimates.
Improves ranking for downstream analyses (e.g., GSEA).
Does not affect p-values or significance calls.

Callout

Choosing a shrinkage method

apeglm (recommended): Fast, well-calibrated, uses coefficient name (coef).
ashr: Required for complex contrasts that can’t be specified with coef (e.g., interaction terms).
normal: Legacy method, generally not recommended.

For simple two-group comparisons like ours, apeglm is preferred.

Step 5: Summarize and visualize results

Create a summary table:

R

log2fc_cut <- log2(1.5)

res_df <- as.data.frame(res_shrunk)
res_df$gene_id <- rownames(res_df)

summary_table <- tibble(
    total_genes = nrow(res_df),
    sig = sum(res_df$padj < 0.05, na.rm = TRUE),
    up = sum(res_df$padj < 0.05 & res_df$log2FoldChange > log2fc_cut, na.rm = TRUE),
    down = sum(res_df$padj < 0.05 & res_df$log2FoldChange < -log2fc_cut, na.rm = TRUE)
)

print(summary_table)

output:

# A tibble: 1 × 4
  total_genes   sig    up  down
        <int> <int> <int> <int>
1       19693  6078  2296  2448

We will attach annotations so that results are interpretable. Load gene annotation data (from Episode 02):

R

mart <-
  read.csv(
    "data/mart.tsv",
    sep = "\t",
    header = TRUE
  )

annot <-
  read.csv(
    "data/annot.tsv",
    sep = "\t",
    header = TRUE
  )

R

res_df <- as.data.frame(res)
res_df$ensembl_gene_id_version <- rownames(res_df)

Join DE results, normalized counts, and annotation:

R

res_annot <- res_df %>%
  dplyr::left_join(mart,
                   by = "ensembl_gene_id_version")

Joining normalized counts and annotation makes the output biologically interpretable. The final table becomes a comprehensive results object containing:

gene identifiers
gene symbols
functional descriptions
differential expression statistics

This is the format most researchers expect when examining results or importing them into downstream tools.

Define significance labels for plotting:

R

log2fc_cut <- log2(1.5)

res_annot <- res_annot %>%
  dplyr::mutate(
    label = dplyr::coalesce(external_gene_name, ensembl_gene_id_version),
    sig = dplyr::case_when(
      padj <= 0.05 & log2FoldChange >=  log2fc_cut ~ "up",
      padj <= 0.05 & log2FoldChange <= -log2fc_cut ~ "down",
      TRUE ~ "ns"
    )
  )

Volcano plot

R

ggplot(
  res_annot,
  aes(
    x     = log2FoldChange,
    y     = -log10(padj),
    col   = sig,
    label = label
  )
) +
  geom_point(alpha = 0.6) +
  scale_color_manual(values = c(
    "up"   = "firebrick",
    "down" = "dodgerblue3",
    "ns"   = "grey70"
  )) +
  geom_text_repel(
    data = dplyr::filter(res_annot, sig != "ns"),
    max.overlaps       = 12,
    min.segment.length = Inf,
    box.padding        = 0.3,
    seed               = 42,
    show.legend        = FALSE
  ) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "grey40") +
  geom_vline(xintercept = c(-log2fc_cut, log2fc_cut), linetype = "dashed", color = "grey40") +
  theme_classic() +
  xlab("log2 fold change") +
  ylab("-log10 adjusted p value") +
  ggtitle("Volcano plot: WT_IR vs WT_mock")

Callout

Interpreting the volcano plot

X-axis: Direction and magnitude of change (positive = higher in WT_IR, i.e., upregulated by radiation).
Y-axis: Statistical significance (-log10 scale, higher = more significant).
Colored points: Genes passing both fold change and significance thresholds.

Challenge

Exercise: Compare with genome-based results

If you also ran Episode 05 (genome-based workflow):

Are the number of DE genes similar between the two approaches?
Do the top DE genes overlap?
Are there systematic differences in fold change estimates?

Show me the solution

Expected observations:

The number of DE genes should be broadly similar, though not identical.
Most top DE genes should overlap between methods.
Kallisto may produce slightly different LFC estimates due to its different quantification algorithm.

Both approaches are valid; consistency between them increases confidence in the results.

Step 6: Save results

Save the full results table:

R

write_tsv(
    res_df,
    "results/deseq2_kallisto/DESeq2_kallisto_results.tsv"
)

Save significant genes only:

R

sig_res <- res_df %>%
    filter(
        padj <= 0.05,
        abs(log2FoldChange) >= log2fc_cut
    )

write_tsv(
    sig_res,
    "results/deseq2_kallisto/DESeq2_kallisto_sig.tsv"
)

Save the DESeq2 object for downstream analysis:

R

saveRDS(dds, "results/deseq2_kallisto/dds_kallisto.rds")

Discussion

Genome-based vs. transcript-based: Which to choose?

Aspect	Genome-based (Ep 05)	Transcript-based (Ep 05b)
Input	BAM files	FASTQ files
Speed	Slower (alignment + counting)	Faster (pseudo-alignment)
Storage	Large (BAM files)	Small (abundance.tsv files)
Bias correction	Limited	Sequence + GC bias
Novel transcripts	Can detect	Cannot detect
Visualization	IGV compatible	No BAM files

For standard differential expression, both methods produce comparable results. Choose based on your specific needs and available resources.

Summary

Key Points

Kallisto output is imported via tximport and loaded with DESeqDataSetFromTximport().
The tximport object preserves transcript length information for accurate normalization.
Exploratory analysis (PCA, distance heatmaps) should precede differential expression testing.
DESeq2 handles the statistical analysis identically to the genome-based workflow.
LFC shrinkage improves fold change estimates for low-count genes.
Results from transcript-based and genome-based workflows should be broadly concordant.