Introduction to RNA-seqWhere are we heading towards in this workshop?
- RNA-seq is a technique of measuring the amount of RNA expressed within a cell/tissue and state at a given time.
- Many choices have to be made when planning an RNA-seq experiment, such as whether to perform poly-A selection or ribosomal depletion, whether to apply a stranded or an unstranded protocol, and whether to sequence the reads in a single-end or paired-end fashion. Each of the choices have consequences for the processing and interpretation of the data.
- Many approaches exist for quantification of RNA-seq data. Some methods align reads to the genome and count the number of reads overlapping gene loci. Other methods map reads to the transcriptome and use a probabilistic approach to estimate the abundance of each gene or transcript.
- Information about annotated genes can be accessed via several sources, including Ensembl, UCSC and GENCODE.
Downloading and organizing files
- RNA-seq requires three main inputs: FASTQ reads, a reference genome (FASTA), and gene annotation (GTF/GFF).
- In lieu of a reference genome and annotation, transcript sequences (FASTA) can be used for transcript-level quantification.
- Keeping files compressed and well-organized supports reproducible analysis.
- Reference files must match in genome build and version.
- Public data repositories such as GEO and SRA provide raw reads and metadata.
- Tools like
wgetandfasterq-dumpenable programmatic, reproducible data retrieval.
Quality control of RNA-seq reads
- FastQC and MultiQC provide essential diagnostics for RNA-seq data.
- Some FastQC warnings are normal for RNA-seq and do not indicate problems.
- Aligners soft clip adapters, so trimming is usually unnecessary for alignment based RNA-seq.
- QC helps detect outliers before alignment and quantification.
A. Genome-based quantification (STAR + featureCounts)
- STAR is a fast splice-aware aligner widely used for RNA-seq.
- Genome indexing requires FASTA and optionally GTF for improved splice detection.
- Mapping is efficiently performed using SLURM array jobs.
- Alignment statistics (from
Log.final.out+ MultiQC) must be reviewed. - Strandness should be inferred using aligned BAM files.
- Final BAM files are ready for downstream counting.
-
featureCountsis used to obtain gene-level counts for differential expression. - Exons are counted and summed per gene.
- The output is a simple matrix for downstream statistical analysis.
B. Transcript-based quantification (Salmon)
- Salmon performs fast, alignment-free transcript quantification.
- A transcriptome FASTA is required for building the index.
-
salmon quantuses FASTQ files directly with bias correction. - Transcript-level outputs include TPM and estimated counts.
-
tximportconverts transcript estimates to gene-level counts. - Gene-level counts from Salmon are suitable for DESeq2.
Gene-level QC and differential expression (DESeq2)
- Counts and sample metadata must be aligned before building DESeq2 objects.
- Biotype annotation allows restriction to protein coding genes for exploratory QC.
- Variance stabilizing transform, distance heatmaps, and PCA help detect outliers and batch effects.
- DESeq2 provides a complete pipeline from raw counts to differential expression.
- Joining DE results with normalized counts and gene annotation yields a useful final table.
- Volcano plots summarize differential expression using fold change and adjusted p values.
B. Differential expression using DESeq2 (Salmon/Kallisto pathway)
- Salmon/Kallisto output is imported via
tximportand loaded withDESeqDataSetFromTximport(). - The tximport object preserves transcript length information for accurate normalization.
- Exploratory analysis (PCA, distance heatmaps) should precede differential expression testing.
- DESeq2 handles the statistical analysis identically to the genome-based workflow.
- LFC shrinkage improves fold change estimates for low-count genes.
- Results from transcript-based and genome-based workflows should be broadly concordant.
Gene set enrichment analysis
- ORA tests if gene sets contain more DE genes than expected by chance using the hypergeometric test.
- The background universe must include all genes that could have been detected as DE.
- Multiple testing correction (FDR) is essential when testing thousands of gene sets.
- GO provides comprehensive but redundant functional annotation; use
simplify()to reduce redundancy. - KEGG provides curated pathway maps but has limited gene coverage and requires Entrez IDs.
- MSigDB Hallmark sets offer 50 high-quality, non-redundant biological signatures.
- Direction-aware analysis (up vs. down separately) often reveals clearer biological signals.
- GSEA avoids arbitrary cutoffs by using the full ranked gene list; it detects subtle but coordinated changes.
- Convergent findings across multiple databases and methods strengthen biological interpretation.
- Always report methods, thresholds, and full results for reproducibility.