RNA-seq in practice: Hands-on workshop on RCAC systems: Key Points

Introduction to RNA-seqWhere are we heading towards in this workshop?

RNA-seq is a technique of measuring the amount of RNA expressed within a cell/tissue and state at a given time.
Many choices have to be made when planning an RNA-seq experiment, such as whether to perform poly-A selection or ribosomal depletion, whether to apply a stranded or an unstranded protocol, and whether to sequence the reads in a single-end or paired-end fashion. Each of the choices have consequences for the processing and interpretation of the data.
Many approaches exist for quantification of RNA-seq data. Some methods align reads to the genome and count the number of reads overlapping gene loci. Other methods map reads to the transcriptome and use a probabilistic approach to estimate the abundance of each gene or transcript.
Information about annotated genes can be accessed via several sources, including Ensembl, UCSC and GENCODE.

RNA-seq requires three main inputs: FASTQ reads, a reference genome (FASTA), and gene annotation (GTF/GFF).
In lieu of a reference genome and annotation, transcript sequences (FASTA) can be used for transcript-level quantification.
Keeping files compressed and well-organized supports reproducible analysis.
Reference files must match in genome build and version.
Public data repositories such as GEO and SRA provide raw reads and metadata.
Tools like wget and fasterq-dump enable programmatic, reproducible data retrieval.

FastQC and MultiQC provide essential diagnostics for RNA-seq data.
Some FastQC warnings are normal for RNA-seq and do not indicate problems.
Aligners soft clip adapters, so trimming is usually unnecessary for alignment based RNA-seq.
QC helps detect outliers before alignment and quantification.

STAR is a fast splice-aware aligner widely used for RNA-seq.
Genome indexing requires FASTA and optionally GTF for improved splice detection.
Mapping is efficiently performed using SLURM array jobs.
Alignment statistics (from Log.final.out + MultiQC) must be reviewed.
Strandness should be inferred using aligned BAM files.
Final BAM files are ready for downstream counting.
featureCounts is used to obtain gene-level counts for differential expression.
Exons are counted and summed per gene.
The output is a simple matrix for downstream statistical analysis.

Counts and sample metadata must be aligned before building DESeq2 objects.
Biotype annotation allows restriction to protein coding genes for exploratory QC.
Variance stabilizing transform, distance heatmaps, and PCA help detect outliers and batch effects.
DESeq2 provides a complete pipeline from raw counts to differential expression.
Joining DE results with normalized counts and gene annotation yields a useful final table.
Volcano plots summarize differential expression using fold change and adjusted p values.

Salmon/Kallisto output is imported via tximport and loaded with DESeqDataSetFromTximport().
The tximport object preserves transcript length information for accurate normalization.
Exploratory analysis (PCA, distance heatmaps) should precede differential expression testing.
DESeq2 handles the statistical analysis identically to the genome-based workflow.
LFC shrinkage improves fold change estimates for low-count genes.
Results from transcript-based and genome-based workflows should be broadly concordant.

ORA tests if gene sets contain more DE genes than expected by chance using the hypergeometric test.
The background universe must include all genes that could have been detected as DE.
Multiple testing correction (FDR) is essential when testing thousands of gene sets.
GO provides comprehensive but redundant functional annotation; use simplify() to reduce redundancy.
KEGG provides curated pathway maps but has limited gene coverage and requires Entrez IDs.
MSigDB Hallmark sets offer 50 high-quality, non-redundant biological signatures.
Direction-aware analysis (up vs. down separately) often reveals clearer biological signals.
GSEA avoids arbitrary cutoffs by using the full ranked gene list; it detects subtle but coordinated changes.
Convergent findings across multiple databases and methods strengthen biological interpretation.
Always report methods, thresholds, and full results for reproducibility.