Introduction to RNA-seqWhere are we heading towards in this workshop?


  • RNA-seq is a technique of measuring the amount of RNA expressed within a cell/tissue and state at a given time.
  • Many choices have to be made when planning an RNA-seq experiment, such as whether to perform poly-A selection or ribosomal depletion, whether to apply a stranded or an unstranded protocol, and whether to sequence the reads in a single-end or paired-end fashion. Each of the choices have consequences for the processing and interpretation of the data.
  • Many approaches exist for quantification of RNA-seq data. Some methods align reads to the genome and count the number of reads overlapping gene loci. Other methods map reads to the transcriptome and use a probabilistic approach to estimate the abundance of each gene or transcript.
  • Information about annotated genes can be accessed via several sources, including Ensembl, UCSC and GENCODE.

Downloading and organizing files


  • RNA-seq requires three main inputs: FASTQ reads, a reference genome (FASTA), and gene annotation (GTF/GFF).
  • In lieu of a reference genome and annotation, transcript sequences (FASTA) can be used for transcript-level quantification.
  • Keeping files compressed and well-organized supports reproducible analysis.
  • Reference files must match in genome build and version.
  • Public data repositories such as GEO and SRA provide raw reads and metadata.
  • Tools like wget and fasterq-dump enable programmatic, reproducible data retrieval.

Quality control of RNA-seq reads


  • FastQC and MultiQC provide essential diagnostics for RNA-seq data.
  • Some FastQC warnings are normal for RNA-seq and do not indicate problems.
  • Aligners soft clip adapters, so trimming is usually unnecessary for alignment based RNA-seq.
  • QC helps detect outliers before alignment and quantification.

A. Genome-based quantification (STAR + featureCounts)


  • STAR is a fast splice-aware aligner widely used for RNA-seq.
  • Genome indexing requires FASTA and optionally GTF for improved splice detection.
  • Mapping is efficiently performed using SLURM array jobs.
  • Alignment statistics (from Log.final.out + MultiQC) must be reviewed.
  • Strandness should be inferred using aligned BAM files.
  • Final BAM files are ready for downstream counting.
  • featureCounts is used to obtain gene-level counts for differential expression.
  • Exons are counted and summed per gene.
  • The output is a simple matrix for downstream statistical analysis.

B. Transcript-based quantification (Salmon)


  • Salmon performs fast, alignment-free transcript quantification.
  • A transcriptome FASTA is required for building the index.
  • salmon quant uses FASTQ files directly with bias correction.
  • Transcript-level outputs include TPM and estimated counts.
  • tximport converts transcript estimates to gene-level counts.
  • Gene-level counts from Salmon are suitable for DESeq2.

Gene-level QC and differential expression (DESeq2)


  • Counts and sample metadata must be aligned before building DESeq2 objects.
  • Biotype annotation allows restriction to protein coding genes for exploratory QC.
  • Variance stabilizing transform, distance heatmaps, and PCA help detect outliers and batch effects.
  • DESeq2 provides a complete pipeline from raw counts to differential expression.
  • Joining DE results with normalized counts and gene annotation yields a useful final table.
  • Volcano plots summarize differential expression using fold change and adjusted p values.

B. Differential expression using DESeq2 (Salmon/Kallisto pathway)


  • Salmon/Kallisto output is imported via tximport and loaded with DESeqDataSetFromTximport().
  • The tximport object preserves transcript length information for accurate normalization.
  • Exploratory analysis (PCA, distance heatmaps) should precede differential expression testing.
  • DESeq2 handles the statistical analysis identically to the genome-based workflow.
  • LFC shrinkage improves fold change estimates for low-count genes.
  • Results from transcript-based and genome-based workflows should be broadly concordant.

Gene set enrichment analysis


  • ORA tests if gene sets contain more DE genes than expected by chance using the hypergeometric test.
  • The background universe must include all genes that could have been detected as DE.
  • Multiple testing correction (FDR) is essential when testing thousands of gene sets.
  • GO provides comprehensive but redundant functional annotation; use simplify() to reduce redundancy.
  • KEGG provides curated pathway maps but has limited gene coverage and requires Entrez IDs.
  • MSigDB Hallmark sets offer 50 high-quality, non-redundant biological signatures.
  • Direction-aware analysis (up vs. down separately) often reveals clearer biological signals.
  • GSEA avoids arbitrary cutoffs by using the full ranked gene list; it detects subtle but coordinated changes.
  • Convergent findings across multiple databases and methods strengthen biological interpretation.
  • Always report methods, thresholds, and full results for reproducibility.