Introduction to RNA-seqWhere are we heading towards in this workshop?


  • RNA-seq is a technique of measuring the amount of RNA expressed within a cell/tissue and state at a given time.
  • Many choices have to be made when planning an RNA-seq experiment, such as whether to perform poly-A selection or ribosomal depletion, whether to apply a stranded or an unstranded protocol, and whether to sequence the reads in a single-end or paired-end fashion. Each of the choices have consequences for the processing and interpretation of the data.
  • Many approaches exist for quantification of RNA-seq data. Some methods align reads to the genome and count the number of reads overlapping gene loci. Other methods map reads to the transcriptome and use a probabilistic approach to estimate the abundance of each gene or transcript.
  • Information about annotated genes can be accessed via several sources, including Ensembl, UCSC and GENCODE.

Downloading and organizing files


  • RNA-seq requires three main inputs: FASTQ reads, a reference genome (FASTA), and gene annotation (GTF/GFF).
  • In lieu of a reference genome and annotation, transcript sequences (FASTA) can be used for transcript-level quantification.
  • Keeping files compressed and well-organized supports reproducible analysis.
  • Reference files must match in genome build and version.
  • Public data repositories such as GEO and SRA provide raw reads and metadata.
  • Tools like wget and fasterq-dump enable programmatic, reproducible data retrieval.

Quality control of RNA-seq reads


  • FastQC and MultiQC provide essential diagnostics for RNA-seq data.
  • Some FastQC warnings are normal for RNA-seq and do not indicate problems.
  • Aligners soft clip adapters, so trimming is usually unnecessary for alignment based RNA-seq.
  • QC helps detect outliers before alignment and quantification.

A. Genome-based quantification (STAR + featureCounts)


  • STAR is a fast splice-aware aligner widely used for RNA-seq.
  • Genome indexing requires FASTA and optionally GTF for improved splice detection.
  • Mapping is efficiently performed using SLURM array jobs.
  • Alignment statistics (from Log.final.out + MultiQC) must be reviewed.
  • Strandness should be inferred using aligned BAM files.
  • Final BAM files are ready for downstream counting.
  • featureCounts is used to obtain gene-level counts for differential expression.
  • Exons are counted and summed per gene.
  • The output is a simple matrix for downstream statistical analysis.

B. Transcript-based quantification (Salmon)


  • Salmon performs fast, alignment-free transcript quantification.
  • A transcriptome FASTA is required for building the index.
  • salmon quant uses FASTQ files directly with bias correction.
  • Transcript-level outputs include TPM and estimated counts.
  • tximport converts transcript estimates to gene-level counts.
  • Gene-level counts from Salmon are suitable for DESeq2.

Gene-level QC and differential expression (DESeq2)


  • Counts and sample metadata must be aligned before building DESeq2 objects.
  • Biotype annotation allows restriction to protein coding genes for exploratory QC.
  • Variance stabilizing transform, distance heatmaps, and PCA help detect outliers and batch effects.
  • DESeq2 provides a complete pipeline from raw counts to differential expression.
  • Joining DE results with normalized counts and gene annotation yields a useful final table.
  • Volcano plots summarize differential expression using fold change and adjusted p values.

Gene set enrichment analysis


  • ORA requires a universe and a significant gene set.
  • Ensembl IDs must be converted to Entrez for GO and KEGG.
  • clusterProfiler provides GO BP and KEGG ORA.
  • Hallmark sets offer high quality curated pathways for interpretation.
  • Dotplots and barplots are effective for visualizing enrichment results.