Introduction to RNA-seqWhere are we heading towards in this workshop?
- RNA-seq is a technique of measuring the amount of RNA expressed within a cell/tissue and state at a given time.
- Many choices have to be made when planning an RNA-seq experiment, such as whether to perform poly-A selection or ribosomal depletion, whether to apply a stranded or an unstranded protocol, and whether to sequence the reads in a single-end or paired-end fashion. Each of the choices have consequences for the processing and interpretation of the data.
- Many approaches exist for quantification of RNA-seq data. Some methods align reads to the genome and count the number of reads overlapping gene loci. Other methods map reads to the transcriptome and use a probabilistic approach to estimate the abundance of each gene or transcript.
- Information about annotated genes can be accessed via several sources, including Ensembl, UCSC and GENCODE.
Downloading and organizing files
- RNA-seq requires three main inputs: FASTQ reads, a reference genome (FASTA), and gene annotation (GTF/GFF).
- In lieu of a reference genome and annotation, transcript sequences (FASTA) can be used for transcript-level quantification.
- Keeping files compressed and well-organized supports reproducible analysis.
- Reference files must match in genome build and version.
- Public data repositories such as GEO and SRA provide raw reads and metadata.
- Tools like
wgetandfasterq-dumpenable programmatic, reproducible data retrieval.
Quality control of RNA-seq reads
- FastQC and MultiQC provide essential diagnostics for RNA-seq data.
- Some FastQC warnings are normal for RNA-seq and do not indicate problems.
- Aligners soft clip adapters, so trimming is usually unnecessary for alignment based RNA-seq.
- QC helps detect outliers before alignment and quantification.
A. Genome-based quantification (STAR + featureCounts)
- STAR is a fast splice-aware aligner widely used for RNA-seq.
- Genome indexing requires FASTA and optionally GTF for improved splice detection.
- Mapping is efficiently performed using SLURM array jobs.
- Alignment statistics (from
Log.final.out+ MultiQC) must be reviewed. - Strandness should be inferred using aligned BAM files.
- Final BAM files are ready for downstream counting.
-
featureCountsis used to obtain gene-level counts for differential expression. - Exons are counted and summed per gene.
- The output is a simple matrix for downstream statistical analysis.
B. Transcript-based quantification (Salmon)
- Salmon performs fast, alignment-free transcript quantification.
- A transcriptome FASTA is required for building the index.
-
salmon quantuses FASTQ files directly with bias correction. - Transcript-level outputs include TPM and estimated counts.
-
tximportconverts transcript estimates to gene-level counts. - Gene-level counts from Salmon are suitable for DESeq2.
Gene-level QC and differential expression (DESeq2)
- Counts and sample metadata must be aligned before building DESeq2 objects.
- Biotype annotation allows restriction to protein coding genes for exploratory QC.
- Variance stabilizing transform, distance heatmaps, and PCA help detect outliers and batch effects.
- DESeq2 provides a complete pipeline from raw counts to differential expression.
- Joining DE results with normalized counts and gene annotation yields a useful final table.
- Volcano plots summarize differential expression using fold change and adjusted p values.
Gene set enrichment analysis
- ORA requires a universe and a significant gene set.
- Ensembl IDs must be converted to Entrez for GO and KEGG.
- clusterProfiler provides GO BP and KEGG ORA.
- Hallmark sets offer high quality curated pathways for
interpretation.
- Dotplots and barplots are effective for visualizing enrichment results.