Skip to content

QC for Genomics

Prerequisites

  • Access to an RCAC cluster (Negishi or Gautschi)
  • Familiarity with the Linux command line (cd, ls, mkdir)
  • A terminal session via Open OnDemand or SSH

Objectives

  • Understand why quality control matters at every stage of genomics analysis
  • Run and interpret FastQC reports for raw sequencing reads
  • Aggregate QC results across samples with MultiQC
  • Trim and filter reads with fastp
  • Know what QC metrics to check after alignment and variant calling

Quality control is the most important step you can take before starting any genomics analysis. The principle is simple: garbage in, garbage out. If your raw sequencing data has problems such as adapter contamination, low-quality bases, or reads from the wrong organism, every downstream result is compromised. The worst part is that these problems often propagate silently. Your pipeline will still produce output files. You will still get variant calls, expression counts, or an assembly. They will just be wrong.

QC is not a one-time checkbox at the beginning of your workflow. You should check quality at every stage:

  • Raw reads: Are the sequences high quality? Is there adapter contamination?
  • After trimming: Did trimming improve quality without losing too much data?
  • After alignment: What fraction of reads mapped? Are there too many duplicates?
  • After variant calling: Are the variant statistics consistent with expectations?
  • After assembly: Is the assembly complete? Are there missing genes?

The cost of skipping QC is real: wasted compute hours processing bad data, false positive results that don’t replicate, and in the worst case, retracted papers. The good news is that QC is fast and easy. Fifteen minutes of checking quality can save you weeks of troubleshooting.

Recommended starting point. Open OnDemand provides a web portal with a file browser, terminal, job submission forms, and interactive apps like JupyterLab and RStudio. No software to install.

  1. Navigate to the OOD portal

    Go to https://gateway.<cluster>.rcac.purdue.edu and log in with your Purdue (BoilerKey) credentials.

    ClusterOOD URL
    Gautschigateway.gautschi.rcac.purdue.edu
    Negishigateway.negishi.rcac.purdue.edu
    Bellgateway.bell.rcac.purdue.edu
    Gilbrethgateway.gilbreth.rcac.purdue.edu
  2. Tour the dashboard

    After login you will see the OOD dashboard with these key sections:

    • Files: Browse, upload, download, and edit files on the cluster
    • Jobs: View active jobs or use the Job Composer to build submission scripts
    • Clusters: Open a shell terminal (equivalent to SSH) directly in the browser
    • Interactive Apps: Launch JupyterLab, RStudio Server, and other GUI applications
  3. Open a terminal

    Click Clusters in the top menu and select the cluster shell access (e.g., Gautschi Shell Access). A terminal opens in your browser. You are now on a login node.

cd $RCAC_SCRATCH
mkdir -p genomics_exchange/session6
cd genomics_exchange/session6

RCAC provides bioinformatics software through the module system. Always start with a clean environment:

module --force purge

We will load specific tool modules as needed in each section below.

FastQC is the standard tool for assessing the quality of sequencing data. It takes FASTQ files as input and produces an HTML report with visualizations of key quality metrics.

Before running FastQC, it helps to understand what a FASTQ file looks like. Each read in a FASTQ file consists of four lines:

@read_name <- Read identifier
ACGTACGTACGT... <- DNA sequence
+ <- Separator
IIIIIHHHGG... <- Quality scores (one character per base)

The quality scores on line 4 are encoded as ASCII characters. Each character represents a Phred quality score: a measure of how confident the sequencer is that it read each base correctly.

Phred ScoreError ProbabilityAccuracy
Q101 in 1090%
Q201 in 10099%
Q301 in 1,00099.9%
Q401 in 10,00099.99%

Modern Illumina sequencers typically produce data with Q30 or higher for most bases. Q20 is generally considered the minimum acceptable quality.

In paired-end sequencing, both ends of each DNA fragment are read, producing two files per sample: R1 (forward read) and R2 (reverse read). This gives you more information about where each fragment came from in the genome and is the most common approach for Illumina sequencing.

  1. Load the FastQC module:

    module load biocontainers fastqc/0.12.1
  2. Create an output directory and run FastQC:

    mkdir -p fastqc_raw
    fastqc -t 4 -o fastqc_raw/ raw_reads/sample1_R1.fastq.gz raw_reads/sample1_R2.fastq.gz
    • -t 4: use 4 threads (one per file, speeds things up)
    • -o fastqc_raw/: write output to this directory
  3. For multiple samples, loop over all files:

    for fq in raw_reads/*.fastq.gz; do
    fastqc -t 4 -o fastqc_raw/ "$fq"
    done
  4. View the HTML report. You can open it through the Open OnDemand file browser (navigate to the file and click on it) or download it to your laptop:

    scp boilerid@negishi.rcac.purdue.edu:$RCAC_SCRATCH/genomics_exchange/session6/fastqc_raw/sample1_R1_fastqc.html .

FastQC evaluates 12 quality modules. Each module gets a pass (green check), warn (orange triangle), or fail (red X) status. The table below explains what each module measures and how to interpret it.

A healthy per-base quality plot looks like this: scores stay in the green zone (Q28+) across the entire read, with only a small dip at the 3’ end.

FastQC per base sequence quality for a clean Illumina library: scores remain at Q39-Q40 across all 150 cycles

ModuleWhat It MeasuresGoodBadCommon False Flags
Basic StatisticsRead count, length, GC%Consistent with expectationsUnexpected read count or lengthRarely triggers false flags
Per Base Sequence QualityPhred scores at each position along the readQ28+ across most positions; slight drop at 3’ end is normalQ < 20 across large portions of the readRarely false; always investigate
Per Tile Sequence QualityQuality variation across flow cell tilesUniform blue heatmapHot spots (yellow/red patches) indicate flow cell problemsRare
Per Sequence Quality ScoresDistribution of average quality per readSingle peak at Q30-Q37Bimodal distribution (two peaks)Rare
Per Base Sequence Content%A, %T, %G, %C at each positionRoughly equal A/T and G/C; flat linesStrong bias throughout the readFirst 10-15 bases often wobble due to random hexamer priming. This is normal
Per Sequence GC ContentDistribution of GC% across all readsSmooth bell curve matching your organismBimodal curve = contamination from another organismRarely false; investigate if unexpected
Per Base N ContentPercentage of N (unknown) bases at each positionNear 0% everywhereSpikes of N calls indicate sequencer problemsRare
Sequence Length DistributionDistribution of read lengthsAll reads same length (pre-trimming)Unexpected variationAfter trimming, variation is expected and normal
Sequence Duplication LevelsHow many reads appear more than onceMost reads unique (low duplication)High duplication in WGS = low library complexityRNA-seq: high duplication is normal (highly expressed genes). Amplicon: high duplication is expected.
Overrepresented SequencesSequences making up > 0.1% of readsNone or very fewAdapter sequences, rRNA contaminationRNA-seq: abundant transcripts trigger this. Not always a problem.
Adapter ContentCumulative adapter contamination at each positionNear 0% throughoutRising curves at read ends = adapter read-throughRarely false; needs trimming if present
Kmer ContentOverrepresented short sequences at specific positionsNo enrichmentEnrichment at specific positions = primer or adapter contaminationFirst few bases often enriched due to priming bias, usually normal

Visual Reference: Good vs Problematic Patterns

Section titled “Visual Reference: Good vs Problematic Patterns”

The cartoons below (by Zandra Selina, CC BY 4.0) show what each module looks like for healthy data and for the most common failure modes. Use them as a quick mental lookup when you open a real report.

FastQC per-base quality patterns: healthy R1/R2, mid-read cycle drop, and a failed R2 pair

R2 is expected to be slightly worse than R1, but a sharp collapse or a mid-read dip points to a sequencing run problem rather than a library issue.

Per-sequence GC content patterns: bimodal indicating mixed libraries, skewed indicating lab artefact, and a clean unimodal distribution

For RNA-seq, mild deviations from a bell curve are normal because expressed transcripts are not GC-neutral. A clearly bimodal curve, however, suggests cross-species contamination.

Adapter content patterns: rising curve from short inserts versus a flat near-zero curve for clean modern data

A rising curve at read ends means inserts are shorter than the read length. This is expected for ancient or degraded DNA, but for modern libraries it indicates short fragments that may benefit from trimming.

Duplication patterns: a heavy tail from low-complexity or overamplified libraries, a periodic pattern from amplicons, and a clean drop-off

High duplication is expected for RNA-seq (highly expressed genes) and amplicon data, but a heavy tail in WGS points to overamplification or low input material.

When you have many samples, opening individual FastQC reports one by one is tedious. MultiQC aggregates results from FastQC (and many other tools) into a single interactive HTML report.

  1. Load MultiQC:

    module load biocontainers multiqc/1.25.1
  2. Run MultiQC on all FastQC results:

    mkdir -p multiqc_raw
    multiqc fastqc_raw/ -o multiqc_raw/
  3. Open multiqc_raw/multiqc_report.html in your browser.

The power of MultiQC is in spotting outliers. When all samples are plotted together:

  • General Statistics table: One row per sample showing total reads, %GC, average quality, and % duplicates. Outlier samples stand out immediately.
  • Per Base Quality plot: All samples overlaid. Five samples might cluster together with good quality while one sample drops early. That is your problem sample.
  • Adapter Content plot: Quickly see which samples have adapter contamination and how severe it is.

The MultiQC status check heatmap gives you a one-glance view of which modules pass, warn, or fail across all samples:

MultiQC FastQC status heatmap with samples on rows and modules on columns; red cells in Per Base Sequence Content and Sequence Duplication Levels are typical for RNA-seq

In this example, every sample fails Per Base Sequence Content and Sequence Duplication Levels. For RNA-seq this is expected: random hexamer priming biases the first ~12 bases, and highly expressed transcripts inflate duplication. The Per Sequence GC Content warnings reflect transcriptome composition rather than a library defect. None of these are reasons to discard data.

fastp is an all-in-one tool for quality filtering, adapter trimming, and read preprocessing. It is fast, handles paired-end data well, and produces its own HTML report.

Adapters are short synthetic DNA sequences added during library preparation. They allow the DNA fragments to attach to the sequencer’s flow cell. Adapters are not part of your organism’s genome. If a DNA fragment is shorter than the read length, the sequencer reads past the insert and into the adapter on the other side. These adapter sequences need to be removed before analysis.

  1. Load fastp:

    module load biocontainers fastp/0.24.0
  2. Run fastp on a paired-end sample:

    mkdir -p trimmed fastp_reports
    fastp \
    -i raw_reads/sample1_R1.fastq.gz \
    -I raw_reads/sample1_R2.fastq.gz \
    -o trimmed/sample1_R1.trimmed.fq.gz \
    -O trimmed/sample1_R2.trimmed.fq.gz \
    --qualified_quality_phred 20 \
    --length_required 50 \
    --detect_adapter_for_pe \
    --thread 4 \
    --html fastp_reports/sample1.html \
    --json fastp_reports/sample1.json
  3. Open fastp_reports/sample1.html in your browser to view the before/after quality report.

ParameterDefaultDescriptionWhen to Change
--qualified_quality_phred15Bases below this quality are “unqualified”Use 20 for standard analysis; 30 for variant calling where accuracy is critical
--unqualified_percent_limit40Drop read if this % of bases are unqualifiedLower for stricter filtering
--length_required15Discard reads shorter than this after trimmingUse 50 for most Illumina data; adjust based on your aligner’s minimum
--detect_adapter_for_peoffAuto-detect adapters via paired-end overlapAlways enable for paired-end data
--adapter_sequenceautoSpecify adapter sequence manuallyOnly if auto-detection fails
--cut_frontoffSliding window trimming from 5’ endEnable for data with 5’ quality issues
--cut_tailoffSliding window trimming from 3’ endEnable for data with 3’ quality dropoff
--cut_window_size4Window size for sliding window trimmingLarger window = more aggressive trimming
--cut_mean_quality20Quality threshold for sliding windowMatch to your --qualified_quality_phred
--trim_poly_gautoTrim poly-G tailsImportant for NovaSeq/NextSeq (two-color chemistry artifacts)
--correctionoffOverlap-based error correction for PE readsEnable for variant calling workflows
--thread1Number of processing threadsUse 4 for good performance; fastp is fast even single-threaded

The fastp report shows:

  1. Summary: Total reads before and after filtering, bases before and after, read lengths
  2. Quality profiles: Side-by-side before/after quality distribution plots
  3. Filtering result: How many reads were removed and why (low quality, too short, too many Ns, adapter contamination)
  4. Adapter detection: Which adapter sequences were found and their contamination rate

After trimming, run FastQC again on the trimmed reads and use MultiQC to compare raw vs. trimmed results side by side.

  1. Run FastQC on trimmed reads:

    mkdir -p fastqc_trimmed
    for fq in trimmed/*.trimmed.fq.gz; do
    fastqc -t 4 -o fastqc_trimmed/ "$fq"
    done
  2. Run MultiQC on everything (raw FastQC + trimmed FastQC + fastp reports):

    mkdir -p multiqc_combined
    multiqc fastqc_raw/ fastqc_trimmed/ fastp_reports/ -o multiqc_combined/
  3. Open multiqc_combined/multiqc_report.html and compare:

    • Adapter content: should be gone in trimmed reads
    • Per base quality: 3’ end quality should improve (low-quality tails removed)
    • Read counts: some reads were lost during filtering (check fastp statistics)

There is no single universal threshold. What “good enough” means depends on your downstream analysis, your organism, and your experimental design. Here are general guidelines:

MetricGoodAcceptableInvestigate
% bases Q30+ (after trim)> 90%80-90%< 80%
% adapter content (after trim)< 1%1-5%> 5%
% reads retained after trimming> 95%85-95%< 85%
Duplication rate (WGS)< 15%15-30%> 30%
GC content distributionUnimodal, matches organismSlight shoulderBimodal

Use this table to decide what action to take based on your QC results:

ScenarioAction
Adapter content > 5%Trim with fastp (--detect_adapter_for_pe). Re-run FastQC to confirm removal.
Mean quality < Q25Trim with fastp quality filtering. If still poor, consider re-sequencing.
Quality drops sharply at 3’ endNormal for Illumina. Light trimming with fastp is sufficient.
GC content bimodalInvestigate contamination. Check for mixed species. Consider decontamination tools (e.g., Kraken2).
High duplication in WGSLow library complexity. May need re-sequencing with more input DNA.
High duplication in RNA-seqNormal: highly expressed genes produce duplicate reads. Not a QC failure.
Overrepresented adapter sequencesTrim with fastp. Confirm removal with FastQC.
Overrepresented biological sequencesCheck what the sequence is (BLAST it). May be rRNA contamination in RNA-seq.
Per base sequence content wobbles at 5’ endNormal: random hexamer priming bias. No action needed.
Very low read count for one sampleCheck library prep. May need re-sequencing. Exclude from analysis if insufficient depth.

Quality control does not end with read trimming. Each stage of your analysis has its own QC metrics.

After aligning reads to a reference genome (with BWA-MEM2, STAR, HISAT2, etc.), check:

module load biocontainers samtools/1.21
Tool / CommandWhat It Shows
samtools flagstat sample.bamTotal reads, mapped reads, properly paired reads, duplicates
samtools stats sample.bamComprehensive alignment statistics including insert size, error rates
picard CollectAlignmentSummaryMetricsMapping rate, mismatch rate, by read group
picard CollectInsertSizeMetricsInsert size distribution (should match library prep)

Key metrics:

  • Mapping rate: > 90% for WGS with good reference; > 70% for RNA-seq
  • Duplicate rate: < 30% typical for WGS; higher for low-input libraries
  • Insert size: should match your library prep target (e.g., 300-500 bp). Bimodal = problem.
  • Properly paired: > 90% expected
MetricExpected ValueWhat It Means
Ti/Tv ratio (WGS, human)2.0-2.1Lower values suggest false positive SNP calls
Ti/Tv ratio (exome, human)2.8-3.0Exonic regions have higher Ti/Tv
Het/Hom ratio (human)1.5-2.0Very high = contamination; very low = inbreeding or sample issue

After alignment with STAR or HISAT2:

  • Read distribution: What fraction maps to exons vs. introns vs. intergenic regions? mRNA-seq should be mostly exonic.
  • Gene body coverage: Plot signal from 5’ to 3’ end of genes. Strong 3’ bias = RNA degradation. Strong 5’ bias = incomplete reverse transcription.
  • Tools: RSeQC, Qualimap, Picard CollectRnaSeqMetrics
  • BUSCO: Checks for expected single-copy orthologs (covered in our HiFi assembly tutorial)
  • QUAST: Basic assembly statistics (N50, total length, number of contigs)

All commands used in this session in one place:

qc_cheatsheet.sh
# --- Setup ---
module --force purge
cd $RCAC_SCRATCH
mkdir -p genomics_exchange/session6
cd genomics_exchange/session6
# --- FastQC on raw reads ---
module load biocontainers fastqc/0.12.1
mkdir -p fastqc_raw
fastqc -t 4 -o fastqc_raw/ raw_reads/*.fastq.gz
# --- MultiQC on raw FastQC results ---
module load biocontainers multiqc/1.25.1
mkdir -p multiqc_raw
multiqc fastqc_raw/ -o multiqc_raw/
# --- Trim with fastp ---
module load biocontainers fastp/0.24.0
mkdir -p trimmed fastp_reports
fastp \
-i raw_reads/sample1_R1.fastq.gz \
-I raw_reads/sample1_R2.fastq.gz \
-o trimmed/sample1_R1.trimmed.fq.gz \
-O trimmed/sample1_R2.trimmed.fq.gz \
--qualified_quality_phred 20 \
--length_required 50 \
--detect_adapter_for_pe \
--thread 4 \
--html fastp_reports/sample1.html \
--json fastp_reports/sample1.json
# --- FastQC on trimmed reads ---
mkdir -p fastqc_trimmed
fastqc -t 4 -o fastqc_trimmed/ trimmed/*.trimmed.fq.gz
# --- Combined MultiQC report ---
mkdir -p multiqc_combined
multiqc fastqc_raw/ fastqc_trimmed/ fastp_reports/ -o multiqc_combined/
# --- Alignment QC (after aligning with BWA-MEM2 or STAR) ---
module load biocontainers samtools/1.21
samtools flagstat aligned/sample1.sorted.bam
samtools stats aligned/sample1.sorted.bam > sample1.stats.txt

Up next: Session 7 (April 21, 2026): Reproducible bioinformatics using Nextflow. We will learn how to wrap QC, alignment, and analysis into reproducible pipelines using Nextflow and nf-core.