Summary and Schedule

Welcome to the Genome Assembly Workshop

This workshop provides a hands-on introduction to long-read genome assembly using HiFiasm and Flye, optimized for the RCAC cluster. You’ll learn best practices for assembly, polishing, and scaffolding with Bionano optical maps, along with strategies for quality assessment and troubleshooting.

Designed for researchers and bioinformaticians, this workshop will equip you with the skills to build high-quality, reproducible genome assemblies on RCAC HPC resources.

Let’s get started!

Setup Instructions

Download files required for the lesson

00h 00m

1. Introduction to Genome assembly

What is genome assembly, and why is it important?
What sequencing technologies can be used for genome assembly?
What are de novo and reference-guided assemblies?
What challenges arise when generating high-quality assemblies?
What software tools are used for assembling genomes?

00h 30m

2. Assembly Strategies

What factors influence the choice of genome assembly strategy?
How do different assembly methods compare in terms of read length, accuracy, and computational requirements?
What are the key steps in evaluating genome assemblies using BUSCO and QUAST?
How do Bionano OGM and Hi-C sequencing improve genome continuity and organization?

00h 50m

3. Data Quality Control

What is data quality checking and filtering?
Why is it necessary to assess the quality of raw sequencing data?
What are the key steps in filtering long-read sequencing data?
How can visualization tools like NanoPlot help in quality assessment?

01h 50m

4. PacBio HiFi Assembly using HiFiasm

What is HiFiasm, and how does it improve genome assembly using PacBio HiFi reads?
What are the key steps in running HiFiasm for haplotype-resolved assembly?
How does HiFiasm handle haplotype resolution and purging of duplications?
What are the benefits of using HiFiasm for assembling complex and heterozygous genomes?

02h 50m

5. Oxford Nanopore Assembly using Flye

What are the key features of ONT reads?
Why is Flye good for assembling ONT reads?
What are the main steps in the Flye assembly workflow?
How can you evaluate the quality of a Flye assembly?

03h 50m

6. Hybrid Long Read Assembly (optional)

What is hybrid assembly, and how does it combine different sequencing technologies?
How can you perform hybrid assembly using both types of long-read data?
What are the key steps in hybrid assembly, including polishing and scaffolding?
How do you evaluate the quality of a hybrid assembly using bioinformatics tools?

04h 35m

7. Scaffolding using Optical Genome Mapping

What is Bionano optical genome mapping (OGM) and how does it improve genome assembly?
How does Bionano Solve hybrid scaffolding integrate optical maps with sequence assemblies?
What are the key steps involved in running the Bionano Solve pipeline for hybrid scaffolding?
How can you assess the quality of hybrid scaffolds generated by Bionano Solve?

05h 15m

8. Assembly Assessment

Why is evaluating genome assembly quality important?
What tools can be used to assess assembly completeness, accuracy, and structural integrity?
How do you interpret key metrics from assembly evaluation tools?
What are the main steps in evaluating a genome assembly using bioinformatics tools?

06h 05m

Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Instructors

Arun Seetharam, Ph.D.: Arun is a lead bioinformatics scientist at Purdue University’s Rosen Center for Advanced Computing. With extensive expertise in comparative genomics, genome assembly, annotation, single-cell genomics, NGS data analysis, metagenomics, proteomics, and metabolomics. Arun supports a diverse range of bioinformatics projects across various organisms, including human model systems.

Schedule

Time	Session
9:00 AM	Introduction to Genome Assembly – Sequencing technologies, assembly concepts, and workshop overview
9:30 AM	Assembly Strategies – Comparing approaches, evaluation metrics, and resource planning
9:50 AM	Data Quality Control – NanoPlot, Filtlong, KMC, and GenomeScope2 for read QC
10:30 AM	Morning Break
10:45 AM	PacBio HiFi Assembly – HiFiasm assembly, purge levels, GFA conversion, and Flye for HiFi
12:00 PM	Lunch Break
1:00 PM	Oxford Nanopore Assembly – Flye assembler, Medaka polishing, and HiFiasm for ONT
1:45 PM	Hybrid Assembly – Combining ONT + HiFi reads with Flye, Bionano scaffolding
2:30 PM	Afternoon Break
2:45 PM	Scaffolding with Optical Genome Mapping – Bionano Solve for HiFiasm and Flye assemblies
3:15 PM	Assembly Evaluation – QUAST, Compleasm, Merqury, Bandage, and comparative analysis
3:50 PM	Wrap-Up & Discussion – Summary, Q&A, and next steps
4:00 PM	Dismissal

What is not covered

Short read assembly
Hi-C scaffolding
Annotation
Comparative analyses

Pre-requisites

Basic knowledge of genomics
Basic knowledge of command line interface
Basic knowledge of bioinformatics tools

Data Sets

To copy the data to your scratch space:

BASH

rsync -avP /depot/workshop/data/genome-assembly/genome-assembly-data $RCAC_SCRATCH

The worked-out results folder is also available at /depot/workshop/data/genome-assembly/genome-assembly-data on the training cluster. Only use this if you are unable to finish the exercises in the workshop.

Software Setup

Discussion

Details

SSH key setup for different systems is detailed in the expandable sections below.

Windows

Open a terminal and run:

SH

ssh-keygen -b 4096 -t rsa
type .ssh\id_rsa.pub | ssh trainXX@negishi.rcac.purdue.edu "mkdir -p ~/.ssh; cat >> ~/.ssh/authorized_keys"

MacOS

Open Terminal and run

SH

ssh-keygen -b 4096 -t rsa
cat .ssh/id_rsa.pub | ssh trainXX@negishi.rcac.purdue.edu "mkdir -p ~/.ssh; cat >> ~/.ssh/authorized_keys"

Linux

Open a terminal and run:

SH

ssh-keygen -b 4096 -t rsa
cat .ssh/id_rsa.pub | ssh trainXX@negishi.rcac.purdue.edu "mkdir -p ~/.ssh; cat >> ~/.ssh/authorized_keys"