Summary and Schedule

Welcome to the Genome Annotation Workshop

This workshop provides a hands-on introduction to genome annotation, covering structural and functional annotation techniques optimized for the RCAC cluster. You’ll learn best practices for gene prediction using BRAKER3, Helixer, and Easel, as well as functional annotation with EnTAP.

Designed for researchers and bioinformaticians, this workshop will equip you with the skills to generate high-quality, reproducible genome annotations, leveraging transcriptomic, proteomic, and homology-based evidence. You’ll also explore strategies for quality assessment and troubleshooting common challenges in annotation workflows.

Let’s get started!

Setup Instructions

Download files required for the lesson

00h 00m

1. Introduction to Genome Annotation

What is genome annotation, and why is it important?
What are the different types of genome annotation?
What challenges make genome annotation difficult?
What data sources are used to improve annotation accuracy?
How does the annotation process fit into genomics research?

00h 20m

2. Annotation Strategies

What are the different strategies used for genome annotation?
How do various methods predict genes, and what are their strengths and limitations?
What role do evidence play in improving gene predictions?
How do deep learning and large models enhance genome annotation accuracy?
What tools are used in this workshop, and how do they fit into different strategies?

00h 32m

3. Annotation Setup

How should files and directories be structured for a genome annotation workflow?
Why is RNA-seq read mapping important for gene prediction?
How does repeat masking improve annotation accuracy?
What preprocessing steps are necessary before running gene prediction tools?

00h 44m

4. Annotation using BRAKER

What is BRAKER3?
What are the different scenarios in which BRAKER3 can be used?
How to run BRAKER3 with different input requirements?
What are the different output files generated by BRAKER3?

00h 56m

5. Annotation using Helixer

How to predict genes using Helixer?
How to download trained models for Helixer?
How to run Helixer on the HPC cluster (Gilbreth)?

01h 08m

6. Annotation using Easel

What are the steps required to set up and run EASEL on an HPC system?
How is Nextflow used to manage and execute the EASEL workflow?
What configuration files need to be modified before running EASEL?
How do you submit and monitor an EASEL job using Slurm?

01h 20m

7. Functional annotation using EnTAP

What is EnTAP, and how does it improve functional annotation for non-model transcriptomes?
How does EnTAP filter, annotate, and assign functional roles to predicted transcripts?
What databases and evidence sources does EnTAP integrate for annotation?
What are the key steps required to set up and execute EnTAP on an HPC system?

01h 32m

8. Annotation Assesment

How to assess the quality of a genome annotation?
What are the different tools available for assessing the quality of a genome annotation?
How to compare the predicted annotation with the reference annotation?
How to measure the number of raw reads assigned to the features predicted by the annotation?

01h 44m

Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.

Instructors

Arun Seetharam, Ph.D.: Arun is a lead bioinformatics scientist at Purdue University’s Rosen Center for Advanced Computing. With extensive expertise in comparative genomics, genome assembly, annotation, single-cell genomics, NGS data analysis, metagenomics, proteomics, and metabolomics. Arun supports a diverse range of bioinformatics projects across various organisms, including human model systems.
Charles Christoffer, Ph.D.: Charles is a Senior Computational Scientist at Purdue University’s Rosen Center for Advanced Computing. He has a Ph.D. in Computer Science in the area of structural bioinformatics and has extensive experience in protein structure prediction.

Schedule

Time	Session
8:30 AM	Arrival & Setup
9:00 AM	Introduction & Annotation Strategies – Overview of genome annotation, structural vs. functional annotation, key challenges, and selecting the right pipeline
10:30 AM	Break
10:40 AM	Gene Annotation with BRAKER – Running BRAKER for ab initio and RNA-seq-supported annotation, gene model evaluation
12:00 PM	Lunch Break
1:00 PM	Interpreting BRAKER Results & Gene Annotation with Helixer – Reviewing BRAKER outputs, refining predictions, and using Helixer for deep-learning-based annotation
2:50 PM	Break
3:10 PM	Functional Annotation with EnTAP & Annotation Quality Assessment – Assigning gene functions, GO term mapping, evaluating completeness with BUSCO, and benchmarking gene models
4:30 PM	Wrap-Up & Discussion – Troubleshooting, Q&A, and next steps

What is not covered

Gene prediction using MAKER
Evidence based gene prediction (EviAnn, EVidenceModeler)
Genome assembly
Comparative analyses

Pre-requisites

Basic knowledge of genomics
Basic knowledge of command line interface
Basic knowledge of bioinformatics tools

Data Sets

To copy only data:

BASH

rsync -avP /depot/workshop/data/annotation_workshop ${RCAC_SCRATCH}/

The worked out folder is available at /depot/workshop/data/annotation_workshop-results on the training cluster. You can copy the data to your scratch space using the following command:

BASH

rsync -avP /depot/workshop/data/annotation_workshop-results ${RCAC_SCRATCH}/

Only use this if you are unable to finish the exercises in the workshop.

You only need one directory on Gilbreth cluster. See below for details.

For Gilbreth Cluster only

BASH

rsync -avP /depot/workshop/data/annotation_workshop/05_helixer ${RCAC_SCRATCH}/

Software Setup

Details

SSH key setup for different systems is detailed in the expandable sections below.

Windows

Open a terminal and run:

SH

ssh-keygen -b 4096 -t rsa
type .ssh\id_rsa.pub | ssh trainXX@negishi.rcac.purdue.edu "mkdir -p ~/.ssh; cat >> ~/.ssh/authorized_keys"

MacOS

Open Terminal and run

SH

ssh-keygen -b 4096 -t rsa
cat .ssh/id_rsa.pub | ssh trainXX@negishi.rcac.purdue.edu "mkdir -p ~/.ssh; cat >> ~/.ssh/authorized_keys"

Linux

Open a terminal and run:

SH

ssh-keygen -b 4096 -t rsa
cat .ssh/id_rsa.pub | ssh trainXX@negishi.rcac.purdue.edu "mkdir -p ~/.ssh; cat >> ~/.ssh/authorized_keys"