Functional annotation using EnTAP

Last updated on 2026-02-24 | Edit this page

Overview

Questions

  • What is EnTAP, and how does it improve functional annotation for non-model transcriptomes?
  • How does EnTAP filter, annotate, and assign functional roles to predicted transcripts?
  • What databases and evidence sources does EnTAP integrate for annotation?
  • What are the key steps required to set up and execute EnTAP on an HPC system?

Objectives

  • Understand how EnTAP improves functional annotation for non-model eukaryotes.
  • Learn how EnTAP processes transcript data through filtering, alignment, and functional assignment.
  • Set up and modify EnTAP configuration files for correct execution.
  • Run EnTAP on an HPC system and interpret the generated annotations.

EnTAP Overview


EnTAP is a bioinformatics pipeline designed to enhance the accuracy, speed, and flexibility of functional annotation for de novo assembled transcriptomes in non-model eukaryotic organisms. It mitigates assembly fragmentation, improves annotation rates, and provides extensive functional insights into transcriptomes. You can provide predicted transcripts in FASTA format and a GFF file containing gene models to run EnTAP.

Callout

Key Features

  • Optimized for non-model eukaryotes – Overcomes challenges of fragmented transcriptome assemblies.
  • Fast & efficient – Runs significantly faster than comparable annotation tools.
  • Customizable – Supports optional filtering and analysis steps for user-specific needs.
  • Comprehensive functional insights – Integrates multiple annotation sources for high-confidence gene assignments.
  • Contaminant detection – Helps remove misleading sequences for cleaner datasets.

How EnTAP Works


  1. Transcriptome Filtering – Identifies true coding sequences (CDS) and removes assembly artifacts:
    • Expression Filtering (optional) – Filters transcripts based on gene expression levels using RSEM.
    • Frame Selection (optional) – Further refines CDS predictions using TransDecoder.
  2. Transcriptome Annotation – Assigns functional information to sequences:
    • Similarity Search – Rapid alignment against user-selected databases using DIAMOND.
    • Contaminant Filtering & Best Hit Selection – Identifies optimal annotations and flags potential contaminants.
    • Orthologous Group Assignment – Assigns translated proteins to gene families using eggNOG/eggnog-mapper, including:
      • Protein Domains (SMART/Pfam)
      • Gene Ontology (GO) Terms
      • KEGG Pathway Annotation
    • InterProScan (optional) – Searches InterPro databases for additional domain, GO, and pathway annotations.
    • Horizontal Gene Transfer Analysis (optional) – Detects potential horizontal gene transfer (HGT) events via DIAMOND.

Running EnTAP


EnTAP is available as a module on the HPC cluster. You can load the module using the following commands:

BASH

ml --force purge
ml biocontainers
ml entap
Prerequisite

First-Time Setup

When running for the first time, you will have to set up the databases for EnTAP. This includes downloading files from various databases, and can be time consuming. This section is already performed so you can skip this step and is included for reference.

Callout

Pre-staged Databases

For this workshop, the EnTAP databases have been pre-configured and are available at /depot/itap/datasets/entap_db/ on the training cluster. You can copy the configuration files from there:

BASH

cp /depot/itap/datasets/entap_db/entap_config.ini .
cp /depot/itap/datasets/entap_db/entap_run.params .

If you are running EnTAP on your own data outside this workshop, follow the database setup instructions in the spoiler section above.

entap_run.params file should be setup as follows (be sure to select the correct databases for your organism):

INI

out-dir=entap_dbfiles
overwrite=false
resume=false
input=
database=uniprot_sprot,refseq_plant
no-trim=false
threads=1
output-format=1,3,4,7,
fpkm=0.5
align=
single-end=false
frame-selection=false
transdecoder-m=100
transdecoder-no-refine-starts=false
taxon=
qcoverage=50
tcoverage=50
contam=
e-value=1e-05
uninformative=conserved,predicted,unknown,unnamed,hypothetical,putative,unidentified,uncharacterized,uncultured,uninformative,
diamond-sensitivity=very-sensitive
ontology_source=0,
eggnog-contaminant=true
eggnog-dbmem=true
eggnog-sensitivity=more-sensitive
interproscan-db=
hgt-donor=
hgt-recipient=
hgt-gff=
ncbi-api-key=
ncbi-api-enable=true

entap_config.ini file should be setup as follows (be sure to modify the paths <custom_location> to your desired location):

INI

data-generate=false
data-type=0,
entap-db-bin=<custom_location>/entap_db/bin/entap_database.bin
entap-db-sql=entap_database.db
entap-graph=entap_graphing.py
rsem-calculate-expression=rsem-calculate-expression
rsem-sam-validator=rsem-sam-validator
rsem-prepare-reference=rsem-prepare-reference
convert-sam-for-rsem=convert-sam-for-rsem
transdecoder-long-exe=TransDecoder.LongOrfs
transdecoder-predict-exe=TransDecoder.Predict
diamond-exe=diamond
eggnog-map-exe=emapper.py
eggnog-map-data=<custom_location>/entap_db/databases
eggnog-map-dmnd=<custom_location>/entap_db/bin/eggnog_proteins.dmnd
interproscan-exe=interproscan.sh

Once done, for the first time setup, you can run the following command:

BASH

ml --force purge
ml biocontainers
ml entap
EnTAP \
   --config \
   --run-ini ./entap_run.params \
   --entap-ini ./entap_config.ini \
   --threads ${SLURM_CPUS_ON_NODE}

This will download the databases and set up the configuration files for EnTAP.

Step 1: Prepare files


Your input files should be in the following format:

  • Transcript FASTA file – Contains predicted transcripts in FASTA format.
  • Configuration file – Specifies parameters for EnTAP execution.
    • entap_run.params – Contains runtime parameters for EnTAP.
    • entap_config.ini – Specifies paths to EnTAP binaries and databases. (you can copy the files from /depot/itap/datasets/entap_db/entap_{config.ini,run.params})

Edit the entap_run.params file to specify the output directory for EnTAP results and the correct input file

INI

out-dir=entap_out
input=input_cds.fasta

Step 2: Run EnTAP


Run EnTAP using the following command:

BASH

ml --force purge
ml biocontainers
ml entap
EnTAP \
   --run \
   --run-ini ./entap_run.params \
   --entap-ini ./entap_config.ini \
   --threads ${SLURM_CPUS_ON_NODE}

Interpreting Results


EnTAP generates several output files, but the key results will be in the entap_out/final_results directory.

EnTAP Output Files Summary

File/Directory Description
Final Results (final_results/)
annotated.fnn FASTA file of annotated transcripts.
annotated.tsv Tab-separated file with functional annotations.
annotated_gene_ontology_terms.tsv GO terms assigned to annotated transcripts.
entap_results.tsv Master summary of all results, including annotations.
unannotated.fnn FASTA file of unannotated transcripts.
unannotated.tsv List of transcripts that failed annotation.
gene_family/ Stores eggNOG gene family assignments, including orthologs and functional annotations.
similarity_search/ Contains results from DIAMOND BLASTX searches against selected databases.
transcriptomes/ Holds the input transcriptome (CDS) and the processed version after filtering.
Challenge

Exercise 1: Annotation Coverage

After running EnTAP, examine the output files in the final_results/ directory. What percentage of your input transcripts received functional annotations?

Hint: Compare the number of entries in annotated.tsv and unannotated.tsv.

You can count the lines (excluding the header) in each file:

BASH

# Count annotated transcripts
tail -n +2 entap_out/final_results/annotated.tsv | wc -l

# Count unannotated transcripts
tail -n +2 entap_out/final_results/unannotated.tsv | wc -l

The annotation rate is calculated as:

annotated / (annotated + unannotated) * 100

This gives you the percentage of transcripts that received at least one functional annotation from the databases used by EnTAP.

Challenge

Exercise 2: Uninformative Annotations

Look at the uninformative parameter in the entap_run.params file. What does this parameter do and why is it important for functional annotation quality?

The uninformative parameter specifies a list of keywords used to filter out low-quality or non-descriptive annotations. In the default configuration, it includes terms like:

uninformative=conserved,predicted,unknown,unnamed,hypothetical,putative,unidentified,uncharacterized,uncultured,uninformative,

When EnTAP encounters a DIAMOND hit whose description contains any of these keywords, it treats that hit as uninformative and attempts to find a better annotation from other database matches. This matters because many protein database entries carry vague descriptions such as “hypothetical protein” or “predicted protein” that provide no meaningful functional insight. By filtering these out, EnTAP ensures that only informative, descriptive functional assignments are kept in the final results, leading to higher-quality annotations for downstream analysis.

Key Points
  • EnTAP enhances functional annotation by integrating multiple evidence sources, including homology, protein domains, and gene ontology.
  • Proper setup of configuration files and databases is essential for accurate and efficient EnTAP execution.
  • Running EnTAP involves transcript filtering, similarity searches, and functional annotation through automated workflows.
  • The pipeline provides extensive insights into transcript function, improving downstream biological interpretations.