Skip to content

Juicer on Negishi cluster

Prerequisites

  • Hi-C paired-end FASTQ files
  • Know your reference genome
  • Know the restriction enzyme used
  • RCAC HPC account (Negishi or Bell)

Objective

Generate .hic contact maps using Juicer Hi-C processing pipeline on the Negishi Cluster

Juicer is a pipeline for analyzing Hi-C data, including alignment, filtering, deduplication, and generation of .hic contact matrices. On Negishi, Juicer runs using a Singularity container with all required dependencies pre-installed (BWA, SAMtools, Java, etc.).

Pre-built reference genomes are available in:

/depot/itap/datasets/juicer/2.0.1

Currently, hg19 is the only genome available (we will add more genomes upon request). The reference directory contains:

  • Reference FASTA (genome.fa)
  • BWA index files (.bwt, .pac, ann, amb and sa.)
  • Chromosome sizes file (chrom.sizes)
  • Restriction enzyme site positions (e.g., hg19_MboI.txt)

Juicer is deployed as a biocontainer on the Negishi cluster. To run Juicer, follow these steps.

Organize your data in a directory structure like this:

cd $RCAC_SCRATCH
mkdir -p juicer_run/fastq
cp /path/to/your/fastq/files/* juicer_run/fastq/

Your directory structure should look like this:

  • Directoryjuicer_run
    • Directoryfastq
      • HIC003_S2_L001_R1_001.fastq.gz
      • HIC003_S2_L001_R2_001.fastq.gz

The juicer pipeline works by creating a series of batch jobs and submitting them all at once using job dependencies. The main script is very light weight and has to be run on the login node.

Load the module

module load biocontainers
module load juicer/2.0.1

You can run Juicer using the juicer.sh script.

juicer.sh -d $RCAC_SCRATCH/juicer_run \
-g hg19 \
-A testpbs \
-q testpbs -Q 2:00:00 \
-l testpbs -L 8:00:00 \

The arguments to the script are:

  • -q: Queue name for alignments (e.g., testpbs)
  • -Q: Walltime for alignments (e.g., 2:00:00)
  • -l: Queue name for the rest of the pipeline (e.g., testpbs)
  • -L: Walltime for the rest of the pipeline (e.g., 8:00:00)
  • -A: Account name (e.g., testpbs)

Here the default arguments are used for the rest of the pipeline.

  • -g: Genome ID (hg19)
  • -z: Genome FASTA file (${JUICER_DIR}/references/Homo_sapiens_assembly19.fasta)
  • -y: Restriction sites (${JUICER_DIR}/restriction_site/hg19_MboI.txt)
  • -D: Juicer scripts (default: ${JUICER_DIR} or /depot/itap/datasets/juicer/2.0.1)
  • -s: Restriction enzyme (MboI)

When this command is run, it will create a series of jobs in the specified queue.

How does the stdout look like for successful run? [click to show answer]
(-: Looking for fastq files...fastq files exist
(-: Aligning files matching /scratch/negishi/aseethar/juicedir/fastq/*_R*.fastq*
in queue testpbs to genome /depot/itap/datasets/juicer/2.0.1/references/Homo_sapiens_assembly19.fasta with no fragment delimited maps.
(-: Created /scratch/negishi/aseethar/juicedir/splits and /scratch/negishi/aseethar/juicedir/aligned.
(-: Starting job to launch other jobs once splitting is complete
(-: Finished adding all jobs... Now is a good time to get that cup of coffee... Last job id 23947729

You can check the jobs running using squeue command

squeue --me

It looks something like this:

JOBID NAME ST USER ACCOUNT NODES CPUS TIME_LIMIT TIME_LEFT NODELIST(REASON)
23947729 a1747060292_prep_don PD aseethar testpbs 1 1 20:00:00 20:00:00 (Dependency)
23947728 a1747060292_arrowhea PD aseethar testpbs 1 1 2:00:00 2:00:00 (Dependency)
23947727 a1747060292_hiccups_ PD aseethar testpbs 1 1 2:00:00 2:00:00 (Dependency)
23947726 a1747060292_hic30 PD aseethar testpbs 1 1 8:00:00 8:00:00 (Dependency)
23947725 a1747060292_hic PD aseethar testpbs 1 1 8:00:00 8:00:00 (Dependency)
23947724 a1747060292_stats30 PD aseethar testpbs 1 1 8:00:00 8:00:00 (Dependency)
23947723 a1747060292_stats PD aseethar testpbs 1 1 8:00:00 8:00:00 (Dependency)
23947722 a1747060292_bamrm PD aseethar testpbs 1 8 2:00:00 2:00:00 (Dependency)
23947721 a1747060292_prestats PD aseethar testpbs 1 8 2:00:00 2:00:00 (Dependency)
23947720 a1747060292_merged30 PD aseethar testpbs 1 8 2:00:00 2:00:00 (Dependency)
23947719 a1747060292_merged1 PD aseethar testpbs 1 8 2:00:00 2:00:00 (Dependency)
23947718 a1747060292_dupcheck PD aseethar testpbs 1 1 2:00:00 2:00:00 (Dependency)
23947717 a1747060292_post_ded PD aseethar testpbs 1 1 1:40:00 1:40:00 (Dependency)
23947716 a1747060292_dedup PD aseethar testpbs 1 1 8:00:00 8:00:00 (Dependency)
23947714 a1747060292_fragmerg PD aseethar testpbs 1 8 8:00:00 8:00:00 (Dependency)
23947713 a1747060292_check PD aseethar testpbs 1 1 2:00:00 2:00:00 (Dependency)
23947712 a1747060292_mergesor PD aseethar testpbs 1 8 8:00:00 8:00:00 (Dependency)
23947711 a1747060292_merge_HI PD aseethar testpbs 1 1 8:00:00 8:00:00 (Dependency)
23947710 a1747060292_merge_HI PD aseethar testpbs 1 1 8:00:00 8:00:00 (Dependency)
23947709 a1747060292_align1_H PD aseethar testpbs 1 8 2:00:00 2:00:00 (Pending)
23947708 a1747060292_HIC003_S PD aseethar testpbs 1 1 2:00:00 2:00:00 (Pending)
23947707 a1747060292_cmd PD aseethar testpbs 1 1 2:00 2:00 (Pending)
23947715 a1747060292_dedup_gu PD aseethar testpbs 1 1 10:00 10:00 (JobHeldUser)

The full list of arguments for juicer.sh is:

OptionDescription
-g genomeIDGenome ID (e.g., hg19, mm10) defined internally or via -z
-d topDirTop-level working directory. Must contain fastq/; creates splits/, aligned/
-z reference-genomePath to genome FASTA file; BWA index files must be in the same directory
-p chrom.sizesPath to chrom.sizes file (can also use genome name like hg38)
-y restriction-site-fileFile with positions of restriction sites (e.g., from generate_site_positions.py)
-D juicerDirPath to Juicer scripts directory (default: /depot/itap/datasets/juicer/2.0.1)
OptionDescription
-q queueSLURM queue for alignment jobs (default: standby)
-l long queueSLURM queue for long jobs such as .hic creation (default: standby)
-Q queue timeTime limit for short jobs (e.g., -Q 4:00 for 4 hours)
-L long queue timeTime limit for long jobs (e.g., -L 168:00 for one week)
-A accountSLURM account name for job submission
OptionDescription
-s siteRestriction enzyme (e.g., MboI, HindIII)
-a aboutFree-text experiment description (enclosed in single quotes)
-i sampleSample name, added to SM: in read group
-k libraryLibrary name, added to LB: in read group
-b ligationLigation junction sequence (used in counting)
OptionDescription
-t threadsNumber of threads for BWA alignment
-T threadsHicNumber of threads for .hic file creation
-C chunk sizeNumber of lines per split file (default: 90,000,000; must be multiple of 4)
-w wobbleWobble distance for deduplication (default: 4)
OptionDescription
-S stageStart from a given stage: chimeric, merge, dedup, afterdedup, final, postproc, early
FlagDescription
-jUse only exact duplicates during deduplication (disables wobble)
-eExit early before .hic file creation
-fInclude fragment-delimited maps in .hic output
-uUse single-end mode for alignment
-mProcess methylation + Hi-C library
--assemblyEarly exit after deduplication (for 3D-DNA input)
--cleanupRemove intermediate files if pipeline completes
--qc_apaRun APA-based QC
--qcDownsample to 1 kb, skip annotation
--in-situLimit to 1 kb map resolution (no annotation)
-h, --helpDisplay usage help and exit

Upon completion, the main output file is:

inter_30.hic

You can visualize this file using Juicebox.

The other output files in aligned/ include:

  • Directoryaligned/
    • header
    • inter.hic
    • inter.txt
    • inter_30.hic
    • inter_30.txt
    • inter_30_contact_domains
    • inter_30_hists.m
    • Directoryinter_30_loops
      • fdr_thresholds_10000
      • fdr_thresholds_25000
      • fdr_thresholds_5000
    • inter_hists.m
    • merged1.txt
    • merged30.txt
    • merged_dedup.bam