GATK4 Beta On BioHPC

Updated 2017-09-28 David Trudgian

Overview

GATK 4 is the latest version of the popular and powerful Genome Analysis Toolkit from the Broad Institute. It is currently available as a beta version, and is not considered ready for general use:

This project is in a beta development stage, which means it is not yet ready for
general use, documentation is absent or incomplete, and that some features and
syntax may still change substantially before general release. Be sure to read
about the known issues before test driving.

BioHPC has made the latest beta version (4 beta 5) available as a module for users who wish to test GATK4 and begin translating their workflows from previous versions. Compared to GATK3, the new version offers:

  • A combined location for the GATK tools as well as tools previously part of the separate Picard suite.
  • Improved parallelization using the Apache Spark framework.
  • A more liberal open-source license covering the use and distribution of the sofware.

Parallelization in GATK4

The major point of interest for most users is the improved parallelization of tools available in GATK4. In GATK3, many tools were slow to run on large datasets as they could not easily make effective use of the large number of CPU cores, and large amount of RAM, available on typical modern HPC systems. The tools were parallelized using different strategies, with parameters that needed tuning separately for each tool and the nature of the input data. In many cases best performance could only be achieved by manually splitting datasets, and running the tools on separate slices. This is complex to implement, and the output must be combined as a separate task.

Tools in GATK4 which can benefit from parallel processing have been rewritten to use Apache Spark. Spark is a framework for distributed in-memory computing on large datasets, run on clusters and in the cloud. Spark provides ways to load a large dataset so it is split efficiently between multiple machines and CPUs, with each carrying out a portion of the work.

Although Spark is a powerful tool, it can be complex to use in an HPC environment such as BioHPC's nucleus cluster. We encourage GATK4 users to begin with simple single node execution, and speak to BioHPC before attempting to run multi-node jobs.

Using GATK4 Beta

GATK4 Module

The current beta 5 version of GATK4 is installed as a module on the Nucleus cluster. You can access it in a terminal session with the command:

module add gatk/4.beta/5

Note that gatk/3.7 remains the default version, which will be loaded using module add gatk. You need to explicitly ask for the gatk4 beta in your module add command.

Running GATK4 non-Spark tools

GATK tools are run via the gatk-launch command. To see a list of tools you can run gatk-launch --list.

GATK4 tools are classed as either non-spark, or spark tools. Spark tools have a name that ends in Spark.

Many tools are relatively simple, and do not need heavy parallel processing. These will not have a Spark variant. You can run these tools in a straightforward manner using the gatk-launch command. E.g. to run the PrintReads tool, use a command line as below:

gatk-launch PrintReads -I sorted1.bam -O output.bam

Running GATK4 Spark Tools - Single Node (LOCAL)

Some GATK4 tools have both non-spark, and spark variants. Other tools are only available in Spark form. If a tool has a Spark and non-Spark version it is likely that the Spark version will perform better on large datasets. However, you might want to try both versions to make sure.

As an example, the PrintReads tool we ran above has a parallel Spark version called PrintReadsSpark. The non-Spark version is suitable for smaller datasets, but if you are using a large dataset, and will be using other Spark tools in your pipeline you should prefer PrintReadsSpark.

Running a Spark tool on a single BioHPC compute node, or on a workstation/thin-client, is straightforward. The newer betas of GATK4 will setup a local spark environment to run the Spark tool on a single machine without any extra work. Just call the Spark tool in the same way as the non-Spark version:

gatk-launch PrintReadsSpark -I sorted1.bam -O output.bam

While the tool runs you will see various log messages related to the local Spark cluster that the gatk-launch script creates for the tool to run on. You will also find files or folders named .parts in the working directory. These contain splits of input/working data that Spark is using to split the task.

Note Many GATK4 Spark tools are incomplete, and/or not verified to produce identical output to their non-Spark counterparts. E.g. PrintReadsSpark cannot handle multiple input files, while PrintReads can. The shortcomings of Spark tools are being addresses by the GATK developers as they move towards a release version of GATK4. When using the beta version please carefully check for expected output, and validate against non-spark tools, GATK3 as appropriate.

Running GATK4 Spark Tools - Multiple Nodes (SPARK CLUSTER)

A big draw of Spark as a computing framework is that it makes it possible to run tools across multiple machines. Most of the examples on the GATK4 website refer to running tools on the Google Compute Cloud. However, Spark can also be used on an HPC cluster, like the BioHPC nucleus cluster.

Important: You should not currently use multi-node Spark runs for any sensitive data, due to insecure web control panels. See limitations section below.

Understanding Spark on HPC Clusters

Running Spark tools across multiple nodes is, unfortunately, much more complex than using them on a single machine. Spark is most often used on a dedicated Spark/Hadoop cluster - where many compute nodes are dedicated solely to running Spark and Hadoop jobs. On a general use HPC cluster like Nucleus, we cannot dedicate resources to Spark & Hadoop which are not commonly used by our community. To do so would increase costs for users, and leave hardware idle.

Spark is a clustered system, where every machine that will process jobs needs to run the Spark daemon, configured so the machines can talk to each other to form a cluster. In addition, when working across multiple machines, the GATK4 tools need to work with data stored in HDFS (the Hadoop distributed file system), rather than directly on /project or work. HDFS distributes files across the nodes in a way that supports the distributed computing that Spark performs. HDFS is implemented by Hadoop daemons, which also need to run on all nodes.

On a single machine, the gatk-launch script can setup a simple single machine Spark cluster for us, run the tool, and then shut down the Spark cluster. It cannot do this across multiple machines, since gatk-launch cannot anticipate the different ways that HPC clusters can be configured.

Running a Spark cluster on Nucleus

Running a GATK4 spark tool on nucleus means that we have to create a Spark & HDFS cluster across multiple nodes. This must be done under the control of the SLURM scheduler that BioHPC uses, and we also need to shut down the cluster and tidy up when we are finished.

BioHPC uses myHadoop/mySpark to allow us to run Hadoop/Spark clusters under SLURM. myHadoop/mySpark are a set of scripts from Glen Lockwood (previously at NERSC) which simplify starting, stopping, and cleaning up a Hadoop/Spark cluster.

mySpark GATK4 Procedure

The general procedure for running GATK4 Spark tools across multiple nodes is as follows:

Write and submit a SLURM sbatch job script which:

  • Allocates multiple compute nodes
  • Uses mySpark to setup a Spark/HDFS cluster configuration
  • Starts the HDFS distributed storage daemons on each allocated node
  • Starts the Spark compute daemons on each allocated node
  • Puts any input data into the HDFS distributed filesystem, from /project or /work
  • Calls the GATK4 spark tools
  • Retrieves any output data from the HDFS distributed filesystem back to /project or /work
  • Shuts down the Spark and HDFS daemons
  • Cleans up temporary files

Example - Running HaplotypeCallerSpark on 4 nodes

The batch script below, when submitted with sbatch performs the steps above to run the GATK4 HaplotypeCallerSpark tool across 4 compute nodes on the Nucleus cluster.

You can use this script as a template for running multi-node Spark jobs. We recommend, however, that you discuss your work with BioHPC before attempting this. It can be difficult to troubleshoot multi-node spark jobs, and we will be able to provide guidance. At minimum you will need to:

  • Edit the hdfs commands that place input onto the HDFS filesystem, and retrieve output.
  • Edit the gatk-launch command (adding additional launches for other tools) as needed.

When writing a GATK4 Spark pipeline:

  • Try to perform all steps in a single job, so that you do not have overhead of starting, stopping Spark between steps.
  • Try to use only Spark tools so you can keep the data in HDFS until you retrieve final output to the cluster filesystems. Non-Spark tools cannot see the data in the HDFS filesystem.
#!/bin/bash
################################################################################
#  slurm.sbatch - A sample submit script for SLURM that illustrates how to
#    spin up a Hadoop cluster for a map/reduce task using myHadoop
#
#  Glenn K. Lockwood, San Diego Supercomputer Center             February 2014
#
#  D C Trudgian - UTSW BioHPC - Modifications for Nucleus
################################################################################

# How many nodes to use for our Hadoop/SPARK jobs?
#SBATCH -N 4

# Which partition to use?
# It's OK to use super here. Spark can cope with nodes that have different
# numbers of CPUs, and share the work between them fairly
#SBATCH --partition=super

# With a 1day time limit
#SBATCH -t 24:00:00

# We must increase the maximum number of processes available to the job
# Hadoop and spark will start a lot of procs/threads, and the default limit
# on BioHPC systems is too low.
ulimit -u 65536

# We use the myhadoop/my-spark configuration tools
module add myhadoop/0.30-spark-2.2.0

# We need the GATK module
module add gatk/4.beta.5

# Set a job-specific directory for our hadoop/spark cluster
# configuration.
export HADOOP_CONF_DIR="${PWD}/hadoop-conf.${SLURM_JOBID}"

# Important info to start Hadoop & Spark
# This is a regular expression that converts Nucleus hostnames to
# infiniband interface addresses
IB_REGEXP='s/Nucleus0*/10\.10\.10\./'
# This is the hostname of our master node
MASTER_HOST=${HOSTNAME}
# This is the Infiniband IP of our master node
MASTER_IB_IP=$( echo $HOSTNAME | sed -e "${IB_REGEXP}" )

# Setup the hadoop & spark configuration, mapping hostnames to IB addresses
# for fast communication between nodes
myhadoop-configure.sh -s /tmp/$USER/$SLURM_JOBID -i 's/Nucleus0*/10\.10\.10\./'
# Now load the configuration environment that was created
source $HADOOP_CONF_DIR/spark/spark-env.sh

# Hadoop Startup
# Start Namenode at master (where this script runs)
srun -n1 -N1 --nodelist=${MASTER_HOST} $HADOOP_HOME/bin/hadoop namenode &
# Start Datanode at each node
for node in $(scontrol show hostnames $SLURM_JOB_NODELIST); do
    srun -n1 -N1 --nodelist=${node} $HADOOP_HOME/bin/hadoop datanode &
done
# Start Yarn resourcemanager at master
srun -n1 -N1 --nodelist=${MASTER_HOST} $HADOOP_HOME/bin/yarn resourcemanager &
# Start Yarn nodemanager at each node
for node in $(scontrol show hostnames $SLURM_JOB_NODELIST); do
    srun -n1 -N1 --nodelist=${node} $HADOOP_HOME/bin/yarn nodemanager &
done
# Start Spark master at master
export $SPARK_LOCAL_IP=${$MASTER_IB_IP}
srun -n1 -N1 --nodelist=${MASTER_HOST} $SPARK_HOME/bin/spark-class org.apache.spark.deploy.master.Master &
# Start Spark workers at nodes
for node in $(scontrol show hostnames $SLURM_JOB_NODELIST); do
  SLAVE_IB_IP=$( echo $node | sed -e "${IB_REGEXP}" )
  srun -n1 -N1 --nodelist=${node} $SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker -i ${SLAVE_IB_IP} spark://${MASTER_IB_IP}:7077 &
done

echo "-- WAIT FOR STARTUP --"
sleep 30


# Put the input files into the HDFS so that Spark tools can access them
$HADOOP_HOME/bin/hdfs dfs -mkdir -p /user/input
$HADOOP_HOME/bin/hdfs dfs -mkdir -p /user/output
$HADOOP_HOME/bin/hdfs dfs -put assembly_selfref_v2.2bit /user/input/
$HADOOP_HOME/bin/hdfs dfs -put realigned_reads_step2.bam /user/input/
$HADOOP_HOME/bin/hdfs dfs -ls /user/input/

# Run GATK 
gatk-launch HaplotypeCallerSpark -R hdfs:///user/input/assembly_selfref_v2.2bit --input hdfs:///user/input/realigned_reads_step2.bam --output hdfs:///user/output/assembly_V1.1_withMito_snp_step2.vcf --heterozygosity 0.03 --indel_heterozygosity 0.0025 --output_mode EMIT_ALL_SITES -- --sparkRunner SPARK --sparkMaster spark://$SPARK_MASTER_IP:7077 --verbose --executor-cores=16 --executor-memory=64G

$HADOOP_HOME/bin/hdfs dfs -ls /user/output/
$HADOOP_HOME/bin/hdfs dfs -get /user/output/assembly_V1.1_withMito_snp_step2.vcf

# Run the cleanup script to remove log and temporary mess from the compute nodes
myhadoop-cleanup.sh

Monitoring Jobs

The HDFS and Spark daemons start web-based control panels where you can monitor the status of your jobs. To access these you need to find the name of the first node allocated to your SLURM job using the squeue command.

If the first node is Nucleus064 then the IP address would be 192.168.54.64. The first part of the IP address is always 192.168.54., the second part is the node number, with any leading zeros removed.

You can then access:

Limitations

This solution for running GATK4 Spark tools across multiple compute nodes has some limitations:

  • Insecure web control panels. The Hadoop and Spark web control panels related to your cluster job are open to anyone who knows the IP address of the nucleus cluster nodes running your job. These control panels allow jobs to be stopped, and data files to be accessed.

  • HDFS disk space is limited. The HDFS filesystem will be hosted on the local disks of each node. BioHPC is currently considering options to provide a high performance filesystem supporting HDFS, if necessary.