GATK

Genome Analysis Toolkit

How to use

gatk is available as a module on hadoop3.cesga.es

module load gatk

More info on how to use modules on the modules tutorial

Using SPARK tools

gatk uses hdfs when you launch a SPARK tool, check the hdfs tutorial for instructions.

gatk only supports SPARK on some tools, SPARK tools always end with Spark. Check the tool list here.

gatk ToolName toolArguments -- --spark-runner SPARK --spark-master yarn additionalSparkArguments

additionalSparkArguments can be used if a gatk job needs access to more resources, SPARK tutorial

Examples

gatk HaplotypeCallerSpark -L 1:1000000-2000000 -R hdfs://nameservice1/user/username/ref.fa -I hdfs://nameservice1/user/username/input.bam -O hdfs://nameservice1/user/username/output.vcf -- --spark-runner SPARK --spark-master yarn --driver-memory 4g --executor-memory 4g
gatk CalcMetadataSpark -I  hdfs://nameservice1/user/username/input.bam -O  hdfs://nameservice1/user/username/statistics.txt -- --spark-runner SPARK --spark-master yarn

Non SPARK tools

You can use standard gatk tools on hadoop3.cesga.es for testing but its not recommended

For this use case, gatk is also available on the finisterrae for faster execution.

More documentation