gatk is available as a module on hadoop3.cesga.es
module load gatk
More info on how to use modules on the modules tutorial
gatk uses hdfs when you launch a SPARK tool, check the hdfs tutorial for instructions.
gatk only supports SPARK on some tools, SPARK tools always end with Spark. Check the tool list here.
gatk ToolName toolArguments -- --spark-runner SPARK --spark-master yarn additionalSparkArguments
additionalSparkArguments
can be used if a gatk job needs access to more resources, SPARK tutorial
gatk HaplotypeCallerSpark -L 1:1000000-2000000 -R hdfs://nameservice1/user/username/ref.fa -I hdfs://nameservice1/user/username/input.bam -O hdfs://nameservice1/user/username/output.vcf -- --spark-runner SPARK --spark-master yarn --driver-memory 4g --executor-memory 4g
gatk CalcMetadataSpark -I hdfs://nameservice1/user/username/input.bam -O hdfs://nameservice1/user/username/statistics.txt -- --spark-runner SPARK --spark-master yarn
You can use standard gatk tools on hadoop3.cesga.es for testing but its not recommended
For this use case, gatk is also available on the finisterrae for faster execution.