Sparklyr

Sparklyr is an R package that interfaces from R to Apache Spark. It was created in 2016 by the Rstudio team and fits in the tydiverse ecosystem providing a complete dplyr backend for Spark.

It makes Spark’s APIs accesible from R, including SparkDataFrames and the MLlib machine learning library.

To use sparklyr on the platform you will need to load the sparklyr module (Modules: Additional Software).:

module load sparklyr

This module includes an anaconda installation of python 2.7, R 3.1.5, sparklyr 1.0.5, and all its dependencies, so in order to use it you only need to start R and load the package:

R
library(sparklyr)

Note

You can check the list of preinstalled packages by typing installed.packages() on the R console.

After that you will need to connect to the spark cluster, this is done using the spark_connect() function.:

sc <- spark_connect(master = "yarn-client", spark_home = Sys.getenv('SPARK_HOME'))

And then use your spark connection sc to access any spark tool.

Finally execute::

spark_disconnect(sc)

to disconnect from spark.

You can also use Sparklyr on a Jupyter Notebook with an R kernel.

Warning

Remember to disconnect from spark and properly shut down the notebook server before logging out.

Sparklyr is not limited to interactive use, you can also use spark-submit to launch a script as a job:

spark-submit --class sparklyr.Shell '/opt/cesga/anaconda/Anaconda2-2018.12-sparklyr/lib/R/library/sparklyr/java/sparklyr-2.4-2.11.jar' 8880 1234 --batch example_sparklyr_script.R

For further information on Sparklyr you can check the getting started Sparklyr Tutorial and take a look at the Sparklyr workshop. There is also the official documentation by the RStudio Team, including this handy cheatsheet.