Sparklyr

R interface for Apache Spark

  • Connect to Spark from R.
  • Complete dplyr backend.
  • Filter and aggregate Spark datasets then bring them into R for analysis and visualization.
  • Use Spark’s distributed machine learning library from R.
  • Work with data stored on HDFS

Getting started

  • Activate your vpn and connect to hadoop3.cesga.es
  • Sparklyr is available as a module, it's based on an anaconda2 distribution. You can load it typing
    module load sparklyr
  • The module includes python 2.7, R 3.1.5, sparklyr 1.0.5, and all its dependencies installed
  • Start R by typing R

Connecting R to a spark session

  • Load the package
     > library(sparklyr)
  • Define a spark connection
     > sc <- spark_connect(master = "yarn-client", spark_home = Sys.getenv('SPARK_HOME'))
  • Use the sc connection to use spark from R
  • Close the connection:
     > spark_disconnect(sc)

More options

  • Use Jupyter Notebooks
     start_jupyter
  • Launch R scripts
    spark-submit --class sparklyr.Shell '/opt/cesga/anaconda/Anaconda2-2018.12-sparklyr/lib/R/library/sparklyr/java/sparklyr-2.4-2.11.jar' 8880 1234 --batch example_sparklyr_script.R

Remember to disconnect sessions and properly shut down Jupyter Notebook servers.

More info