Sparklyr

R interface for Apache Spark

Connect to Spark from R.
Complete dplyr backend.
Filter and aggregate Spark datasets then bring them into R for analysis and visualization.
Use Spark’s distributed machine learning library from R.
Work with data stored on HDFS

Getting started

Activate your vpn and connect to hadoop3.cesga.es
Sparklyr is available as a module, it's based on an anaconda2 distribution. You can load it typing
```
module load sparklyr
```
The module includes python 2.7, R 3.1.5, sparklyr 1.0.5, and all its dependencies installed
Start R by typing R

Connecting R to a spark session

Load the package
```
 > library(sparklyr)
```

Define a spark connection

 > sc <- spark_connect(master = "yarn-client", spark_home = Sys.getenv('SPARK_HOME'))

Use the sc connection to use spark from R
Close the connection:
```
 > spark_disconnect(sc)
```

More options

Use Jupyter Notebooks
```
 start_jupyter
```

Launch R scripts

spark-submit --class sparklyr.Shell '/opt/cesga/anaconda/Anaconda2-2018.12-sparklyr/lib/R/library/sparklyr/java/sparklyr-2.4-2.11.jar' 8880 1234 --batch example_sparklyr_script.R

Remember to disconnect sessions and properly shut down Jupyter Notebook servers.

More info