Providing quick access to ready-to-use Big Data solutions

Because Big Data doesn't have to be complicated

Javier Cacheiro / Cloudera Certified Developer for Spark & Hadoop / @javicacheiro

Big Data


3Vs of Big Data

Vertical Scaling

Horizontal Scaling

Data Centric

Compute centric: bring the data to the computation

Data centric: bring the computation to the data

MPI Shortcomings

Jonathan Dursi: HPC is dying, and MPI is killing it

  • Wrong level of abstraction for application writers
  • No fault-tolerance

MPI Boilerplate

Framework   Lines   Lines of Boilerplate
MPI+Python 52 20+
Spark+Python    28 2
Chapel 20 1



Hardware Infrastructure

  • 38 nodes: 4 masters + 34 slaves
  • Storage capacity 816TB
  • Aggregated I/O throughtput 30GB/s
  • 64GB RAM per node
  • 10GbE connectivity between all nodes

Hardware Master Nodes

  • Model: Lenovo System x3550 M5
  • CPU: 2x Intel Xeon E5-2620 v3 @ 2.40GHz
  • Cores: 12 (2x6)
  • HyperThreading: On (24 threads)
  • Total memory: 64GB
  • Network: 1x10Gbps + 2x1Gbps
  • Disks: 8x 480GB SSD SATA 2.5" MLC G3HS
  • Controller: ServeRAID M5210 1GB Cache FastPath

Hardware Slave Nodes

  • Modelo: Lenovo System x3650 M5
  • CPU: 2x Intel Xeon E5-2620 v3 @ 2.40GHz
  • Cores: 12 (2x6)
  • HyperThreading: On (24 threads)
  • Total memory: 64GB
  • Network: 1x10Gbps + 2x1Gbps
  • Disks: 12x 2TB NL SATA 6Gbps 3.5" G2HS
  • Controller: N2215 SAS/SATA HBA



Platforms available

  • Hadoop Platform
  • PaaS Platform (beta)

Hadoop Platform

  • Ready to use Hadoop ecosystem
  • Covers most of the uses cases
  • Production ready
  • Fully optimized for Big Data applications

PaaS Platform

  • When you need something outside the Hadoop ecosystem
  • Enables you to deploy custom Big Data clusters
  • Advanced resource planning based on Mesos
  • No virtualization overheads: based on Docker
  • Includes a catalog of products ready to use: eg. Cassandra, MongoDB, PostgreSQL


Front Page

General info

Tools available



Hadoop Platform

Your ready-to-use Hadoop ecosystem


Core Components

  • HDFS: Parallel filesystem
  • YARN: Resource manager


The Hadoop Distributed Filesystem

HDFS Architecture

HDFS Replicas

Upload a file to HDFS

To upload a file from local disk to HDFS:

hdfs dfs -put file.txt file.txt

It will copy the file to /user/username/file.txt in HDFS.

List files

To list files in HDFS:

hdfs dfs -ls

Lists the files in our HOME directory of HDFS /user/username/

To list files in the root directory:

hdfs dfs -ls /

Working with directories

Create a directory:

hdfs dfs -mkdir /tmp/test

Delete a directory:

hdfs dfs -rm -r -f /tmp/test

Working with files

Read a file:

hdfs dfs -cat file.txt

Download a file from HDFS to local disk:

hdfs dfs -get fichero.txt

Web File Explorer

You can easily access the HUE File Explorer from the WebUI:


You can easily access the NameNode UI from the WebUI:


Yet Another Resource Negotiator

YARN Architecture

Launching an application

yarn jar application.jar DriverClass input output

List running jobs

yarn application -list

See application logs

yarn logs -applicationId applicationId

Kill an application

yarn application -kill applicationId

Capacity Scheduler Queues





MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster.


Launching a job

To launch a job:

yarn jar job.jar DriverClass input output

List MR jobs

To list running MR jobs:

mapred job -list

Cancelling a job

To cancel a job:

mapred job -kill [jobid]


You can easily monitor your jobs using the YARN UI from the WebUI:

Job History

You can see finished jobs using the MR2 UI from the WebUI:

MapReduce for Java developers

Development Environment Setup

  • For a sample Maven-based project use:
    git clone
  • Import the project in Eclipse using m2e or in Intellij
  • If using an IDE like Eclipse or Intellij it can be useful to:
                  # Download sources and javadoc
                  mvn dependency:sources
                  mvn dependency:resolve -Dclassifier=javadoc
                  # Update the existing Eclipse project
                  mvn eclipse:eclipse
                  # Or if you using Intellij IDEA
                  mvn idea:idea

Maven Basic Usage


mvn compile

Run the tests

mvn test

Package your app

mvn package

Manual Process

If you prefer to compile and package manually:

            javac -classpath $(hadoop classpath) *.java
            jar cvf wordcount.jar *.class

MapReduce Program

Basic components of a program:

  • Driver: management code for the job or sequence of jobs
  • map function of the Mapper
  • reduce function of the Reducer

Driver Code

public class Driver {
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf);
    job.setJobName("Word Count");
    FileInputFormat.setInputPaths(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));    
    boolean success = job.waitForCompletion(true);
    System.exit(success ? 0 : 1);

Map Code

public class WordMapper
    extends Mapper<LongWritable, Text, Text, IntWritable> {
	private final static IntWritable one = new IntWritable(1);
	private Text word = new Text();
	public void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		String line = value.toString();
		for (String field : line.split("\\W+")) {
			if (field.length() > 0) {
				context.write(word, one);

Reduce Code

public class SumReducer
  extends Reducer<Text, IntWritable, Text, IntWritable> {
	public void reduce(
            Text key, Iterable<IntWritable> values, Context context)
			throws IOException, InterruptedException {
		int wordCount = 0;

		for (IntWritable value : values) {
			wordCount += value.get();
		context.write(key, new IntWritable(wordCount));

MapReduce Streaming API

Quick how-to

  • We write two separated scripts: Mapper y Reducer
  • The Mapper script will receive as stdin the file line by line
  • The stdout of the Mapper and Reducer must be key-value pairs separated by a tab


yarn jar \
    /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
    -input input -output output \
    -mapper -reducer \
    -file -file


A fast and general engine for large-scale data processing




Language Selection

  • Scala
  • Java
  • Python
  • R


PySpark Basics

  • Based on Anaconda Python distribution
  • Over 720 packages for data preparation, data analysis, data visualization, machine learning and interactive data science

Running pyspark interactively

  • Using Jupyter notebook
  • Running from the command line using ipython:
    PYSPARK_DRIVER_PYTHON=ipython pyspark


          [jlopez@login7 ~]$ PYSPARK_DRIVER_PYTHON=ipython pyspark
          >>>  from pyspark.sql import Row
          >>>  Person = Row('name', 'surname')
          >>>  data = []
          >>>  data.append(Person('Joe', 'MacMillan'))
          >>>  data.append(Person('Gordon', 'Clark'))
          >>>  data.append(Person('Cameron', 'Howe'))
          >>>  df = sqlContext.createDataFrame(data)
            |   name|  surname|
            |    Joe|MacMillan|
            | Gordon|    Clark|
            |Cameron|     Howe|


Submit job to queue

Spark Components

spark-submit Python

        # client mode
        spark-submit --master yarn \
          --name testWC input output
        # cluster mode
        spark-submit --master yarn --deploy-mode cluster \
          --name testWC input output

spark-submit Scala/Java

        # client mode
        spark-submit --master yarn --name testWC \
          --class es.cesga.hadoop.Test test.jar \
          input output
        # cluster mode
        spark-submit --master yarn --deploy-mode cluster \
          --name testWC \
          --class es.cesga.hadoop.Test test.jar \
          input output

spark-submit options

--num-executors NUM    Number of executors to launch (Default: 2)
--executor-cores NUM   Number of cores per executor. (Default: 1)
--driver-cores NUM     Number of cores for driver (cluster mode)
--executor-memory MEM  Memory per executor (Default: 1G)
--queue QUEUE_NAME     The YARN queue to submit to (Default: "default")
--proxy-user NAME      User to impersonate


SQL-like interface


Hive offers the possibility to use Hadoop through a SQL-like interface

Hive and Impala use the same SQL syntax HiveQL

Field delimitter

  • Default field delimitter Ctr+A (0x01)
  • It can be changed when creating a table
                ROW FORMAT DELIMITED
                FIELDS TERMINATED BY ':'


Transferring data between Hadoop and relational databases


List tables

        sqoop list-tables \
          --username ${USER} -P \
          --connect jdbc:postgresql://${SERVER}/${DB}

Import one table

        sqoop import \
          --username ${USER} --password ${PASSWORD} \
          --connect jdbc:postgresql://${SERVER}/${DB} \
          --table mytable \
          --target-dir /user/username/mytable \
          --num-mappers 1

Import into Hive

        sqoop import \
          --username ${USER} --password ${PASSWORD} \
          --connect jdbc:postgresql://${SERVER}/${DB} \
          --table mytable \
          --target-dir /user/username/mytable \
          --num-mappers 1 \

Create only the table structure into Hive

        sqoop create-hive-table \
          --username ${USER} --password ${PASSWORD} \
          --connect jdbc:postgresql://${SERVER}/${DB} \
          --table mytable

Sqoop Export


First create table into PostgreSQL

        sqoop export \
          --username ${USER} --password ${PASSWORD} \
          --connect jdbc:postgresql://${SERVER}/${DB} \
          --table mytable \
          --export-dir /user/username/mytable \
          --input-fields-terminated-by '\001' \
          --num-mappers 1

Direct mode

For MySQL and PosgreSQL for faster performance you can use direct mode (--direct option)


Improving the performance


Low-level routines for performing common linear algebra operations


Adds support to Python for fast operations with multi-dimensional arrays and matrices

Already configured to use Intel MKL

Storage Formats


How to connect

How to connect: Setup

  1. Configure the Fortigate VPN
  2. Start the VPN

How to connect: CLI

Using a powerful CLI through SSH:


How to connect: WebUI

Using a simple Web User Interface


The VPN software allows you not only to connect to CESGA in a secure way but also to access internal resources that you can not reach otherwise.

Installing the Forticlient VPN software

  • Download the VPN software from Official Forticlient download page or from the CESGA Repository
  • For Windows/Mac just launch the installation wizard
  • For new Linux version follow the official instructions
  • For old Linux versions check the next slide
  • Enter the following configuration options
  • Gateway:
    Port: 443
    Username: your email address registered at CESGA
    Password: your password in the supercomputers
  • Remember to start the VPN before connecting to

VPN Installation in old Linux versions

  • If your Linux distribution is not supported in the official download page use our CESGA Local Repository
  • Follow the next steps:
  • unrar e vpn-fortissl.rar
    tar xvzf forticlientsslvpn_linux_4.4.2323.tar.gz
    cd forticlientsslvpn
    Accept the license agreement presented
    ../forticlientsslvpn_cli \
       --server \

Alternative Open Source Linux Client OpenfortiVPN

  • For Linux there is also an alternative open-source client that you can use instead of the official fortivpn client
  • Some Linux distibutions like Ubuntu, Debian, OpenSuse or Arch Linux provide openfortivpn packages
  • Check the project github page for details openfortivpn
  • It has also a GUI that you can use: openfortigui

How to upload data

How to upload data

Use our dedicated Data Transfer Node:

SCP/SFTP is fine for small transfers

But for large transfers use Globus Online

Our endpoint is cesga#dtn

If you prefer you can use gridftp directly :-)

Expected upload times


Choosing software versions

module available


We recommend that you use the Anaconda version of Python instead of the OS one

module load anaconda2/2018.12

Even you can try Python 3 (not officially supported)

module load anaconda3/2018.12

Build tools

You can load the following build tools

  • maven
  • sbt



The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Launching Jupyter

  • Connect to
  • Go to your working directory
  • Launch the Jupyter server
  • Point your browser to the address where the notebook is running:
    The Jupyter Notebook is running at:
  • The VPN must be running


Jupyter Terminal

Jupyter Conda






Success Stories

Gaia (UDC)

FilmYou (CITIC)


We are here to help:

Stay up to date subscribing to our Mailing list