BD|CESGA

Providing quick access to ready-to-use Big Data solutions.

Because Big Data doesn't have to be complicated.

Why?

Leverage the power of the new Big Data Infrastructure.

Easy

No need to learn how to deploy complex services, just connect and start using them.

Fast

Access a fully optimized infrastructure for Big Data applications.

Data Transfer

Bring your data through a fast network interconnection or physically.

Free

The service is free for Galician Universities and CSIC users the same way the HPC service is.

What?

Just a quick overview of some of the available services ready-to-use.

Generic placeholder image

HDFS

Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers.

Generic placeholder image

YARN

Allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics.

Generic placeholder image

MapReduce

Software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

Generic placeholder image

Spark

Fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing.

Generic placeholder image

Hive

Data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.

Generic placeholder image

Sqoop

A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

Generic placeholder image

Jupyter

A web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text.

Generic placeholder image

Zeppelin SOON

Web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.

Generic placeholder image

HUE

Web interface for analyzing data with Apache Hadoop. Hue applications let you browse HDFS, manage a Hive metastore, and run Hive and Cloudera Impala queries, HBase and Sqoop commands, Pig scripts, MapReduce jobs, and Oozie workflows.

Generic placeholder image

HBase

Hadoop database, a distributed, scalable, big data store.

Generic placeholder image

Oozie

Workflow scheduler system to manage Apache Hadoop jobs.

Generic placeholder image

Pig

A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Generic placeholder image

Storm

A distributed real-time computation system for processing large volumes of high-velocity data.

Generic placeholder image

Kafka

A unified, high-throughput, low-latency platform for handling real-time data feeds.

Generic placeholder image

Flume

A distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.

Generic placeholder image

Tez

An extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop.

Generic placeholder image

ZooKeeper

A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Generic placeholder image

Mahout

Environment for quickly creating scalable performant machine learning applications.

Generic placeholder image

Slider

An application to deploy existing distributed applications on an Apache Hadoop YARN cluster, monitor them and make them larger or smaller as desired even while the application is running..

Generic placeholder image

Falcon

A framework to simplify data pipeline processing and management on Hadoop clusters.

Generic placeholder image

Atlas

Scalable and extensible set of core foundational governance services enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.

Generic placeholder image

Mesos

A cluster manager that simplifies running applications on a scalable cluster of servers, and the heart of the Mesosphere system.

Generic placeholder image

Marathon

A container orchestration platform for Mesos and DCOS.

Generic placeholder image

Consul

A tool for discovering and configuring services in our Big Data infrastructure.

Generic placeholder image

Cassandra

A distributed database for managing large amounts of structured data across many commodity servers, while providing highly available service and no single point of failure.

Generic placeholder image

MongoDB

A document-oriented database very easy to use.

Generic placeholder image

PostgreSQL

A popular SQL database. Because not everything has to be NoSQL.

Generic placeholder image

GlusterFS

A distributed-replicated file system.

Generic placeholder image

SLURM SOON

A job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters.

Generic placeholder image

CDH

An Apache Hadoop distribution by Cloudera.

How?

How to connect to the service.

The services are only accessible once you are inside CESGA's VPN. Once you are inside the VPN you can connect using the WebUI or SSH depending on the service you will be using.

  •   Activate the VPN
  •   Connect to the WebUI
  •   Connect through SSH to hadoop3.cesga.es

WebUI

A simple Web User Interface that complements the standard SSH CLI.

Go to WebUI!

Hadoop

Your ready-to-use Hadoop ecosystem.

CDH

Based on Cloudera CDH 6.1.1.

Hadoop 3

Includes most of the components in the Hadoop ecosystem.

Production ready

Ready to run production jobs.

Powerful

A fully optimized infrastructure for Big Data applications.

PaaS

When you need something outside the Hadoop ecosystem.

Flexible

Enables you to deploy custom Big Data clusters.

Disk-aware

Advanced disk-aware resource planning.

Docker

No virtualization overheads.

Variety

Includes a catalog of products ready to use: Cassandra, MongoDB, PostgreSQL.

Tutorials

We have prepared some tutorials to get you started using the platform.

Workshop

User Guide

Workshop

Workshop

VPN

VPN

Spark

Spark

PySpark

PySpark

Sparklyr

Sparklyr

HDFS

HDFS

YARN

YARN

MapReduce

MapReduce

Hive

Hive

Sqoop

Sqoop

Jupyter

Jupyter

GATK

GATK

PaaS

PaaS

Hbase

Hbase

PostgreSQL

PostgreSQL

Mariadb

Mariadb

Uchuu

Uchuu

Stay updated

Stay up to date with the status of the infrastructure and get info about new services.

Contact Us

We are here to help.

For any question contact .