.. _hdfs:

HDFS: The Hadoop File System
============================

HDFS is the underlying distributed filesystem that you will use to run your applications so they will take advantadge of parallel data processing.

HDFS is optimized for large sequencial reads and the best performance is obtained on large files (>1GB).

The files are split in blocks that by default have a block size of 128MB and blocks are replicated across multiple nodes, guaranteeing fault-tolerance in case of a node failure.

By default each block will have **3 replicas** but you can control the amount of replicas when you create the file.


We recommend that you upload the files first to the BD HOME filesystem using the DTN server and then from there you copy them to HDFS. See the :ref:`how_to_upload_data` section for more information.

To put a file in HDFS you can run the following command::

  hdfs dfs -put file.txt file.txt

To list files::

  hdfs dfs -ls

To create a directory::

  hdfs dfs -mkdir mydir

To get a file from HDFS to the local filesystem::

  hdfs dfs -get file.txt

Using :ref:`hue` you have a nice Web UI to explore HDFS, you can access it through the :ref:`webui`.

.. figure:: _static/screenshots/hue-hdfs.png
    :align: center

    Exploring HDFS from HUE.

For further information on how to use HDFS you can check the `HDFS Tutorial`_ that we have prepared to get you started and the `Hadoop Documentation`_ as reference.

.. _HDFS Tutorial: https://bigdata.cesga.es/tutorials/hdfs.html
.. _Hadoop Documentation: https://archive.cloudera.com/cdh6/6.0.0/docs/hadoop-3.0.0-cdh6.0.0/index.html