HDFS Architecture | Hadoop Tutorial pdf

HDFS Architecture


HDFS or Hadoop Distributed File System is a block-structured file system where each file is divided into blocks of a predetermined size. These blocks are stored across a cluster of one or several machines. HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and a number of DataNodes (Slave nodes). HDFS is constructed using Java programming language, due to which HDFS can be deployed on broad spectrum of machines that support Java. Though one can run several datanodes on a single machine, but in practical world, these datanodes are spread across various machines.

NameNode:
Namenode is the master of HDFS that maintains and manages the blocks present on the DataNodes (slave nodes). Think of the Namenode as a Lamborghini in midst of various other cars. Thus, like a Lamborghini, Namenode is a very high-availability server that manages the file system namespace and controls access to files by clients. There is just one Namenode in Gen1 Hadoop which is the single point of failure in the entire Hadoop HDFS cluster. The HDFS architecture is built in such a way that the user data is never stored in the Namenode.

Functions of a NameNode:
Let’s list out various functions of a NameNode:
1. The NameNode maintains and executes the file system namespace. If there are any modifications in the file system namespace or in its properties, this is tracked by the NameNode.
2. It directs the Datanodes (Slave nodes) to execute the low-level I/O operations.
3. It keeps a record of how the files in HDFS are divided into blocks, in which nodes these blocks are stored and by and large the NameNode manages cluster configuration.
4. It maps a file name to a set of blocks and maps a block to the DataNodes where it is located.
5. It records the metadata of all the files stored in the cluster, e.g. the location, the size of the files, permissions, hierarchy, etc.
6. With the help of a transactional log, that is, the EditLog, the NameNode records each and every change that takes place to the file system metadata. For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog.
7. The NameNode is also responsible to take care of the replication factor of all the blocks. If there is a change in the replication factor of any of the blocks, the NameNode will record this in the EditLog.
8. NameNode regularly receives a Heartbeat and a Blockreport from all the DataNodes in the cluster to make sure that the datanodes are working properly. A Block Report contains a list of all blocks on a DataNode.
9. In case of a datanode failure, the Namenode chooses new datanodes for new replicas, balances disk usage and also manages the communication traffic to the datanodes.

DataNodes:
Datanodes are the slave nodes in HDFS, just like a any average car in front of a Lamborghini! Unlike NameNode, datanode is a commodity hardware, that is, a non-expensive system which is not of high quality or high-availability. Datanode is a block server that stores the data in the local file ext3 or ext4.

Functions of DataNodes:
Let’s list out various functions of Datanodes:
1. Datanodes perform the low-level read and write requests from the file system’s clients.
2. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.
3. They regularly send a report on all the blocks present in the cluster to the NameNode.
4. Datanodes also enables pipelining of data.
5. They forward data to other specified DataNodes.
6. Datanodes send heartbeats to the NameNode once every 3 seconds, to report the overall health of HDFS.
7. The DataNode stores each block of HDFS data in separate files in its local file system.
8.  When the Datanodes gets started, they scan through its local file system, creates a list of all HDFS data blocks that relate to each of these local files and send a Blockreport to the NameNode.

Secondary NameNode
This is a misnomer!
In the HDFS Architecture, the name – Secondary NameNode gives an impression that it is a substitute of the NameNode. Alas! It is not!
Now, at this point, we know that NameNode stores vital information related to the Metadata of all the blocks stored in HDFS. This data is not only stored in the main memory, but also in the disk.
The two associated files are:
1. Fsimage: An image of the file system on starting the NameNode.
2. EditLogs: A series of modifications done to the file system after starting the NameNode.
The Secondary NameNode is one which constantly reads all the file systems and metadata from the RAM of the NameNode and writes it into the hard disk or the file system. It is responsible for combining the editlogs with fsimage from the NameNode. It downloads the EditLogs from the NameNode at regular intervals and applies to fsimage. The new fsimage is copied back to the NameNode, which is used whenever the Namenode is started the next time.
However, as the secondary NameNode is unable to process the metadata onto the disk, it is not a substitute to the NameNode. So if the NameNode fails, the entire Hadoop HDFS goes down and you will lose the entire RAM present in the RAM. It just performs regular checkpoints in HDFS. Just a helper, a checkpoint node!

1 comment: