Understanding Hadoop Cluster, Name Node, Secondary Name Node, Job Tracker in Big Data Hadoop

Everyone might have had heard of the most buzzing technique around “Big Data Hadoop”. But, wait, do you know how does it work and what’s gonna it do to businesses these days? If not, hang on and continue reading.

We have taken an innovative approach to help you understand the architecture and methods of a Hadoop cluster, its functionality and server infrastructure.  So, hang on and get ready for a journey to Hadoop Cluster!

Hadoop Server Roles

There are there major categories of machine roles in a Hadoop deployment including- Client Machines, Master Nodes and Slave Nodes.

The Master Nodes are responsible for two key functionalities that make up Hadoop- first is storing bulks of data (HDFS) and the second is running parallel processing of data with the help of MapReduce. The Name Node controls as well as coordinates the data storage functions, whereas the Job Tracker monitors the parallel process of data with the help of MapReduce. The Slave Nodes on the other hand create the framework of vast majority of machines and carry out all the complicated work of storage and running the computations.

Every Slave runs both a Task Tracker and Data Node daemon to communicate and receive instructions from the master nodes. The Task Tracker daemon works under the Job Tracker, the Data Node dameon a slave to the Name Node.

Client Machines feature Hadoop installation with all the cluster settings, however no Master or Slave nodes. Instead, the function of the Client machine is to load data into the cluster, submit MapReduce tasks explaining how the data has to be processed and then save or view the result of the task when finished. In small clusters, there is a single physical server executing many roles, performing the roles of a Job Tracker and Name Node. No matter medium or large clusters, you’ll often have every role being operated on a single server machine.

As far as real production clusters arte concerned, there is no server virtualization, plus no hypervisor layer as they will lead to sluggish performance. Hadoop works best on Linux machines, where it works directly with the underlying hardware.

In a typical architecture of Hadoop cluster, you’ll have rack servers settled in racks connected to a top of rack switch normally with 1 or 2 GE boned links.

The rack switch features uplinks linked to another tier of switches that connect other racks with a uniform bandwidth, making the cluster. Most of the servers will be slave nodes featuring huge disk storage & considerable amount of CPU and DRAM. Some of them will be the master nodes having different configuration featuring more CPU and DRAM as compared to local storage.

Hadoop cluster is of no use until it has data, so it will begin the process by loading the huge File.txt into the cluster. The main purpose is fast parallel processing of huge data. In order to do that you need as many machines as possible to work on this data all at once. To that end, the client breaks the data file into small blocks and arranges those blocks on different machines all through the cluster. The more blocks means more machines will be able to work on this data parallel. But, these machines may be prone to failure, so one should put data on multiple machines at one to avert any data loss. So every block will be copied in the cluster once it is loaded. The standard setting of Hadoop features 3 copies of each block in the cluster. You can configure this with the help of dfs.replication parameter in the hdfs-site.xml parameter.

The client machine breaks the File.txt into three blocks and for each block; it consults the Name Node & gets a list of three data nodes which consist a copy of this block. The block gets written directly to the data node by the client. The receiving data node copies the block into the other data nodes and this cycle repeats for the remaining others. The name node isn’t the data path, it just provides the map of the path of the data and where the data should go in to the cluster.

The secondary node contacts the NameNode in a periodic manner, which is usually 1 hour by default. The filesystem metadata stored in RAM hasn’t the ability to process that metadata kept in the NameNode. So if it crashes, everything in RAM gets lost and you don’t get any backup of this filesystem. The secondary node contacts NameNode in one hour and pulls copy of metadata info.  It mixes up this info into fresh file folder and sends it back again to NameNode, which having a copy for itself. Hence, it does the task of housekeeping. In case a name node fails, the copied metadata can help to rebuild it easily.

JobTracker & TaskTracker

When it comes to JobTracker then it’s a master that creates & runs the job. The JobTracker can run on the NameNode, allocates job to the TaskTracker running on the DataNodes. It runs the task and reports the status to the JobTracker.

If you are looking for Complete Big data Hadoop Training in Delhi then pls visit our Delhi office to get all the details.



Get Weekly Free Articles

on latest technology from Madrid Software Training