Essential Roles of MapReduce, Hive, Hbase and HDFS in Hadoop System

 

Hadoop technology is buzzing around these days; however most of the professionals in IT domain still are unaware of the key components that comprise this open-source software platform. Just before we hop on to a detailed discussion of the role MapReduce, Hive, Hbase and HDFS have to play in Hadoop, let’s have an understanding of what actually is Hadoop.

At the most fundamental level, Hadoop is an open-source platform, which has been designed to store as well as process big data. The strength of this cloud-based storage system lies in its capacity to scale across several commodity servers, which do not share disk space or memory.

Hadoop assigns tasks across these servers also called “worker nodes” or “slave nodes”, basically using the power of each device & running them concurrently. This is what facilitates huge amounts of data to be evaluated; breaking down the tasks across various locations hence, helps bigger tasks to be finished faster.

This big data storage system can be considered as an ecosystem made of many different components that work together to form a single platform. These key components include HDFS, MapReduce, Hive, and Hbase. Let’s take a look at these key components in brief.

Hadoop Distributed File System (HDFS)

HDFS is the most important component that enables Hadoop to store as well as handle big data. It is a scalable file system, which distributes & stores data across all machines in a group of servers called “Hadoop cluster”. There are different parts of this HDFS cluster that contain:

NameNode: It runs on a master node, which tracks & directs the storage.

DataNode: It runs on slave nodes that form the majority of machines in a cluster. It instructs data files to break down into blocks, each of which is copied three-times & stores on machines across the cluster.  These copies ensure that the data is protected in case something goes wrong or if one server fails.

Client machine: These machines have Hadoop installed on them and they are accountable for loading data into the cluster, submitting the MapReduce tasks and analyzing the results of the job once finished.

MapReduce

It is the system that efficiently processes the large amount of data stored by Hadoop in HDFS. Firstly created by Google, the strength of MapReduce lies in its capacity to break a single large data processing task into a smaller job. All the jobs in MapReduce are written in Java; however other languages can also be used through the Hadoop Streaming API, a utility that comes with this big data storage system. Once a task has been created, it is spread across multiple nodes & run concurrently. The “reduce” part combines the results.

Hive

Hive is just like an SQL interface in Hadoop system. The data stored in HBase component can be accessed via Hive. This component is of great use for developers having no know-how in the MapReduce framework for writing queries that are changed into MapReduce jobs in Hadoop system.

In other words, Hive is a Data Warehousing package, which is constructed on top of Hadoop for the assessment of huge amounts of data. It is mainly designed for users who are comfortable with SQL. The best part of Hive is that it conceptualizes the complications of Hadoop as users don’t need to write MapReduce programs while using it.  

HBase

It is a columnar database management system (DBMS), built on top of Hadoop that runs on Hadoop Distributed File System. Akin to MapReduce, the applications of Hbase are written in Java and other languages with the help of their Thrift database that is a framework allowing cross-language service development. The main difference between MapReduce & HBase is that HBase is designed to work best with random workloads.

For example, suppose you have a regular file, which have to be processed, here MapReduce works just brilliant. But, if you have a table, which is huge, suppose petabyte in size and you have to process a single row from a random location that lies within the same table, the use of Hbase would be fine. Another best feature of Hbase is that it has very low latency or time delay, it offers.

However, it is important to note that Hbase & MapReduce are not mutually special. You can run them together. For example, MapReduce can run against an Hbase file or table.

Apart from these key components there are Pig, Flume, Sqoop, Oozie, and Zoo Keeper too, which also make the ecosystem of Hadoop.

 

Get Weekly Free Articles

on latest technology from Madrid Software Training