Why social media companies like Facebook, Linkedin & Twitter use Hadoop Technology ?


Have you ever thought why social media companies like Facebook, Linkedin, and Twitter use Hadoop technology? Do you have any idea as how big are their clusters? Well, it would be fun to compare these social media giants by the size of Hadoop installations they usually practice. The size of the data managed by these companies indicates their investment in this open-source framework. According to a report, the market for this framework is projected to grow up to 16.1 billion dollar by 2020.

Before digging out the reasons, why these companies are fantasized about Hadoop technology, let’s take a look at the concept of “Big Data”.

What actually is big data?

The concept “big data” means a dataset that continues to grow so further that after sometimes it becomes challenging to manage it through a traditional database management concept or tool. This difficulty can be seen in the form of data storage, capture, sharing, visualization, sharing and more.

The big data spreads across three dimensions that include- Volume, Velocity & Variety.

Volume: It refers to the size of data, which is very large (often in terabytes and petabytes).

Velocity: It must be used when streaming into the enterprise to maximize the value to the business. Here, timing is very important.

Variety: It goes beyond the structured data, containing structured data of all varieties such as audio, text, posts, video, log files, and so on.

Real-time handling of such big data may differ from one platform to another for example; Facebook gathers user click-stream from the walls (facebook-wall) by using an Ajax listener that, then, transmits those events back to the concerning data centers. The information is stored on Hadoop File System (HDFS) through Scribe and gathered by PTail.

Contrary to Facebook, Twitter leverages this technology for batch processing & Storm for real-time processing, where Storm was designed to execute a fairly complex collection of data. This data comes from the stream as it flows through the structure, ahead of being sent to the batch system for further research and analysis.

How Does Hadoop manage the Bulk of Data via Social Networking Sites?

A client collects unstructured as well as semi-structured data from various sources including social media feeds, log files, & internal data stores. The data is broken into “parts” that are then loaded into a file system. This file system is made up of multiple nodes that run on commodity hardware. The default file system is called Hadoop Distributed File System (HDFS). This file system is capable of storing large volumes of unstructured & semi-structured data. The big data isn’t required to be arranged into relational rows and columns.

Each “part” is repeated several times and loaded into the file system so that if due to any reason a node fails, another node carries a copy of the same data. A Name Node performs like a mediator, transmitting information like where the certain data in the cluster resides, which nodes are ready, and which ones have failed or stopped working. 

Once the cluster is loaded with the data, it’s ready to be examined through the MapReduce framework. A “Map” job done by the client, which is actually a query written in Java, is submitted to one of the nodes residing in the cluster called “Job tracker”. It refers to the Name Node to identify which data it has to access to fulfill the job and where in the cluster that particular data is stored. Once identified, the Job Tracker sends the query to the applicable nodes. Rather than fetching the whole data back to the central location, processing takes place at every node at the same time. It is one of the most important features of Hadoop. 

After the nodes have finished their task, they store the results. The client then starts a “Reduce” job via the Job tracker in which the results generated through map phase are stored locally on particular nodes. These are collected to decide the “answer” to the original query, afterward loaded onto another node in the cluster. These results are accessed by the client, which, then be loaded into various analytics environments for research and analysis. Here the job of MapReduce is finished.

The data is then ready for further analysis by the experts. Data scientists or experts can manipulate & evaluate the data by using various tools to get further insights & patterns or create the foundation to make user-facing analytic apps. The data can also be replicated and transmitted from Hadoop clusters into data warehouses, relational databases and other conventional IT systems for further analysis.

Social media giants are unleashing the potential of Hadoop technology, are you?

Get Weekly Free Articles

on latest technology from Madrid Software Training