Data security in Big Data Hadoop

 

In this technically-advanced era buzzed with Big Data, with inexpensive data storage devices, as well as cost-effective processing power, being available easily, corporate sectors are getting massive volumes of data, with the sole purpose of deriving insights & making accurate decisions. While the entire focus is on gathering data, having all data at a single location invites threats and risks of data security that can lead to negative publicity and in some cases the loss of customer confidence.

Data security in Hadoop is one of the key solutions powering Big Data implementations. In this blog, we will discuss data security in Hadoop in detail, but before that let’s start with some quick facts.

Evolution of Hadoop security

The time when the Hadoop was in its initial stage, security wasn’t the matter of concern. In almost every case, it was being developed with the use of data sets, which were publically accessible and security was of not as important as it is now. As Hadoop has become a mainstream these days, organizations are feeding Hadoop cluster with a lot of data from many sources that create possible security situations. The developer community of Hadoop has come to know that more sturdy security controls are required & has decided to concentrate on the data protection aspect and new security features are being introduced.

While the use of fundamental features offered by this big data management system is of utmost importance, companies cannot be parochial; instead they must implement a holistic approach to protecting Hadoop. Security in this big data management system is a vast area and ever evolving to satiate the growing market.

Hadoop Big Data Security- A Three- Tier Approach

Hadoop security is a multi-layered approach in which each layer features a different set of security approaches as well as techniques.

Data Transfer & Integration Layer

It is the initial security layer that initiates integration cups between the various systems of the source and Hadoop ecosystem. For the ingestion of the data and distribution out of Hadoop, there are various methods that can transfer data back & forth from the sources systems. Here is a list of the security aspects of some tools for data transfer-

  • Apache Flume: It can be used for gathering, aggregating & moving bulks of data from many sources into HDFS (Hadoop Distributed File System). In case of multiple data access, the users can use Flume agent to HDFS, they can create proxy users, which they can map to a single principal user.
  • Apache Sqoop: It can be used to transmit data to & from relational databases to Hadoop and it facilitates role-based access well as execution restrictions with the help of ‘Admin’ & ‘Operator’ roles. It inflicts limitations on executions of actions like import & export of data by the end users.
  • External Tools: Extract, Transform & Load (ETL) tools or customized applications can link to Hadoop data stores, for example, HBase or Hive. These support Lightweight Directory Access Protocol (LDAP), Kerberos and custom-pluggable authentication. The external apps can access Hadoop or by copying the connected user with the help of proxy privileges configured in Hadoop.
  • File Transfer: Secured File Transfer Protocol (SFTP) is the best option for data transmission. Also, if an FTP server is to be implemented, then it will be the best choice to use single user access of FTP server or proxy user credentials with needed permissions.

OS Layer - Authorization & Authentication

The file system of big data management system is skin to a Portable Operating System Interface for UNIX file system and allows administrators and users to apply file permissions & control read and write accessibility. The association of the base Operating System and Hadoop cluster is yet another layer that needs security. It is really important to think of OS users, group policies, as well as the file permissions in this layer, while protecting Hadoop cluster.

In order to resolve OS related issues, Hadoop should be configured by implementing a user id that isn’t the foot user or isn’t a part of the root user group. This user acts like a super-user for Hadoop Name Node and has the rights to start as well as stop Hadoop processes. In the ecosystem, many users, namely ‘mapred’, ‘hdfs’, and ‘yarn’ are made during installation. Usually, a common UNIX group is made to give access to these Hadoop internal users. However, for the end-users who want to access HDFS, it is helpful to user proxy users for the same task instead of allowing for group access. To further improve the security of the Hadoop cluster, security features essential to Hadoop must be completely utilized apart from OS users and file permissions.

Hadoop Integral Security Layer

Hadoop offers many security control features. The further development of it is expected to provide improved security features like Remote Procedure Calls (RPC) Connections, Hypertext Transfer Protocol (HTTP) Web Consoles, Delegation Tokens, Data Block Control and more.

Third-party Hadoop security solutions

Though, Hadoop has many inbuilt security features, there still are loopholes. This has let vendors come up with latest security solutions for Hadoop, which include open source solutions like Knox Gateway, Sentry, Intel’s Project Rhino and more, while the commercial security solutions include IBM's InfoSphere Data Privacy for Hadoop, Dataguise for Hadoop, Zettaset Orchestrator, and Protegrity Big Data Protector.

Now we can expect that with each improved version of Hadoop, new security solutions are being introduced as well.

 

 

 

Tags - Data Security in Hadoop , Big Data Hadoop Training Institute in Vasant Kunj , Big Data Hadoop Training Institute in Vasant Vihar 

 

Get Weekly Free Articles

on latest technology from Madrid Software Training