How to Run Spark on Hadoop, Some tips to Consider ?

How to Run Spark on Hadoop, Some tips to Consider

Hadoop has been around since a long time back, although not so long, but it has taken the market by storm. Almost every industry is adapting to this latest big data solution to analyse, process and execute a huge amount of data for better insights and profitability.

 

More and more customers have started running Spark on Hadoop. This leads to some common problems that may face later. These challenges may seem to be hard for them to tackle, but we are here to assist. To help organizations tackle such situations, we have started spreading awareness through our blogs that will give tips and tricks to easily getting up and running on Spark and curbing overall time to value.

 

Our main focus is on Spark running on Hadoop, but here is one question, which versions are the best ones? As both of these solutions continue to evolve with time, it becomes important to run relatively latest versions of both so as to gain maximum benefits.

For this blog we have assumed that we are working in an environment of Spark 1.2.1 which is running on Hadoop 2.4.1 version.

The standard Hadoop installations involve an edge or gateway node which in other words is called workbenches. These workbenches are used to access the Hadoop command line tools along with Spark. The Hadoop administrator should look after or can say maintain the software on the workbenches along with the installation of Spark. We imagine that the installation of Spark has already been done on that particular node and your local laptop for debugging purposes.

Once it is up and running together with Hadoop, you can launch it using any of these three modes including yearn client, local or yarn cluster.

Local mode: This rolls out a single Spark shell along with every Spark component running within the same JVM (Java Virtual Machine). This is really good for debugging processes on your laptop or a workbench. Below is given the example of how you’d invoke Spark in the local mode.

cd $SPARK\\_HOME

./bin/spark-shell

Yarn-cluster: The Spark driver runs under the Hadoop cluster in the form of a YARM Application Master & spins up Spark executer under Yarn containers. This helps spark applications to run under the hadoop cluster and and become absolutely decoupled from the workbench. This is used just for job submission task.

Here we have discussed an example:

cd $SPARK\\_HOME

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn –deploy-mode cluster --num-executors 3 --driver-memory 1g --executor-memory 2g --executor-cores 1 --queue thequeue $SPARK\\_HOME/examples/target/spark-examples\\_*-1.2.1.jar

Kindly note that in the above showed example, the queue option has been used to specify the Hadoop queue to which it submits the application.

 

Yarn-client: The Spark driver runs itself on the workbench with the help of Application Master operating in a reduced form or role. It just requests resources from Yarn to make sure the spark workers stay in the Hadoop cluster under YARN containers. This helps for an interactive yet user-friendly environment with distributed tasks or operations.

Below is given an example of invoking spark in above stated mode and making sure it picks up the Hadoop LZO codec.

cd $SPARK\\_HOME

bin/spark-shell --master yarn --deploy-mode client --queue research --driver-memory 512M --driver-class-path /opt/hadoop/share/hadoop/mapreduce/lib/hadoop-lzo-0.4.18-201409171947.jar

There are three Spark modes as we have said above and these three have different use cases. Local & yarn-client modes specifically launch spark shells where one can type Scala commands and code on the go.

 

This helps initial, exploratory development. The usage of computing power differentiates these two nodes. Local mode limits the Spark framework to one’s laptop only, while the yarn-client mode takes commands from the shell but rolls out the actual computing to the Spark executors that run on nodes within your cluster. At certain points, the experiment workflow of yours will combine into a Scala program.

 

You can build a self-contained Spark application package out of the Scala program that can be rolled out through spark-submit in yarn-cluster mode to run it on the cluster.

Get Weekly Free Articles

on latest technology from Madrid Software Training