Friday, December 15, 2017

Apache Spark Cluster On Local Machines Setup

Before setting up an apache spark cluster on your server environment, you might want to test it by setting up a similar configuration on your local environment and play around with it.


Spark Installation 
Download Apache Spark from the link or you can install it from the command line from here
  • Next verify Java installation with the command below:

$java -version
If it has been installed you should see a similar response below:
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
If not then you have to install it
  • Next verify that Scala has been installed 

$scala -version
That should also give a response to the below mesage
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
If not then you have to install it.

Continuing with Apache Spark installation 
  • Extract the tar file

$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz 
  • Move the spark folder to the desired directory e.g (/usr/local/spark)

$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
  • Add the following line to the ~/.bashrc file. The essence of this is to add the location where the spark source file are located to the path variable.

export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc
  • Verify the successful spark installation on the desired system 

$spark-shell
If its successful you should see the following response below:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled; 15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
/ __/__ ___ _____/ /__
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292. Welcome to ____ __ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.4.0 /_/
scala>
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) Type in expressions to have them evaluated.
Spark context available as sc

All systems that would be on the cluster must first have spark running on them, hence you would have to perform the above operation on all the systems.

Setting-Up The Local Cluster
The way the spark cluster works is one system is the master and the rest are the slaves:
  • Go to SPARK_HOME/conf and create a file with the name spark-env.sh.

There will be spark-env.sh.template in same folder and this file gives you details on nhowto declare various environment variables. 
  • Enter the master ip address on the masters system


  • Open the slaves file in the same folder i.e SPARK_HOME/conf and if there is none then create it. Note the slave file does not have any extension.

All these files must be saved on all systems with the same data i.e the master system ip address entered in the spark-env.sh file on all systems and the slave ip address of all other systems entered in the slaves folder of all systems as well. This is very important.
  • Navigate to the spark folder sbin directory (/usr/local/spark/sbin) and enter the following command

$sudo ./start-all.sh
  • Enter the password on the prompt
  • Go to your browser and enter IP_ADDRESS_OF_YOUR_MASTER_SYSTEM:8080 in the URL and press enter
The sudo command is very important or you will get and error message of permission denied.
Note:
The ./start-all.sh command is preferred to starting the master first and then subsequently starting all the slaves afterwards which you can see how that works in the references I added but if you prefer that route then you can do that. I prefer this where one command can be used to start all the systems at once and then stop them afterwards.

you might get an error message of access during after running the start all command but I have actually forgotten how I overcame that access error. If you do get that error. Please contact me and I would help resolve it. Enjoy.



Reference
http://paxcel.net/blog/how-to-setup-apache-spark-standalone-cluster-on-multiple-machine/
https://www.tutorialspoint.com/apache_spark/apache_spark_installation.htm

1 comment:

How To Upgrade (Flash) Linksys' WRT54G/GL/GS Firmware to Tomato Firmware For IP Address and Bandwidth Monitoring

As a System Administrator one is usually faced with the challenge of providing the best possible technology solution within the confinem...