7 Steps to Setup Spark Cluster with 3 RPi 4
Spark is an open-source cluster computing framework that is written in Scala and Java. In order to run Spark on cluster mode, we need a cluster. Thanks to Raspberry Pi, we can build a decent cluster within 200 Pounds. In this blog post, I will share my experience in building a Raspberry Pi cluster with Spark and test an example on Spark cluster mode. Here are the 7 Steps
- Hardware — RPi Cluster (take about a day maybe…)
- Install Spark and its Dependencies (20 mins)
- Setup Password-less ssh Between Nodes (20 mins)
- Setup Master Node (3 mins)
- Setup Worker Nodes (3 mins)
- Start Cluster and Check the Cluster Web UI (1 min)
- Test Example Code and Check the Result (3 mins)
If you already have an RPi cluster and all of them have Ubuntu installed, then setup a Spark cluster only takes up to 2 hours.
0. Hardware — RPi Cluster
Here is my Raspberry Pi 4 cluster hardware setup. Yours does not necessarily like this, but this setup is good for cooling without a doubt. 3 of the RPis are 4GB version and the other one is 8GB. And they all have the Ubuntu OS installed.
For Spark cluster, I used 3 4GB version Raspberry Pi 4, and each node setup as follows
You can change the hostname by editing the file /etc/hostname
1. Install Spark and its Dependencies
Install Spark and dependencies in all 3 RPi nodes.
- Install Dependencies
sudo apt install -y default-jre default-jdk scala
- Download Spark and unzip it
mkdir download
cd download
wget https://dlcdn.apache.org/spark/spark-3.2.0/spark-3.2.0-bin-hadoop3.2.tgz
tar vxf spark-3.2.0-bin-hadoop3.2.tgz
sudo mv spark-3.2.0-bin-hadoop3.2 /usr/local
- Setup Spark environment
vim ~/.profile
# add following two lines to ~/.profile
export PATH="$PATH:/usr/local/spark-3.2.0-bin-hadoop3.2/bin"
export PATH="$PATH:/usr/local/spark-3.2.0-bin-hadoop3.2/sbin"
- Reboot nodes
sudo reboot
- Test Spark by executing spark-shell
spark-shell
The output will be like following
Now you can check the standalone node status by go to http://<node ip address>:4040. However 4040 may not be open by default, you need to open 4040 firstly
sudo ufw allow 4040
The web ui would be like following
2. Setup Password-less ssh Between Nodes
Spark requires password-less access between both the Master and the Worders. To do this we need to create an ssh key on all of the nodes. However keep in mind that in the industry, there will be an authn & authz service for this step. Manually adding ssh keys in real-world clusters is not realistic.
Suppose we want node B to access node A without a password, then in node A
- Generate ssh-key
ssh-keygen -o -a 100 -t ed25519 -f ~/.ssh/id_ed25519 -C "<some email address>"
# just use defualt by clicking Enter
- Copy ssh-key to target node B
ssh-copy-id <node B username>:<node B ip address>
# for example ssh-copy-id ubuntu@192.168.1.76
Then in node B, we change the permission of the ssh-key
chmod 600 .ssh/authorized_keys
chmod 700 .ssh
Repeat the above steps in all three nodes and make sure they can access each other without a password.
3. Setup Master Node
In the Master node, we do the following steps to setup the master role in the Spark cluster
- Specify the IP address and port of the master node in spark-evn.sh. Here we can copy from
spark-env.sh.template
cd /usr/local/spark-3.2.0-bin-hadoop3.2/conf
cp spark-env.sh.template spark-evn.sh
vim spark-env.sh
# add following two lines
export SPARK_MASTER_HOST="192.168.1.77" # master host IP
export SPARK_MASTER_WEBUI_PORT=”9999"
- Add workers IP addresses in the same directory
vim workers
# add following two lines - workers IP
192.168.1.76
192.168.1.78
- Setup workers nodes host — IP address mapping
sudo vim /etc/hosts
# add following two lines
192.168.1.76 work76
192.168.1.78 work78
4. Setup Worker Nodes
In both worker nodes, we do the following steps to setup worker role in the cluster
- Specify the IP address of the master node in spark-evn.sh. Here we can copy from
spark-env.sh.template
cd /usr/local/spark-3.2.0-bin-hadoop3.2/conf
cp spark-env.sh.template spark-evn.sh
vim spark-env.sh
# add following two lines
export SPARK_MASTER_HOST="192.168.1.77" # master host IP
- Specify the IP address of two workers in the same directory
vim workers
# add following two lines - workers IP
192.168.1.76
192.168.1.78
- Setup workers nodes host — IP address mapping
sudo vim /etc/hosts
# add following two lines
192.168.1.76 work76
192.168.1.78 work78
5. Start Cluster and Check the Cluster Web UI
To start the cluster, in the master node, execute the following command
/usr/local/spark-3.2.0-bin-hadoop3.2/sbin/start-all.sh
It will show the following logs
starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark-3.2.0-bin-hadoop3.2/logs/spark-ubuntu-org.apache.spark.deploy.master.Master-1-main77.out
192.168.1.78: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-3.2.0-bin-hadoop3.2/logs/spark-ubuntu-org.apache.spark.deploy.worker.Worker-1-work78.out
192.168.1.76: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark-3.2.0-bin-hadoop3.2/logs/spark-ubuntu-org.apache.spark.deploy.worker.Worker-1-work76.out
Once finished, navigate to the browser and go to http://192.168.1.77:9999. It will show the Spark cluster info like
We can see that there are two workers, we have our RPi Spark cluster!
6. Test Example Code and Check the Result
Now let’s run an example code (calculate Pi) to test our cluster. Go to master and execute the command
spark-submit --master spark://192.168.1.77:7077 --deploy-mode cluster --class org.apache.spark.examples.SparkPi /usr/local/spark-3.2.0-bin-hadoop3.2/examples/jars/spark-examples_2.12-3.2.0.jar 1000
It will show logs like
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/usr/local/spark-3.2.0-bin-hadoop3.2/jars/spark-unsafe_2.12-3.2.0.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/10/31 17:14:10 INFO SecurityManager: Changing view acls to: ubuntu
21/10/31 17:14:10 INFO SecurityManager: Changing modify acls to: ubuntu
21/10/31 17:14:10 INFO SecurityManager: Changing view acls groups to:
21/10/31 17:14:10 INFO SecurityManager: Changing modify acls groups to:
21/10/31 17:14:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set()
21/10/31 17:14:11 INFO Utils: Successfully started service 'driverClient' on port 35401.
21/10/31 17:14:12 INFO TransportClientFactory: Successfully created connection to /192.168.1.77:7077 after 101 ms (0 ms spent in bootstraps)
21/10/31 17:14:12 INFO ClientEndpoint: ... waiting before polling master for driver state
21/10/31 17:14:12 INFO ClientEndpoint: Driver successfully submitted as driver-20211031171412-0000
21/10/31 17:14:17 INFO ClientEndpoint: State of driver-20211031171412-0000 is RUNNING
21/10/31 17:14:17 INFO ClientEndpoint: Driver running on 192.168.1.76:44169 (worker-20211031170321-192.168.1.76-44169)
21/10/31 17:14:17 INFO ClientEndpoint: spark-submit not configured to wait for completion, exiting spark-submit JVM.
21/10/31 17:14:17 INFO ShutdownHookManager: Shutdown hook called
21/10/31 17:14:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-feeb4553-e220-45e6-ab92-957e9b0d1e29
You can then refresh the browser and check the job status.
The job should be finished in 60 secs. You can also explore the jobs detail or driver detail by clicking the hyperlinks.
The result is in the stdout of driver, such as
Now I have a Spark cluster running on RPis. However, I haven’t thought about what I would do with it now. I think I will setup HDFS and run word-count as the next step.