Running Spark on Yarn with RPi Cluster
I set up an RPi cluster (3 nodes) with Spark running on it from this blog. In actual projects, Spark is usually embedded in the Hadoop ecosystem to replace Hadoop’s native Map-Reduce because Spark tends to perform faster than Hadoop and it uses RAM to cache and process data instead of writing results to disk. In general, there are three ways to run Spark on the Hadoop eco-system.
Spark is a computing platform, not a professional resource management platform. In the real world, running Spark on YARN is very common.
Now let’s add Hadoop to the RPi cluster and config Spark with YARN and HDFS. You can use xsync to distribute files across workers.
My RPi setting is the following. I changed the master hostname to its IP address to reduce complexity (like setup DNS in router).
1. Install Hadoop in RPi
- In each RPi nodes, create the HDFS folders and config the permission.
sudo mkdir /opt/hadoop
sudo mkdir /opt/hadoop/hdfs
sudo mkdir /opt/hadoop/hdfs/namenode # for master node
sudo mkdir /opt/hadoop/hdfs/datanode # for worker node
sudo chown ubuntu -R /opt/hadoop
- Setup Hadoop in each RPi node
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
sudo tar -xvxf hadoop-3.3.1.tar.gz -C /usr/local/
- Add Hadoop to the environment by
vim ~/.profile
export JAVA_HOME=$(readlink -f /usr/bin/javac | sed "s:/bin/javac::")
export HADOOP_HOME="/usr/local/hadoop-3.3.1"
export HADOOP_INSTALL=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH="$PATH:$HADOOP_INSTALL/bin"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
- Reboot nodes
sudo reboot
- Test Hadoop installation by
hadoop version
, and it should show the following in the terminal.
Hadoop 3.3.1
Source code repository https://github.com/apache/hadoop.git -r a3b9c37a397ad4188041dd80621bdeefc46885f2
Compiled by ubuntu on 2021-06-15T05:13Z
Compiled with protoc 3.7.1
From source with checksum 88a4ddb2299aca054416d6b7f81ca55
This command was run using /usr/local/hadoop-3.3.1/share/hadoop/common/hadoop-common-3.3.1.jar
2. Config Hadoop Cluster Mode
- Only in Master node, add workers’ IP address to
$HADOOP_HOME/etc/hadoop/worders
- For all the RPi nodes, edit core-site.xml :
vim $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.1.77:9000/</value>
</property>
<property>
<name>fs.default.FS</name>
<value>hdfs://192.168.1.77:9000/</value>
</property>
<!--
If you didn't config your router DNS, then disable the hostname.
Otherwise it will use the hostname as default for namenode address,
which will cause connection failure.
-->
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
</configuration>
- Edit hdfs-site.xml:
vim $HADOOP_HOME/etc/hadoop/core-site.xml
<configuration>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/hdfs/datanode</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/hdfs/namenode</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>192.168.1.77:50070</value>
</property>
<!--
Replicatio number should equals to datanodes number,
which is workers number
-->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
- Edit yarn-site.xml:
vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>192.168.1.77:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>192.168.1.77:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>192.168.1.77:8050</value>
</property>
</configuration>
- Add JAVA_HOME to hadoop-env.sh:
vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
- Format namenode and start Hadoop
hdfs namenode -format
$HADOOP_HOME/sbin/start-all.sh
If no error, we can navigate to the Hadoop Web UI http://192.168.1.77:50070/. And you can check some Hadoop settings and HDFS.
3. Config Spark
In all 3 nodes, we can add the history server and YARN
- Create a directory to store application history data. In the master node, we use HDFS command to create a directory folder in HDFS.
hdfs dfs -mkdir -p /directory
- Update
/usr/local/spark-3.2.0-bin-hadoop3.2/conf/spark-defaults.conf
to use YARN as master.
spark.master yarn
spark.eventLog.enabled true
spark.eventLog.dir hdfs://192.168.1.77:9000/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 512m
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
- Update
/usr/local/spark-3.2.0-bin-hadoop3.2/conf/spark-env.sh
to add Spark history server and Hadoop config dir
export SPARK_HISTORY_OPTS="
-Dspark.history.ui.port=18080
-Dspark.history.fs.locDirectory=hdfs://192.168.1.77:9000/directory
-Dspark.history.retainedApplication=30"
export JAVA_HOME=$(readlink -f /usr/bin/javac | sed "s:/bin/javac::")
export HADOOP_CONF_DIR="/usr/local/hadoop-3.3.1/etc/hadoop"
- Start Spark and check YARN Web UI
/usr/local/spark-3.2.0-bin-hadoop3.2/sbin/start-all.sh
/usr/local/spark-3.2.0-bin-hadoop3.2/sbin/start-history-server.sh
If no error, then navigate to http://192.168.1.77:8088/, and you can check the overall summary and nodes info.
4. Test
SparkPi
Now let’s run the SparkPi with Yarn. Here we replace the — master config with yarn and the others should stay the same.
spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi /usr/local/spark-3.2.0-bin-hadoop3.2/examples/jars/spark-examples_2.12-3.2.0.jar 1000
While it is running (about 3–5 mins.) we can check yarn Web UI and monitor the application progress.
WordCount
WordCount is a famous Map-Reduce example (source code). You can create a text file and upload it to the HDFS.
hdfs dfs -mkdir -p /input
hdfs dfs -put test.txt /input
Check the file in http://192.168.1.77:50070/explorer.html#/input
Then run the WordCount example
spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.JavaWordCount /usr/local/spark-3.2.0-bin-hadoop3.2/examples/jars/spark-examples_2.12-3.2.0.jar hdfs://192.168.1.77:9000/input/test.txt
The word count result will be output in the terminal output, such as
2021-11-07 21:14:31,655 INFO cluster.YarnScheduler: Killing all running tasks in stage 1: Stage finished
2021-11-07 21:14:31,663 INFO scheduler.DAGScheduler: Job 0 finished: collect at JavaWordCount.java:53, took 42.811902 s
slinging: 1
House: 2
believe.: 1
worrying: 1
works,: 1
parenting: 1
“I: 1
behind: 1
evidence: 1
end: 4
When you navigate to the job Web UI, you can check the detail of the job.
5. Fin?
Not yet, we have tried YARN, why not Airflow?