Hadoop/YARN User Guide¶

Hadoop version: Hadoop >= 2.7 or CDH 5.X. Hadoop 3.X or CDH 6.X have not been tested and thus currently not supported.

You can run Analytics Zoo programs on standard Hadoop/YARN clusters without any changes to the cluster (i.e., no need to pre-install Analytics Zoo or any Python libraries in the cluster).

1. Prepare Environment¶

You need to first use conda to prepare the Python environment on the local client machine. Create a conda environment and install all the needed Python libraries in the created conda environment:

conda create -n zoo python=3.7  # "zoo" is conda environment name, you can use any name you like.
conda activate zoo

# Use conda or pip to install all the needed Python dependencies in the created conda environment.

You need to download and install JDK in the environment, and properly set the environment variable JAVA_HOME, which is required by Spark. JDK8 is highly recommended.

You may take the following commands as a reference for installing OpenJDK:

# For Ubuntu
sudo apt-get install openjdk-8-jre
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/

# For CentOS
su -c "yum install java-1.8.0-openjdk"
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre

export PATH=$PATH:$JAVA_HOME/bin
java -version  # Verify the version of JDK.

Check the Hadoop setup and configurations of your cluster. Make sure you properly set the environment variable HADOOP_CONF_DIR, which is needed to initialize Spark on YARN:
```
export HADOOP_CONF_DIR=the directory of the hadoop and yarn configurations
```
For CDH users

If your CDH cluster has already installed Spark, the CDH’s spark will have conflict with the pyspark installed by pip required by analytics-zoo in next section.

Thus before running analytics-zoo applications, you should unset all the spark related environment variables. You can use env | grep SPARK to find all the existing spark environment variables.

Also, CDH cluster’s HADOOP_CONF_DIR should by default be set to /etc/hadoop/conf.

2. YARN Client Mode¶

Install Analytics Zoo in the created conda environment via pip:
```
pip install analytics-zoo
```
View the Python User Guide for more details.
We recommend using init_orca_context at the very beginning of your code to initiate and run Analytics Zoo on standard Hadoop/YARN clusters in YARN client mode:
```
from zoo.orca import init_orca_context

sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2)
```
By specifying cluster_mode to be “yarn-client”, init_orca_context would automatically prepare the runtime Python environment, detect the current Hadoop configurations from HADOOP_CONF_DIR and initiate the distributed execution engine on the underlying YARN cluster. View Orca Context for more details.
You can then simply run your Analytics Zoo program in a Jupyter notebook:
```
jupyter notebook --notebook-dir=./ --ip=* --no-browser
```
or as a normal Python script (e.g. script.py):
```
python script.py
```

3. YARN Cluster Mode¶

Follow the steps below if you need to run Analytics Zoo in YARN cluster mode.

Download and extract Spark. You are recommended to use Spark 2.4.3. Set the environment variable SPARK_HOME:
```
export SPARK_HOME=the root directory where you extract the downloaded Spark package
```
Download and extract Analytics Zoo. Make sure the Analytics Zoo package you download is built with the compatible version with your Spark. Set the environment variable ANALYTICS_ZOO_HOME:
```
export ANALYTICS_ZOO_HOME=the root directory where you extract the downloaded Analytics Zoo package
```
Pack the current conda environment to environment.tar.gz (you can use any name you like):
```
conda pack -o environment.tar.gz
```
You need to write your Analytics Zoo program as a Python script. In the script, you can call init_orca_context and specify cluster_mode to be “spark-submit”:
```
from zoo.orca import init_orca_context

sc = init_orca_context(cluster_mode="spark-submit")
```

Use spark-submit to submit your Analytics Zoo program (e.g. script.py):

PYSPARK_PYTHON=./environment/bin/python ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
    --master yarn-cluster \
    --executor-memory 10g \
    --driver-memory 10g \
    --executor-cores 8 \
    --num-executors 2 \
    --archives environment.tar.gz#environment \
    script.py

You can adjust the configurations according to your cluster settings.