Hadoop/YARN User Guide

Hadoop version: Hadoop >= 2.7 or CDH 5.X. Hadoop 3.X or CDH 6.X have not been tested and thus currently not supported.


You can run Analytics Zoo programs on standard Hadoop/YARN clusters without any changes to the cluster (i.e., no need to pre-install Analytics Zoo or any Python libraries in the cluster).

1. Prepare Environment

  • You need to first use conda to prepare the Python environment on the local client machine. Create a conda environment and install all the needed Python libraries in the created conda environment:

    conda create -n zoo python=3.7  # "zoo" is conda environment name, you can use any name you like.
    conda activate zoo
    
    # Use conda or pip to install all the needed Python dependencies in the created conda environment.
    
  • You need to download and install JDK in the environment, and properly set the environment variable JAVA_HOME, which is required by Spark. JDK8 is highly recommended.

    You may take the following commands as a reference for installing OpenJDK:

    # For Ubuntu
    sudo apt-get install openjdk-8-jre
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
    
    # For CentOS
    su -c "yum install java-1.8.0-openjdk"
    export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre
    
    export PATH=$PATH:$JAVA_HOME/bin
    java -version  # Verify the version of JDK.
    
  • Check the Hadoop setup and configurations of your cluster. Make sure you properly set the environment variable HADOOP_CONF_DIR, which is needed to initialize Spark on YARN:

    export HADOOP_CONF_DIR=the directory of the hadoop and yarn configurations
    
  • For CDH users

If your CDH cluster has already installed Spark, the CDH’s spark will have conflict with the pyspark installed by pip required by analytics-zoo in next section.

Thus before running analytics-zoo applications, you should unset all the spark related environment variables. You can use env | grep SPARK to find all the existing spark environment variables.

Also, CDH cluster’s HADOOP_CONF_DIR should by default be set to /etc/hadoop/conf.


2. YARN Client Mode

  • Install Analytics Zoo in the created conda environment via pip:

    pip install analytics-zoo
    

    View the Python User Guide for more details.

  • We recommend using init_orca_context at the very beginning of your code to initiate and run Analytics Zoo on standard Hadoop/YARN clusters in YARN client mode:

    from zoo.orca import init_orca_context
    
    sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2)
    

    By specifying cluster_mode to be “yarn-client”, init_orca_context would automatically prepare the runtime Python environment, detect the current Hadoop configurations from HADOOP_CONF_DIR and initiate the distributed execution engine on the underlying YARN cluster. View Orca Context for more details.

  • You can then simply run your Analytics Zoo program in a Jupyter notebook:

    jupyter notebook --notebook-dir=./ --ip=* --no-browser
    

    or as a normal Python script (e.g. script.py):

    python script.py
    

3. YARN Cluster Mode

Follow the steps below if you need to run Analytics Zoo in YARN cluster mode.

  • Download and extract Spark. You are recommended to use Spark 2.4.3. Set the environment variable SPARK_HOME:

    export SPARK_HOME=the root directory where you extract the downloaded Spark package
    
  • Download and extract Analytics Zoo. Make sure the Analytics Zoo package you download is built with the compatible version with your Spark. Set the environment variable ANALYTICS_ZOO_HOME:

    export ANALYTICS_ZOO_HOME=the root directory where you extract the downloaded Analytics Zoo package
    
  • Pack the current conda environment to environment.tar.gz (you can use any name you like):

    conda pack -o environment.tar.gz
    
  • You need to write your Analytics Zoo program as a Python script. In the script, you can call init_orca_context and specify cluster_mode to be “spark-submit”:

    from zoo.orca import init_orca_context
    
    sc = init_orca_context(cluster_mode="spark-submit")
    
  • Use spark-submit to submit your Analytics Zoo program (e.g. script.py):

    PYSPARK_PYTHON=./environment/bin/python ${ANALYTICS_ZOO_HOME}/bin/spark-submit-python-with-zoo.sh \
        --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
        --master yarn-cluster \
        --executor-memory 10g \
        --driver-memory 10g \
        --executor-cores 8 \
        --num-executors 2 \
        --archives environment.tar.gz#environment \
        script.py
    

    You can adjust the configurations according to your cluster settings.