Python User Guide


1. Install

  • We recommend using conda to prepare the Python environment as follows:

    conda create -n zoo python=3.7  # "zoo" is conda environment name, you can use any name you like.
    conda activate zoo
    
  • You need to install JDK in the environment, and properly set the environment variable JAVA_HOME. JDK8 is highly recommended.

    You may take the following commands as a reference for installing OpenJDK:

    # For Ubuntu
    sudo apt-get install openjdk-8-jre
    export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
    
    # For CentOS
    su -c "yum install java-1.8.0-openjdk"
    export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre
    
    export PATH=$PATH:$JAVA_HOME/bin
    java -version  # Verify the version of JDK.
    

1.1 Official Release

You can install the latest release version of Analytics Zoo as follows:

pip install analytics-zoo

Note: Installing Analytics Zoo will automatically install bigdl==0.13.0, pyspark==2.4.6, conda-pack==0.3.1 and their dependencies if they haven’t been detected in your conda environment.

1.2 Nightly Build

You can install the latest nightly build of Analytics Zoo as follows:

pip install --pre --upgrade analytics-zoo

Alternatively, you can find the list of the nightly build versions here, and install a specific version as follows:

pip install analytics-zoo=version

Note: If you are using a custom URL of Python Package Index, you may need to check whether the latest packages have been sync’ed with pypi. Or you can add the option -i https://pypi.python.org/simple when pip install to use pypi as the index-url.


2. Run

Note: Installing Analytics Zoo from pip will automatically install pyspark. To avoid possible conflicts, you are highly recommended to unset the environment variable SPARK_HOME if it exists in your environment.

2.1 Interactive Shell

You may test if the installation is successful using the interactive Python shell as follows:

  • Type python in the command line to start a REPL.

  • Try to run the example code below to verify the installation:

    import zoo
    from zoo.orca import init_orca_context
    
    print(zoo.__version__)  # Verify the version of analytics-zoo.
    sc = init_orca_context()  # Initiation of analytics-zoo on the underlying cluster.
    

2.2 Jupyter Notebook

You can start the Jupyter notebook as you normally do using the following command and run Analytics Zoo programs directly in a Jupyter notebook:

jupyter notebook --notebook-dir=./ --ip=* --no-browser

2.3 Python Script

You can directly write Analytics Zoo programs in a Python file (e.g. script.py) and run in the command line as a normal Python program:

python script.py

3. Python Dependencies

We recommend using conda to manage your Python dependencies. Libraries installed in the current conda environment will be automatically distributed to the cluster when calling init_orca_context. You can also add extra dependencies as .py, .zip and .egg files by specifying extra_python_lib argument in init_orca_context.

For more details, please refer to Orca Context.


4. Compatibility

Analytics Zoo has been tested on Python 3.6 and 3.7 with the following library versions:

pyspark==2.4.6
ray==1.2.0
tensorflow==1.15.0 or >2.0
pytorch>=1.5.0
torchvision>=0.6.0
horovod==0.19.2
mxnet>=1.6.0
bayesian-optimization==1.1.0
dask==2.14.0
h5py==2.10.0
numpy==1.18.1
opencv-python==4.2.0.34
pandas==1.0.3
Pillow==7.1.1
protobuf==3.12.0
psutil==5.7.0
py4j==0.10.7
redis==3.4.1
scikit-learn==0.22.2.post1
scipy==1.4.1
tensorboard==1.15.0
tensorboardX>=2.1
tensorflow-datasets==3.2.0
tensorflow-estimator==1.15.1
tensorflow-gan==2.0.0
tensorflow-hub==0.8.0
tensorflow-metadata==0.21.1
tensorflow-probability==0.7.0
Theano==1.0.4

5. Known Issues

  • If you meet the following error when pip install analytics-zoo:

ERROR: Could not find a version that satisfies the requirement pypandoc (from versions: none)
ERROR: No matching distribution found for pypandoc
Could not import pypandoc - required to package PySpark
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.8/site-packages/setuptools/installer.py", line 126, in fetch_build_egg
    subprocess.check_call(cmd)
  File "/root/anaconda3/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmprefr87ue', '--quiet', 'pypandoc']' returned non-zero exit status 1.

This is actually caused by pip install pyspark in your Python environment. You can fix it by running pip install pypandoc first and then pip install analytics-zoo.