Python User Guide¶
1. Install¶
We recommend using conda to prepare the Python environment as follows:
conda create -n zoo python=3.7 # "zoo" is conda environment name, you can use any name you like. conda activate zoo
You need to install JDK in the environment, and properly set the environment variable
JAVA_HOME
. JDK8 is highly recommended.You may take the following commands as a reference for installing OpenJDK:
# For Ubuntu sudo apt-get install openjdk-8-jre export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ # For CentOS su -c "yum install java-1.8.0-openjdk" export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre export PATH=$PATH:$JAVA_HOME/bin java -version # Verify the version of JDK.
1.1 Official Release¶
You can install the latest release version of Analytics Zoo as follows:
pip install analytics-zoo
Note: Installing Analytics Zoo will automatically install bigdl==0.13.0
, pyspark==2.4.6
, conda-pack==0.3.1
and their dependencies if they haven’t been detected in your conda environment.
1.2 Nightly Build¶
You can install the latest nightly build of Analytics Zoo as follows:
pip install --pre --upgrade analytics-zoo
Alternatively, you can find the list of the nightly build versions here, and install a specific version as follows:
pip install analytics-zoo=version
Note: If you are using a custom URL of Python Package Index, you may need to check whether the latest packages have been sync’ed with pypi.
Or you can add the option -i https://pypi.python.org/simple
when pip install to use pypi as the index-url.
2. Run¶
Note: Installing Analytics Zoo from pip will automatically install pyspark
. To avoid possible conflicts, you are highly recommended to unset the environment variable SPARK_HOME
if it exists in your environment.
2.1 Interactive Shell¶
You may test if the installation is successful using the interactive Python shell as follows:
Type
python
in the command line to start a REPL.Try to run the example code below to verify the installation:
import zoo from zoo.orca import init_orca_context print(zoo.__version__) # Verify the version of analytics-zoo. sc = init_orca_context() # Initiation of analytics-zoo on the underlying cluster.
2.2 Jupyter Notebook¶
You can start the Jupyter notebook as you normally do using the following command and run Analytics Zoo programs directly in a Jupyter notebook:
jupyter notebook --notebook-dir=./ --ip=* --no-browser
2.3 Python Script¶
You can directly write Analytics Zoo programs in a Python file (e.g. script.py) and run in the command line as a normal Python program:
python script.py
3. Python Dependencies¶
We recommend using conda to manage your Python dependencies. Libraries installed in the current conda environment will be automatically distributed to the cluster when calling init_orca_context
. You can also add extra dependencies as .py
, .zip
and .egg
files by specifying extra_python_lib
argument in init_orca_context
.
For more details, please refer to Orca Context.
4. Compatibility¶
Analytics Zoo has been tested on Python 3.6 and 3.7 with the following library versions:
pyspark==2.4.6
ray==1.2.0
tensorflow==1.15.0 or >2.0
pytorch>=1.5.0
torchvision>=0.6.0
horovod==0.19.2
mxnet>=1.6.0
bayesian-optimization==1.1.0
dask==2.14.0
h5py==2.10.0
numpy==1.18.1
opencv-python==4.2.0.34
pandas==1.0.3
Pillow==7.1.1
protobuf==3.12.0
psutil==5.7.0
py4j==0.10.7
redis==3.4.1
scikit-learn==0.22.2.post1
scipy==1.4.1
tensorboard==1.15.0
tensorboardX>=2.1
tensorflow-datasets==3.2.0
tensorflow-estimator==1.15.1
tensorflow-gan==2.0.0
tensorflow-hub==0.8.0
tensorflow-metadata==0.21.1
tensorflow-probability==0.7.0
Theano==1.0.4
5. Known Issues¶
If you meet the following error when
pip install analytics-zoo
:
ERROR: Could not find a version that satisfies the requirement pypandoc (from versions: none)
ERROR: No matching distribution found for pypandoc
Could not import pypandoc - required to package PySpark
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.8/site-packages/setuptools/installer.py", line 126, in fetch_build_egg
subprocess.check_call(cmd)
File "/root/anaconda3/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/root/anaconda3/bin/python', '-m', 'pip', '--disable-pip-version-check', 'wheel', '--no-deps', '-w', '/tmp/tmprefr87ue', '--quiet', 'pypandoc']' returned non-zero exit status 1.
This is actually caused by pip install pyspark
in your Python environment. You can fix it by running pip install pypandoc
first and then pip install analytics-zoo
.