Developer Guide


Analytics Zoo source code is available at GitHub:

git clone https://github.com/intel-analytics/analytics-zoo.git

By default, git clone will download the development version of Analytics Zoo. If you want a release version, you can use the command git checkout to change the specified version.

1. Python

1.1 Build

To generate a new whl package for pip install, you can run the following script:

bash analytics-zoo/pyzoo/dev/build.sh linux default false

Arguments:

  • The first argument is the platform to build for. Either ‘linux’ or ‘mac’.

  • The second argument is the analytics-zoo version to build for. ‘default’ means the default version for the current branch. You can also specify a different version if you wish, e.g., ‘0.6.0.dev1’.

  • You can also add other profiles to build the package, especially Spark and BigDL versions. For example, under the situation that pyspark==2.4.3 is a dependency, you need to add profiles -Dspark.version=2.4.3 -Dbigdl.artifactId=bigdl-SPARK_2.4 -P spark_2.4+ to build Analytics Zoo for Spark 2.4.3.

After running the above command, you will find a whl file under the folder analytics-zoo/pyzoo/dist/. You can then directly pip install it to your local Python environment:

pip install analytics-zoo/pyzoo/dist/analytics_zoo-VERSION-py2.py3-none-PLATFORM_x86_64.whl

See here for more instructions to run analytics-zoo after pip install.

1.2 IDE Setup

Any IDE that support python should be able to run Analytics Zoo. PyCharm works fine for us.

You need to do the following preparations before starting the IDE to successfully run an Analytics Zoo Python program in the IDE:

  • Build Analytics Zoo; see here for more instructions.

  • Prepare Spark environment by either setting SPARK_HOME as the environment variable or pip install pyspark. Note that the Spark version should match the one you build Analytics Zoo on.

  • Set BIGDL_CLASSPATH:

export BIGDL_CLASSPATH=analytics-zoo/dist/lib/analytics-zoo-*-jar-with-dependencies.jar
  • Prepare BigDL Python environment by either downloading BigDL source code from GitHub or pip install bigdl. Note that the BigDL version should match the one you build Analytics Zoo on.

  • Add pyzoo and spark-analytics-zoo.conf to PYTHONPATH:

export PYTHONPATH=analytics-zoo/pyzoo:analytics-zoo/dist/conf/spark-analytics-zoo.conf:$PYTHONPATH

If you download BigDL from GitHub, you also need to add BigDL/pyspark to PYTHONPATH:

export PYTHONPATH=BigDL/pyspark:$PYTHONPATH

The above environmental variables should be available when running or debugging code in IDE.

  • In PyCharm, go to RUN -> Edit Configurations. In the “Run/Debug Configurations” panel, you can update the above environment variables in your configuration.

2. Scala

2.1 Build

Maven 3 is needed to build Analytics Zoo, you can download it from the maven website.

After installing Maven 3, please set the environment variable MAVEN_OPTS as follows:

$ export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

Build using make-dist.sh

It is highly recommended that you build Analytics Zoo using the make-dist.sh script with Java 8.

You can build Analytics Zoo with the following commands:

$ bash make-dist.sh

After that, you can find a dist folder, which contains all the needed files to run a Analytics Zoo program. The files in dist include:

  • dist/lib/analytics-zoo-VERSION-jar-with-dependencies.jar: This jar package contains all dependencies except Spark classes.

  • dist/lib/analytics-zoo-VERSION-python-api.zip: This zip package contains all Python files of Analytics Zoo.

The instructions above will build Analytics Zoo with Spark 2.4.3. To build with other spark versions, for example building analytics-zoo with spark 2.2.0, you can use bash make-dist.sh -Dspark.version=2.2.0 -Dbigdl.artifactId=bigdl_SPARK_2.2.

Build with JDK 11

Spark starts to supports JDK 11 and Scala 2.12 at Spark 3.0. You can use -P spark_3.x to specify Spark3 and scala 2.12. Additionally, make-dist.sh default uses Java 8. To compile with Java 11, it is required to specify building opts -Djava.version=11 -Djavac.version=11. You can build with make-dist.sh.

It’s recommended to download Oracle JDK 11. This will avoid possible incompatibilities with maven plugins. You should update PATH and make sure your JAVA_HOME environment variable is set to Java 11 if you’re running from the command line. If you’re running from an IDE, you need to make sure it is set to run maven with your current JDK.

Build with make-dist.sh:

$ bash make-dist.sh -P spark_3.x -Djava.version=11 -Djavac.version=11

2.2 IDE Setup

Analytics Zoo uses maven to organize project. You should choose an IDE that supports Maven project and scala language. IntelliJ IDEA works fine for us.

In IntelliJ, you can open Analytics Zoo project root directly, and the IDE will import the project automatically.

We set the scopes of spark related libraries to provided in the maven pom.xml, which, however, will cause a problem in IDE (throwing NoClassDefFoundError when you run applications). You can easily change the scopes using the all-in-one profile.

  • In Intellij, go to View -> Tools Windows -> Maven Projects. Then in the Maven Projects panel, Profiles -> click “all-in-one”.