Chronos User Guide

1. Overview

Chronos is an application framework for building large-scale time series analysis applications.

You can use Chronos to do:

2. Install

Install analytics-zoo with target [ray] and some additional dependencies for Chronos.

conda create -n my_env python=3.7
conda activate my_env
pip install analytics-zoo[ray]==0.11.0

Some dependencies are needed by different components in Chronos. Please install them: tensorflow>=1.15.0,<2.0.0, h5py==2.10.0, ray[tune]==1.2.0, pandas, scikit-learn>=0.20.0,<=0.22.0, requests, tsfresh, torch==1.8.1.

3 Initialization

Chronos uses Orca to enable distributed training and AutoML capabilities. Init orca as below. View Orca Context for more details. Note that argument init_ray_on_spark must be True for Chronos.

if args.cluster_mode == "local":
    init_orca_context(cluster_mode="local", cores=4, init_ray_on_spark=True) # run in local mode
elif args.cluster_mode == "k8s":
    init_orca_context(cluster_mode="k8s", num_nodes=2, cores=2, init_ray_on_spark=True) # run on K8s cluster
elif args.cluster_mode == "yarn":
    init_orca_context(cluster_mode="yarn-client", num_nodes=2, cores=2, init_ray_on_spark=True) # run on Hadoop YARN cluster

View Quick Start for a more detailed example.


4 Forecasting

Time Series Forecasting uses the history to predict the future. There’re two ways to do forecasting:

  • Use AutoTS pipeline

  • Use Standalone Forecaster pipeline

4.1 Use AutoTS Pipeline (with AutoML)

You can use the AutoTS package to to build a time series forecasting pipeline with AutoML.

The general workflow has two steps:

View AutoTS notebook example for more details.

4.1.1 Prepare input data

You should prepare the training dataset and the optional validation dataset. Both training and validation data need to be provided as Pandas Dataframe. The dataframe should have at least two columns:

  • The datetime column, which should have Pandas datetime format (you can use pandas.to_datetime to convert a string into a datetime format)

  • The target column, which contains the data points at the associated timestamps; these data points will be used to predict future data points.

You may have other input columns for each row as extra feature; so the final input data could look something like below.

datetime    target  extra_feature_1  extra_feature_2
2019-06-06  1.2     1                2
2019-06-07  2.30    2                1

4.1.2 Create AutoTSTrainer

You can create an AutoTSTrainer as follows (dt_col is the datetime, target_col is the target column, and extra_features_col is the extra features):

from zoo.chronos.autots.forecast import AutoTSTrainer

trainer = AutoTSTrainer(dt_col="datetime", target_col="target", horizon=1, extra_features_col=["extra_feature_1","extra_feature_2"])

View AutoTSTrainer API Doc for more details.

4.1.3 Train AutoTS pipeline

You can then train on the input data using AutoTSTrainer.fit with AutoML as follows:

ts_pipeline = trainer.fit(train_df, validation_df, recipe=SmokeRecipe())

recipe configures the search space for auto tuning. View Recipe API docs for available recipes. After training, it will return a TSPipeline, which includes not only the model, but also the data preprocessing/post processing steps.

Appropriate hyperparameters are automatically selected for the models and data processing steps in the pipeline during the fit process, and you may use built-in visualization tool (This link lead to our old document, please head back after reading the visualization page.) to inspect the training results after training stopped.

4.1.4 Use TSPipeline

Use TSPipeline.predict|evaluate|fit for prediction, evaluation or (incremental) fitting. Note: incremental fitting on TSPipeline just update the model weights the standard way, which does not involve AutoML.

ts_pipeline.predict(test_df)
ts_pipeline.evalute(val_df)
ts_pipeline.fit(new_train_df, new_val_df, epochs=10)

Use TSPipeline.save|load to load or save.

from zoo.chronos.autots.forecast import TSPipeline
loaded_ppl = TSPipeline.load(file)
loaded_ppl.save(another_file)

View TSPipeline API Doc for more details.

Note: init_orca_context is not needed if you just use the trained TSPipeline for inference, evaluation or incremental fitting.


4.2 Use Standalone Forecaster Pipeline

Chronos provides a set of standalone time series forecasters without AutoML support, including deep learning models as well as traditional statistical models.

View some examples notebooks for Network Traffic Prediction

The common process of using a Forecaster looks like below.

f = Forecaster()
f.fit(...)
f.predict(...)

Refer to API docs of each Forecaster for detailed usage instructions and examples.

4.2.1 LSTMForecaster

LSTMForecaster wraps a vanilla LSTM model, and is suitable for univariate time series forecasting.

View Network Traffic Prediction notebook and LSTMForecaster API Doc for more details.

4.2.2 Seq2SeqForecaster

Seq2SeqForecaster wraps a sequence to sequence model based on LSTM, and is suitable for multivariant & multistep time series forecasting.

View Seq2SeqForecaster API Doc for more details.

4.2.3 TCNForecaster

Temporal Convolutional Networks (TCN) is a neural network that use convolutional architecture rather than recurrent networks. It supports multi-step and multi-variant cases. Causal Convolutions enables large scale parallel computing which makes TCN has less inference time than RNN based model such as LSTM.

View Network Traffic multivariate multistep Prediction notebook and TCNForecaster API Doc for more details.

4.2.4 MTNetForecaster

MTNetForecaster wraps a MTNet model. The model architecture mostly follows the MTNet paper with slight modifications, and is suitable for multivariate time series forecasting.

View Network Traffic Prediction notebook and MTNetForecaster API Doc for more details.

4.2.5 TCMFForecaster

TCMFForecaster wraps a model architecture that follows implementation of the paper DeepGLO paper with slight modifications. It is especially suitable for extremely high dimensional (up-to millions) multivariate time series forecasting.

View High-dimensional Electricity Data Forecasting example and TCMFForecaster API Doc for more details.

4.2.6 ARIMAForecaster

ARIMAForecaster wraps a ARIMA model and is suitable for univariate time series forecasting. It works best with data that show evidence of non-stationarity in the sense of mean (and an initial differencing step (corresponding to the “I, integrated” part of the model) can be applied one or more times to eliminate the non-stationarity of the mean function.

View ARIMAForecaster API Doc for more details.

4.2.7 ProphetForecaster

ProphetForecaster wraps the Prophet model (site) which is an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects and is suitable for univariate time series forecasting. It works best with time series that have strong seasonal effects and several seasons of historical data and is robust to missing data and shifts in the trend, and typically handles outliers well.

View Stock Prediction notebook and ProphetForecaster API Doc for more details.

5 Anomaly Detection

Anomaly Detection detects abnormal samples in a given time series. Chronos provides a set of unsupervised anomaly detectors.

View some examples notebooks for Datacenter AIOps.

5.1 ThresholdDetector

ThresholdDetector detects anomaly based on threshold. It can be used to detect anomaly on a given time series (notebook), or used together with Forecasters (#forecasting) to detect anomaly on new coming samples (notebook).

View ThresholdDetector API Doc for more details.

5.2 AEDetector

AEDetector detects anomaly based on the reconstruction error of an autoencoder network.

View anomaly detection notebook and AEDetector API Doc for more details.

5.3 DBScanDetector

DBScanDetector uses DBSCAN clustering algortihm for anomaly detection.

View anomaly detection notebook and DBScanDetector API Doc for more details.

6 Data Processing and Feature Engineering

Time series data is a special data formulation with its specific operations. Chronos provides TSDataset as a time series dataset abstract for data processing (e.g. impute, deduplicate, resample, scale/unscale, roll sampling) and auto feature engineering (e.g. datetime feature, aggregation feature). Cascade call is supported for most of the methods. TSDataset can be initialized from a pandas dataframe and be directly used in AutoTSEstimator. It can also be converted to a pandas dataframe or numpy ndarray for Forecasters and Anomaly Detectors.

TSDataset is designed for general time series processing while providing many specific operations for the convenience of different tasks(e.g. forecasting, anomaly detection).

6.1 Basic concepts

A time series can be interpreted as a sequence of real value whose order is timestamp. While a time series dataset can be a combination of one or a huge amount of time series. It may contain multiple time series since users may collect different time series in the same/different period of time (e.g. An AIops dataset may have CPU usage ratio and memory usage ratio data for two servers at a period of time. This dataset contains four time series).

In TSDataset, we provide 2 possible dimensions to construct a high dimension time series dataset (i.e. feature dimension and id dimension).

  • feature dimension: Time series along this dimension might be independent or related. Though they may be related, they are assumed to have different patterns and distributions and collected on the same period of time. For example, the CPU usage ratio and Memory usage ratio for the same server at a period of time.

  • id dimension: Time series along this dimension are assumed to have the same patterns and distributions and might by collected on the same or different period of time. For example, the CPU usage ratio for two servers at a period of time.

All the preprocessing operations will be done on each independent time series(i.e on both feature dimension and id dimension), while feature scaling will be only carried out on the feature dimension.

6.2 Create a TSDataset

Currently TSDataset only supports initializing from a pandas dataframe through TSDataset.from_pandas. A typical valid time series dataframe df is shown below.

You can initialize a TSDataset by simply:

# Server id  Datetime         CPU usage   Mem usage
# 0          08:39 2021/7/9   93          24            
# 0          08:40 2021/7/9   91          24              
# 0          08:41 2021/7/9   93          25              
# 0          ...              ...         ...
# 1          08:39 2021/7/9   73          79            
# 1          08:40 2021/7/9   72          80              
# 1          08:41 2021/7/9   79          80              
# 1          ...              ...         ...
tsdata = TSDataset.from_pandas(df,
                               dt_col="Datetime",
                               id_col="Server id",
                               target_col=["CPU usage",
                                           "Mem usage"])

target_col is a list of all elements along feature dimension, while id_col is the identifier that distinguishes the id dimension. dt_col is the datetime column. For extra_feature_col(not shown in this case), you should list those features that you are not interested for your task (e.g. you will not perform forecasting or anomaly detection task on this col).

If you are building a prototype for your forecasting/anomaly detection task and you need to split you dataset to train/valid/test set, you can use with_split parameter.TSDataset supports split with ratio by val_ratio and test_ratio.

6.3 Time series dataset preprocessing

TSDataset now supports impute, deduplicate and resample. You may fill the missing point by impute in different modes. You may remove the records that are totally the same by deduplicate. You may change the sample frequency by resample. A typical cascade call for preprocessing is:

tsdata.deduplicate().resample(interval="2s").impute()

6.4 Feature scaling

Scaling all features to one distribution is important, especially when we want to train a machine learning/deep learning system. TSDataset supports all the scalers in sklearn through scale and unscale method. Since a scaler should not fit on the validation and test set, a typical call for scaling operations is:

from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
# scale
for tsdata in [tsdata_train, tsdata_valid, tsdata_test]:
    tsdata.scale(scaler, fit=tsdata is tsdata_train)
# unscale
for tsdata in [tsdata_train, tsdata_valid, tsdata_test]:
    tsdata.unscale()

unscale_numpy is specially designed for forecasters. Users may unscale the output of a forecaster by this operation. A typical call is:

x, y = tsdata_test.scale(scaler)\
                  .roll(lookback=..., horizon=...)\
                  .to_numpy()
yhat = forecaster.predict(x)
unscaled_yhat = tsdata_test.unscale_numpy(yhat)
unscaled_y = tsdata_test.unscale_numpy(y)
# calculate metric by unscaled_yhat and unscaled_y

6.5 Feature generation

Other than historical target data and other extra feature provided by users, some additional features can be generated automatically by TSDataset. gen_dt_feature helps users to generate 10 datetime related features(e.g. MONTH, WEEKDAY, …). gen_global_feature and gen_rolling_feature are powered by tsfresh to generate aggregated features (e.g. min, max, …) for each time series or rolling windows respectively.

6.6 Roll sampling and other transformation

Roll sampling (or sliding window sampling) is useful when you want to train a supervised deep learning forecasting model. Please refer to the API doc roll for detailed behavior. A typical call of roll is as following:

# forecaster
x, y = tsdata.roll(lookback=..., horizon=...).to_numpy()
forecaster.fit(x, y)

# anomaly detector on "target" col
x = tsdata.to_pandas()["target"].to_numpy()
anomaly_detector.fit(x)

View TSDataset API Doc for more details.