TSDataset

chronos.data.tsdataset

Time series data is a special data formulation with specific operations. TSDataset is an abstract of time series dataset, which provides various data processing operations (e.g. impute, deduplicate, resample, scale/unscale, roll) and feature engineering methods (e.g. datetime feature, aggregation feature). Cascade call is supported for most of the methods. TSDataset can be initialized from a pandas dataframe and be converted to a pandas dataframe or numpy ndarray.

class zoo.chronos.data.tsdataset.TSDataset(data, **schema)[source]

Bases: object

TSDataset is an abstract of time series dataset. Cascade call is supported for most of the transform methods.

static from_pandas(df, dt_col, target_col, id_col=None, extra_feature_col=None, with_split=False, val_ratio=0, test_ratio=0.1, largest_look_back=0, largest_horizon=1)[source]

Initialize tsdataset(s) from pandas dataframe.

Parameters
  • df – a pandas dataframe for your raw time series data.

  • dt_col – a str indicates the col name of datetime column in the input data frame.

  • target_col – a str or list indicates the col name of target column in the input data frame.

  • id_col – (optional) a str indicates the col name of dataframe id. If it is not explicitly stated, then the data is interpreted as only containing a single id.

  • extra_feature_col – (optional) a str or list indicates the col name of extra feature columns that needs to predict the target column.

  • with_split – (optional) bool, states if we need to split the dataframe to train, validation and test set. The value defaults to False.

  • val_ratio – (optional) float, validation ratio. Only effective when with_split is set to True. The value defaults to 0.

  • test_ratio – (optional) float, test ratio. Only effective when with_split is set to True. The value defaults to 0.1.

  • largest_look_back – (optional) int, the largest length to look back. Only effective when with_split is set to True. The value defaults to 0.

  • largest_horizon – (optional) int, the largest num of steps to look forward. Only effective when with_split is set to True. The value defaults to 1.

Returns

a TSDataset instance when with_split is set to False, three TSDataset instances when with_split is set to True.

Create a tsdataset instance by:

>>> # Here is a df example:
>>> # id        datetime      value   "extra feature 1"   "extra feature 2"
>>> # 00        2019-01-01    1.9     1                   2
>>> # 01        2019-01-01    2.3     0                   9
>>> # 00        2019-01-02    2.4     3                   4
>>> # 01        2019-01-02    2.6     0                   2
>>> tsdataset = TSDataset.from_pandas(df, dt_col="datetime",
>>>                                   target_col="value", id_col="id",
>>>                                   extra_feature_col=["extra feature 1",
>>>                                                      "extra feature 2"])
impute(mode='last', const_num=0)[source]

Impute the tsdataset by imputing each univariate time series distinguished by id_col and feature_col.

Parameters
  • mode

    imputation mode, select from “last”, “const” or “linear”.

    ”last”: impute by propagating the last non N/A number to its following N/A. if there is no non N/A number ahead, 0 is filled instead.

    ”const”: impute by a const value input by user.

    ”linear”: impute by linear interpolation.

  • const_num – indicates the const number to fill, which is only effective when mode is set to “const”.

Returns

the tsdataset instance.

deduplicate()[source]

Remove those duplicated records which has exactly the same values in each feature_col for each multivariate timeseries distinguished by id_col.

Returns

the tsdataset instance.

resample(interval, start_time=None, end_time=None, merge_mode='mean')[source]

Resample on a new interval for each univariate time series distinguished by id_col and feature_col.

Parameters
  • interval – pandas offset aliases, indicating time interval of the output dataframe.

  • start_time – start time of the output dataframe.

  • end_time – end time of the output dataframe.

  • merge_mode – if current interval is smaller than output interval, we need to merge the values in a mode. “max”, “min”, “mean” or “sum” are supported for now.

Returns

the tsdataset instance.

gen_dt_feature()[source]
Generate datetime feature for each row. Currently we generate following features:
“MINUTE”: The minute of the time stamp.
“DAY”: The day of the time stamp.
“DAYOFYEAR”: The ordinal day of the year of the time stamp.
“HOUR”: The hour of the time stamp.
“WEEKDAY”: The day of the week of the time stamp, Monday=0, Sunday=6.
“WEEKOFYEAR”: The ordinal week of the year of the time stamp.
“MONTH”: The month of the time stamp.
“IS_AWAKE”: Bool value indicating whether it belongs to awake hours for the time stamp,
True for hours between 6A.M. and 1A.M.
“IS_BUSY_HOURS”: Bool value indicating whether it belongs to busy hours for the time
stamp, True for hours between 7A.M. and 10A.M. and hours between 4P.M. and 8P.M.
“IS_WEEKEND”: Bool value indicating whether it belongs to weekends for the time stamp,
True for Saturdays and Sundays.
Returns

the tsdataset instance.

gen_global_feature(settings='comprehensive', full_settings=None)[source]

Generate per-time-series feature for each time series. This method will be implemented by tsfresh.

TODO: relationship with scale should be figured out.

Parameters
  • settings – str or dict. If a string is set, then it must be one of “comprehensive” “minimal” and “efficient”. If a dict is set, then it should follow the instruction for default_fc_parameters in tsfresh. The value is defaulted to “comprehensive”.

  • full_settings – dict. It should follow the instruction for kind_to_fc_parameters in tsfresh. The value is defaulted to None.

Returns

the tsdataset instance.

gen_rolling_feature(window_size, settings='comprehensive', full_settings=None)[source]

Generate aggregation feature for each sample. This method will be implemented by tsfresh.

TODO: relationship with scale should be figured out.

Parameters
  • window_size – int, generate feature according to the rolling result.

  • settings – str or dict. If a string is set, then it must be one of “comprehensive” “minimal” and “efficient”. If a dict is set, then it should follow the instruction for default_fc_parameters in tsfresh. The value is defaulted to “comprehensive”.

  • full_settings – dict. It should follow the instruction for kind_to_fc_parameters in tsfresh. The value is defaulted to None.

Returns

the tsdataset instance.

roll(lookback, horizon, feature_col=None, target_col=None, id_sensitive=False)[source]

Sampling by rolling for machine learning/deep learning models.

Parameters
  • lookback – int, lookback value.

  • horizon – int or list, if horizon is an int, we will sample horizon step continuously after the forecasting point. if horizon is a list, we will sample discretely according to the input list. specially, when horizon is set to 0, ground truth will be generated as None.

  • feature_col – str or list, indicates the feature col name. Default to None, where we will take all available feature in rolling.

  • target_col – str or list, indicates the target col name. Default to None, where we will take all target in rolling. it should be a subset of target_col you used to initialize the tsdataset.

  • id_sensitive

    bool, if id_sensitive is False, we will rolling on each id’s sub dataframe and fuse the sampings. The shape of rolling will be x: (num_sample, lookback, num_feature_col + num_target_col) y: (num_sample, horizon, num_target_col) where num_sample is the summation of sample number of each dataframe

    if id_sensitive is True, we will rolling on the wide dataframe whose columns are cartesian product of id_col and feature_col The shape of rolling will be x: (num_sample, lookback, new_num_feature_col + new_num_target_col) y: (num_sample, horizon, new_num_target_col) where num_sample is the sample number of the wide dataframe, new_num_feature_col is the product of the number of id and the number of feature_col. new_num_target_col is the product of the number of id and the number of target_col.

Returns

the tsdataset instance.

roll() can be called by:

>>> # Here is a df example:
>>> # id        datetime      value   "extra feature 1"   "extra feature 2"
>>> # 00        2019-01-01    1.9     1                   2
>>> # 01        2019-01-01    2.3     0                   9
>>> # 00        2019-01-02    2.4     3                   4
>>> # 01        2019-01-02    2.6     0                   2
>>> tsdataset = TSDataset.from_pandas(df, dt_col="datetime",
>>>                                   target_col="value", id_col="id",
>>>                                   extra_feature_col=["extra feature 1",
>>>                                                      "extra feature 2"])
>>> horizon, lookback = 1, 1
>>> tsdataset.roll(lookback=lookback, horizon=horizon, id_sensitive=False)
>>> x, y = tsdataset.to_numpy()
>>> print(x, y) # x = [[[1.9, 1, 2 ]], [[2.3, 0, 9 ]]] y = [[[ 2.4 ]], [[ 2.6 ]]]
>>> print(x.shape, y.shape) # x.shape = (2, 1, 3) y.shape = (2, 1, 1)
>>> tsdataset.roll(lookback=lookback, horizon=horizon, id_sensitive=True)
>>> x, y = tsdataset.to_numpy()
>>> print(x, y) # x = [[[ 1.9, 2.3, 1, 2, 0, 9 ]]] y = [[[ 2.4, 2.6]]]
>>> print(x.shape, y.shape) # x.shape = (1, 1, 6) y.shape = (1, 1, 2)
to_numpy()[source]

Export rolling result in form of a tuple of numpy ndarray (x, y).

Returns

a 2-dim tuple. each item is a 3d numpy ndarray. The ndarray is casted to float64.

to_pandas()[source]

Export the pandas dataframe.

Returns

the internal dataframe.

scale(scaler, fit=True)[source]

Scale the time series dataset’s feature column and target column.

Parameters
  • scaler – sklearn scaler instance, StandardScaler, MaxAbsScaler, MinMaxScaler and RobustScaler are supported.

  • fit – if we need to fit the scaler. Typically, the value should be set to True for training set, while False for validation and test set. The value is defaulted to True.

Returns

the tsdataset instance.

Assume there is a training set tsdata and a test set tsdata_test. scale() should be called first on training set with default value fit=True, then be called on test set with the same scaler and fit=False.

>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler()
>>> tsdata.scale(scaler, fit=True)
>>> tsdata_test.scale(scaler, fit=False)
unscale()[source]

Unscale the time series dataset’s feature column and target column.

Returns

the tsdataset instance.

unscale_numpy(data)[source]

Unscale the time series forecaster’s numpy prediction result/ground truth.

Parameters

data – a numpy ndarray with 3 dim whose shape should be exactly the same with self.numpy_y.

Returns

the unscaled numpy ndarray.