TSDataset¶
chronos.data.tsdataset¶
Time series data is a special data formulation with specific operations. TSDataset is an abstract of time series dataset, which provides various data processing operations (e.g. impute, deduplicate, resample, scale/unscale, roll) and feature engineering methods (e.g. datetime feature, aggregation feature). Cascade call is supported for most of the methods. TSDataset can be initialized from a pandas dataframe and be converted to a pandas dataframe or numpy ndarray.
- class zoo.chronos.data.tsdataset.TSDataset(data, **schema)[source]¶
Bases:
object
TSDataset is an abstract of time series dataset. Cascade call is supported for most of the transform methods.
- static from_pandas(df, dt_col, target_col, id_col=None, extra_feature_col=None, with_split=False, val_ratio=0, test_ratio=0.1, largest_look_back=0, largest_horizon=1)[source]¶
Initialize tsdataset(s) from pandas dataframe.
- Parameters
df – a pandas dataframe for your raw time series data.
dt_col – a str indicates the col name of datetime column in the input data frame.
target_col – a str or list indicates the col name of target column in the input data frame.
id_col – (optional) a str indicates the col name of dataframe id. If it is not explicitly stated, then the data is interpreted as only containing a single id.
extra_feature_col – (optional) a str or list indicates the col name of extra feature columns that needs to predict the target column.
with_split – (optional) bool, states if we need to split the dataframe to train, validation and test set. The value defaults to False.
val_ratio – (optional) float, validation ratio. Only effective when with_split is set to True. The value defaults to 0.
test_ratio – (optional) float, test ratio. Only effective when with_split is set to True. The value defaults to 0.1.
largest_look_back – (optional) int, the largest length to look back. Only effective when with_split is set to True. The value defaults to 0.
largest_horizon – (optional) int, the largest num of steps to look forward. Only effective when with_split is set to True. The value defaults to 1.
- Returns
a TSDataset instance when with_split is set to False, three TSDataset instances when with_split is set to True.
Create a tsdataset instance by:
>>> # Here is a df example: >>> # id datetime value "extra feature 1" "extra feature 2" >>> # 00 2019-01-01 1.9 1 2 >>> # 01 2019-01-01 2.3 0 9 >>> # 00 2019-01-02 2.4 3 4 >>> # 01 2019-01-02 2.6 0 2 >>> tsdataset = TSDataset.from_pandas(df, dt_col="datetime", >>> target_col="value", id_col="id", >>> extra_feature_col=["extra feature 1", >>> "extra feature 2"])
- impute(mode='last', const_num=0)[source]¶
Impute the tsdataset by imputing each univariate time series distinguished by id_col and feature_col.
- Parameters
mode –
imputation mode, select from “last”, “const” or “linear”.
”last”: impute by propagating the last non N/A number to its following N/A. if there is no non N/A number ahead, 0 is filled instead.
”const”: impute by a const value input by user.
”linear”: impute by linear interpolation.
const_num – indicates the const number to fill, which is only effective when mode is set to “const”.
- Returns
the tsdataset instance.
- deduplicate()[source]¶
Remove those duplicated records which has exactly the same values in each feature_col for each multivariate timeseries distinguished by id_col.
- Returns
the tsdataset instance.
- resample(interval, start_time=None, end_time=None, merge_mode='mean')[source]¶
Resample on a new interval for each univariate time series distinguished by id_col and feature_col.
- Parameters
interval – pandas offset aliases, indicating time interval of the output dataframe.
start_time – start time of the output dataframe.
end_time – end time of the output dataframe.
merge_mode – if current interval is smaller than output interval, we need to merge the values in a mode. “max”, “min”, “mean” or “sum” are supported for now.
- Returns
the tsdataset instance.
- gen_dt_feature()[source]¶
- Generate datetime feature for each row. Currently we generate following features:“MINUTE”: The minute of the time stamp.“DAY”: The day of the time stamp.“DAYOFYEAR”: The ordinal day of the year of the time stamp.“HOUR”: The hour of the time stamp.“WEEKDAY”: The day of the week of the time stamp, Monday=0, Sunday=6.“WEEKOFYEAR”: The ordinal week of the year of the time stamp.“MONTH”: The month of the time stamp.“IS_AWAKE”: Bool value indicating whether it belongs to awake hours for the time stamp,True for hours between 6A.M. and 1A.M.“IS_BUSY_HOURS”: Bool value indicating whether it belongs to busy hours for the timestamp, True for hours between 7A.M. and 10A.M. and hours between 4P.M. and 8P.M.“IS_WEEKEND”: Bool value indicating whether it belongs to weekends for the time stamp,True for Saturdays and Sundays.
- Returns
the tsdataset instance.
- gen_global_feature(settings='comprehensive', full_settings=None)[source]¶
Generate per-time-series feature for each time series. This method will be implemented by tsfresh.
TODO: relationship with scale should be figured out.
- Parameters
settings – str or dict. If a string is set, then it must be one of “comprehensive” “minimal” and “efficient”. If a dict is set, then it should follow the instruction for default_fc_parameters in tsfresh. The value is defaulted to “comprehensive”.
full_settings – dict. It should follow the instruction for kind_to_fc_parameters in tsfresh. The value is defaulted to None.
- Returns
the tsdataset instance.
- gen_rolling_feature(window_size, settings='comprehensive', full_settings=None)[source]¶
Generate aggregation feature for each sample. This method will be implemented by tsfresh.
TODO: relationship with scale should be figured out.
- Parameters
window_size – int, generate feature according to the rolling result.
settings – str or dict. If a string is set, then it must be one of “comprehensive” “minimal” and “efficient”. If a dict is set, then it should follow the instruction for default_fc_parameters in tsfresh. The value is defaulted to “comprehensive”.
full_settings – dict. It should follow the instruction for kind_to_fc_parameters in tsfresh. The value is defaulted to None.
- Returns
the tsdataset instance.
- roll(lookback, horizon, feature_col=None, target_col=None, id_sensitive=False)[source]¶
Sampling by rolling for machine learning/deep learning models.
- Parameters
lookback – int, lookback value.
horizon – int or list, if horizon is an int, we will sample horizon step continuously after the forecasting point. if horizon is a list, we will sample discretely according to the input list. specially, when horizon is set to 0, ground truth will be generated as None.
feature_col – str or list, indicates the feature col name. Default to None, where we will take all available feature in rolling.
target_col – str or list, indicates the target col name. Default to None, where we will take all target in rolling. it should be a subset of target_col you used to initialize the tsdataset.
id_sensitive –
bool, if id_sensitive is False, we will rolling on each id’s sub dataframe and fuse the sampings. The shape of rolling will be x: (num_sample, lookback, num_feature_col + num_target_col) y: (num_sample, horizon, num_target_col) where num_sample is the summation of sample number of each dataframe
if id_sensitive is True, we will rolling on the wide dataframe whose columns are cartesian product of id_col and feature_col The shape of rolling will be x: (num_sample, lookback, new_num_feature_col + new_num_target_col) y: (num_sample, horizon, new_num_target_col) where num_sample is the sample number of the wide dataframe, new_num_feature_col is the product of the number of id and the number of feature_col. new_num_target_col is the product of the number of id and the number of target_col.
- Returns
the tsdataset instance.
roll() can be called by:
>>> # Here is a df example: >>> # id datetime value "extra feature 1" "extra feature 2" >>> # 00 2019-01-01 1.9 1 2 >>> # 01 2019-01-01 2.3 0 9 >>> # 00 2019-01-02 2.4 3 4 >>> # 01 2019-01-02 2.6 0 2 >>> tsdataset = TSDataset.from_pandas(df, dt_col="datetime", >>> target_col="value", id_col="id", >>> extra_feature_col=["extra feature 1", >>> "extra feature 2"]) >>> horizon, lookback = 1, 1 >>> tsdataset.roll(lookback=lookback, horizon=horizon, id_sensitive=False) >>> x, y = tsdataset.to_numpy() >>> print(x, y) # x = [[[1.9, 1, 2 ]], [[2.3, 0, 9 ]]] y = [[[ 2.4 ]], [[ 2.6 ]]] >>> print(x.shape, y.shape) # x.shape = (2, 1, 3) y.shape = (2, 1, 1) >>> tsdataset.roll(lookback=lookback, horizon=horizon, id_sensitive=True) >>> x, y = tsdataset.to_numpy() >>> print(x, y) # x = [[[ 1.9, 2.3, 1, 2, 0, 9 ]]] y = [[[ 2.4, 2.6]]] >>> print(x.shape, y.shape) # x.shape = (1, 1, 6) y.shape = (1, 1, 2)
- to_numpy()[source]¶
Export rolling result in form of a tuple of numpy ndarray (x, y).
- Returns
a 2-dim tuple. each item is a 3d numpy ndarray. The ndarray is casted to float64.
- scale(scaler, fit=True)[source]¶
Scale the time series dataset’s feature column and target column.
- Parameters
scaler – sklearn scaler instance, StandardScaler, MaxAbsScaler, MinMaxScaler and RobustScaler are supported.
fit – if we need to fit the scaler. Typically, the value should be set to True for training set, while False for validation and test set. The value is defaulted to True.
- Returns
the tsdataset instance.
Assume there is a training set tsdata and a test set tsdata_test. scale() should be called first on training set with default value fit=True, then be called on test set with the same scaler and fit=False.
>>> from sklearn.preprocessing import StandardScaler >>> scaler = StandardScaler() >>> tsdata.scale(scaler, fit=True) >>> tsdata_test.scale(scaler, fit=False)