Friesian Feature API¶

friesian.feature.table¶

class zoo.friesian.feature.table.Table(df)[source]¶

Bases: object

compute()[source]¶: Trigger computation of the Table.

to_spark_df()[source]¶

Convert the current Table to a Spark DataFrame.

Returns: The converted Spark DataFrame.

size()[source]¶

Returns the number of rows in this Table.

Returns: The number of rows in the current Table.

broadcast()[source]¶: Marks the Table as small enough for use in broadcast join.

select(*cols)[source]¶

Select specific columns.

Parameters: cols – str or a list of str that specifies column names. If it is ‘*’, select all the columns.
Returns: A new Table that contains the specified columns.

drop(*cols)[source]¶

Returns a new Table that drops the specified column. This is a no-op if schema doesn’t contain the given column name(s).

Parameters: cols – str or a list of str that specifies the name of the columns to drop.
Returns: A new Table that drops the specified column.

fillna(value, columns)[source]¶

Replace null values.

Parameters

value – int, long, float, string, or boolean. Value to replace null values with.
columns – list of str, the target columns to be filled. If columns=None and value is int, all columns of integer type will be filled. If columns=None and value is long, float, str or boolean, all columns will be filled.

Returns

A new Table that replaced the null values with specified value

dropna(columns, how='any', thresh=None)[source]¶

Drops the rows containing null values in the specified columns.

Parameters

columns – str or a list of str that specifies column names. If it is None, it will operate on all columns.
how – If how is “any”, then drop rows containing any null values in columns. If how is “all”, then drop rows only if every column in columns is null for that row.
thresh – int, if specified, drop rows that have less than thresh non-null values. Default is None.

Returns

A new Table that drops the rows containing null values in the specified columns.

distinct()[source]¶

Select the distinct rows of the Table.

Returns: A new Table that only contains distinct rows.

filter(condition)[source]¶

Filters the rows that satisfy condition. For instance, filter(“col_1 == 1”) will filter the rows that has value 1 at column col_1.

Parameters: condition – str that gives the condition for filtering.
Returns: A new Table with filtered rows.

clip(columns, min=None, max=None)[source]¶

Clips continuous values so that they are within the range [min, max]. For instance, by setting the min value to 0, all negative values in columns will be replaced with 0.

Parameters

columns – str or a list of str, the target columns to be clipped.
min – numeric, the minimum value to clip values to. Values less than this will be replaced with this value.
max – numeric, the maximum value to clip values to. Values greater than this will be replaced with this value.

Returns

A new Table that replaced the value less than min with specified min and the value greater than max with specified max.

log(columns, clipping=True)[source]¶

Calculates the log of continuous columns.

Parameters

columns – str or a list of str, the target columns to calculate log.
clipping – boolean, if clipping=True, the negative values in columns will be clipped to 0 and log(x+1) will be calculated. If clipping=False, log(x) will be calculated.

Returns

A new Table that replaced value in columns with logged value.

fill_median(columns)[source]¶

Replaces null values with the median in the specified numeric columns. Any column to be filled should not contain only null values.

Parameters: columns – str or a list of str that specifies column names. If it is None, it will operate on all numeric columns.
Returns: A new Table that replaces null values with the median in the specified numeric columns.

median(columns)[source]¶

Returns a new Table that has two columns, column and median, containing the column names and the medians of the specified numeric columns.

Parameters: columns – str or a list of str that specifies column names. If it is None, it will operate on all numeric columns.
Returns: A new Table that contains the medians of the specified columns.

merge_cols(columns, target)[source]¶

Merge the target column values as a list to a new column. The original columns will be dropped.

Parameters

columns – a list of str, the target columns to be merged.
target – str, the new column name of the merged column.

Returns

A new Table that replaces columns with a new target column of merged list values.

rename(columns)[source]¶

Rename columns with new column names

Parameters: columns – dict. Name pairs. For instance, {‘old_name1’: ‘new_name1’, ‘old_name2’: ‘new_name2’}”.
Returns: A new Table with new column names.

show(n=20, truncate=True)[source]¶

Prints the first n rows to the console.

Parameters

n – int, the number of rows to show.
truncate – If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right.

get_stats(columns, aggr)[source]¶

Calculate the statistics of the values over the target column(s).

Parameters: columns – str or a list of str that specifies the name(s) of the target column(s).

If columns is None, then the function will return statistics for all numeric columns. :param aggr: str or a list of str or dict to specify aggregate functions, min/max/avg/sum/count are supported. If aggr is a str or a list of str, it contains the name(s) of aggregate function(s). If aggr is a dict, the key is the column name, and the value is the aggregate function(s).

Returns: dict, the key is the column name, and the value is aggregate result(s).

min(columns)[source]¶

Returns a new Table that has two columns, column and min, containing the column names and the minimum values of the specified numeric columns.

Parameters: columns – str or a list of str that specifies column names. If it is None, it will operate on all numeric columns.
Returns: A new Table that contains the minimum values of the specified columns.

max(columns)[source]¶

Returns a new Table that has two columns, column and max, containing the column names and the maximum values of the specified numeric columns.

Parameters: columns – str or a list of str that specifies column names. If it is None, it will operate on all numeric columns.
Returns: A new Table that contains the maximum values of the specified columns.

to_list(column)[source]¶

Convert all values of the target column to a list. Only call this if the Table is small enough.

Parameters: column – str, specifies the name of target column.
Returns: list, contains all values of the target column.

to_dict()[source]¶

Convert the Table to a dictionary. Only call this if the Table is small enough.

Returns: dict, the key is the column name, and the value is the list containing

all values in the corresponding column.

add(columns, value=1)[source]¶

Increase all of values of the target numeric column(s) by a constant value.

Parameters

columns – str or a list of str, the target columns to be increased.
value – numeric (int/float/double/short/long), the constant value to be added.

Returns

A new Table with updated numeric values on specified columns.

property columns¶

Get column names of the Table.

Returns: A list of strings that specify column names.

sample(fraction, replace=False, seed=None)[source]¶

Return a sampled subset of Table.

Parameters

fraction – float, fraction of rows to generate, should be within the range [0, 1].
replace – allow or disallow sampling of the same row more than once.
seed – seed for sampling.

Returns

A new Table with sampled rows.

ordinal_shuffle_partition()[source]¶

Shuffle each partition of the Table by adding a random ordinal column for each row and sort by this ordinal column within each partition.

Returns: A new Table with shuffled partitions.

write_parquet(path, mode='overwrite')[source]¶

Write the Table to Parquet file.

Parameters

path – str. The path to the Parquet file. Note that the col_name will be used as basename of the Parquet file.
mode – str. One of “append”, “overwrite”, “error” or “ignore”. append: Append contents to the existing data. overwrite: Overwrite the existing data. error: Throw an exception if the data already exists. ignore: Silently ignore this operation if data already exists.

cast(columns, dtype)[source]¶

Cast columns to the specified type.

Parameters

columns – str or a list of str that specifies column names. If it is None, then cast all of the columns.
dtype – str (“string”, “boolean”, “int”, “long”, “short”, “float”, “double”) that specifies the data type.

Returns

A new Table that casts all of the specified columns to the specified type.

append_column(name, value)[source]¶

Append a column with a constant value to the Table.

Parameters

name – str, the name of the new column.
value – The constant column value for the new column.

Returns

A new Table with the appended column.

col(name)[source]¶: Get the target column of the Table.

class zoo.friesian.feature.table.FeatureTable(df)[source]¶

Bases: zoo.friesian.feature.table.Table

classmethod read_parquet(paths)[source]¶

Loads Parquet files as a FeatureTable.

Parameters: paths – str or a list of str. The path(s) to Parquet file(s).
Returns: A FeatureTable for recommendation data.

classmethod read_json(paths, cols=None)[source]¶

classmethod read_csv(paths, delimiter=',', header=False, names=None, dtype=None)[source]¶

Loads csv files as a FeatureTable.

Parameters

paths – str or a list of str. The path(s) to csv file(s).
delimiter – str, delimiter to use for parsing the csv file(s). Default is “,”.
header – boolean, whether the first line of the csv file(s) will be treated as the header for column names. Default is False.
names – str or a list of str, the column names for the csv file(s). You need to provide this if the header cannot be inferred. If specified, names should have the same length as the number of columns.
dtype – str or a list of str or dict, the column data type(s) for the csv file(s). You may need to provide this if you want to change the default inferred types of specified columns. If dtype is a str, then all the columns will be cast to the target dtype. If dtype is a list of str, then it should have the same length as the number of columns and each column will be cast to the corresponding str dtype. If dtype is a dict, then the key should be the column name and the value should be the str dtype to cast the column to.

Returns

A FeatureTable for recommendation data.

encode_string(columns, indices)[source]¶

Encode columns with provided list of StringIndex.

Parameters

columns – str or a list of str, the target columns to be encoded.
indices – StringIndex or a list of StringIndex, StringIndexes of target columns. The StringIndex should at least have two columns: id and the corresponding categorical column. Or it can be a dict or a list of dicts. In this case, the keys of the dict should be within the categorical column and the values are the target ids to be encoded.

Returns

A new FeatureTable which transforms categorical features into unique integer values with provided StringIndexes.

filter_by_frequency(columns, min_freq=2)[source]¶

Filter the FeatureTable by the given minimum frequency on the target columns.

Parameters

columns – str or a list of str, column names which are considered for filtering.
min_freq – int, min frequency. Columns with occurrence below this value would be filtered.

Returns

A new FeatureTable with filtered records.

hash_encode(columns, bins, method='md5')[source]¶

Hash encode for categorical column(s).

Parameters

columns – str or a list of str, the target columns to be encoded. For dense features, you need to cut them into discrete intervals beforehand.
bins – int, defines the number of equal-width bins in the range of column(s) values.
method – hashlib supported method, like md5, sha256 etc.

Returns

A new FeatureTable which hash encoded columns.

cross_hash_encode(columns, bins, cross_col_name=None)[source]¶

Hash encode for cross column(s).

Parameters

columns – a list of str, the categorical columns to be encoded as cross features. For dense features, you need to cut them into discrete intervals beforehand.
bins – int, defined the number of equal-width bins in the range of column(s) values.
cross_col_name – str, the column name for output cross column. Default is None, and in this case the default cross column name will be ‘crossed_col1_col2’ for [‘col1’, ‘col2’].

Returns

A new FeatureTable which the target cross column.

category_encode(columns, freq_limit=None)[source]¶

Category encode the given columns.

Parameters

columns – str or a list of str, target columns to encode from string to index.
freq_limit – int, dict or None. Categories with a count/frequency below freq_limit will be omitted from the encoding. Can be represented as either an integer, dict. For instance, 15, {‘col_4’: 10, ‘col_5’: 2} etc. Default is None, and in this case all the categories that appear will be encoded.

Returns

A tuple of a new FeatureTable which transforms categorical features into unique integer values, and a list of StringIndex for the mapping.

one_hot_encode(columns, sizes=None, prefix=None, keep_original_columns=False)[source]¶

Convert categorical features into ont hot encodings. If the features are string, you should first call category_encode to encode them into indices before one hot encoding. For each input column, a one hot vector will be created expanding multiple output columns, with the value of each one hot column either 0 or 1. Note that you may only use one hot encoding on the columns with small dimensions for memory concerns.

For example, for column ‘x’ with size 5: Input: |x| |1| |3| |0| Output will contain 5 one hot columns: |prefix_0|prefix_1|prefix_2|prefix_3|prefix_4| | 0 | 1 | 0 | 0 | 0 | | 0 | 0 | 0 | 1 | 0 | | 1 | 0 | 0 | 0 | 0 |

Parameters

columns – str or a list of str, the target columns to be encoded.
sizes – int or a list of int, the size(s) of the one hot vectors of the column(s). Default is None, and in this case, the sizes will be calculated by the maximum value(s) of the columns(s) + 1, namely the one hot vector will cover 0 to the maximum value. You are recommended to provided the sizes if they are known beforehand. If specified, sizes should have the same length as columns.
prefix – str or a list of str, the prefix of the one hot columns for the input column(s). Default is None, and in this case, the prefix will be the input column names. If specified, prefix should have the same length as columns. The one hot columns for each input column will have column names: prefix_0, prefix_1, … , prefix_maximum
keep_original_columns – boolean, whether to keep the original index column(s) before the one hot encoding. Default is False, and in this case the original column(s) will be replaced by the one hot columns. If True, the one hot columns will be appended to each original column.

Returns

A new FeatureTable which transforms categorical indices into one hot encodings.

gen_string_idx(columns, freq_limit=None)[source]¶

Generate unique index value of categorical features. The resulting index would start from 1 with 0 reserved for unknown features.

Parameters

columns – str or a list of str, target columns to generate StringIndex.
freq_limit – int, dict or None. Categories with a count/frequency below freq_limit will be omitted from the encoding. Can be represented as either an integer, dict. For instance, 15, {‘col_4’: 10, ‘col_5’: 2} etc. Default is None, and in this case all the categories that appear will be encoded.

Returns

A StringIndex or a list of StringIndex.

cross_columns(crossed_columns, bucket_sizes)[source]¶

Cross columns and hashed to specified bucket size

Parameters: crossed_columns – list of column name pairs to be crossed.

i.e. [[‘a’, ‘b’], [‘c’, ‘d’]] :param bucket_sizes: hash bucket size for crossed pairs. i.e. [1000, 300]

Returns: FeatureTable include crossed columns(i.e. ‘a_b’, ‘c_d’)

min_max_scale(columns, min=0.0, max=1.0)[source]¶

Rescale each column individually to a common range [min, max] linearly using: column summary statistics, which is also known as min-max normalization or Rescaling.

Parameters

columns – list of column names
min – Lower bound after transformation, shared by all columns. 0.0 by default.
max – Upper bound after transformation, shared by all columns. 1.0 by default.

Returns

FeatureTable and mapping = {c: (originalMin, originalMax) for c in columns}

add_negative_samples(item_size, item_col='item', label_col='label', neg_num=1)[source]¶

Generate negative item visits for each positive item visit

Parameters

item_size – integer, max of item.
item_col – string, name of item column
label_col – string, name of label column
neg_num – integer, for each positive record, add neg_num of negative samples

Returns

FeatureTable

add_hist_seq(cols, user_col, sort_col='time', min_len=1, max_len=100)[source]¶

Generate a list of item visits in history

Parameters

cols – list of string, ctolumns need to be aggragated
user_col – string, user column.
sort_col – string, sort by sort_col
min_len – int, minimal length of a history list
max_len – int, maximal length of a history list

Returns

FeatureTable

add_neg_hist_seq(item_size, item_history_col, neg_num)[source]¶

Generate a list negative samples for each item in item_history_col

Parameters

item_size – int, max of item.
item2cat – FeatureTable with a dataframe of item to catgory mapping
item_history_col – string, this column should be a list of visits in history
neg_num – int, for each positive record, add neg_num of negative samples

Returns

FeatureTable

mask(mask_cols, seq_len=100)[source]¶

Mask mask_cols columns

Parameters

mask_cols – list of string, columns need to be masked with 1s and 0s.
seq_len – int, length of masked column

Returns

FeatureTable

pad(cols, seq_len=100, mask_cols=None)[source]¶

pad and mask columns

param cols

list of string, columns need to be padded with 0s.

param mask_cols

list of string, columns need to be masked with 1s and 0s.

param seq_len

int, length of masked column

return

FeatureTable

apply(in_col, out_col, func, data_type)[source]¶

Transform a FeatureTable using a python udf

param in_col

string, name of column needed to be transformed.

param out_col

string, name of output column.

param func

python function

param data_type

string, data type of out_col

return

FeatureTable

join(table, on=None, how=None)[source]¶

Join a FeatureTable with another FeatureTable, it is wrapper of spark dataframe join

Parameters

table – FeatureTable
on – string, join on this column
how – string

Returns

FeatureTable

add_value_features(key_cols, tbl, key, value)[source]¶

Add features based on key_cols and another key value table, for each col in key_cols, it adds a value_col using key-value pairs from tbl

Parameters

key_cols – list[string]
tbl – Table with only two columns [key, value]
key – string, name of key column in tbl
value – string, name of value column in tbl

Returns

FeatureTable

group_by(columns=[], agg='count', join=False)[source]¶

Group the Table with specified columns and then run aggregation. Optionally join the result with the original Table.

Parameters

columns – str or a list of str. Columns to group the Table. If it is an empty list, aggregation is run directly without grouping. Default is [].
agg –
str, list or dict. Aggregate functions to be applied to grouped Table. Default is “count”. Supported aggregate functions are: “max”, “min”, “count”, “sum”, “avg”, “mean”, “sumDistinct”, “stddev”, “stddev_pop”, “variance”, “var_pop”, “skewness”, “kurtosis”, “collect_list”, “collect_set”, “approx_count_distinct”, “first”, “last”. If agg is a str, then agg is the aggregate function and the aggregation is performed on all columns that are not in columns. If agg is a list of str, then agg is a list of aggregate function and the aggregation is performed on all columns that are not in columns. If agg is a single dict mapping from str to str, then the key is the column to perform aggregation on, and the value is the aggregate function. If agg is a single dict mapping from str to list, then the key is the column to perform aggregation on, and the value is list of aggregate functions.

Examples: agg=”sum” agg=[“last”, “stddev”] agg={“*”:”count”} agg={“col_1”:”sum”, “col_2”:[“count”, “mean”]}
join – boolean. If join is True, join the aggregation result with original Table.

Returns

A new Table with aggregated column fields.

split(ratio, seed=None)[source]¶

Split the FeatureTable into multiple FeatureTables for train, validation and test.

Parameters

ratio – a list of portions as weights with which to split the FeatureTable. Weights will be normalized if they don’t sum up to 1.0.
seed – The seed for sampling.

Returns

A tuple of FeatureTables split by the given ratio.

class zoo.friesian.feature.table.StringIndex(df, col_name)[source]¶

Bases: zoo.friesian.feature.table.Table

classmethod read_parquet(paths, col_name=None)[source]¶

Loads Parquet files as a StringIndex.

Parameters

paths – str or a list of str. The path/paths to Parquet file(s).
col_name – str. The column name of the corresponding categorical column. If col_name is None, the file name will be used as col_name.

Returns

A StringIndex.

classmethod from_dict(indices, col_name)[source]¶

Create the StringIndex from a dict of indices.

Parameters

indices – dict. The key is the categorical column, the value is the corresponding index. We assume that the key is a str and the value is a int.
col_name – str. The column name of the categorical column.

Returns

A StringIndex.

to_dict()[source]¶

Convert the StringIndex to a dict, with the categorical features as keys and indices as values. Note that you may only call this if the StringIndex is small.

Returns: A dict for the mapping from string to index.

write_parquet(path, mode='overwrite')[source]¶

Write the StringIndex to Parquet file.

Parameters

path – str. The path to the Parquet file. Note that the col_name will be used as basename of the Parquet file.
mode – str. One of “append”, “overwrite”, “error” or “ignore”. append: Append the contents of this StringIndex to the existing data. overwrite: Overwrite the existing data. error: Throw an exception if the data already exists. ignore: Silently ignore this operation if the data already exists.