Skip to content

Latest commit

 

History

History
105 lines (81 loc) · 3.19 KB

Feature engineering.md

File metadata and controls

105 lines (81 loc) · 3.19 KB

Feature engineering

A critical part of machine learning is feature engineering. BlueCast's pipelines will automatically execute only necessary feature engineering and leaves this to the end user. However BlueCast offers some tools for feature engineering to make this part more approachable and faster.

First we import the required modules:

from bluecast.preprocessing.feature_types import FeatureTypeDetector
from bluecast.preprocessing.feature_creation import AddRowLevelAggFeatures, GroupLevelAggFeatures

Next we can make use of FeatureTypeDetector to identify numerical columns:

ignore_cols = [TARGET, "id", "CustomerId"]

feat_type_detector = FeatureTypeDetector()
train_data = feat_type_detector.fit_transform_feature_types(train.drop(ignore_cols, axis=1))

Next we use AddRowLevelAggFeatures to create features on row level. This usually adds a small degree of additional performance.

agg_feat_creator = AddRowLevelAggFeatures()

train_num = agg_feat_creator.add_row_level_agg_features(train.loc[:, feat_type_detector.num_columns])
test_num = agg_feat_creator.add_row_level_agg_features(test.loc[:, feat_type_detector.num_columns])

train_num = train_num.drop(agg_feat_creator.original_features, axis=1)
test_num = test_num.drop(agg_feat_creator.original_features, axis=1)


train = pd.concat([train, train_num], axis=1)
test = pd.concat([test, test_num], axis=1)

Additionally we can also provide information via group aggregations with GroupLevelAggFeatures:

group_agg_creator = GroupLevelAggFeatures()

train_num = group_agg_creator.create_groupby_agg_features(
    df = train,
    groupby_columns=["Geography", "Gender", "NumOfProducts"],
    columns_to_agg=feat_type_detector.num_columns, # None = take all
    target_col=None,
    aggregations = None # falls back to some aggs
)

test_num = group_agg_creator.create_groupby_agg_features(
    df = test,
    groupby_columns=["Geography", "Gender", "NumOfProducts"],
    columns_to_agg=feat_type_detector.num_columns, # None = take all
    target_col=TARGET,
    aggregations = None # falls back to some aggs
)

# joining the train information everywhere
train = train.merge(train_num, on=["Geography", "Gender", "NumOfProducts"], how="left")
test = test.merge(train_num, on=["Geography", "Gender", "NumOfProducts"], how="left")

Please note that this will increase the number of features significantly.

BlueCast also offers a convenience function to create features in a simpler way:

from bluecast.preprocessing.feature_creation import add_groupby_agg_feats
import pandas as pd

df = pd.DataFrame(
    {
        "group_col": ["A", "A", "B", "B", "C"],
        "num_col1": [10, 20, 30, 40, 50],
        "num_col2": [1, 2, 3, 4, 5],
        "target_col": [100, 200, 300, 400, 500],
    }
)

# Define parameters for the add_groupby_agg_feats function
groupby_cols = ["group_col"]
to_group_cols = ["num_col1", "num_col2"]
num_col_prefix = "agg"
target_col = "target_col"
aggregations = ["min", "max", "mean", "sum"]

# Call the function
result_df = add_groupby_agg_feats(
    df, groupby_cols, to_group_cols, num_col_prefix, target_col, aggregations
)

This will return the original DataFrame enriched with the desired aggregations.