diff --git a/README.md b/README.md index e509848..cea4718 100644 --- a/README.md +++ b/README.md @@ -4,21 +4,21 @@ A package for dbt which enables standardization of data sets. You can use it to The package contains a set of macros that mirror the functionality of the [scikit-learn preprocessing module](https://scikit-learn.org/stable/modules/preprocessing.html). Originally they were developed as part of the 2019 Medium article [Feature Engineering in Snowflake](https://medium.com/omnata/feature-engineering-in-snowflake-4312032e0d53). -Currently they have been tested in Snowflake, Redshift , BigQuery, and SQL Server. The test case expectations have been built using scikit-learn (see *.py in [integration_tests/data/sql](integration_tests/data/sql)), so you can expect behavioural parity with it. +Currently they have been tested in Snowflake, Redshift , BigQuery, SQL Server and PostgreSQL 13.2. The test case expectations have been built using scikit-learn (see *.py in [integration_tests/data/sql](integration_tests/data/sql)), so you can expect behavioural parity with it. The macros are: -| scikit-learn function | macro name | Snowflake | BigQuery | Redshift | MSSQL | Example | -| --- | --- | --- | --- | --- | --- | --- | -| [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer)| k_bins_discretizer | Y | Y | Y | N | ![example](images/k_bins.gif) | -| [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder)| label_encoder | Y | Y | Y | Y | ![example](images/label_encoder.gif) | -| [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) | max_abs_scaler | Y | Y | Y | Y | [![example](images/max_abs_scaler.png)](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#maxabsscaler) | -| [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) | min_max_scaler | Y | Y | Y | N | [![example](images/min_max_scaler.png)](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#minmaxscaler) | -| [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer) | normalizer | Y | Y | Y | Y | [![example](images/normalizer.png)](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#normalizer) | -| [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) | one_hot_encoder | Y | Y | Y | Y | ![example](images/one_hot_encoder.gif) | -| [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer) | quantile_transformer | Y | Y | N | N | [![example](images/quantile_transformer.png)](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#quantiletransformer-uniform-output) | -| [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler) | robust_scaler | Y | Y | Y | N | [![example](images/robust_scaler.png)](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#robustscaler) | -| [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) | standard_scaler | Y | Y | Y | N | [![example](images/standard_scaler.png)](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#standardscaler) | +| scikit-learn function | macro name | Snowflake | BigQuery | Redshift | MSSQL | PostgreSQL | Example | +| --- | --- | --- | --- | --- | --- | --- | --- | +| [KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer)| k_bins_discretizer | Y | Y | Y | N | Y | ![example](images/k_bins.gif) | +| [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder)| label_encoder | Y | Y | Y | Y | Y | ![example](images/label_encoder.gif) | +| [MaxAbsScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MaxAbsScaler.html#sklearn.preprocessing.MaxAbsScaler) | max_abs_scaler | Y | Y | Y | Y | Y | [![example](images/max_abs_scaler.png)](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#maxabsscaler) | +| [MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler) | min_max_scaler | Y | Y | Y | N | Y | [![example](images/min_max_scaler.png)](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#minmaxscaler) | +| [Normalizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html#sklearn.preprocessing.Normalizer) | normalizer | Y | Y | Y | Y | Y | [![example](images/normalizer.png)](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#normalizer) | +| [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) | one_hot_encoder | Y | Y | Y | Y | Y | ![example](images/one_hot_encoder.gif) | +| [QuantileTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html#sklearn.preprocessing.QuantileTransformer) | quantile_transformer | Y | Y | N | N | Y | [![example](images/quantile_transformer.png)](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#quantiletransformer-uniform-output) | +| [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler) | robust_scaler | Y | Y | Y | N | Y | [![example](images/robust_scaler.png)](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#robustscaler) | +| [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) | standard_scaler | Y | Y | Y | N | Y | [![example](images/standard_scaler.png)](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#standardscaler) | _\* 2D charts taken from [scikit-learn.org](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html), GIFs are my own_ ## Installation @@ -26,7 +26,7 @@ To use this in your dbt project, create or modify packages.yml to include: ``` packages: - package: "omnata-labs/dbt_ml_preprocessing" - version: [">=1.0.1"] + version: [">=1.0.2"] ``` _(replace the revision number with the latest)_ diff --git a/dbt_project.yml b/dbt_project.yml index 0175002..41f7d63 100644 --- a/dbt_project.yml +++ b/dbt_project.yml @@ -1,5 +1,5 @@ name: 'dbt_ml_preprocessing' -version: '1.0.1' +version: '1.0.2' require-dbt-version: ">=0.15.1" diff --git a/integration_tests/macros/equality_with_numeric_tolerance.sql b/integration_tests/macros/equality_with_numeric_tolerance.sql index 66d9438..b992e6f 100644 --- a/integration_tests/macros/equality_with_numeric_tolerance.sql +++ b/integration_tests/macros/equality_with_numeric_tolerance.sql @@ -60,6 +60,10 @@ where percent_difference > {{ percentage_tolerance }} {% do return( redshift__test_equality_with_numeric_tolerance(model,compare_model,source_join_column,target_join_column,source_numeric_column_name,target_numeric_column_name,percentage_tolerance,output_all_rows=False)) %} {% endmacro %} +{% macro postgres__test_equality_with_numeric_tolerance(model,compare_model,source_join_column,target_join_column,source_numeric_column_name,target_numeric_column_name,percentage_tolerance,output_all_rows=False) %} +{% do return( redshift__test_equality_with_numeric_tolerance(model,compare_model,source_join_column,target_join_column,source_numeric_column_name,target_numeric_column_name,percentage_tolerance,output_all_rows=False)) %} +{% endmacro %} + {% macro snowflake__test_equality_with_numeric_tolerance(model,compare_model,source_join_column,target_join_column,source_numeric_column_name,target_numeric_column_name,percentage_tolerance,output_all_rows=False) %} {% set compare_cols_csv = compare_columns | join(', ') %} diff --git a/integration_tests/macros/quantile_transformer_model_macro.sql b/integration_tests/macros/quantile_transformer_model_macro.sql index bb4cf1f..6c590fe 100644 --- a/integration_tests/macros/quantile_transformer_model_macro.sql +++ b/integration_tests/macros/quantile_transformer_model_macro.sql @@ -5,6 +5,13 @@ with data as ( select * from data {% endmacro %} +{% macro postgres__quantile_transformer_model_macro() %} +with data as ( + {{ dbt_ml_preprocessing.quantile_transformer( ref('data_quantile_transformer') ,'col_to_transform') }} +) +select * from data +{% endmacro %} + -- macro not supported in other databases {% macro default__quantile_transformer_model_macro() %} select 1 from (select 1) where 1=2 -- empty result set so that test passes diff --git a/integration_tests/macros/test_quantile_transformer_result_with_tolerance.sql b/integration_tests/macros/test_quantile_transformer_result_with_tolerance.sql index a7394ec..f382bbd 100644 --- a/integration_tests/macros/test_quantile_transformer_result_with_tolerance.sql +++ b/integration_tests/macros/test_quantile_transformer_result_with_tolerance.sql @@ -19,3 +19,8 @@ select 1 from (select 1) where 1=2 -- empty result set so that test passes {% macro sqlserver__test_quantile_transformer_result_with_tolerance() %} select null as '1' where 1=2 -- empty result set so that test passes {% endmacro %} + +-- testing macro not supported in postgres +{% macro postgres__test_quantile_transformer_result_with_tolerance() %} +select null where 1=2 -- empty result set so that test passes +{% endmacro %} \ No newline at end of file diff --git a/integration_tests/models/sql/schema.yml b/integration_tests/models/sql/schema.yml index 8dbfa79..30cca2f 100644 --- a/integration_tests/models/sql/schema.yml +++ b/integration_tests/models/sql/schema.yml @@ -19,7 +19,7 @@ models: target_join_column: id_col source_numeric_column_name: col_to_scale_scaled target_numeric_column_name: col_to_scale_scaled - percentage_tolerance: 0.00000001 + percentage_tolerance: 0.009 - name: test_min_max_scaler_with_column_selection tests: @@ -29,7 +29,7 @@ models: target_join_column: id_col source_numeric_column_name: col_to_scale_scaled target_numeric_column_name: col_to_scale_scaled - percentage_tolerance: 0.00000001 + percentage_tolerance: 0.009 - name: test_k_bins_discretizer_default_bins tests: diff --git a/macros/label_encoder.sql b/macros/label_encoder.sql index 799c542..2de0f66 100644 --- a/macros/label_encoder.sql +++ b/macros/label_encoder.sql @@ -57,3 +57,7 @@ from {{ source_table }} {% macro sqlserver__label_encoder(source_table,source_column,include_columns) %} {% do return( dbt_ml_preprocessing.redshift__label_encoder(source_table,source_column,include_columns)) %} {%- endmacro %} + +{% macro postgres__label_encoder(source_table,source_column,include_columns) %} + {% do return( dbt_ml_preprocessing.redshift__label_encoder(source_table,source_column,include_columns)) %} +{% endmacro %} \ No newline at end of file diff --git a/macros/quantile_transformer.sql b/macros/quantile_transformer.sql index d4382db..a1594b5 100644 --- a/macros/quantile_transformer.sql +++ b/macros/quantile_transformer.sql @@ -70,4 +70,8 @@ from linear_interpolation_variables The `quantile_transformer` macro is only supported on Snowflake and BigQuery at this time. It should work on other DBs, it just requires some rework. {% endset %} {%- do exceptions.raise_compiler_error(error_message) -%} +{% endmacro %} + +{% macro postgre__quantile_transformer(source_table,source_column,n_quantiles,output_distribution,subsample,include_columns) %} + {% do return( dbt_ml_preprocessing.bigquery__quantile_transformer(source_table,source_column,n_quantiles,output_distribution,subsample,include_columns)) %} {% endmacro %} \ No newline at end of file