fix: OneHotEncoder no longer creates duplicate column names #271

zzril · 2023-05-05T09:17:56Z

Closes #201.

Summary of Changes

Changed OneHotEncoder to manually implement the encoding.
(Breaking) Changed the format of newly generated columns to use two underscores as separator. In case of naming conflicts, a hash and a unique ID will be appended to the column name.

also count number of occurences of column names

Tests run through, column name format not yet as specified (one "_" instdead of two)

Co-authored-by: ilkajw <123072184+ilkajw@users.noreply.github.com>

lars-reimann · 2023-05-05T09:19:37Z

🦙 MegaLinter status: ✅ SUCCESS

Descriptor	Linter	Files	Fixed	Errors	Elapsed time
✅ PYTHON	black	7	0	0	0.86s
✅ PYTHON	mypy	7		0	1.8s
✅ PYTHON	ruff	7	0	0	0.05s
✅ REPOSITORY	git_diff	yes		no	0.03s

See detailed report in MegaLinter reports
Set VALIDATE_ALL_CODEBASE: true in mega-linter.yml to validate all sources, not only the diff

MegaLinter is graciously provided by

Co-authored-by: ilkajw <123072184+ilkajw@users.noreply.github.com>

reverse_transform still missing

Still need to fix one test which checks the wrapped_encoder. Still need to change single to double underscore and update tests accordingly. Co-authored-by: ilkajw <123072184+ilkajw@users.noreply.github.com>

…lumns-with-same-name

codecov · 2023-05-09T11:02:56Z

Codecov Report

Merging #271 (07e6adc) into main (8db5914) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #271   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           43        43           
  Lines         1761      1786   +25     
=========================================
+ Hits          1761      1786   +25

Impacted Files	Coverage Δ
src/safeds/data/tabular/containers/_table.py	`100.00% <ø> (ø)`
src/safeds/exceptions/__init__.py	`100.00% <ø> (ø)`
...ds/data/tabular/transformation/_one_hot_encoder.py	`100.00% <100.00%> (ø)`
src/safeds/exceptions/_data.py	`100.00% <100.00%> (ø)`

Changed to double underscore. Adapted tests accordingly.

zzril · 2023-05-09T14:16:47Z

Note that this breaks code that depends on the old column renaming schema (single underscore as separator). Not sure if the keyword in the PR message is enough for that.

Also note that this PR does not yet include performance tests.

src/safeds/data/tabular/transformation/_one_hot_encoder.py

lars-reimann · 2023-05-09T20:30:12Z

Note that this breaks code that depends on the old column renaming schema (single underscore as separator). Not sure if the keyword in the PR message is enough for that.

While the version of this library is in the 0.y.z range, we don't need to pay much attention to breaking changes (see this). Still good to mention this as you've done.

lars-reimann · 2023-05-09T20:34:50Z

Once #301 is implemented, we can shorten the implementation of transform and inverse_transform a little. But no need to wait for that.

lars-reimann

Looks great, thanks!

## [0.12.0](v0.11.0...v0.12.0) (2023-05-11) ### Features * add `learning_rate` to AdaBoost classifier and regressor. ([#251](#251)) ([7f74440](7f74440)), closes [#167](#167) * add alpha parameter to `lasso_regression` ([#232](#232)) ([b5050b9](b5050b9)), closes [#163](#163) * add parameter `lasso_ratio` to `ElasticNetRegression` ([#237](#237)) ([4a1a736](4a1a736)), closes [#166](#166) * Add parameter `number_of_tree` to `RandomForest` classifier and regressor ([#230](#230)) ([414336a](414336a)), closes [#161](#161) * Added `Table.plot_boxplots` to plot a boxplot for each numerical column in the table ([#254](#254)) ([0203a0c](0203a0c)), closes [#156](#156) [#239](#239) * Added `Table.plot_histograms` to plot a histogram for each column in the table ([#252](#252)) ([e27d410](e27d410)), closes [#157](#157) * Added `Table.transform_table` method which returns the transformed Table ([#229](#229)) ([0a9ce72](0a9ce72)), closes [#110](#110) * Added alpha parameter to `RidgeRegression` ([#231](#231)) ([1ddc948](1ddc948)), closes [#164](#164) * Added Column#transform ([#270](#270)) ([40fb756](40fb756)), closes [#255](#255) * Added method `Table.inverse_transform_table` which returns the original table ([#227](#227)) ([846bf23](846bf23)), closes [#111](#111) * Added parameter `c` to `SupportVectorMachines` ([#267](#267)) ([a88eb8b](a88eb8b)), closes [#169](#169) * Added parameter `maximum_number_of_learner` and `learner` to `AdaBoost` ([#269](#269)) ([bb5a07e](bb5a07e)), closes [#171](#171) [#173](#173) * Added parameter `number_of_trees` to `GradientBoosting` ([#268](#268)) ([766f2ff](766f2ff)), closes [#170](#170) * Allow arguments of type pathlib.Path for file I/O methods ([#228](#228)) ([2b58c82](2b58c82)), closes [#146](#146) * convert `Schema` to `dict` and format it nicely in a notebook ([#244](#244)) ([ad1cac5](ad1cac5)), closes [#151](#151) * Convert between Excel file and `Table` ([#233](#233)) ([0d7a998](0d7a998)), closes [#138](#138) [#139](#139) * convert containers for tabular data to HTML ([#243](#243)) ([683c279](683c279)), closes [#140](#140) * make `Column` a subclass of `Sequence` ([#245](#245)) ([a35b943](a35b943)) * mark optional hyperparameters as keyword only ([#296](#296)) ([44a41eb](44a41eb)), closes [#278](#278) * move exceptions back to common package ([#295](#295)) ([a91172c](a91172c)), closes [#177](#177) [#262](#262) * precision metric for classification ([#272](#272)) ([5adadad](5adadad)), closes [#185](#185) * Raise error if an untagged table is used instead of a `TaggedTable` ([#234](#234)) ([8eea3dd](8eea3dd)), closes [#192](#192) * recall and F1-score metrics for classification ([#277](#277)) ([2cf93cc](2cf93cc)), closes [#187](#187) [#186](#186) * replace prefix `n` with `number_of` ([#250](#250)) ([f4f44a6](f4f44a6)), closes [#171](#171) * set `alpha` parameter for regularization of `ElasticNetRegression` ([#238](#238)) ([e642d1d](e642d1d)), closes [#165](#165) * Set `column_names` in `fit` methods of table transformers to be required ([#225](#225)) ([2856296](2856296)), closes [#179](#179) * set learning rate of Gradient Boosting models ([#253](#253)) ([9ffaf55](9ffaf55)), closes [#168](#168) * Support vector machine for regression and for classification ([#236](#236)) ([7f6c3bd](7f6c3bd)), closes [#154](#154) * usable constructor for `Table` ([#294](#294)) ([56a1fc4](56a1fc4)), closes [#266](#266) * usable constructor for `TaggedTable` ([#299](#299)) ([01c3ad9](01c3ad9)), closes [#293](#293) ### Bug Fixes * OneHotEncoder no longer creates duplicate column names ([#271](#271)) ([f604666](f604666)), closes [#201](#201) * selectively ignore one warning instead of all warnings ([#235](#235)) ([3aad07d](3aad07d))

lars-reimann · 2023-05-11T20:03:31Z

🎉 This PR is included in version 0.12.0 🎉

The release is available on:

v0.12.0
GitHub release

Your semantic-release bot 📦🚀

zzril and others added 4 commits April 28, 2023 12:03

add check for detecting duplicate names

40b2065

Refactor column name generation

f57b9d5

also count number of occurences of column names

Integrate counter into column name generation

423480f

Tests run through, column name format not yet as specified (one "_" instdead of two)

Create mapping of concrete values to new columns

4ed8cbf

Co-authored-by: ilkajw <123072184+ilkajw@users.noreply.github.com>

zzril linked an issue May 5, 2023 that may be closed by this pull request

OneHotEncoder can accidentally create columns with same name #201

Closed

zzril and others added 4 commits May 5, 2023 13:38

Remove wrapped OneHotEncoder in transform method

7ab806d

Co-authored-by: ilkajw <123072184+ilkajw@users.noreply.github.com>

Make inverse transform raise NotImplementedError

5ac6f6d

Finish transform

f4a9bce

reverse_transform still missing

Finish inverse_transform

2ffeec3

Still need to fix one test which checks the wrapped_encoder. Still need to change single to double underscore and update tests accordingly. Co-authored-by: ilkajw <123072184+ilkajw@users.noreply.github.com>

zzril changed the title ~~201 onehotencoder can accidentally create columns with same name~~ Fix #201: onehotencoder can accidentally create columns with same name May 5, 2023

zzril changed the title ~~Fix #201: onehotencoder can accidentally create columns with same name~~ fix #201: onehotencoder can accidentally create columns with same name May 8, 2023

zzril added 2 commits May 9, 2023 12:35

Fix warnings and code style

2a19b05

Fix testcase to not checked for dropped attribute

e1be725

zzril changed the title ~~fix #201: onehotencoder can accidentally create columns with same name~~ fix #201: OneHotEncoder no longer creates duplicate column names May 9, 2023

zzril changed the title ~~fix #201: OneHotEncoder no longer creates duplicate column names~~ fix: OneHotEncoder no longer creates duplicate column names May 9, 2023

Merge branch 'main' into 201-onehotencoder-can-accidentally-create-co…

99d8c5b

…lumns-with-same-name

zzril and others added 10 commits May 9, 2023 13:11

Change OneHotEncoder column format

867d5b0

Changed to double underscore. Adapted tests accordingly.

Fix MegaLinter warnings

68dfe9d

Inline is_fitted method to help MegaLinter

b9d609c

Fix docstrings for OneHotEncoder for new format

a10a179

Fix docstrings in _table.py for new format

3e32857

style: apply automated linter fixes

662a175

Add tests for conflicting column names

36eb22f

Add Exception for value not present when fitted

2e28ecd

Fix imports

9923aa6

Add test for ValueNotPresentWhenFittedError

f2e347d

zzril marked this pull request as ready for review May 9, 2023 14:13

zzril requested a review from a team as a code owner May 9, 2023 14:13

lars-reimann reviewed May 9, 2023

View reviewed changes

src/safeds/data/tabular/transformation/_one_hot_encoder.py Outdated Show resolved Hide resolved

src/safeds/data/tabular/transformation/_one_hot_encoder.py Outdated Show resolved Hide resolved

Remove unneccessary copied in OneHotEncoder

07e6adc

lars-reimann approved these changes May 10, 2023

View reviewed changes

lars-reimann merged commit f604666 into main May 10, 2023

lars-reimann deleted the 201-onehotencoder-can-accidentally-create-columns-with-same-name branch May 10, 2023 18:19

lars-reimann added the released Included in a release label May 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: OneHotEncoder no longer creates duplicate column names #271

fix: OneHotEncoder no longer creates duplicate column names #271

zzril commented May 5, 2023 •

edited

Loading

lars-reimann commented May 5, 2023 •

edited

Loading

codecov bot commented May 9, 2023 •

edited

Loading

zzril commented May 9, 2023

lars-reimann commented May 9, 2023 •

edited

Loading

lars-reimann commented May 9, 2023 •

edited

Loading

lars-reimann left a comment

lars-reimann commented May 11, 2023

fix: OneHotEncoder no longer creates duplicate column names #271

fix: OneHotEncoder no longer creates duplicate column names #271

Conversation

zzril commented May 5, 2023 • edited Loading

Summary of Changes

lars-reimann commented May 5, 2023 • edited Loading

🦙 MegaLinter status: ✅ SUCCESS

codecov bot commented May 9, 2023 • edited Loading

Codecov Report

zzril commented May 9, 2023

lars-reimann commented May 9, 2023 • edited Loading

lars-reimann commented May 9, 2023 • edited Loading

lars-reimann left a comment

Choose a reason for hiding this comment

lars-reimann commented May 11, 2023

zzril commented May 5, 2023 •

edited

Loading

lars-reimann commented May 5, 2023 •

edited

Loading

codecov bot commented May 9, 2023 •

edited

Loading

lars-reimann commented May 9, 2023 •

edited

Loading

lars-reimann commented May 9, 2023 •

edited

Loading