Reduce the memory usage of Gaussian Copula training. #233

jalr4ever · 2024-11-06T14:02:44Z

Description

I added a new data_transformer and apply in GaussianCopula.

In terms of effectiveness:

Simplified continuous data processing

Use a single normal distribution instead of multiple Gaussian distributions.
Smaller output dimension (1D vs multidimensional).

Different discrete data encoding

Use frequency encoding instead of one-hot encoding.
Significantly reduces the output dimension.

More suitable for statistical models

Outputs simpler numerical representations that retain the statistical properties of the data.

Motivation and Context

See Issue : #194
Currently, training Gaussian models on high-cardinality discrete data encounters memory issues during training, which is identified as being caused by sharing One-Hot encoding with CTGAN. For statistical models, it is proposed to use frequency-based encoding here, which will significantly reduce memory consumption and speed up the fitting process.

How has this been tested?

Performance test cases can consume a lot of time.

I have used memray to test this performance issue, and the memory reduced from over 30 GB to within 3 GB.

This was a memory result test for issue original data model fit:

This was a result quality evaluate for original data(head 1000 lines) and sample (1000 lines), evaluate by sdmetrics

Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 134/134 [00:00<00:00, 569.54it/s]|
Column Shapes Score: 83.49%

(2/2) Evaluating Column Pair Trends: |██████████| 8911/8911 [03:26<00:00, 43.14it/s]| 
Column Pair Trends Score: 73.19%

Overall Score (Average): 78.34%

             Property     Score
0       Column Shapes  0.834902
1  Column Pair Trends  0.731938

Types of changes

Maintenance (no change in code, maintain the project's CI, docs, etc.)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

…inality discrete columns based on frequency encoding.

sweep-ai · 2024-11-06T14:02:54Z

Hey @jalr4ever, here is an example of how you can ask me to improve this pull request:

@Sweep Add unit tests for the `StatisticDataTransformer` class to verify:
1. Correct transformation of continuous columns using `_fit_continuous` and `_transform_continuous` methods
2. Correct inverse transformation of continuous columns using `_inverse_transform_continuous`
3. Correct transformation of discrete columns using `_fit_discrete` and `_transform_discrete` methods
4. Correct inverse transformation of discrete columns using `_inverse_transform_discrete`
5. Handling of edge cases like empty dataframes, columns with missing values, and columns with single unique value

📖 For more information on how to use Sweep, please read our documentation.

Wh1isper

Awesome!

feat: fix high memory issues in Gaussian copula fitting for high card…

77040e6

…inality discrete columns based on frequency encoding.

jalr4ever marked this pull request as ready for review November 7, 2024 09:19

jalr4ever requested a review from Wh1isper November 7, 2024 09:19

Wh1isper linked an issue Nov 7, 2024 that may be closed by this pull request

Performance issues with GaussianCopula training on tabular data #194

Closed

Wh1isper approved these changes Nov 7, 2024

View reviewed changes

jalr4ever merged commit 8eb395b into main Nov 7, 2024
12 checks passed

jalr4ever deleted the jalr4ever-patch-performance-for-gaussian branch November 21, 2024 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the memory usage of Gaussian Copula training. #233

Reduce the memory usage of Gaussian Copula training. #233

jalr4ever commented Nov 6, 2024 •

edited by Wh1isper

Loading

sweep-ai bot commented Nov 6, 2024 •

edited

Loading

Wh1isper left a comment

Reduce the memory usage of Gaussian Copula training. #233

Reduce the memory usage of Gaussian Copula training. #233

Conversation

jalr4ever commented Nov 6, 2024 • edited by Wh1isper Loading

Description

Motivation and Context

How has this been tested?

Types of changes

Checklist:

sweep-ai bot commented Nov 6, 2024 • edited Loading

Wh1isper left a comment

Choose a reason for hiding this comment

jalr4ever commented Nov 6, 2024 •

edited by Wh1isper

Loading

sweep-ai bot commented Nov 6, 2024 •

edited

Loading