Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the memory usage of Gaussian Copula training. #233

Merged
merged 1 commit into from
Nov 7, 2024

Conversation

jalr4ever
Copy link
Collaborator

@jalr4ever jalr4ever commented Nov 6, 2024

Description

I added a new data_transformer and apply in GaussianCopula.

In terms of effectiveness:

Simplified continuous data processing

  • Use a single normal distribution instead of multiple Gaussian distributions.
  • Smaller output dimension (1D vs multidimensional).

Different discrete data encoding

  • Use frequency encoding instead of one-hot encoding.
  • Significantly reduces the output dimension.

More suitable for statistical models

  • Outputs simpler numerical representations that retain the statistical properties of the data.

Motivation and Context

See Issue : #194
Currently, training Gaussian models on high-cardinality discrete data encounters memory issues during training, which is identified as being caused by sharing One-Hot encoding with CTGAN. For statistical models, it is proposed to use frequency-based encoding here, which will significantly reduce memory consumption and speed up the fitting process.

How has this been tested?

Performance test cases can consume a lot of time.

I have used memray to test this performance issue, and the memory reduced from over 30 GB to within 3 GB.

This was a memory result test for issue original data model fit:
iShot_2024-11-06_14 33 16

This was a result quality evaluate for original data(head 1000 lines) and sample (1000 lines), evaluate by sdmetrics

Generating report ...

(1/2) Evaluating Column Shapes: |██████████| 134/134 [00:00<00:00, 569.54it/s]|
Column Shapes Score: 83.49%

(2/2) Evaluating Column Pair Trends: |██████████| 8911/8911 [03:26<00:00, 43.14it/s]| 
Column Pair Trends Score: 73.19%

Overall Score (Average): 78.34%

             Property     Score
0       Column Shapes  0.834902
1  Column Pair Trends  0.731938

Types of changes

  • Maintenance (no change in code, maintain the project's CI, docs, etc.)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

…inality discrete columns based on frequency encoding.
Copy link
Contributor

sweep-ai bot commented Nov 6, 2024

Hey @jalr4ever, here is an example of how you can ask me to improve this pull request:

@Sweep Add unit tests for the `StatisticDataTransformer` class to verify:
1. Correct transformation of continuous columns using `_fit_continuous` and `_transform_continuous` methods
2. Correct inverse transformation of continuous columns using `_inverse_transform_continuous`
3. Correct transformation of discrete columns using `_fit_discrete` and `_transform_discrete` methods
4. Correct inverse transformation of discrete columns using `_inverse_transform_discrete`
5. Handling of edge cases like empty dataframes, columns with missing values, and columns with single unique value

📖 For more information on how to use Sweep, please read our documentation.

@jalr4ever jalr4ever marked this pull request as ready for review November 7, 2024 09:19
@jalr4ever jalr4ever requested a review from Wh1isper November 7, 2024 09:19
@Wh1isper Wh1isper linked an issue Nov 7, 2024 that may be closed by this pull request
Copy link
Collaborator

@Wh1isper Wh1isper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@jalr4ever jalr4ever merged commit 8eb395b into main Nov 7, 2024
12 checks passed
@jalr4ever jalr4ever deleted the jalr4ever-patch-performance-for-gaussian branch November 21, 2024 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance issues with GaussianCopula training on tabular data
2 participants