Reduce the memory usage of Gaussian Copula training. #233
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
I added a new
data_transformer
and apply in GaussianCopula.In terms of effectiveness:
Simplified continuous data processing
Different discrete data encoding
More suitable for statistical models
Motivation and Context
See Issue : #194
Currently, training Gaussian models on high-cardinality discrete data encounters memory issues during training, which is identified as being caused by sharing One-Hot encoding with CTGAN. For statistical models, it is proposed to use frequency-based encoding here, which will significantly reduce memory consumption and speed up the fitting process.
How has this been tested?
Performance test cases can consume a lot of time.
I have used
memray
to test this performance issue, and the memory reduced from over 30 GB to within 3 GB.This was a memory result test for issue original data model fit:
This was a result quality evaluate for original data(head 1000 lines) and sample (1000 lines), evaluate by
sdmetrics
Types of changes
Checklist: