-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regulate positive-negative values in the generated data #232
Conversation
…and add functional test cases.
…valid discrete columns and refactor part of this.
…nerator into jalr4ever-patch-values-polarities
Hey @jalr4ever, here is an example of how you can ask me to improve this pull request: @Sweep Add unit tests for the `PositiveNegativeFilter` class to verify that it correctly identifies and handles positive and negative columns. The test should include scenarios with mixed numeric columns, all-positive columns, and all-negative columns. 📖 For more information on how to use Sweep, please read our documentation. |
I'm not too familiar with the original design of this piece, but it looks OK. cc @MooooCat |
@Wh1isper I have made many attempts, but I don't know why the CI/CD test cases are failing. Can you help me take a look? It works fine on my local machine. 😩 |
It seems the mock data was changed somehow |
@Wh1isper 🫠 Yeah, what should I do? |
I reproduced it locally by logging. It seems to be caused by class variable pollution. For some unknown reason, it is the variable pollution caused by the same field names of 2024-11-07 11:33:12.401 | INFO | sdgx.data_processors.transformers.nan:fit:71 - NonValueTransformer Fitted.
2024-11-07 11:33:12.401 | INFO | sdgx.data_processors.transformers.nan:fit:74 - NonValueTransformer get column list from metadata: ['int_id', 'pos_int', 'neg_int', 'pos_float', 'neg_float', 'mixed_int', 'mixed_float'].
2024-11-07 11:33:12.401 | INFO | sdgx.data_processors.transformers.nan:fit:75 - Metadata int columns: {'int_id', 'pos_int', 'mixed_int', 'neg_int'}.
2024-11-07 11:33:12.401 | INFO | sdgx.data_processors.transformers.nan:fit:76 - Metadata float columns: {'pos_float', 'mixed_float', 'neg_float'}.
2024-11-07 11:33:12.401 | INFO | sdgx.data_processors.transformers.nan:fit:77 - Self int_columns: {'hours-per-week', 'educational-num', 'fnlwgt', 'capital-loss', 'age', 'capital-gain'}.
2024-11-07 11:33:12.402 | INFO | sdgx.data_processors.transformers.nan:fit:93 - NonValueTransformer get int columns: {'pos_int', 'hours-per-week', 'educational-num', 'mixed_int', 'fnlwgt', 'capital-loss', 'neg_int', 'age', 'capital-gain'}.
2024-11-07 11:33:12.402 | INFO | sdgx.data_processors.transformers.nan:fit:102 - NonValueTransformer get float columns: {'pos_float', 'mixed_float', 'neg_float'}.
|
…when fitting a new dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should do clean
in fit
, consider following snip
for data in dataloader:
formater.fit(data)
We can expose clear
as a public method in every subclass and baseclass(remine empty in baseclass), call it when needed. What do you think?
For For your suggestion: I'm not quite sure where to appropriately call this after setting it to public based on your suggestion. My thought is to call it when obtaining processors ( |
…rt `fit()` the processors
Cause we're using mutable objects. class A:
x: list = []
def callx(self):
self.x.append("1") # THIS IS USING A.x!!
a = A()
a.callx()
print(A.x) # [1] We should imp transformer as follow: class A:
x: list
def __init__(self):
self.x = []
def callx(self):
self.x.append("1")
a = A()
a.callx()
print(A.x) # AttributeError: type object 'A' has no attribute 'x'
print(a.x) # [1] |
I recommend checking all inspect and transforms As for the |
Okay. I think we should conclude this PR here. As for the issue with the variable class in the processor, it's a refactoring issue, so let's address it in a new PR. Let's merge this one first; what do you think? |
Acturally it's a bug and no conflict with the original design, can you do it in this pr? If not, we cloud leave it in next pr, but this pr should not include |
Given that we'll be fixing the problem immediately, a temporary failure of the CI is acceptable, in which case I agree to merge this pr first. |
🧐 Okay, The ❌ of CI/CD is annoying. I would have try if |
…e variables avoid variable pollution.
@Wh1isper The approach you mentioned works out. and I have already submitted it. Could you help me review it? Furthermore, I suddenly realized this is a basic question that I couldn't figure out. 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this!
…m/hitsz-ids/synthetic-data-generator into jalr4ever-patch-values-polarities
Description
I updated the logic of the filter module, fixed the issue where positive and negative values were not correctly applied during sampling, and added a test package
synthesizer
for establishing functional tests related to synthesizers.Additionally, I founded that a series of logical issues during ctgan training; I refactored this part to make its logic clearer and added some comments explaining its background.
Moreover, I believe that some built-in filter can be integrated, such as the range value filter.
Motivation and Context
Related:
How has this been tested?
See
tests/synthesizer/test_ctgan_synthesizer.py
Types of changes
Checklist: