Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regulate positive-negative values in the generated data #232

Merged
merged 12 commits into from
Nov 7, 2024

Conversation

jalr4ever
Copy link
Collaborator

Description

I updated the logic of the filter module, fixed the issue where positive and negative values were not correctly applied during sampling, and added a test package synthesizer for establishing functional tests related to synthesizers.

Additionally, I founded that a series of logical issues during ctgan training; I refactored this part to make its logic clearer and added some comments explaining its background.

Moreover, I believe that some built-in filter can be integrated, such as the range value filter.

Motivation and Context

Related:

How has this been tested?

See tests/synthesizer/test_ctgan_synthesizer.py

Types of changes

  • Maintenance (no change in code, maintain the project's CI, docs, etc.)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

Copy link
Contributor

sweep-ai bot commented Nov 6, 2024

Hey @jalr4ever, here is an example of how you can ask me to improve this pull request:

@Sweep Add unit tests for the `PositiveNegativeFilter` class to verify that it correctly identifies and handles positive and negative columns. The test should include scenarios with mixed numeric columns, all-positive columns, and all-negative columns.

📖 For more information on how to use Sweep, please read our documentation.

@jalr4ever jalr4ever requested a review from Wh1isper November 6, 2024 12:56
@Wh1isper
Copy link
Collaborator

Wh1isper commented Nov 6, 2024

I'm not too familiar with the original design of this piece, but it looks OK.

cc @MooooCat

@Wh1isper Wh1isper requested a review from MooooCat November 6, 2024 13:02
@jalr4ever
Copy link
Collaborator Author

@Wh1isper I have made many attempts, but I don't know why the CI/CD test cases are failing. Can you help me take a look? It works fine on my local machine. 😩

@Wh1isper
Copy link
Collaborator

Wh1isper commented Nov 6, 2024

2024-11-06 13:51:26.531 | INFO | sdgx.synthesizer:fit:292 - Fitting data processors...
2024-11-06 13:51:26.547 | INFO | sdgx.data_processors.transformers.nan:fit:71 - NonValueTransformer Fitted.
2024-11-06 13:51:26.547 | INFO | sdgx.data_processors.transformers.nan:fit:87 - NonValueTransformer get int columns: {'mixed_int', 'hours-per-week', 'educational-num', 'pos_int', 'age', 'neg_int', 'fnlwgt', 'capital-loss', 'capital-gain'}.
2024-11-06 13:51:26.547 | INFO | sdgx.data_processors.transformers.nan:fit:96 - NonValueTransformer get float columns: {'neg_float', 'mixed_float', 'pos_float'}.
2024-11-06 13:51:26.547 | INFO | sdgx.data_processors.transformers.nan:fit:101 - NonValueTransformer get column list from metadata: ['int_id', 'pos_int', 'neg_int', 'pos_float', 'neg_float', 'mixed_int', 'mixed_float'].

It seems the mock data was changed somehow

@jalr4ever
Copy link
Collaborator Author

@Wh1isper 🫠 Yeah, what should I do?

@jalr4ever
Copy link
Collaborator Author

I reproduced it locally by logging. It seems to be caused by class variable pollution.

For some unknown reason, it is the variable pollution caused by the same field names of NonValueTransformer and OutlierTransformer, and further reasons are still being explored.

2024-11-07 11:33:12.401 | INFO     | sdgx.data_processors.transformers.nan:fit:71 - NonValueTransformer Fitted.
2024-11-07 11:33:12.401 | INFO     | sdgx.data_processors.transformers.nan:fit:74 - NonValueTransformer get column list from metadata: ['int_id', 'pos_int', 'neg_int', 'pos_float', 'neg_float', 'mixed_int', 'mixed_float'].
2024-11-07 11:33:12.401 | INFO     | sdgx.data_processors.transformers.nan:fit:75 - Metadata int columns: {'int_id', 'pos_int', 'mixed_int', 'neg_int'}.
2024-11-07 11:33:12.401 | INFO     | sdgx.data_processors.transformers.nan:fit:76 - Metadata float columns: {'pos_float', 'mixed_float', 'neg_float'}.
2024-11-07 11:33:12.401 | INFO     | sdgx.data_processors.transformers.nan:fit:77 - Self int_columns: {'hours-per-week', 'educational-num', 'fnlwgt', 'capital-loss', 'age', 'capital-gain'}.
2024-11-07 11:33:12.402 | INFO     | sdgx.data_processors.transformers.nan:fit:93 - NonValueTransformer get int columns: {'pos_int', 'hours-per-week', 'educational-num', 'mixed_int', 'fnlwgt', 'capital-loss', 'neg_int', 'age', 'capital-gain'}.
2024-11-07 11:33:12.402 | INFO     | sdgx.data_processors.transformers.nan:fit:102 - NonValueTransformer get float columns: {'pos_float', 'mixed_float', 'neg_float'}.

Copy link
Collaborator

@Wh1isper Wh1isper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do clean in fit, consider following snip

for data in dataloader:
   formater.fit(data)

We can expose clear as a public method in every subclass and baseclass(remine empty in baseclass), call it when needed. What do you think?

@jalr4ever
Copy link
Collaborator Author

@Wh1isper

For _clear(): This is just a temporary solution. In fact, I don't know why this situation occurs; when I run this test case individually, there are no issues, but when running all cases at once, the problem arises. Basically, each processor should be new and there shouldn't be any "old data." I only added this _clear in fit to make it pass the test. Do you have any idea on this?

For your suggestion: I'm not quite sure where to appropriately call this after setting it to public based on your suggestion. My thought is to call it when obtaining processors ( sdgx/synthesizer.py#fit()-line-296 ) when starting the fit. How do you think?

@jalr4ever jalr4ever requested a review from Wh1isper November 7, 2024 06:25
@Wh1isper
Copy link
Collaborator

Wh1isper commented Nov 7, 2024

I don't know why this situation occurs

Cause we're using mutable objects.

class A:
    x: list = []

    def callx(self):
        self.x.append("1") # THIS IS USING A.x!!


a = A()
a.callx()
print(A.x)  # [1]

We should imp transformer as follow:

class A:
    x: list

    def __init__(self):
        self.x = []

    def callx(self):
        self.x.append("1")


a = A()
a.callx()
print(A.x)  # AttributeError: type object 'A' has no attribute 'x'
print(a.x)  # [1]

@Wh1isper
Copy link
Collaborator

Wh1isper commented Nov 7, 2024

I recommend checking all inspect and transforms

As for the clean method, it can be implemented as optional if Synthesizer doesn't use it. Users may be able to implement Synthesizer themselves to use these methods.

@jalr4ever
Copy link
Collaborator Author

@Wh1isper

I recommend checking all inspect and transforms

As for the clean method, it can be implemented as optional if Synthesizer doesn't use it. Users may be able to implement Synthesizer themselves to use these methods.

Okay. I think we should conclude this PR here. As for the issue with the variable class in the processor, it's a refactoring issue, so let's address it in a new PR. Let's merge this one first; what do you think?

@jalr4ever jalr4ever marked this pull request as ready for review November 7, 2024 08:01
@Wh1isper
Copy link
Collaborator

Wh1isper commented Nov 7, 2024

As for the issue with the variable class in the processor, it's a refactoring issue

Acturally it's a bug and no conflict with the original design, can you do it in this pr? If not, we cloud leave it in next pr, but this pr should not include clear func as a temporary measure to get tests passed. To be clear, I agree that clean is not necessary which should be removed(or revertd)

@Wh1isper
Copy link
Collaborator

Wh1isper commented Nov 7, 2024

Given that we'll be fixing the problem immediately, a temporary failure of the CI is acceptable, in which case I agree to merge this pr first.

@jalr4ever
Copy link
Collaborator Author

As for the issue with the variable class in the processor, it's a refactoring issue

Acturally it's a bug and no conflict with the original design, can you do it in this pr? If not, we cloud leave it in next pr, but this pr should not include clear func as a temporary measure to get tests passed. To be clear, I agree that clean is not necessary which should be removed(or revertd)

🧐 Okay, The ❌ of CI/CD is annoying. I would have try if def __init__(self): self.x = [] works.

@jalr4ever
Copy link
Collaborator Author

class A:
   x: list

   def __init__(self):
       self.x = []

   def callx(self):
       self.x.append("1")


a = A()
a.callx()
print(A.x)  # AttributeError: type object 'A' has no attribute 'x'
print(a.x)  # [1]

@Wh1isper The approach you mentioned works out. and I have already submitted it. Could you help me review it?

Furthermore, I suddenly realized this is a basic question that I couldn't figure out. 😅

tests/conftest.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Wh1isper Wh1isper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this!

@jalr4ever jalr4ever merged commit 4e4fff1 into main Nov 7, 2024
12 checks passed
@jalr4ever jalr4ever deleted the jalr4ever-patch-values-polarities branch November 21, 2024 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants