Regulate positive-negative values in the generated data #232

jalr4ever · 2024-11-06T12:55:53Z

Description

I updated the logic of the filter module, fixed the issue where positive and negative values were not correctly applied during sampling, and added a test package synthesizer for establishing functional tests related to synthesizers.

Additionally, I founded that a series of logical issues during ctgan training; I refactored this part to make its logic clearer and added some comments explaining its background.

Moreover, I believe that some built-in filter can be integrated, such as the range value filter.

Motivation and Context

See tests/synthesizer/test_ctgan_synthesizer.py

Types of changes

Maintenance (no change in code, maintain the project's CI, docs, etc.)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

…and add functional test cases.

…valid discrete columns and refactor part of this.

…nerator into jalr4ever-patch-values-polarities

sweep-ai · 2024-11-06T12:56:01Z

Hey @jalr4ever, here is an example of how you can ask me to improve this pull request:

@Sweep Add unit tests for the `PositiveNegativeFilter` class to verify that it correctly identifies and handles positive and negative columns. The test should include scenarios with mixed numeric columns, all-positive columns, and all-negative columns.

📖 For more information on how to use Sweep, please read our documentation.

Wh1isper · 2024-11-06T13:02:18Z

I'm not too familiar with the original design of this piece, but it looks OK.

cc @MooooCat

…ed(Success in local)

…led(Success in local)

jalr4ever · 2024-11-06T13:56:14Z

@Wh1isper I have made many attempts, but I don't know why the CI/CD test cases are failing. Can you help me take a look? It works fine on my local machine. 😩

Wh1isper · 2024-11-06T15:24:33Z

2024-11-06 13:51:26.531 | INFO | sdgx.synthesizer:fit:292 - Fitting data processors...
2024-11-06 13:51:26.547 | INFO | sdgx.data_processors.transformers.nan:fit:71 - NonValueTransformer Fitted.
2024-11-06 13:51:26.547 | INFO | sdgx.data_processors.transformers.nan:fit:87 - NonValueTransformer get int columns: {'mixed_int', 'hours-per-week', 'educational-num', 'pos_int', 'age', 'neg_int', 'fnlwgt', 'capital-loss', 'capital-gain'}.
2024-11-06 13:51:26.547 | INFO | sdgx.data_processors.transformers.nan:fit:96 - NonValueTransformer get float columns: {'neg_float', 'mixed_float', 'pos_float'}.
2024-11-06 13:51:26.547 | INFO | sdgx.data_processors.transformers.nan:fit:101 - NonValueTransformer get column list from metadata: ['int_id', 'pos_int', 'neg_int', 'pos_float', 'neg_float', 'mixed_int', 'mixed_float'].

It seems the mock data was changed somehow

jalr4ever · 2024-11-06T15:41:24Z

@Wh1isper 🫠 Yeah, what should I do？

jalr4ever · 2024-11-07T03:49:33Z

I reproduced it locally by logging. It seems to be caused by class variable pollution.

For some unknown reason, it is the variable pollution caused by the same field names of NonValueTransformer and OutlierTransformer, and further reasons are still being explored.

2024-11-07 11:33:12.401 | INFO     | sdgx.data_processors.transformers.nan:fit:71 - NonValueTransformer Fitted.
2024-11-07 11:33:12.401 | INFO     | sdgx.data_processors.transformers.nan:fit:74 - NonValueTransformer get column list from metadata: ['int_id', 'pos_int', 'neg_int', 'pos_float', 'neg_float', 'mixed_int', 'mixed_float'].
2024-11-07 11:33:12.401 | INFO     | sdgx.data_processors.transformers.nan:fit:75 - Metadata int columns: {'int_id', 'pos_int', 'mixed_int', 'neg_int'}.
2024-11-07 11:33:12.401 | INFO     | sdgx.data_processors.transformers.nan:fit:76 - Metadata float columns: {'pos_float', 'mixed_float', 'neg_float'}.
2024-11-07 11:33:12.401 | INFO     | sdgx.data_processors.transformers.nan:fit:77 - Self int_columns: {'hours-per-week', 'educational-num', 'fnlwgt', 'capital-loss', 'age', 'capital-gain'}.
2024-11-07 11:33:12.402 | INFO     | sdgx.data_processors.transformers.nan:fit:93 - NonValueTransformer get int columns: {'pos_int', 'hours-per-week', 'educational-num', 'mixed_int', 'fnlwgt', 'capital-loss', 'neg_int', 'age', 'capital-gain'}.
2024-11-07 11:33:12.402 | INFO     | sdgx.data_processors.transformers.nan:fit:102 - NonValueTransformer get float columns: {'pos_float', 'mixed_float', 'neg_float'}.

…when fitting a new dataset.

Wh1isper

I don't think we should do clean in fit, consider following snip

for data in dataloader:
   formater.fit(data)

We can expose clear as a public method in every subclass and baseclass(remine empty in baseclass), call it when needed. What do you think?

jalr4ever · 2024-11-07T06:23:20Z

@Wh1isper

For _clear(): This is just a temporary solution. In fact, I don't know why this situation occurs; when I run this test case individually, there are no issues, but when running all cases at once, the problem arises. Basically, each processor should be new and there shouldn't be any "old data." I only added this _clear in fit to make it pass the test. Do you have any idea on this?

For your suggestion: I'm not quite sure where to appropriately call this after setting it to public based on your suggestion. My thought is to call it when obtaining processors ( sdgx/synthesizer.py#fit()-line-296 ) when starting the fit. How do you think?

…rt `fit()` the processors

Wh1isper · 2024-11-07T07:09:28Z

I don't know why this situation occurs

Cause we're using mutable objects.

class A:
    x: list = []

    def callx(self):
        self.x.append("1") # THIS IS USING A.x!!


a = A()
a.callx()
print(A.x)  # [1]

We should imp transformer as follow:

class A:
    x: list

    def __init__(self):
        self.x = []

    def callx(self):
        self.x.append("1")


a = A()
a.callx()
print(A.x)  # AttributeError: type object 'A' has no attribute 'x'
print(a.x)  # [1]

Wh1isper · 2024-11-07T07:11:22Z

I recommend checking all inspect and transforms

As for the clean method, it can be implemented as optional if Synthesizer doesn't use it. Users may be able to implement Synthesizer themselves to use these methods.

jalr4ever · 2024-11-07T07:37:24Z

@Wh1isper

I recommend checking all inspect and transforms

As for the clean method, it can be implemented as optional if Synthesizer doesn't use it. Users may be able to implement Synthesizer themselves to use these methods.

Okay. I think we should conclude this PR here. As for the issue with the variable class in the processor, it's a refactoring issue, so let's address it in a new PR. Let's merge this one first; what do you think?

Wh1isper · 2024-11-07T08:18:48Z

As for the issue with the variable class in the processor, it's a refactoring issue

Acturally it's a bug and no conflict with the original design, can you do it in this pr? If not, we cloud leave it in next pr, but this pr should not include clear func as a temporary measure to get tests passed. To be clear, I agree that clean is not necessary which should be removed(or revertd)

Wh1isper · 2024-11-07T08:20:23Z

Given that we'll be fixing the problem immediately, a temporary failure of the CI is acceptable, in which case I agree to merge this pr first.

jalr4ever · 2024-11-07T09:31:32Z

As for the issue with the variable class in the processor, it's a refactoring issue

Acturally it's a bug and no conflict with the original design, can you do it in this pr? If not, we cloud leave it in next pr, but this pr should not include clear func as a temporary measure to get tests passed. To be clear, I agree that clean is not necessary which should be removed(or revertd)

🧐 Okay, The ❌ of CI/CD is annoying. I would have try if def __init__(self): self.x = [] works.

…e variables avoid variable pollution.

jalr4ever · 2024-11-07T10:09:21Z

class A:
   x: list

   def __init__(self):
       self.x = []

   def callx(self):
       self.x.append("1")


a = A()
a.callx()
print(A.x)  # AttributeError: type object 'A' has no attribute 'x'
print(a.x)  # [1]

@Wh1isper The approach you mentioned works out. and I have already submitted it. Could you help me review it?

Furthermore, I suddenly realized this is a basic question that I couldn't figure out. 😅

tests/conftest.py

Wh1isper

LGTM, thanks for fixing this!

tests/test_ctgan_synthesizer.py

…m/hitsz-ids/synthetic-data-generator into jalr4ever-patch-values-polarities

jalr4ever added 3 commits November 6, 2024 20:44

fix: positive-negative issue - correctly register the filter manager …

fa3fa6e

…and add functional test cases.

fix: CTGAN filtering invalid discrete issue - correctly filter out in…

a1b010b

…valid discrete columns and refactor part of this.

Merge branch 'main' of https://github.com/hitsz-ids/synthetic-data-ge…

e789cef

…nerator into jalr4ever-patch-values-polarities

jalr4ever requested a review from Wh1isper November 6, 2024 12:56

Wh1isper requested a review from MooooCat November 6, 2024 13:02

jalr4ever added 3 commits November 6, 2024 21:11

fix: Moved test file to test/ root dir for weird CI/CD testing fail…

d719abf

…ed(Success in local)

fix: Wrong import for CI/CD testing failed(Success in local)

566c2a7

fix: Moved fixture to pytest file itself, try if avoid CI/CD test fai…

a4c31f7

…led(Success in local)

fix: transformer class variable having old values, leading to errors …

1f6e1f5

…when fitting a new dataset.

Wh1isper reviewed Nov 7, 2024

View reviewed changes

jalr4ever requested a review from Wh1isper November 7, 2024 06:25

fix: refactor _clear to public, and invoking before synthesizer sta…

b8482ce

…rt `fit()` the processors

jalr4ever marked this pull request as ready for review November 7, 2024 08:01

fix: remove _clear and turn processors class variables into instanc…

bb4773b

…e variables avoid variable pollution.

Wh1isper reviewed Nov 7, 2024

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

Update tests/conftest.py

0ed6e44

Wh1isper approved these changes Nov 7, 2024

View reviewed changes

Wh1isper reviewed Nov 7, 2024

View reviewed changes

tests/test_ctgan_synthesizer.py Outdated Show resolved Hide resolved

jalr4ever added 2 commits November 7, 2024 18:35

fix: remove unnecessary using data loader

1f79ad9

Merge branch 'jalr4ever-patch-values-polarities' of https://github.co…

fa7c492

…m/hitsz-ids/synthetic-data-generator into jalr4ever-patch-values-polarities

jalr4ever merged commit 4e4fff1 into main Nov 7, 2024
12 checks passed

This was referenced Nov 7, 2024

How to regulate Negative values in the Generated Data #231

Closed

Handling Fixed Column Relationships using FixedCombinationInspector and FixedCombinationTransformer #219

Merged

jalr4ever deleted the jalr4ever-patch-values-polarities branch November 21, 2024 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regulate positive-negative values in the generated data #232

Regulate positive-negative values in the generated data #232

jalr4ever commented Nov 6, 2024

sweep-ai bot commented Nov 6, 2024 •

edited

Loading

Wh1isper commented Nov 6, 2024

jalr4ever commented Nov 6, 2024

Wh1isper commented Nov 6, 2024

jalr4ever commented Nov 6, 2024

jalr4ever commented Nov 7, 2024

Wh1isper left a comment

jalr4ever commented Nov 7, 2024

Wh1isper commented Nov 7, 2024 •

edited

Loading

Wh1isper commented Nov 7, 2024

jalr4ever commented Nov 7, 2024

Wh1isper commented Nov 7, 2024

Wh1isper commented Nov 7, 2024

jalr4ever commented Nov 7, 2024

jalr4ever commented Nov 7, 2024

Wh1isper left a comment

Regulate positive-negative values in the generated data #232

Regulate positive-negative values in the generated data #232

Conversation

jalr4ever commented Nov 6, 2024

Description

Motivation and Context

How has this been tested?

Types of changes

Checklist:

sweep-ai bot commented Nov 6, 2024 • edited Loading

Wh1isper commented Nov 6, 2024

jalr4ever commented Nov 6, 2024

Wh1isper commented Nov 6, 2024

jalr4ever commented Nov 6, 2024

jalr4ever commented Nov 7, 2024

Wh1isper left a comment

Choose a reason for hiding this comment

jalr4ever commented Nov 7, 2024

Wh1isper commented Nov 7, 2024 • edited Loading

Wh1isper commented Nov 7, 2024

jalr4ever commented Nov 7, 2024

Wh1isper commented Nov 7, 2024

Wh1isper commented Nov 7, 2024

jalr4ever commented Nov 7, 2024

jalr4ever commented Nov 7, 2024

Wh1isper left a comment

Choose a reason for hiding this comment

sweep-ai bot commented Nov 6, 2024 •

edited

Loading

Wh1isper commented Nov 7, 2024 •

edited

Loading