Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Numeric Data Inspection and Introduce Positive/Negative Filtering #217

Merged
merged 19 commits into from
Aug 27, 2024

Conversation

MooooCat
Copy link
Contributor

Enhance NumericInspector and Implement PositiveNegativeFilter

Description

This PR introduces significant enhancements to the Synthetic Data Generator (SDG) framework, specifically in the NumericInspector class and the addition of a new PositiveNegativeFilter class. The NumericInspector has been updated to support the identification of both positive and negative numeric columns, improving the quality of synthetic data generation. The PositiveNegativeFilter class is designed to filter data based on the positivity or negativity of values in specified columns, ensuring that the integrity of the data is maintained during processing.

Key changes include:

  • Updated NumericInspector to classify columns as positive or negative based on defined thresholds.
  • Introduced PositiveNegativeFilter to enforce positivity or negativity constraints on specified columns during data processing.
  • Added comprehensive test cases to validate the functionality of the new filter and the updated inspector.

Motivation and Context

The motivation behind these changes is to enhance the data quality assurance mechanisms within the SDG framework. By allowing the identification of positive and negative columns, we can ensure that the synthetic data generated meets specific criteria, which is crucial for various applications such as model training and data sharing. This change addresses the need for more robust data validation and filtering capabilities, ultimately leading to better performance and reliability of the generated synthetic data.

How has this been tested?

The changes have been thoroughly tested using a dedicated test suite. The following tests were performed:

  • Unit tests for the updated NumericInspector to ensure correct identification of positive and negative columns.
  • Integration tests for the PositiveNegativeFilter to verify that it correctly filters data based on the positivity and negativity of values in specified columns.
  • Tests included checks for the integrity of mixed columns, ensuring they remain unchanged during filtering processes.
  • All tests were executed in a controlled environment using pytest, and all assertions passed successfully.

Types of changes

  • Maintenance (no change in code, maintain the project's CI, docs, etc.)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

@MooooCat MooooCat self-assigned this Aug 22, 2024
@MooooCat MooooCat requested a review from jalr4ever August 22, 2024 03:15
@MooooCat
Copy link
Contributor Author

@jalr4ever Please help me review this PR.

Copy link
Collaborator

@jalr4ever jalr4ever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestions in the PR are optional changes.

sdgx/data_models/inspectors/numeric.py Outdated Show resolved Hide resolved
sdgx/data_models/inspectors/numeric.py Show resolved Hide resolved
@MooooCat
Copy link
Contributor Author

Modified the code according to the suggestions in the code review, and all unit tests have passed.

@MooooCat MooooCat requested a review from jalr4ever August 27, 2024 01:24
@MooooCat MooooCat merged commit 9a34789 into main Aug 27, 2024
12 checks passed
@MooooCat MooooCat deleted the feature-intro-rule-processor branch August 27, 2024 01:30
@@ -14,69 +14,132 @@ class NumericInspector(Inspector):

This class is a subclass of `Inspector` and is designed to provide methods for inspecting
and analyzing numeric data. It includes methods for detecting int or float data type.

In August 2024, we introduced a new feature that will continue to judge the positivity or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should indicate the PR and release version here, rather than the date?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I also have another branch in development, I'll release after merging another PR. Due to various reasons, we haven't released a new version for a long time :(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nerver mind, thanks for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to regulate the range of Synthetic Data
3 participants