Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCHEMATIC-214] Wrap pandas functions to support not including None with the NA values argument #1553

Merged
merged 27 commits into from
Dec 13, 2024

Conversation

BryanFauble
Copy link
Collaborator

@BryanFauble BryanFauble commented Nov 22, 2024

Problem:

  1. When the None string is included with a manifest the pandas function was causing it to be converted over to a float not a number (NaN). This is a change in Pandas 2.0 release: "Added "None" to default na_values in read_csv() (GH 50286)"

Solution:

  1. Grab the existing list of na_value objects and remove the None value from the list. Pass that list back into the function and replace the default na_value objects with this new list.

Testing:

  1. Unit/Integration testing
  2. Tom tested with an HTAN manifest that was broken, confirmed that this branch processed the manifest without issue

@thomasyu888 thomasyu888 changed the title [SCHEMATIC-210] Wrap pandas functions to support not including None with the NA values argument [SCHEMATIC-214] Wrap pandas functions to support not including None with the NA values argument Nov 22, 2024
@BryanFauble BryanFauble marked this pull request as ready for review November 25, 2024 16:47
@BryanFauble BryanFauble requested a review from a team as a code owner November 25, 2024 16:47
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM! let's wait for Gianna/Andrew to review this when they're back before merging in case there are things we aren't thinking of.

@thomasyu888 thomasyu888 requested a review from a team November 25, 2024 21:23
@thomasyu888 thomasyu888 requested a review from GiaJordan December 2, 2024 21:16
Copy link
Contributor

@linglp linglp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here make sense to me. But I am thinking if we could add None as a valid value here in the data model: https://github.com/Sage-Bionetworks/schematic/blob/develop/tests/data/example.model.csv#L9 and then create a manifest with string None to better test the changes? Currently, we can't test with our existing data model because no attribute has "None" as a valid value.

@BryanFauble
Copy link
Collaborator Author

The changes here make sense to me. But I am thinking if we could add None as a valid value here in the data model: https://github.com/Sage-Bionetworks/schematic/blob/develop/tests/data/example.model.csv#L9 and then create a manifest with string None to better test the changes? Currently, we can't test with our existing data model because no attribute has "None" as a valid value.

I added to an existing test for this, let me know if this covers what you had in mind @linglp

6eeacd5

schematic/models/validate_attribute.py Show resolved Hide resolved
tests/data/example_test_nones.model.csv Outdated Show resolved Hide resolved
Update example_test_nones.model.csv component and add new invalid manifest with nones
@GiaJordan GiaJordan self-requested a review December 9, 2024 17:58
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewelamb, thanks for adding the integration test - please see my comments.

@GiaJordan
Copy link
Contributor

I've opened up #1556 to address the data model and component concerns. That PR just updates the existing data models and adds new test manifests. We can revert #1555 or modify the test_nones data models in this branch to what they were before.
@andrewelamb The valid test manifest used in that pr raises no warnings or errors and the invalid manifest only raises the one expected error message. You can use those for your tests instead of the ones currently being used.
cc: @thomasyu888

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To aid in the upcoming release due to long testing times, I did this: d0a8e15.

  1. Nit: Thanks for adding this in an existing test module, but please be sure to modify the docstring to fit the extra testing functions
  2. Nit: In my opinion, parametrize is best used when it's testing the same condition. For example, if there were many valid manifests to be tested, that is when I would personally use it to add all the valid manifest paths. In my opinion, it's more readable than trying to figure out the if-else AND parametrize around two testing conditions.

Great work everybody!

cc @andrewelamb.

Copy link

@thomasyu888 thomasyu888 merged commit 70813f1 into develop Dec 13, 2024
8 checks passed
@thomasyu888 thomasyu888 deleted the schematic-210-pandas-remove-none branch December 13, 2024 04:00
andrewelamb added a commit that referenced this pull request Dec 16, 2024
* add new tests

* add unit tests

* ran black

* Update schematic/models/validate_attribute.py

Co-authored-by: BryanFauble <17128019+BryanFauble@users.noreply.github.com>

* added tests

* Update README.md

* Update README.md

* add unit tests

* run black

* Update README.md

* temp commit

* remove old tests

* [FDS-2386] Synapse entity tracking and code concurrency updates (#1505)

* [FDS-2386] Synapse entity tracking and code concurrency updates

* ran black

* Update CODEOWNERS

* updated data model type rules to include error param

* fix validate type attribute to use msg level param

* added error handling

* run black

* create Node class

* sat up Node class so that nodes with no displayName fields cause an error on creation

* ran black

* ran mypy

* added new configs for CLI tests

* added new manifests for testing CLI commands

* automate manual CLI tests

* ran black

* Update CODEOWNERS

* Update scan_repo.yml

* Update .github/CODEOWNERS

* Update .github/workflows/scan_repo.yml

* Attach additional telemetry data to OTEL traces (#1519)

* Attach additional telemetry data to OTEL traces

* feat: added tracing for cross manifest validation and file name validation  (#1509)

* add tracing for GX validation

* temp commit

* Updating contribution doc to expect squash and merge (#1534)

* [FDS-2491] Integration tests for Schematic API Test plan (#1512)

Integration tests for Schematic API Test plan

* [FDS-2500] Add Integration Tests for: Manifest Validation (#1516)

* Add Integration Tests for: Manifest Validation

* [FDS-2449] Lock `sphinx` version and update `poetry.lock` (#1530)

Also install `typing-extensions` in the build

* manual test files now being saved in manifests folder

* manual test files now being saved in manifests folder

* remove lines to delete json files that were under git control

* ran black

* add try finally blocks to remove created files

* ran black

* add lines to remove created json files

* Update file annotation store process to require filename be present in order to annotate file

* add lines to remove created json files

* Revert "Update file annotation store process to require filename be present in order to annotate file"

This reverts commit f57c718.

* Don't attempt to annotate the table

* add code in finally blocks to reset config to default values, when tests change them

* complete submit manifest command test

* ran black

* add test for bug case

* update test for table tidyness

* remove unused import

* remove etag column if already present when building temp file view

* catch all exceptions to switch to sequential mode

* update test for updated data

* Revert "update test for updated data"

This reverts commit 255e3c0.

* Revert "catch all exceptions to switch to sequential mode"

This reverts commit 68b0b24.

* catch ValueErrors as well

* Updates for integration test failures (#1537)

* Updates for integration test failures, Config file reset and scope changes

* add todos for removing config resets

* [FDS-2525] Authenticated export of telemetry data (#1527)

* Authenticated export of telemetry data, updating to HTTP otel library

* temp reduce tests

* restore tests

* uncomment tests

* redid how files are deleted, manual tests values are set

* ran black

* [SCHEMATIC-157] Make some dependencies required to avoid `schematic CLI` commands from potentially erroring when doing a pip install (#1540)

* Make otel flash non-optional

* Add dependencies as non-optional

* Include schematic_api for now (#1547)

* update toml version to 24.11.1 (#1548)

* [SCHEMATIC-193] Support exporting telemetry data from GH integration test runs (#1550)

* Support exporting telemetry data from GH run via access token retrieved via oauth2

* [SCHEMATIC-30, SCHEMATIC-200] Add version to click cli / use pathlib.Path module for checking cache size (#1542)

* Add version to click cli

* Add version

* Run black

* Reformat

* Fix

* Update schematic/schemas/data_model_parser.py

* Add test for check_synapse_cache_size

* Reformat

* Fix tests

* Remove unused parameter

* Install all-extras for now

* Make otel flash non-optional

* Update dockerfile

* Add dependencies as non-optional

* Update pyproject toml

* Fix trivy issue

* Add service version

* Run black

* Move all utils.general tests into separate folder

* Use pre-commit

* Add updates to contribution doc

* Fix

* Add service version to log provider

---------

Co-authored-by: BryanFauble <17128019+BryanFauble@users.noreply.github.com>

* [SCHEMATIC-212] Prevent traces from being combined (#1552)

* Set instance id in github CI run, uninstrument flask auto during integration test run

* [SCHEMATIC-163] Catch error when manifest is generated and existing one doesn't have `entityId` (#1551)

* adds error handling

* adds unit tests for _get_file_entityIds

* updates error message

* adds entityid check to parent func

* updates docstring

* [SCHEMATIC-183] Use paths from file view for manifest generation (#1529)

source manifest file paths from synapse fileviews at generation

* [SCHEMATIC-214] Wrap pandas functions to support not including `None` with the NA values argument (#1553)

* Wrap pandas functions to support not including `None` with the NA values argument

* Ignore types

* pylint issues

* ordering of ignore

* Add to integration test to cover none in a manifest

* Add additional test for manifest

* [SCHEMATIC-210] Add attribute to nones data model (#1555)

Update example_test_nones.model.csv component and add new invalid manifest with nones

* first commit

* ran black

* add test for validateModelManifest

* [SCHEMATIC-214] change data model and component (#1556)

* add valid values to Patient attributes

* update data model

* add test manifests

* update test for new model

* update test for new valid value

* change test to use new manifests

* remove uneeded test file

* revert file

* revert file

* change tests to use new manifests

* remove uneeded manifests

* ran black

* add tests back in

* ran black

* revert manifest

* Split up valid and errored test as separate testing functions

* Remove unused import

---------

Co-authored-by: Gianna Jordan <61707471+GiaJordan@users.noreply.github.com>
Co-authored-by: Andrew Lamb <andrewelamb@gmail.com>
Co-authored-by: Thomas Yu <thomas.yu@sagebase.org>

* incremented packge version number

* Update publish.yml

* Update test.yml

* Update api_test.yml

* Update pdoc.yml

* Update version.py

* updates publish.yml (#1558) (#1561)

Co-authored-by: Brad Macdonald <52762200+BWMac@users.noreply.github.com>

---------

Co-authored-by: BryanFauble <17128019+BryanFauble@users.noreply.github.com>
Co-authored-by: Jenny V Medina <jenny.medina@sagebase.org>
Co-authored-by: Thomas Yu <thomas.yu@sagebase.org>
Co-authored-by: Lingling <55448354+linglp@users.noreply.github.com>
Co-authored-by: GiaJordan <gianna.jordan@sagebase.org>
Co-authored-by: Brad Macdonald <52762200+BWMac@users.noreply.github.com>
Co-authored-by: Gianna Jordan <61707471+GiaJordan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants