Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate schemas overlooking the nullability fields #71

Merged
merged 32 commits into from
Apr 13, 2023
Merged

Conversation

ireneisdoomed
Copy link
Contributor

This PR includes:

  • Changes to the validate_schema Dataset method to compare schemas between dataframes by looking at the names.
    • The functionality is essentially the same, only that we iterate on the column names, instead of on the properties of each schema field.
    • There are 2 checks in place: a) checking that the observed schema does not have any extra field; b) checking that the mandatory fields dictated by the expected schema are present in the observed_schema
  • Added 2 unit tests for each of the functionalities I describe above

Notes:

  • I also had to tweak the pytest_generate_tests function. TIL that when we use pytest's metafuncs, we are defining a fixture that will be used in every test, unless specified otherwise.
  • Mypy doesn't check the script that tests the schema. This is because mypy is crashing when checking this script, even with the newer version, hence I was not able to commit. It's not that the script is raising a typing error, but mypy is failing to deserialize something. I've found an issue in their repo that might be related. Full trace:
Traceback (most recent call last):
  File "/Users/irenelopez/.cache/pre-commit/repomwnp37wt/py_env-python3.8/bin/mypy", line 8, in <module>
    sys.exit(console_entry())
  File "/Users/irenelopez/.cache/pre-commit/repomwnp37wt/py_env-python3.8/lib/python3.8/site-packages/mypy/__main__.py", line 15, in console_entry
    main()
  File "mypy/main.py", line 95, in main
  File "mypy/main.py", line 174, in run_build
  File "mypy/build.py", line 193, in build
  File "mypy/build.py", line 276, in _build
  File "mypy/build.py", line 2903, in dispatch
  File "mypy/build.py", line 3284, in process_graph
  File "mypy/build.py", line 3362, in process_fresh_modules
  File "mypy/build.py", line 2101, in load_tree
  File "mypy/nodes.py", line 397, in deserialize
  File "mypy/nodes.py", line 3689, in deserialize
  File "mypy/nodes.py", line 3630, in deserialize
  File "mypy/nodes.py", line 274, in deserialize
  File "mypy/nodes.py", line 854, in deserialize
  File "mypy/types.py", line 189, in deserialize_type

  File "mypy/types.py", line 2038, in deserialize
KeyError: 'unpack_kwargs'

@codecov-commenter
Copy link

codecov-commenter commented Apr 11, 2023

Codecov Report

Merging #71 (26d9aea) into main (59edc6f) will increase coverage by 0.31%.
The diff coverage is 96.15%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #71      +/-   ##
==========================================
+ Coverage   77.99%   78.31%   +0.31%     
==========================================
  Files          41       41              
  Lines        1127     1139      +12     
==========================================
+ Hits          879      892      +13     
+ Misses        248      247       -1     
Impacted Files Coverage Δ
src/otg/dataset/intervals.py 41.93% <ø> (ø)
src/otg/dataset/dataset.py 88.88% <91.66%> (+3.17%) ⬆️
src/otg/common/schemas.py 100.00% <100.00%> (ø)
src/otg/dataset/study_index.py 96.55% <100.00%> (ø)

@ireneisdoomed ireneisdoomed changed the base branch from do_hydra to main April 11, 2023 11:33
Copy link
Collaborator

@d0choa d0choa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think is worth considering the next 2 cases?

  • use StructField .dataType method to compare if the expected and observed types are the same. I believe they are ignored at the moment. (e.g. df.schema[0].dataType)
  • While we were comparing the whole StructField before, this PR only compares the names. This makes all nested information ignored. If I understand this correctly you would be ignoring all differences in nested columns. We might need some recursivity similar to this chunk of code from Chispa.

Since you have some tests already, perhaps it's useful to add some of these cases and try to do test-driven development.

We can also merge and work separately if this unblocks other work. I'm approving just in case.

@DSuveges
Copy link
Contributor

I agree with @d0choa, we should only drop the nullability check not the type check.

@ireneisdoomed
Copy link
Contributor Author

ireneisdoomed commented Apr 12, 2023

@d0choa @DSuveges Thank you so much for your comments. They were all on point. The new changes include:

  • types of each field are now validated, not only the names (leaving apart only the nullability property)
  • compatibility for nested validation. this is achieved by flattening the schema of both dataframes to extract their fields. the function is heavily inspired in the one we have to calculate the metrics.
  • unit tests for: missing field, extra field, field of different type. run both in nested and not nested mock data.

.gitignore Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@ireneisdoomed
Copy link
Contributor Author

ireneisdoomed commented Apr 13, 2023

The PR now includes fixes in the business logic that were raised with the schema validation. All due to duplicated fields found in the observed_schema. Source of duplication was of 2 types:

  • Duplicated "chromosome" key in the V2G generation from the intervals. This is because when joining 2 tables using aliases, the joining key is kept. I don't know if it is a Pyspark bug, but it is how it works. Commit: 26d9aea
  • Duplicated columns in the study table. We now have this process where to create a studyIndex dataset early on, we need to define columns containing dummy data that will be parsed later in the process. Examples of this were columns like numSamples or summarystatsLocation. When these fields are parsed, they are joined back to the original studyIndex dataset. The problem was that these columns needed to be dropped in the original object before joining. I wish there was a better way of doing this, but I think we have to live with it. @d0choa suggested to handle these only in the context of the tests, where this fields are already part of the df because its creation is dictated by the schema. Commits: fe93cf5, dce0e76, dec2cf4

@ireneisdoomed
Copy link
Contributor Author

Last change: added an ad hoc check for duplicated fields in d0b3489

redundancy only happens in the testing contest
@ireneisdoomed ireneisdoomed merged commit 660e5d8 into main Apr 13, 2023
@ireneisdoomed ireneisdoomed deleted the il-schemas branch April 13, 2023 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants