Validate schemas overlooking the nullability fields #71

ireneisdoomed · 2023-04-11T10:51:41Z

This PR includes:

Changes to the validate_schema Dataset method to compare schemas between dataframes by looking at the names.
- The functionality is essentially the same, only that we iterate on the column names, instead of on the properties of each schema field.
- There are 2 checks in place: a) checking that the observed schema does not have any extra field; b) checking that the mandatory fields dictated by the expected schema are present in the observed_schema
Added 2 unit tests for each of the functionalities I describe above

Notes:

I also had to tweak the pytest_generate_tests function. TIL that when we use pytest's metafuncs, we are defining a fixture that will be used in every test, unless specified otherwise.
Mypy doesn't check the script that tests the schema. This is because mypy is crashing when checking this script, even with the newer version, hence I was not able to commit. It's not that the script is raising a typing error, but mypy is failing to deserialize something. I've found an issue in their repo that might be related. Full trace:

Traceback (most recent call last):
  File "/Users/irenelopez/.cache/pre-commit/repomwnp37wt/py_env-python3.8/bin/mypy", line 8, in <module>
    sys.exit(console_entry())
  File "/Users/irenelopez/.cache/pre-commit/repomwnp37wt/py_env-python3.8/lib/python3.8/site-packages/mypy/__main__.py", line 15, in console_entry
    main()
  File "mypy/main.py", line 95, in main
  File "mypy/main.py", line 174, in run_build
  File "mypy/build.py", line 193, in build
  File "mypy/build.py", line 276, in _build
  File "mypy/build.py", line 2903, in dispatch
  File "mypy/build.py", line 3284, in process_graph
  File "mypy/build.py", line 3362, in process_fresh_modules
  File "mypy/build.py", line 2101, in load_tree
  File "mypy/nodes.py", line 397, in deserialize
  File "mypy/nodes.py", line 3689, in deserialize
  File "mypy/nodes.py", line 3630, in deserialize
  File "mypy/nodes.py", line 274, in deserialize
  File "mypy/nodes.py", line 854, in deserialize
  File "mypy/types.py", line 189, in deserialize_type

  File "mypy/types.py", line 2038, in deserialize
KeyError: 'unpack_kwargs'

windows cannot be applied inside where clauses

variant_df does not have split coordinates

codecov-commenter · 2023-04-11T10:56:22Z

Codecov Report

Merging #71 (26d9aea) into main (59edc6f) will increase coverage by 0.31%.
The diff coverage is 96.15%.

@@            Coverage Diff             @@
##             main      #71      +/-   ##
==========================================
+ Coverage   77.99%   78.31%   +0.31%     
==========================================
  Files          41       41              
  Lines        1127     1139      +12     
==========================================
+ Hits          879      892      +13     
+ Misses        248      247       -1

Impacted Files	Coverage Δ
src/otg/dataset/intervals.py	`41.93% <ø> (ø)`
src/otg/dataset/dataset.py	`88.88% <91.66%> (+3.17%)`	⬆️
src/otg/common/schemas.py	`100.00% <100.00%> (ø)`
src/otg/dataset/study_index.py	`96.55% <100.00%> (ø)`

d0choa

Do you think is worth considering the next 2 cases?

use StructField .dataType method to compare if the expected and observed types are the same. I believe they are ignored at the moment. (e.g. df.schema[0].dataType)
While we were comparing the whole StructField before, this PR only compares the names. This makes all nested information ignored. If I understand this correctly you would be ignoring all differences in nested columns. We might need some recursivity similar to this chunk of code from Chispa.

Since you have some tests already, perhaps it's useful to add some of these cases and try to do test-driven development.

We can also merge and work separately if this unblocks other work. I'm approving just in case.

DSuveges · 2023-04-11T16:05:34Z

I agree with @d0choa, we should only drop the nullability check not the type check.

ireneisdoomed · 2023-04-12T15:52:16Z

@d0choa @DSuveges Thank you so much for your comments. They were all on point. The new changes include:

types of each field are now validated, not only the names (leaving apart only the nullability property)
compatibility for nested validation. this is achieved by flattening the schema of both dataframes to extract their fields. the function is heavily inspired in the one we have to calculate the metrics.
unit tests for: missing field, extra field, field of different type. run both in nested and not nested mock data.

.gitignore

pyproject.toml

…etl_python into il-schemas

This reverts commit 6cc9af1.

…ion" This reverts commit c57919d.

This reverts commit 829ef6b.

This reverts commit 512c842.

This reverts commit b1d56c4.

This reverts commit 5a8e03a.

This reverts commit 0d5c6e9.

…etl_python into il-schemas

…d coalesce

ireneisdoomed · 2023-04-13T08:15:13Z

The PR now includes fixes in the business logic that were raised with the schema validation. All due to duplicated fields found in the observed_schema. Source of duplication was of 2 types:

Duplicated "chromosome" key in the V2G generation from the intervals. This is because when joining 2 tables using aliases, the joining key is kept. I don't know if it is a Pyspark bug, but it is how it works. Commit: 26d9aea
Duplicated columns in the study table. We now have this process where to create a studyIndex dataset early on, we need to define columns containing dummy data that will be parsed later in the process. Examples of this were columns like numSamples or summarystatsLocation. When these fields are parsed, they are joined back to the original studyIndex dataset. The problem was that these columns needed to be dropped in the original object before joining. ~~I wish there was a better way of doing this, but I think we have to live with it~~. @d0choa suggested to handle these only in the context of the tests, where this fields are already part of the df because its creation is dictated by the schema. Commits: fe93cf5, dce0e76, dec2cf4

ireneisdoomed · 2023-04-13T08:40:41Z

Last change: added an ad hoc check for duplicated fields in d0b3489

redundancy only happens in the testing contest

ireneisdoomed added 10 commits April 6, 2023 18:20

fix: update configure and gitignore

6cc9af1

fix: move rsId and concordance check outside the filter function

c57919d

windows cannot be applied inside where clauses

fix: join _variant_coordinates_in_ldindex on variantId

829ef6b

variant_df does not have split coordinates

style: rename ld indices location to directory and no extension

512c842

fix: correct attribute names for ld indices

b1d56c4

fix: order ld index by idx and unpersist data

5a8e03a

feat: redefine validate_schema to avoid nullability issues

838d4f1

build: update mypy to 1.2.0

0d5c6e9

test: add TestValidateSchema suite

b997962

feat: redefine validate_schema to avoid nullability issues

f1a3fc1

ireneisdoomed requested review from d0choa and DSuveges April 11, 2023 10:59

ireneisdoomed changed the base branch from do_hydra to main April 11, 2023 11:33

d0choa approved these changes Apr 11, 2023

View reviewed changes

ireneisdoomed added 5 commits April 12, 2023 03:23

feat: add type checking to validate_schema

1f8d9a1

test: added test_validate_schema_different_datatype

f76e62f

feat: added flatten_schema function and test

5414538

feat: add support and tests for nested data

51ad0aa

feat: merge with remote branch

ad95698

ireneisdoomed force-pushed the il-schemas branch from ad95698 to 51ad0aa Compare April 12, 2023 16:00

ireneisdoomed and others added 2 commits April 12, 2023 17:02

feat: add support and tests for nested data

480539b

Merge branch 'main' into il-schemas

27eeb03

d0choa reviewed Apr 12, 2023

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

ireneisdoomed force-pushed the il-schemas branch from 27eeb03 to 480539b Compare April 12, 2023 16:09

d0choa reviewed Apr 12, 2023

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

d0choa and others added 2 commits April 12, 2023 17:20

Merge branch 'il-schemas' of https://github.com/opentargets/genetics_…

079ee76

…etl_python into il-schemas

Revert "fix: update configure and gitignore"

4051d0d

This reverts commit 6cc9af1.

ireneisdoomed added 11 commits April 12, 2023 17:28

Revert "fix: move rsId and concordance check outside the filter funct…

5dda33f

…ion" This reverts commit c57919d.

Revert "fix: join _variant_coordinates_in_ldindex on variantId"

5f9c7c1

This reverts commit 829ef6b.

Revert "style: rename ld indices location to directory and no extension"

eedf763

This reverts commit 512c842.

Revert "fix: correct attribute names for ld indices"

1f6e389

This reverts commit b1d56c4.

Revert "fix: order ld index by idx and unpersist data"

aba6bd8

This reverts commit 5a8e03a.

Revert "build: update mypy to 1.2.0"

34f1024

This reverts commit 0d5c6e9.

Merge branch 'il-schemas' of https://github.com/opentargets/genetics_…

57310d7

…etl_python into il-schemas

fix: _annotate_sumstats_info drop duplicated columns before join an…

dec2cf4

…d coalesce

fix: _annotate_ancestries drop default fields before join

dce0e76

fix: _annotate_discovery_sample_sizes drop default fields before join

fe93cf5

fix: handle duplicated chrom in v2g generation

26d9aea

feat: add check for duplicated field to validate_schema

d0b3489

refactor: drop redundants in tests

b22feb3

redundancy only happens in the testing contest

ireneisdoomed merged commit 660e5d8 into main Apr 13, 2023

ireneisdoomed deleted the il-schemas branch April 13, 2023 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate schemas overlooking the nullability fields #71

Validate schemas overlooking the nullability fields #71

ireneisdoomed commented Apr 11, 2023

codecov-commenter commented Apr 11, 2023 •

edited

Loading

d0choa left a comment •

edited

Loading

DSuveges commented Apr 11, 2023

ireneisdoomed commented Apr 12, 2023 •

edited

Loading

ireneisdoomed commented Apr 13, 2023 •

edited

Loading

ireneisdoomed commented Apr 13, 2023

Validate schemas overlooking the nullability fields #71

Validate schemas overlooking the nullability fields #71

Conversation

ireneisdoomed commented Apr 11, 2023

codecov-commenter commented Apr 11, 2023 • edited Loading

Codecov Report

d0choa left a comment • edited Loading

Choose a reason for hiding this comment

DSuveges commented Apr 11, 2023

ireneisdoomed commented Apr 12, 2023 • edited Loading

ireneisdoomed commented Apr 13, 2023 • edited Loading

ireneisdoomed commented Apr 13, 2023

codecov-commenter commented Apr 11, 2023 •

edited

Loading

d0choa left a comment •

edited

Loading

ireneisdoomed commented Apr 12, 2023 •

edited

Loading

ireneisdoomed commented Apr 13, 2023 •

edited

Loading