-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate schemas overlooking the nullability fields #71
Conversation
windows cannot be applied inside where clauses
variant_df does not have split coordinates
Codecov Report
@@ Coverage Diff @@
## main #71 +/- ##
==========================================
+ Coverage 77.99% 78.31% +0.31%
==========================================
Files 41 41
Lines 1127 1139 +12
==========================================
+ Hits 879 892 +13
+ Misses 248 247 -1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think is worth considering the next 2 cases?
- use
StructField
.dataType
method to compare if the expected and observed types are the same. I believe they are ignored at the moment. (e.g.df.schema[0].dataType
) - While we were comparing the whole
StructField
before, this PR only compares the names. This makes all nested information ignored. If I understand this correctly you would be ignoring all differences in nested columns. We might need some recursivity similar to this chunk of code from Chispa.
Since you have some tests already, perhaps it's useful to add some of these cases and try to do test-driven development.
We can also merge and work separately if this unblocks other work. I'm approving just in case.
I agree with @d0choa, we should only drop the nullability check not the type check. |
@d0choa @DSuveges Thank you so much for your comments. They were all on point. The new changes include:
|
ad95698
to
51ad0aa
Compare
27eeb03
to
480539b
Compare
This reverts commit 6cc9af1.
The PR now includes fixes in the business logic that were raised with the schema validation. All due to duplicated fields found in the observed_schema. Source of duplication was of 2 types:
|
Last change: added an ad hoc check for duplicated fields in d0b3489 |
redundancy only happens in the testing contest
This PR includes:
Notes:
pytest_generate_tests
function. TIL that when we use pytest's metafuncs, we are defining a fixture that will be used in every test, unless specified otherwise.