-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for existing BigQuery Schema (replace #40) #57
Conversation
Identify data chunks by count and line number
… key from new schema only differs in case
…cess inconvertible types
…or loops to achieve this
…luding ones starting from an existing schema.
@bxparks This is ready for review. Happy to have your feedback and incorporate anything else that is needed to get this merged. I know this is a big change but it's a particularly useful one when you are loading data into existing tables. |
@abroglesc Thanks for picking this up! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic, thanks for doing this. I tell you, no matter how many recursive functions I write and review, they still make my brain hurt trying to figure out if they are doing the right thing. Most of this looks great, I just have some small nits here and there. The major discussion point is inside merge_schema_entry()
where you added some code to handle the situation of a field going from REQUIRED -> NULLABLE due to an existing schema. Please take a look at my notes and proposal, and let me know what you think.
f'old=({old_status},{full_old_name},{old_mode},{old_type}); ' | ||
f'new=({new_status},{full_new_name},{new_mode},{new_type})') | ||
return None | ||
# primitive-types are conservatively deduced NULLABLE. In case we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 358-364: This code here is very subtle, it took me a while to figure out what's going on. Let me see if I understand this correctly.
Without an existing schema, the only legal transitions are NULLABLE -> NULLABLE and REPEATED -> REPEATED. When we add an existing schema through --existing_schema_path
, we can now have a field going from REQUIRED -> NULLABLE. (The reverse, NULLABLE -> REQUIRED, cannot happen because a REQUIRED field is produced only during flatten_schema()
, never during deduce_schema_for_line()
).
So now we have 2 cases:
Case 1) REQUIRED -> NULLABLE(filled=True)
Case 2) REQUIRED -> NULLABLE(filled=False).
In Case 1 (filled=True), the obvious thing is to keep the field as REQUIRED (regardless of whether --infer_mode was selected).
In Case 2 (filled=False), we have 2 choices:
Choice 2.1) We can revert the existing REQUIRED into a NULLABLE.
Choice 2.2) We can print out an error message, ignore this field, and continue processing.
I feel like the User should be able to choose 2.1 or 2.2 with a flag, because it's not obvious which behavior is the correct one. And here, it seems like we can overload the meaning of the --infer_mode
to choose 2.1. It was originally meant to upgrade NULLABLE -> REQUIRED, but I think it makes sense to overload its meaning to allow the reverse.
Putting all this together, I think it's worth pulling this logic into its own section, above Line 357, and I think it would look something like this:
# If the old field is a REQUIRED primitive (which could only have come from an existing
# schema), the new field can be either a NULLABLE(filled) or a NULLABLE(unfilled).
if old_mode == 'REQUIRED' and new_mode == 'NULLABLE':
# If the new field is filled, then retain the REQUIRED.
if new_schema_entry['filled']:
new_info['mode'] = old_mode # REQUIRED
new_mode = old_mode
else:
# The new field is not filled (i.e. an empty or null field).
# If --infer_mode is active, then we allow the REQUIRED to revert back to NULLABLE.
if self.infer_mode:
old_info['mode'] = new_mode # NULLABLE
old_mode = new_mode
else:
# Leave the old_mode and new_mode unchanged and different.
# The mismatch will be detected in the code below.
pass
[...The following code remains unchanged...]
# For all other types...
if old_mode != new_mode:
self.log_error(...)
return None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bxparks this came from @bozzzzo's initial implementation.
I implemented your suggestion and then when writing a unit testcase to go from REQUIRED --> NULLABLE so I could ensure errors existed properly I could never trigger this. After a lot of tracing this is because when we deduce an existing schema we assume that status is hard since we can't change a type on line 286
# new 'soft' does not clobber old 'hard'
if old_status == 'hard' and new_status == 'soft':
return old_schema_entry
Given we assume all are hard and if we find a NULL field in a schema_entry we instantiate it with a status of soft
we will never reach the condition to relax from REQUIRED --> NULLABLE.
Code for instantiating a NULL schema_entry
bigquery-schema-generator/bigquery_schema_generator/generate_schema.py
Lines 427 to 436 in 7d1b4ce
elif value_type == '__null__': | |
schema_entry = OrderedDict([ | |
('status', 'soft'), | |
('filled', False), | |
('info', OrderedDict([ | |
('mode', 'NULLABLE'), | |
('name', sanitized_key), | |
('type', 'STRING'), | |
])), | |
]) |
Should we move the logic about relaxing a field from REQUIRED --> NULLABLE or NULLABLE --> REQUIRED further up in this function or should we delete this status check and use it only when doing type conversions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I broke out the mode determination into it's own function and I call that within this soft/hard check and lower in the code where this comment initially referenced and it seems to be working. Expect a commit with the change and updated testcase soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has now been implemented with tests and pushed in bb5745c
@@ -216,8 +264,8 @@ def read_schema_section(self): | |||
break | |||
(tag, _) = self.parse_tag_line(line) | |||
if tag in self.TAG_TOKENS: | |||
if tag == 'SCHEMA': | |||
raise Exception("Unexpected SCHEMA tag") | |||
if tag in ('DATA', 'ERRORS', 'EXISTING_SCHEMA', 'SCHEMA'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's getting harder to keep track of which tags are allowed in which sections. Originally, the order of the tags were just: DATA, [ERRORS], SCHEMA, END. But now it's DATA, [EXISTING_SCHEMA], [ERRORS], SCHEMA, END. A better way would be to allow these sections to appears in any order. But that's a bit out of scope for this PR. If I get motivated, maybe I'll take a crack at it after merging in this PR... but realistically, it will probably not rise high enough on my priority list with so many other things going on. Too bad. At least this tidbit is recorded here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #59 to keep that nice to have visible after we merge this PR.
When is this expected to be merged and be in the stable release? |
…se for new schema items
Fixed the casing issue. Added a couple tests for that as well. Just need to address the rest of the comments. Overall the logic fixes are mostly implemented just small changes left. |
…_schema_entry and merge_schema_entry in deduce_schema_for_line
Hi, Just want to check in to get a status on this PR. The sooner we get this in the better, because it takes a fair bit of effort to pull in all the context info into my head to allow me to review this code properly. I don't work on this code frequently, so if we wait too long, I start forgetting things. BTW, there's a |
Hi @bxparks, I will try to have this all addressed today. I had a bit of time off recently where I disconnected and just getting back into the swing of things. |
@bxparks At this time I believe I have addressed all of your comments. Let me know if there is anything else we need to get this PR merged :) |
… --> FLOAT and INT64 --> Integer
I now truly believe this is ready for a final review after addressing the one final comment I missed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replied in one of my embedded comments that this PR has so many commits, and has Bozo's commit which are not rebased from develop
, that I can no longer follow what it's doing. I added request in there, asking you to squash-merge all these changes into a single commit, rebased from develop
, then send me another PR. Which will make it possible to get a sensible git diff
.
new_schema_entry = self.get_schema_entry(key, value) | ||
new_schema_entry = self.get_schema_entry(key, | ||
value, | ||
base_path=base_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--count \ | ||
--ignore W503 \ | ||
--show-source \ | ||
--statistics \ | ||
--max-line-length=80 | ||
flake8 tests/ \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to duplicate the entire command, just add tests
to the previous flake8
command:
flake8:
flake8 bigquery_schema_generator tests \
--count \
--ignore W503 \
--show-source \
--statistics \
--max-line-length=80
Small nit: I prefer my directories to not have a trailing /
, since it is not part of their name. And trailing slashes are actually meaningful in some programs (e.g. rsync(1)), so I'd rather not get in the habit of using them without explicit reasons.
sanitized_key = self.sanitize_name(key) | ||
# We want to use a lowercase version of the key in the schema map so | ||
# that we can aggregate keys with slightly different casing together | ||
sanitized_key = self.sanitize_name(key).lower() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename sanitized_key
in this method to something like canonical_key
, to avoid confusion with sanitized_key
in get_schema_entry()
? Also, update the comment to say something like: "The canonical key is the lower-cased version of the sanitized key so that the case of the field name is preserved when generating the schema."
new_info['type'] = candidate_type | ||
return new_schema_entry | ||
|
||
def merge_mode(self, old_schema_entry, new_schema_entry, base_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this section is the crux of this PR, but unfortunately I am having an incredibly difficult time reviewing it. The GitHub code review tool is so clunky and difficult compared to what I'm used to. And the git diff
output is completely messed up because the previous PR was not rebased off the current develop
branch, so it's giving me diffs with lines that don't exist anymore, and I can't make any sense of it.
Request: Can you squash merge all your commits and Bozo's commits into a single commit, and send me another PR? I don't how if GitHub allows you to push those changes into this PR, or whether it will create another PR. I suspect the latter. This way, all your changes will be rebased off the current state of the code, so I can figure out what actually changed. Right now, I just cannot make any sense of it...
I think what you need to do is create a new branch, then do something like:
$ git co develop
$ git pull # make sure that you are synced with the repo
$ git co -b {new_branch}
$ git merge --squash abroglesc-existing-schema
$ git push
Then create a PR request on GitHub from {new_branch}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# impossible, but more effort than I want to expend right now) to figure | ||
# out which fields are missing so that we can mark the appropriate | ||
# schema entries with 'filled=False'. | ||
if infer_mode and input_format != 'csv': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be a bad idea. I know that you tried to recover some of this functionality in flatten_schema_map()
, I am not sure that's sufficient. The reason is that if --infer_mode is enabled in JSON mode, it will do the wrong thing when JSON lines are missing those fields completely. All of the records will have a filled=True state, so I think the program will incorrectly infer the mode to be REQUIRED
. This will break bq load
when loading that dataset with this generated schema.
I realize that I was the one who suggested overloading the --infer_mode
for the REQUIRED->NULLABLE
transition, but I'm starting to think that we need a separate flag for this. To avoid enabling --infer_mode
for JSON format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will continue on the next PR I submit with squashed commits but why would this be bad given that added one more check to only apply this in flatten_schema_map()
if the input_format is a CSV. If it's a JSON input format it will not elevate these to REQUIRED.
Add support for existing BigQuery Schema with a single squashed commit (replace #57)
Superceded by #61, closing. |
This takes the code proposed in PR #40 and attempts to address all outstanding items of concern as they were reviewed by @bxparks. #40 is very stale at this time and it does not seem that will get implemented at any point. I would like to request we close at #40 in favor of this PR but we should use that as a base of what changes I made since you last reviewed that PR.
The changes I have made as requested are:
commit id: 8fbe52d
commit id: 1e286f1
commit id: 1e286f1
commit id: 7afdf419a2c4a61c3b356d1b59ad5baaed8cc171
commit id: fff6f5b
error_logs
that are returned.commit id: fff6f5b
type_mismatch_callback
. This was not properly tested and I didn't feel like maintaining/testing it.commit id: c5a57de
commit id: 96ca4ae
generate-schema
command examplesSummarizing original changes from @bozzzzo.