Add support for existing BigQuery Schema (replace #40) #57

abroglesc · 2020-11-09T21:22:53Z

This takes the code proposed in PR #40 and attempts to address all outstanding items of concern as they were reviewed by @bxparks. #40 is very stale at this time and it does not seem that will get implemented at any point. I would like to request we close at #40 in favor of this PR but we should use that as a base of what changes I made since you last reviewed that PR.

The changes I have made as requested are:

Extend testdata structure to enable interpreting the existing schema as a string
commit id: 8fbe52d
Convert added pytest tests to normal unittest (replaced fixtures with nested for loops which is worse but satisfies not using a different test suite)
commit id: 1e286f1
Delete pytest dep, and delete tox
commit id: 1e286f1
Add --existing_schema command line flag for reading in existing schema from a file. (super easy)
commit id: 7afdf419a2c4a61c3b356d1b59ad5baaed8cc171
Write some tests that include an existing schema in testdata.txt
commit id: fff6f5b
Remove ERRORS INFORMED from the data_reader. This didn't make sense to me and I did not see the value. All errors should be part of the error_logs that are returned.
commit id: fff6f5b
Remove type_mismatch_callback. This was not properly tested and I didn't feel like maintaining/testing it.
commit id: c5a57de
Fix bug as it related to bigquery not providing a mode for some nested JSON fields by defaulting to NULLABLE
commit id: 96ca4ae
Not requested but while updating the README I added the bash syntax flag to the generate-schema command examples

Summarizing original changes from @bozzzzo.

Break sanitize_names into it's own function and call this earlier in the schema generation process, not just during flattening.
Add ability to begin from an existing bigquery table schema by being able to reconstitute a schema_map from an existing bigquery schema.

Identify data chunks by count and line number

… key from new schema only differs in case

…cess inconvertible types

…or loops to achieve this

…luding ones starting from an existing schema.

abroglesc · 2020-11-09T21:43:54Z

@bxparks This is ready for review. Happy to have your feedback and incorporate anything else that is needed to get this merged. I know this is a big change but it's a particularly useful one when you are loading data into existing tables.

bozzzzo · 2020-11-10T09:15:23Z

@abroglesc Thanks for picking this up!

bxparks

This is fantastic, thanks for doing this. I tell you, no matter how many recursive functions I write and review, they still make my brain hurt trying to figure out if they are doing the right thing. Most of this looks great, I just have some small nits here and there. The major discussion point is inside merge_schema_entry() where you added some code to handle the situation of a field going from REQUIRED -> NULLABLE due to an existing schema. Please take a look at my notes and proposal, and let me know what you think.

bxparks · 2020-11-14T03:25:03Z

bigquery_schema_generator/generate_schema.py

-                f'old=({old_status},{full_old_name},{old_mode},{old_type}); '
-                f'new=({new_status},{full_new_name},{new_mode},{new_type})')
-            return None
+            # primitive-types are conservatively deduced NULLABLE. In case we


Lines 358-364: This code here is very subtle, it took me a while to figure out what's going on. Let me see if I understand this correctly.

Without an existing schema, the only legal transitions are NULLABLE -> NULLABLE and REPEATED -> REPEATED. When we add an existing schema through --existing_schema_path, we can now have a field going from REQUIRED -> NULLABLE. (The reverse, NULLABLE -> REQUIRED, cannot happen because a REQUIRED field is produced only during flatten_schema(), never during deduce_schema_for_line()).

So now we have 2 cases:

Case 1) REQUIRED -> NULLABLE(filled=True)
Case 2) REQUIRED -> NULLABLE(filled=False).

In Case 1 (filled=True), the obvious thing is to keep the field as REQUIRED (regardless of whether --infer_mode was selected).

In Case 2 (filled=False), we have 2 choices:
Choice 2.1) We can revert the existing REQUIRED into a NULLABLE.
Choice 2.2) We can print out an error message, ignore this field, and continue processing.

I feel like the User should be able to choose 2.1 or 2.2 with a flag, because it's not obvious which behavior is the correct one. And here, it seems like we can overload the meaning of the --infer_mode to choose 2.1. It was originally meant to upgrade NULLABLE -> REQUIRED, but I think it makes sense to overload its meaning to allow the reverse.

Putting all this together, I think it's worth pulling this logic into its own section, above Line 357, and I think it would look something like this:

# If the old field is a REQUIRED primitive (which could only have come from an existing # schema), the new field can be either a NULLABLE(filled) or a NULLABLE(unfilled). if old_mode == 'REQUIRED' and new_mode == 'NULLABLE': # If the new field is filled, then retain the REQUIRED. if new_schema_entry['filled']: new_info['mode'] = old_mode # REQUIRED new_mode = old_mode else: # The new field is not filled (i.e. an empty or null field). # If --infer_mode is active, then we allow the REQUIRED to revert back to NULLABLE. if self.infer_mode: old_info['mode'] = new_mode # NULLABLE old_mode = new_mode else: # Leave the old_mode and new_mode unchanged and different. # The mismatch will be detected in the code below. pass [...The following code remains unchanged...] # For all other types... if old_mode != new_mode: self.log_error(...) return None

Hi @bxparks this came from @bozzzzo's initial implementation.

I implemented your suggestion and then when writing a unit testcase to go from REQUIRED --> NULLABLE so I could ensure errors existed properly I could never trigger this. After a lot of tracing this is because when we deduce an existing schema we assume that status is hard since we can't change a type on line 286

# new 'soft' does not clobber old 'hard' if old_status == 'hard' and new_status == 'soft': return old_schema_entry

Given we assume all are hard and if we find a NULL field in a schema_entry we instantiate it with a status of soft we will never reach the condition to relax from REQUIRED --> NULLABLE.

Code for instantiating a NULL schema_entry

bigquery-schema-generator/bigquery_schema_generator/generate_schema.py

Lines 427 to 436 in 7d1b4ce

elif value_type == '__null__':

schema_entry = OrderedDict([

('status', 'soft'),

('filled', False),

('info', OrderedDict([

('mode', 'NULLABLE'),

('name', sanitized_key),

('type', 'STRING'),

])),

])

Should we move the logic about relaxing a field from REQUIRED --> NULLABLE or NULLABLE --> REQUIRED further up in this function or should we delete this status check and use it only when doing type conversions?

I broke out the mode determination into it's own function and I call that within this soft/hard check and lower in the code where this comment initially referenced and it seems to be working. Expect a commit with the change and updated testcase soon.

This has now been implemented with tests and pushed in bb5745c

bigquery_schema_generator/generate_schema.py

tests/test_generate_schema.py

bigquery_schema_generator/generate_schema.py

tests/test_generate_schema.py

tests/data_reader.py

bxparks · 2020-11-14T04:17:05Z

tests/data_reader.py

@@ -216,8 +264,8 @@ def read_schema_section(self):
                break
            (tag, _) = self.parse_tag_line(line)
            if tag in self.TAG_TOKENS:
-                if tag == 'SCHEMA':
-                    raise Exception("Unexpected SCHEMA tag")
+                if tag in ('DATA', 'ERRORS', 'EXISTING_SCHEMA', 'SCHEMA'):


Hmm, it's getting harder to keep track of which tags are allowed in which sections. Originally, the order of the tags were just: DATA, [ERRORS], SCHEMA, END. But now it's DATA, [EXISTING_SCHEMA], [ERRORS], SCHEMA, END. A better way would be to allow these sections to appears in any order. But that's a bit out of scope for this PR. If I get motivated, maybe I'll take a crack at it after merging in this PR... but realistically, it will probably not rise high enough on my priority list with so many other things going on. Too bad. At least this tidbit is recorded here.

Created #59 to keep that nice to have visible after we merge this PR.

tests/test_generate_schema.py

ZebraCat · 2020-11-17T12:08:01Z

When is this expected to be merged and be in the stable release?
we are working with big query in an environment where the schema can change every day, and this solution is perfect for us.

…se for new schema items

abroglesc · 2020-11-17T19:39:24Z

Fixed the casing issue. Added a couple tests for that as well. Just need to address the rest of the comments. Overall the logic fixes are mostly implemented just small changes left.

…_schema_entry and merge_schema_entry in deduce_schema_for_line

bxparks · 2020-11-28T20:31:22Z

Hi, Just want to check in to get a status on this PR. The sooner we get this in the better, because it takes a fair bit of effort to pull in all the context info into my head to allow me to review this code properly. I don't work on this code frequently, so if we wait too long, I start forgetting things.

BTW, there's a Makefile checked in at the root. You can run the unit tests and check flake8 errors by typing make at the top. You don't have to wait until you push tot GitHub.

abroglesc · 2020-11-30T18:01:29Z

Hi @bxparks,

I will try to have this all addressed today. I had a bit of time off recently where I disconnected and just getting back into the swing of things.

…a_map

… scan that

abroglesc · 2020-12-01T01:23:41Z

@bxparks re: using Makefile for checking Flake8 I started doing that but it wasn't scanning tests/ folder whereas the CI/CD in github was. This was added to the flake8 Make target in commit 9829bb0

abroglesc · 2020-12-01T01:40:21Z

@bxparks At this time I believe I have addressed all of your comments. Let me know if there is anything else we need to get this PR merged :)

… --> FLOAT and INT64 --> Integer

abroglesc · 2020-12-01T04:00:19Z

I now truly believe this is ready for a final review after addressing the one final comment I missed.

…son input_format

bxparks

I replied in one of my embedded comments that this PR has so many commits, and has Bozo's commit which are not rebased from develop, that I can no longer follow what it's doing. I added request in there, asking you to squash-merge all these changes into a single commit, rebased from develop, then send me another PR. Which will make it possible to get a sensible git diff.

bxparks · 2020-12-01T21:28:05Z

bigquery_schema_generator/generate_schema.py

-            new_schema_entry = self.get_schema_entry(key, value)
+            new_schema_entry = self.get_schema_entry(key,
+                                                     value,
+                                                     base_path=base_path)


Weird, this is fixed on the develop branch... In fact, you fixed it with 0c2b3a0. It looks like the problem is that you took Bozo's #40 but didn't rebase it off develop, so now, this PR has diverged.

bxparks · 2020-12-02T00:45:54Z

Makefile

 		--count \
 		--ignore W503 \
 		--show-source \
 		--statistics \
 		--max-line-length=80
+	flake8 tests/ \


No need to duplicate the entire command, just add tests to the previous flake8 command:

flake8: flake8 bigquery_schema_generator tests \ --count \ --ignore W503 \ --show-source \ --statistics \ --max-line-length=80

Small nit: I prefer my directories to not have a trailing /, since it is not part of their name. And trailing slashes are actually meaningful in some programs (e.g. rsync(1)), so I'd rather not get in the habit of using them without explicit reasons.

bxparks · 2020-12-02T01:12:40Z

bigquery_schema_generator/generate_schema.py

-            sanitized_key = self.sanitize_name(key)
+            # We want to use a lowercase version of the key in the schema map so
+            # that we can aggregate keys with slightly different casing together
+            sanitized_key = self.sanitize_name(key).lower()


Can you rename sanitized_key in this method to something like canonical_key, to avoid confusion with sanitized_key in get_schema_entry()? Also, update the comment to say something like: "The canonical key is the lower-cased version of the sanitized key so that the case of the field name is preserved when generating the schema."

bxparks · 2020-12-02T02:38:13Z

bigquery_schema_generator/generate_schema.py

+            new_info['type'] = candidate_type
+        return new_schema_entry
+
+    def merge_mode(self, old_schema_entry, new_schema_entry, base_path):


Ok, this section is the crux of this PR, but unfortunately I am having an incredibly difficult time reviewing it. The GitHub code review tool is so clunky and difficult compared to what I'm used to. And the git diff output is completely messed up because the previous PR was not rebased off the current develop branch, so it's giving me diffs with lines that don't exist anymore, and I can't make any sense of it.

Request: Can you squash merge all your commits and Bozo's commits into a single commit, and send me another PR? I don't how if GitHub allows you to push those changes into this PR, or whether it will create another PR. I suspect the latter. This way, all your changes will be rebased off the current state of the code, so I can figure out what actually changed. Right now, I just cannot make any sense of it...

I think what you need to do is create a new branch, then do something like:

$ git co develop $ git pull # make sure that you are synced with the repo $ git co -b {new_branch} $ git merge --squash abroglesc-existing-schema $ git push

Then create a PR request on GitHub from {new_branch}.

@bxparks created #61 as requested.

bxparks · 2020-12-02T02:45:56Z

bigquery_schema_generator/generate_schema.py

-        # impossible, but more effort than I want to expend right now) to figure
-        # out which fields are missing so that we can mark the appropriate
-        # schema entries with 'filled=False'.
-        if infer_mode and input_format != 'csv':


I think this might be a bad idea. I know that you tried to recover some of this functionality in flatten_schema_map(), I am not sure that's sufficient. The reason is that if --infer_mode is enabled in JSON mode, it will do the wrong thing when JSON lines are missing those fields completely. All of the records will have a filled=True state, so I think the program will incorrectly infer the mode to be REQUIRED. This will break bq load when loading that dataset with this generated schema.

I realize that I was the one who suggested overloading the --infer_mode for the REQUIRED->NULLABLE transition, but I'm starting to think that we need a separate flag for this. To avoid enabling --infer_mode for JSON format.

Will continue on the next PR I submit with squashed commits but why would this be bad given that added one more check to only apply this in flatten_schema_map() if the input_format is a CSV. If it's a JSON input format it will not elevate these to REQUIRED.

Add support for existing BigQuery Schema with a single squashed commit (replace #57)

bxparks · 2020-12-05T17:38:11Z

Superceded by #61, closing.

bozzzzo and others added 23 commits March 24, 2020 09:19

Add support for starting from a known schema_map

41f24cc

A SCHEMA tag can not be followed by DATA, ERROR or SCHEMA tag

99c5168

Convert tests to pytest.

783e1f1

Identify data chunks by count and line number

Add support for starting from a known BQ schema

bbbb35c

Add support for tox to test across python versions

040d801

BQ schema is case insensitive, track keys by their lowercase value

ea5107e

lowercase field names when generating schema map from schema

35d5cec

Keep schema map index case insensitive, but preserve original field case

444a5b9

case sensitivity fix - preserve key name from old/original schema, if…

265b3c3

… key from new schema only differs in case

add an optional callback to SchemaGenerator - allow the caller to pro…

ee47847

…cess inconvertible types

Modified data_reader class to read in an existing schema

8fbe52d

Fixed bug related to sanitization within CSV datasets

e15bfe9

Migrated off of pytest and back to unittest. Coverted fixtures into f…

1e286f1

…or loops to achieve this

Adding command-line flag for starting from existing bigquery schema

7afdf41

Added default NULLABLE mode when bigquery does not provide one

96ca4ae

Removing errors informed from tests. Adding additional test cases inc…

fff6f5b

…luding ones starting from an existing schema.

Removing type_mismatch_callback as this was untested

c5a57de

Merge branch 'develop' into abroglesc-existing-schema

5ef1427

Fixing tests post merging from develop

098dbba

Removing tox from gitignore

608b3fd

Updating README to include details on existing_schema_path

dd0db5f

Fixing Flake8 errors

de27694

Actually fully fixing flake8 tests

7d1b4ce

abroglesc mentioned this pull request Nov 9, 2020

Add support for starting from exising BQ schema #40

Closed

bxparks reviewed Nov 14, 2020

View reviewed changes

abroglesc added 2 commits November 16, 2020 10:09

Removing old generator.run function call

ed60a7f

Removing unused test function

bb35faf

Keeping case sensitivity rather than converting everything to lowerca…

23f8c40

…se for new schema items

abroglesc added 3 commits November 17, 2020 18:14

Fixing error logging bug related to base_path not being passed to get…

39f8222

…_schema_entry and merge_schema_entry in deduce_schema_for_line

Adding additional json_full_path error locations

efcb2fa

Fixing flake8 error

865e270

bxparks mentioned this pull request Nov 18, 2020

Schema generation from a list #58

Closed

abroglesc added 4 commits November 30, 2020 17:06

Allow infer_schema to control relaxing mode when using existing_schem…

bb5745c

…a_map

Renaming line --> line_number

411f402

Updating make flake8 task to also scan tests/ folder since CI/CD does…

9829bb0

… scan that

Fix flake8 on tests/

b556e0b

abroglesc added 3 commits November 30, 2020 17:27

Revert read_errors_section logic to original

be3a37a

Convert .format into f strings

fae9581

Revert generator for testcases to original loop method

d41556e

abroglesc mentioned this pull request Dec 1, 2020

Allow sections in test DataReader to appear in any order #59

Open

abroglesc added 2 commits November 30, 2020 19:58

Added a test for standard sql types to legacy type conversion FLOAT64…

38523e2

… --> FLOAT and INT64 --> Integer

Added additional 2 standard to legacy type conversions to test

989c2f8

abroglesc requested a review from bxparks December 1, 2020 05:58

abroglesc mentioned this pull request Dec 1, 2020

Schema augmentation function #60

Closed

Fixed bug where we used infer_mode to set a field as REQUIRED for a j…

82593a4

…son input_format

bxparks reviewed Dec 2, 2020

View reviewed changes

abroglesc mentioned this pull request Dec 2, 2020

Add support for existing BigQuery Schema with a single squashed commit (replace #57) #61

Merged

bxparks added a commit that referenced this pull request Dec 5, 2020

Merge pull request #61 from abroglesc/abrogle-new-existing-schema

4a99d8a

Add support for existing BigQuery Schema with a single squashed commit (replace #57)

bxparks closed this Dec 5, 2020

bxparks mentioned this pull request Dec 5, 2020

merge 1.3 into master #62

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for existing BigQuery Schema (replace #40) #57

Add support for existing BigQuery Schema (replace #40) #57

abroglesc commented Nov 9, 2020 •

edited

Loading

abroglesc commented Nov 9, 2020

bozzzzo commented Nov 10, 2020

bxparks left a comment

bxparks Nov 14, 2020

abroglesc Dec 1, 2020

abroglesc Dec 1, 2020

abroglesc Dec 1, 2020

bxparks Nov 14, 2020

abroglesc Dec 1, 2020

ZebraCat commented Nov 17, 2020 •

edited

Loading

abroglesc commented Nov 17, 2020

bxparks commented Nov 28, 2020

abroglesc commented Nov 30, 2020

abroglesc commented Dec 1, 2020 •

edited

Loading

abroglesc commented Dec 1, 2020

abroglesc commented Dec 1, 2020

bxparks left a comment

bxparks Dec 1, 2020

bxparks Dec 2, 2020

bxparks Dec 2, 2020

bxparks Dec 2, 2020

abroglesc Dec 2, 2020

bxparks Dec 2, 2020

abroglesc Dec 2, 2020

bxparks commented Dec 5, 2020

	elif value_type == '__null__':
	schema_entry = OrderedDict([
	('status', 'soft'),
	('filled', False),
	('info', OrderedDict([
	('mode', 'NULLABLE'),
	('name', sanitized_key),
	('type', 'STRING'),
	])),
	])

Add support for existing BigQuery Schema (replace #40) #57

Add support for existing BigQuery Schema (replace #40) #57

Conversation

abroglesc commented Nov 9, 2020 • edited Loading

abroglesc commented Nov 9, 2020

bozzzzo commented Nov 10, 2020

bxparks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZebraCat commented Nov 17, 2020 • edited Loading

abroglesc commented Nov 17, 2020

bxparks commented Nov 28, 2020

abroglesc commented Nov 30, 2020

abroglesc commented Dec 1, 2020 • edited Loading

abroglesc commented Dec 1, 2020

abroglesc commented Dec 1, 2020

bxparks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bxparks commented Dec 5, 2020

abroglesc commented Nov 9, 2020 •

edited

Loading

ZebraCat commented Nov 17, 2020 •

edited

Loading

abroglesc commented Dec 1, 2020 •

edited

Loading