Add support for starting from exising BQ schema #40

bozzzzo · 2020-03-22T17:31:11Z

We're loading a kafka topic to BQ. It seems prudent to inform SchemaGenerator of existing BQ table schema before processing new records as this reduces the chances for misdetection.

bxparks · 2020-03-22T18:57:39Z

Hi Bozo (assuming that's your real name?),

Thanks for your interest in this project. It looks like you spent some time going through the code to understand how it works. I have a some initial feedback.

It's generally a good idea to give the maintainer of a project a note first before sending a PR containing new frameworks, new dependencies, and significant additional code or refactoring of code.
With regards to d7275d7 which adds pytest and tox, I am inclined to reject this. First, it pulls in a bunch of pip packages, some of which are unfamiliar to me, so this commit imposes additional burden on me to spend time to figure those out, and maintain them into the future. This includes 2 new configuration files, in both TOML and INI format? What is the benefit that we gain for adding all those extra dependencies and complexity?
With regards to 590e221 "Add support for starting from exising BQ schema", what is the problem that you are trying to solve with this code? I would have expected some additional examples and test cases, but all I see are some additional error messages? What does "INFORMED" mean?
It looks like you simplified some parts, like the handling of 'sanitize_names' flag, which simplifies the code. It was a submission by a contributor which I probably didn't spend that much time on. I would accept this change if submitted separately.
It looks like you added support for non-legacy SQL types (INT64, FLOAT64, BOOL, STRUCT). I would accept those changes if submitted separately. But only if it was accompanied by additional unit tests.
It looks like you found some bugs in the DataReader.read_chunk() method. I would accept those changes if submitted separately, preferably with additional unit tests.

Keep in mind that I don't actively work in this space (big data and BigQuery) currently, and I haven't looked at this code for many months. It is likely that you understand my code better than me at this moment. I understand that you want to resolve some deficiency with the code, but you need to take the time to explain to me what those problems are. Also keep in mind that whatever changes that I accept, as a maintainer, I have to maintain that code into the future, whereas you can just walk away.

bozzzzo · 2020-03-24T08:14:18Z

Hi Bozo (assuming that's your real name?),

Hi Brian, my real name is Božo, but I'm kind of still not used to websites properly supporting UTF8 so...

Thanks for your interest in this project. It looks like you spent some time going through the code to understand how it works. I have a some initial feedback.

Thanks for taking the time to go through this, I understand I put a lot of stuff in one PR

It's generally a good idea to give the maintainer of a project a note first before sending a PR containing new frameworks, new dependencies, and significant additional code or refactoring of code.

point taken.

With regards to d7275d7 which adds pytest and tox, I am inclined to reject this. First, it pulls in a bunch of pip packages, some of which are unfamiliar to me, so this commit imposes additional burden on me to spend time to figure those out, and maintain them into the future. This includes 2 new configuration files, in both TOML and INI format? What is the benefit that we gain for adding all those extra dependencies and complexity?

The short answer is, I mostly wanted to make tests use pytest. What I like about it is that distinct testcasess that are parametrized using @fixture are given unique identifiers so one can easily re-run just one failing test, without needing to hack on the test suite.

Tox is... someting that may be less relevant to SchemaGenerator, as your project is python3 only. in short tox allows testing the project across several versions of python easily. This one needs the tox.ini to work and for some reson pyproject.toml. The latter one I don't fully understand. It seems to be the new best way to substitute the most of setup.py. I don't know how to use it fully, I just learned to parrot the minimal incantaion by searching SO and looking at https://github.com/pypa/sampleproject

flake8 python linter with formatting opinions. I disabled this for now, as the submission to clean up was excessive even for my threshold :)

With regards to 590e221 "Add support for starting from exising BQ schema", what is the problem that you are trying to solve with this code?

The SchemaGenerator use-case seemed to be "I want to import this one file into BQ" What we're doing is importing a kafka topic to BQ, kinda ELT style. The topic schema is changing over time
and I want to catch errors before they are loaded to BQ as we can create much better diagnostics with early warning if the topic schema was changed in a way that BQ cannot accomodate on it's own (yes the proper solution would be to not make mistakes upstream).

So Schema generator as-is helps with detecting the schema for a slice of kafka messages, but has no opinion on schema compatibility of past import with the current one, while it does have such opinion when the incompatible change happens mid-slice. So I wanted to close this gap, and give SchemaGenerator a 'memory' of what it saw before.

I could have pickled the schema_map and do it this way with minimal changes in the schema generator, but it seemed to me much more robust to be able to work agains an existing BQ table and bootstrap the generator from that schema.

I would have expected some additional examples and test cases, but all I see are some additional error messages?

There are two additional tests. One tests that a BQ schema can be converted to a schema_map and then flattened again without losing information. The other one...

What does "INFORMED" mean?

... tests that the SchemaGenerator produces the same schema when starting with the schema known. What does change in this case are the error messages in edge conditions. I think they are really warning messages, but didn't want to go adding warning category.

It looks like you simplified some parts, like the handling of 'sanitize_names' flag, which simplifies the code. It was a submission by a contributor which I probably didn't spend that much time one. I would accept this change if submitted separately.

yeah, santitizing was happening too late for round-trip to work correctly, so I moved it to the source. it may be too simplistic performance wise as it now happens for every key of every data instead of once per key of produced schema.

It looks like you added support for non-legacy SQL types (INT64, FLOAT64, BOOL, STRUCT). I would accept those changes if submitted separately. But only if it was accompanied by additional unit tests.

BQ schema API documentation is vague which values it actually returns INTEGER or INT64 and since SchemaGenerator only supports INTEGER I added the translation when converting the BQ schema to schema_map

It looks like you found some bugs in the DataReader.read_chunk() method. I would accept those changes if submitted separately, preferably with additional unit tests.

ok.

Keep in mind that I don't actively work in this space (big data and BigQuery) currently, and I haven't looked at this code for many months. It is likely that you understand my code better than me at this moment. I understand that you want to resolve some deficiency with the code, but you need to take the time to explain to me what those problems are. Also keep in mind that whatever changes that I accept, as a maintainer, I have to maintain that code into the future, whereas you can just walk away.

I understand. I'll prepare the minimal patch that would be of value to us. that is just a 5 line change, ability to inject a prebuilt schema_map into the run() method. I can carry the rest on my fork or even in our solution.

Identify data chunks by count and line number

bxparks · 2020-03-24T16:24:42Z

Ok, I can see the value in being able to generate the BQ schema incrementally from a previous BQ schema. There are several parts to this problem:

Being able to re-constitute a schema_map from a schema. I never considered this possibility so I don't know if there are any subtleties or edge cases. I understand that's what your bq_schema_to_map() function is doing. I think this needs some unit tests.
Incrementally adding a new chunk of data to the reconstituted schema. It's not clear to me how errors are supposed to be communicated to the user.
Exposing this to the command line script. It looks like you added this feature only by using this project as a library. To be consistent, I think this feature should be exposed to the command line, probably through a flag (--existing_schema ?) that reads in the existing schema, then generates the new schema.
Testing. Seems like you would need to extend the structure of testdata.txt format so that you can test this incremental schema generation. I think it would be something like:

DATA 
{new JSON or CSV data}
EXISTING_SCHEMA
{existing BQ schema}
ERRORS
{errors messages}
SCHEMA
{new expected BQ schema}
END

With regards to pytest and tox, I don't think we should pull these in. With regards to a linter/formatter, I use yapf3 and pylint. But they are not hooked up to a CI, so I don't run them as often as I should.

… key from new schema only differs in case

…cess inconvertible types

abroglesc · 2020-10-27T22:13:12Z

Looking at this, it would be useful to me as well. Is it possible to push to get these implemented or help make the necessary adjustments myself to allow this to be merged?

abroglesc · 2020-11-09T21:45:00Z

I opened #57 as a continuation of this since I cannot contribute to the branch this PR was opened from but I have an interest in getting this code merged.

Thank you @bozzzzo for all of the original work here, it is really useful to me :)

bxparks · 2020-12-05T17:38:40Z

Superceded by #61. Closing.

bozzzzo added 5 commits March 24, 2020 09:19

Add support for starting from a known schema_map

41f24cc

A SCHEMA tag can not be followed by DATA, ERROR or SCHEMA tag

99c5168

Convert tests to pytest.

783e1f1

Identify data chunks by count and line number

Add support for starting from a known BQ schema

bbbb35c

Add support for tox to test across python versions

040d801

bozzzzo force-pushed the develop branch from 590e221 to 040d801 Compare March 24, 2020 08:40

bozzzzo added 2 commits March 24, 2020 20:36

BQ schema is case insensitive, track keys by their lowercase value

ea5107e

lowercase field names when generating schema map from schema

35d5cec

bozzzzo force-pushed the develop branch from cf2d426 to 35d5cec Compare April 7, 2020 08:53

bozzzzo and others added 3 commits April 23, 2020 13:14

Keep schema map index case insensitive, but preserve original field case

444a5b9

case sensitivity fix - preserve key name from old/original schema, if…

265b3c3

… key from new schema only differs in case

add an optional callback to SchemaGenerator - allow the caller to pro…

ee47847

…cess inconvertible types

abroglesc mentioned this pull request Nov 9, 2020

Add support for existing BigQuery Schema (replace #40) #57

Closed

bxparks mentioned this pull request Nov 18, 2020

Schema generation from a list #58

Closed

bxparks closed this Dec 5, 2020

bxparks mentioned this pull request Dec 5, 2020

merge 1.3 into master #62

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for starting from exising BQ schema #40

Add support for starting from exising BQ schema #40

bozzzzo commented Mar 22, 2020

bxparks commented Mar 22, 2020 •

edited

Loading

bozzzzo commented Mar 24, 2020

bxparks commented Mar 24, 2020 •

edited

Loading

abroglesc commented Oct 27, 2020

abroglesc commented Nov 9, 2020

bxparks commented Dec 5, 2020

Add support for starting from exising BQ schema #40

Add support for starting from exising BQ schema #40

Conversation

bozzzzo commented Mar 22, 2020

bxparks commented Mar 22, 2020 • edited Loading

bozzzzo commented Mar 24, 2020

bxparks commented Mar 24, 2020 • edited Loading

abroglesc commented Oct 27, 2020

abroglesc commented Nov 9, 2020

bxparks commented Dec 5, 2020

bxparks commented Mar 22, 2020 •

edited

Loading

bxparks commented Mar 24, 2020 •

edited

Loading