Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for starting from exising BQ schema #40

Closed
wants to merge 10 commits into from

Conversation

bozzzzo
Copy link

@bozzzzo bozzzzo commented Mar 22, 2020

We're loading a kafka topic to BQ. It seems prudent to inform SchemaGenerator of existing BQ table schema before processing new records as this reduces the chances for misdetection.

@bxparks
Copy link
Owner

bxparks commented Mar 22, 2020

Hi Bozo (assuming that's your real name?),

Thanks for your interest in this project. It looks like you spent some time going through the code to understand how it works. I have a some initial feedback.

  1. It's generally a good idea to give the maintainer of a project a note first before sending a PR containing new frameworks, new dependencies, and significant additional code or refactoring of code.

  2. With regards to d7275d7 which adds pytest and tox, I am inclined to reject this. First, it pulls in a bunch of pip packages, some of which are unfamiliar to me, so this commit imposes additional burden on me to spend time to figure those out, and maintain them into the future. This includes 2 new configuration files, in both TOML and INI format? What is the benefit that we gain for adding all those extra dependencies and complexity?

  3. With regards to 590e221 "Add support for starting from exising BQ schema", what is the problem that you are trying to solve with this code? I would have expected some additional examples and test cases, but all I see are some additional error messages? What does "INFORMED" mean?

  4. It looks like you simplified some parts, like the handling of 'sanitize_names' flag, which simplifies the code. It was a submission by a contributor which I probably didn't spend that much time on. I would accept this change if submitted separately.

  5. It looks like you added support for non-legacy SQL types (INT64, FLOAT64, BOOL, STRUCT). I would accept those changes if submitted separately. But only if it was accompanied by additional unit tests.

  6. It looks like you found some bugs in the DataReader.read_chunk() method. I would accept those changes if submitted separately, preferably with additional unit tests.

Keep in mind that I don't actively work in this space (big data and BigQuery) currently, and I haven't looked at this code for many months. It is likely that you understand my code better than me at this moment. I understand that you want to resolve some deficiency with the code, but you need to take the time to explain to me what those problems are. Also keep in mind that whatever changes that I accept, as a maintainer, I have to maintain that code into the future, whereas you can just walk away.

@bozzzzo
Copy link
Author

bozzzzo commented Mar 24, 2020

Hi Bozo (assuming that's your real name?),

Hi Brian, my real name is Božo, but I'm kind of still not used to websites properly supporting UTF8 so...

Thanks for your interest in this project. It looks like you spent some time going through the code to understand how it works. I have a some initial feedback.

Thanks for taking the time to go through this, I understand I put a lot of stuff in one PR

  1. It's generally a good idea to give the maintainer of a project a note first before sending a PR containing new frameworks, new dependencies, and significant additional code or refactoring of code.

point taken.

  1. With regards to d7275d7 which adds pytest and tox, I am inclined to reject this. First, it pulls in a bunch of pip packages, some of which are unfamiliar to me, so this commit imposes additional burden on me to spend time to figure those out, and maintain them into the future. This includes 2 new configuration files, in both TOML and INI format? What is the benefit that we gain for adding all those extra dependencies and complexity?

The short answer is, I mostly wanted to make tests use pytest. What I like about it is that distinct testcasess that are parametrized using @fixture are given unique identifiers so one can easily re-run just one failing test, without needing to hack on the test suite.

Tox is... someting that may be less relevant to SchemaGenerator, as your project is python3 only. in short tox allows testing the project across several versions of python easily. This one needs the tox.ini to work and for some reson pyproject.toml. The latter one I don't fully understand. It seems to be the new best way to substitute the most of setup.py. I don't know how to use it fully, I just learned to parrot the minimal incantaion by searching SO and looking at https://github.com/pypa/sampleproject

flake8 python linter with formatting opinions. I disabled this for now, as the submission to clean up was excessive even for my threshold :)

  1. With regards to 590e221 "Add support for starting from exising BQ schema", what is the problem that you are trying to solve with this code?

The SchemaGenerator use-case seemed to be "I want to import this one file into BQ" What we're doing is importing a kafka topic to BQ, kinda ELT style. The topic schema is changing over time
and I want to catch errors before they are loaded to BQ as we can create much better diagnostics with early warning if the topic schema was changed in a way that BQ cannot accomodate on it's own (yes the proper solution would be to not make mistakes upstream).

So Schema generator as-is helps with detecting the schema for a slice of kafka messages, but has no opinion on schema compatibility of past import with the current one, while it does have such opinion when the incompatible change happens mid-slice. So I wanted to close this gap, and give SchemaGenerator a 'memory' of what it saw before.

I could have pickled the schema_map and do it this way with minimal changes in the schema generator, but it seemed to me much more robust to be able to work agains an existing BQ table and bootstrap the generator from that schema.

I would have expected some additional examples and test cases, but all I see are some additional error messages?

There are two additional tests. One tests that a BQ schema can be converted to a schema_map and then flattened again without losing information. The other one...

What does "INFORMED" mean?

... tests that the SchemaGenerator produces the same schema when starting with the schema known. What does change in this case are the error messages in edge conditions. I think they are really warning messages, but didn't want to go adding warning category.

  1. It looks like you simplified some parts, like the handling of 'sanitize_names' flag, which simplifies the code. It was a submission by a contributor which I probably didn't spend that much time one. I would accept this change if submitted separately.

yeah, santitizing was happening too late for round-trip to work correctly, so I moved it to the source. it may be too simplistic performance wise as it now happens for every key of every data instead of once per key of produced schema.

  1. It looks like you added support for non-legacy SQL types (INT64, FLOAT64, BOOL, STRUCT). I would accept those changes if submitted separately. But only if it was accompanied by additional unit tests.

BQ schema API documentation is vague which values it actually returns INTEGER or INT64 and since SchemaGenerator only supports INTEGER I added the translation when converting the BQ schema to schema_map

  1. It looks like you found some bugs in the DataReader.read_chunk() method. I would accept those changes if submitted separately, preferably with additional unit tests.

ok.

Keep in mind that I don't actively work in this space (big data and BigQuery) currently, and I haven't looked at this code for many months. It is likely that you understand my code better than me at this moment. I understand that you want to resolve some deficiency with the code, but you need to take the time to explain to me what those problems are. Also keep in mind that whatever changes that I accept, as a maintainer, I have to maintain that code into the future, whereas you can just walk away.

I understand. I'll prepare the minimal patch that would be of value to us. that is just a 5 line change, ability to inject a prebuilt schema_map into the run() method. I can carry the rest on my fork or even in our solution.

@bxparks
Copy link
Owner

bxparks commented Mar 24, 2020

Ok, I can see the value in being able to generate the BQ schema incrementally from a previous BQ schema. There are several parts to this problem:

  1. Being able to re-constitute a schema_map from a schema. I never considered this possibility so I don't know if there are any subtleties or edge cases. I understand that's what your bq_schema_to_map() function is doing. I think this needs some unit tests.

  2. Incrementally adding a new chunk of data to the reconstituted schema. It's not clear to me how errors are supposed to be communicated to the user.

  3. Exposing this to the command line script. It looks like you added this feature only by using this project as a library. To be consistent, I think this feature should be exposed to the command line, probably through a flag (--existing_schema ?) that reads in the existing schema, then generates the new schema.

  4. Testing. Seems like you would need to extend the structure of testdata.txt format so that you can test this incremental schema generation. I think it would be something like:

DATA 
{new JSON or CSV data}
EXISTING_SCHEMA
{existing BQ schema}
ERRORS
{errors messages}
SCHEMA
{new expected BQ schema}
END
  1. With regards to pytest and tox, I don't think we should pull these in. With regards to a linter/formatter, I use yapf3 and pylint. But they are not hooked up to a CI, so I don't run them as often as I should.

@abroglesc
Copy link
Contributor

Looking at this, it would be useful to me as well. Is it possible to push to get these implemented or help make the necessary adjustments myself to allow this to be merged?

@abroglesc
Copy link
Contributor

I opened #57 as a continuation of this since I cannot contribute to the branch this PR was opened from but I have an interest in getting this code merged.

Thank you @bozzzzo for all of the original work here, it is really useful to me :)

@bxparks
Copy link
Owner

bxparks commented Dec 5, 2020

Superceded by #61. Closing.

@bxparks bxparks closed this Dec 5, 2020
@bxparks bxparks mentioned this pull request Dec 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants