-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for starting from exising BQ schema #40
Conversation
Hi Bozo (assuming that's your real name?), Thanks for your interest in this project. It looks like you spent some time going through the code to understand how it works. I have a some initial feedback.
Keep in mind that I don't actively work in this space (big data and BigQuery) currently, and I haven't looked at this code for many months. It is likely that you understand my code better than me at this moment. I understand that you want to resolve some deficiency with the code, but you need to take the time to explain to me what those problems are. Also keep in mind that whatever changes that I accept, as a maintainer, I have to maintain that code into the future, whereas you can just walk away. |
Hi Brian, my real name is Božo, but I'm kind of still not used to websites properly supporting UTF8 so...
Thanks for taking the time to go through this, I understand I put a lot of stuff in one PR
point taken.
The short answer is, I mostly wanted to make tests use pytest. What I like about it is that distinct testcasess that are parametrized using Tox is... someting that may be less relevant to SchemaGenerator, as your project is python3 only. in short tox allows testing the project across several versions of python easily. This one needs the flake8 python linter with formatting opinions. I disabled this for now, as the submission to clean up was excessive even for my threshold :)
The SchemaGenerator use-case seemed to be "I want to import this one file into BQ" What we're doing is importing a kafka topic to BQ, kinda ELT style. The topic schema is changing over time So Schema generator as-is helps with detecting the schema for a slice of kafka messages, but has no opinion on schema compatibility of past import with the current one, while it does have such opinion when the incompatible change happens mid-slice. So I wanted to close this gap, and give SchemaGenerator a 'memory' of what it saw before. I could have pickled the schema_map and do it this way with minimal changes in the schema generator, but it seemed to me much more robust to be able to work agains an existing BQ table and bootstrap the generator from that schema.
There are two additional tests. One tests that a BQ schema can be converted to a schema_map and then flattened again without losing information. The other one...
... tests that the SchemaGenerator produces the same schema when starting with the schema known. What does change in this case are the error messages in edge conditions. I think they are really warning messages, but didn't want to go adding warning category.
yeah, santitizing was happening too late for round-trip to work correctly, so I moved it to the source. it may be too simplistic performance wise as it now happens for every key of every data instead of once per key of produced schema.
BQ schema API documentation is vague which values it actually returns INTEGER or INT64 and since SchemaGenerator only supports INTEGER I added the translation when converting the BQ schema to schema_map
ok.
I understand. I'll prepare the minimal patch that would be of value to us. that is just a 5 line change, ability to inject a prebuilt schema_map into the run() method. I can carry the rest on my fork or even in our solution. |
Identify data chunks by count and line number
Ok, I can see the value in being able to generate the BQ schema incrementally from a previous BQ schema. There are several parts to this problem:
|
… key from new schema only differs in case
…cess inconvertible types
Looking at this, it would be useful to me as well. Is it possible to push to get these implemented or help make the necessary adjustments myself to allow this to be merged? |
Superceded by #61. Closing. |
We're loading a kafka topic to BQ. It seems prudent to inform SchemaGenerator of existing BQ table schema before processing new records as this reduces the chances for misdetection.