Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change dict to json #1028

Merged
merged 8 commits into from
Jul 11, 2024
Merged

Change dict to json #1028

merged 8 commits into from
Jul 11, 2024

Conversation

KasiaHinkson
Copy link
Contributor

RECORD type requires the shape of the dictionary to be explicit and fails when the Parsons table has a dict value. I've tested with New/Mode data, and it loads successfully as JSON type.

Copy link
Collaborator

@austinweisgrau austinweisgrau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to work either, at least with what I tried:

Table([{"dict_value": {"a": 1, "b": 2}}]).to_bigquery("aweisgrau.test", if_exists="drop")

yields

google_bigquery ERROR {'reason': 'invalid', 'location': 'gs://bkt-tmc-mem-wfp/856f7b4d-b0b1-4e1a-8a56-d8f9bfb1f091.csv', 'message': 'Error while reading data, error message: syntax error while parsing object key - invalid literal; last read: \'{\'\'; expected string literal; line_number: 2 byte_offset_to_start_of_line: 22 column_index: 0 column_name: "dict_value" column_type: JSON value: "{\\\'a\\\': 1, \\\'b\\\': 2}" File: gs://bkt-tmc-mem-wfp/856f7b4d-b0b1-4e1a-8a56-d8f9bfb1f091.csv'}

@austinweisgrau
Copy link
Collaborator

Maybe if we look for any columns with dicts and convert them to JSON strings before sending them to bigquery? And keeping the column type defined as JSON

@austinweisgrau
Copy link
Collaborator

Orrrr change the way Python dicts are materialized to csvs to be json compliant or somethinf

@KasiaHinkson
Copy link
Contributor Author

Thank you so much, I had a feeling I was missing something and couldn't make the time to more fully test it, so I super appreciate this. I like the idea of converting dicts to JSON strings

Copy link
Collaborator

@austinweisgrau austinweisgrau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works but the problem with this implementation is that the Table column is changed in a persistent way after the load. This is probably not great as a side effect.

tbl = Table([{'dict_value': {'a': 1, 'b': 2}}])

type(tbl[0]['dict_value'])
>> dict

tbl.to_bigquery('aweisgrau.test', if_exists='drop')
type(tbl[0]['dict_value'])
>> str

I think options include:

  • figure out if there's an elegant way to make a separate version of the table temporarily just for the load to bigquery without changing the original table object
  • require that folks do this conversion themself and define the schema with the new string column as a JSON type if they want this behavior

@KasiaHinkson
Copy link
Contributor Author

Yeah, I knew something kind of funky was going on. I think I lean towards your second idea, it's more flexible and explicit.

@KasiaHinkson
Copy link
Contributor Author

Ok, so what I've done is just json.dumps() all the dicts and lists, then in dbt I'm using PARSE_JSON. That felt easier in this case than specifying the entire table schema. If that feels like a reasonable expectation, then I think we should just take out the dict: RECORD, since that doesn't work, and add some documentation where it would fit best. What do you think?

@austinweisgrau
Copy link
Collaborator

Yes that sounds right to me

@shaunagm
Copy link
Collaborator

What's the status of this? From this conversation it sounds like you want to update the PR to remove "dict": "RECORD" instead of replacing it with "dict":"JSON" - lmk when that's done and I can approve and merge (or feel free to do something else if I've misunderstood, of course!)

@KasiaHinkson
Copy link
Contributor Author

I actually think our conclusion was to take dict out entirely and add documentation that before running a copy function, users should convert dictionaries to json strings and then use PARSE_JSON in BQ. I'm not 100% sure where to write this documentation

@austinweisgrau
Copy link
Collaborator

Maybe if any column best type is a dict, we catch the key error and re-raise it with a more verbose description of the situation and how to address it?

@austinweisgrau
Copy link
Collaborator

Here's my suggestion of what we could do here: #1068

Copy link
Collaborator

@austinweisgrau austinweisgrau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@KasiaHinkson
Copy link
Contributor Author

Addressed by PR #1068

@austinweisgrau
Copy link
Collaborator

austinweisgrau commented Jul 11, 2024

Noo @KasiaHinkson sorry - #1068 merged a change into THIS branch, not into main. This PR is now ready to be merged into main, not deleted.

@austinweisgrau
Copy link
Collaborator

Im going to merge though since it seems like we're in agreement that it's good to go.

@austinweisgrau austinweisgrau merged commit 118d744 into main Jul 11, 2024
34 checks passed
@austinweisgrau austinweisgrau deleted the kasiahinkson/list-data-type branch July 11, 2024 20:12
@KasiaHinkson
Copy link
Contributor Author

Ohhh yep, my apologies! No power for 2 days turns my brain off, apparently 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants