-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip bad records instead of throwing an exception? #49
Comments
Can you filter out the invalid lines of your JSON input file before processing it through generate_schema.py? I mean it's a 12-line python script: (Edit: Fixed line counter, originally started as an error counter) #!/usr/bin/env python3
import sys
import json
line_number = 0
for line in sys.stdin:
line_number += 1
try:
json.loads(line)
print(line, end='')
except:
print(f'#{line_number}: invalid JSON', file=sys.stderr) Put this in a script called
I am hesitant about adding a I want to keep |
The core reason for me would be performance, i.e. not having to traverse and parse the JSON file twice. |
Performance is a fair point for large input files. So something like |
For JSON files, it seems to me that the easiest/shortest solution would be to catch the exception in the If the option becomes more generic, i.e. |
Implemented Side comment about You can try out my code by checkinout out the |
Looks like it works well, thanks! |
…lines is not given, throws an exception upon the very first error and stops (#49)
Thanks for testing! The change in behavior was not intended, I was tired yesterday and was not thinking clearly. I fixed it so that it follows the old behavior when --ignore_invalid_lines is missing, in other words, it stops and throws an exception as soon as the first invalid line is encountered. |
I pulled the latest version and I can confirm that without the flag, the script behaves as it did originally (throwing an exception); with the flag, it works as expected (skipping the line). |
I have a newline delimited JSON file with a few bad (i.e. undecodable) lines. Currently this results in a
JSONDecodeError
halting execution.Given that BigQuery can cope with bad records (
--max_bad_records
parameter) by skipping them, would it be useful to have a similar option in the schema generator? (This could be useful for e.g. CSV files with missing trailing columns as well.)Concretely, the issue with my JSON file could be resolved by adding an (optional) try/except to
bigquery-schema-generator/bigquery_schema_generator/generate_schema.py
Lines 546 to 552 in a60c38a
The text was updated successfully, but these errors were encountered: