-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CSV files with header in first row #26
Conversation
Your implementation is certainly better, thank you for spending your time on this! What's concerning is that I've just noticed that 'hardness' and 'softness' aren't working (neither with mine nor yours) as it is continuously being reported to be For example, for an input,
It returns,
While I'd expect |
I think you might be misunderstanding the meaning of 'soft' and 'hard'. If you look at
Everything else is resolved to be 'hard', even the |
You're right. It seems to consider it an empty String.
And then resolve it as,
It makes me think that due to nature of CSV files it'd make sense to treat them as,
A dirty show case (only for booleans and integers, and only for CSV:)), in.csv
out.json
If we could have had something a similar logic triggered when a CSV format is chosen. Not sure if you'd like it :) |
…hich can be clobbered by subsequent non-empty values (#26)
You are absolutely right that an empty CSV field must be handled specially. I missed that. It looks like
The first case ( In the second case, I think we just need to infer this to be I added a bunch of unit tests to cover these cases. Let me know if there's something I missed. |
Thank you! In simple words, having an input file such as,
I would conclude that, |
I think you have a small typo in your example. If you want
With this correction,
All of those seem right to me. Column I wonder if we are thinking about "soft" and "hard" in the same way. In the code, I use "soft" to mean the same as "provisional" or "temporary". When the script comes across a value whose type is not certain, it assigns a So columns |
I see! It probably makes sense for large hierarchical JSON files.
As you may think one file could be insufficient to decide on However, if I arbitrarily decide that N files (from technical POV concatenated perhaps) gives me this level of certainty and if Long story short, I wonder if it inspires you further to develop it the way I described it (I could send one more PR too for starters). Frankly, I think a different meaning for |
As far as my last sentence is concerned it could be something like,
Processed with an unspecified flag
And with
Whereas |
Hi, It sounds like you want to figure out if the mode of a column is NULLABLE or REQUIRED. However, the script is not designed to do that. It always assumes that the column is NULLABLE, because the resulting schema will allow the data file to be imported using Let me put this another way. The auto-schema detection algorithm of |
Exactly, apart from supporting CSV file format I'd like it to allow resolving a column mode. |
Ok, that seems like a feature request which is separate from adding support for CSV. Can we agree to merge in this PR to add support for reading a CSV file. Then can you file a feature request for the NULLABLE versus REQUIRED mode detection? In that ticket, can you describe as clearly as possible what the expected behavior you would like? See if you can examine all the edge cases, particularly the distinction between a missing field and an empty field. Your use-case is not something that I have come across myself, so if you can describe the concrete problem that you are trying to solve, that will help me understand your needs. I'm not ruling out adding the feature, it's just that I don't currently understand the pressing need. |
You're right, it would make sense to merge this feature now and work on/discuss mode auto-detection in a dedicated PR. Not every user is as much concerned about the mode as I am so the current PR certainly has a value in it already. I'll come up with a PR the soonest I can, so we could continue our discussions. Thank you for your time! |
…em for JSON, following 'bq load' behavior
…ields in the schema file for CSV input file
I had to make a bunch of more changes to correctly support CSV files:
Will merge this into 'develop' branch and let it simmer there for a short while. |
Support CSV files (inspired by #25). Adds a level of indirection using 2 adapters: