Add support for CSV files with header in first row #26

bxparks · 2019-02-26T21:48:06Z

Support CSV files (inspired by #25). Adds a level of indirection using 2 adapters:

csv.DictReader to read a CSV file with the header in the first row
a generator to read the newline-delimited JSON file

korotkevics · 2019-02-26T23:57:23Z

Support CSV files (inspired by #25). Adds a level of indirection using 2 adapters:

csv.DictReader to read a CSV file with the header in the first row

a generator to read the newline-delimited JSON file

Your implementation is certainly better, thank you for spending your time on this!

What's concerning is that I've just noticed that 'hardness' and 'softness' aren't working (neither with mine nor yours) as it is continuously being reported to be hard. I haven't yet figured out why.

For example, for an input,

a,b,c
,ho,hi
3,hu,he

It returns,

{
  "a": {
    "status": "hard",
    "info": {
      "mode": "NULLABLE",
      "name": "a",
      "type": "STRING"
    }
  },
  "b": {
    "status": "hard",
    "info": {
      "mode": "NULLABLE",
      "name": "b",
      "type": "STRING"
    }
  },
  "c": {
    "status": "hard",
    "info": {
      "mode": "NULLABLE",
      "name": "c",
      "type": "STRING"
    }
  }
}

While I'd expect a to be QINTEGER and soft.

bxparks · 2019-02-27T00:29:19Z

I think you might be misunderstanding the meaning of 'soft' and 'hard'. If you look at SchemaGenerator.get_schema_entry(), there are 3 ways that an entry can be 'soft':

value is null
value is an empty array []
value is an empty record {}

Everything else is resolved to be 'hard', even the Q* types. For a CSV file, I am almost certain that an empty field is returned as an empty string "" by DictReader, and of course, CSV doesn't support arrays or records.

korotkevics · 2019-02-27T08:02:08Z

You're right. It seems to consider it an empty String.

...
                elif self.FLOAT_MATCHER.match(value):
                    return 'QFLOAT'  # quoted float
                elif value.lower() in ['true', 'false']:
                    return 'QBOOLEAN'  # quoted boolean
                else:
->                  return 'STRING'
...

And then resolve it as,

...
        else:
            schema_entry = OrderedDict([('status', 'hard'),
                                        ('info', OrderedDict([
                                            ('mode', value_mode),
                                            ('name', key),
                                            ('type', value_type),
                                        ]))])
...

It makes me think that due to nature of CSV files it'd make sense to treat them as,

empty string is to equal null
whatever quoted type + string = unquoted type
hard mode + hard mode -> hard mode, hard mode + soft mode -> soft mode with type inherited from hard mode entry, soft mode + soft mode -> soft mode

A dirty show case (only for booleans and integers, and only for CSV:)),
korotkevics@8967bad

in.csv

a,b,c,d,e
,ho,hi,true
3,hu,he,

generate-schema --debugging_map --input_format=csv < in.csv > out.json; cat out.json

out.json

{
  "a": {
    "status": "soft",
    "info": {
      "mode": "NULLABLE",
      "name": "a",
      "type": "QINTEGER"
    }
  },
  "b": {
    "status": "hard",
    "info": {
      "mode": "NULLABLE",
      "name": "b",
      "type": "STRING"
    }
  },
  "c": {
    "status": "hard",
    "info": {
      "mode": "NULLABLE",
      "name": "c",
      "type": "STRING"
    }
  },
  "d": {
    "status": "soft",
    "info": {
      "mode": "NULLABLE",
      "name": "d",
      "type": "QBOOLEAN"
    }
  }
}

If we could have had something a similar logic triggered when a CSV format is chosen. Not sure if you'd like it :)

…hich can be clobbered by subsequent non-empty values (#26)

bxparks · 2019-02-27T18:37:48Z

You are absolutely right that an empty CSV field must be handled specially. I missed that.

It looks like csv.DictReader distinguishes 2 cases:

if the field is missing (the comma is missing), then DictReader returns a python None
if the field is empty (in other words, there exists a comma that delimits the field), then DictReader returns an empty string ""

The first case (None) is handled automatically by the current code.

In the second case, I think we just need to infer this to be soft String. Then the rest of the type inference logic works as expected.

I added a bunch of unit tests to cover these cases. Let me know if there's something I missed.

korotkevics · 2019-02-27T23:41:01Z

Thank you!
As far as empty values are concerned I think we are on the same page now and your solution again turns out to be more compact :)
However, with regards to soft/hard I am having different expectations here, I am not sure maybe it is simply because my take on this is different.
I branched out from csv to csv-ext and shared what I meant by that. I tweaked it under an if statement, and in tests I had to use --keep_nulls flag as well as change an expected float to integer (it is explained further below). Ideally, it shouldn't be tweaked like so, I guess.

In simple words, having an input file such as,

a,b,c,d,e
1,x,true,2.0
2,x,,,4
3,,,,

I would conclude that,
a is consistently filled in with a value and it is always an INTEGER, therefore it is hard to assume it as such
b is filled with a value sometimes and when it is filled it is a STRING, therefore it is soft to assume it as such
c is same, it is soft to assume it a BOOLEAN
d is empty, so we'd assume it a soft STRING by default
e is sometimes INTEGER and sometimes FLOAT, it is soft to assume that it is INTEGER (the closest approximation) UPDATE: well, I am not sure about this point, we'll have a look later.

bxparks · 2019-02-28T00:24:13Z

I think you have a small typo in your example. If you want d to be always empty, I think you need to add an extra comma in Line 2, before the "2.0", like this:

a,b,c,d,e
1,x,true,,2.0
2,x,,,4
3,,,,

With this correction, d is always empty, so you need --keep_nulls to include it in the schema, to get:

[ 
  { 
    "mode": "NULLABLE",
    "name": "a",
    "type": "INTEGER"
  },
  { 
    "mode": "NULLABLE",
    "name": "b",
    "type": "STRING"
  },
  { 
    "mode": "NULLABLE",
    "name": "c",
    "type": "BOOLEAN"
  },
  { 
    "mode": "NULLABLE",
    "name": "d",
    "type": "STRING"
  },
  { 
    "mode": "NULLABLE",
    "name": "e",
    "type": "FLOAT"
  }
]

All of those seem right to me. Column e is inferred to be a FLOAT because "2.0" is a floating point number. bq load also interprets "2.0" as a FLOAT, so I think that's consistent.

I wonder if we are thinking about "soft" and "hard" in the same way. In the code, I use "soft" to mean the same as "provisional" or "temporary". When the script comes across a value whose type is not certain, it assigns a status of "soft" (usually, this is a null value, or an empty string ""). If the script then encounters a subsequent data record with a value whose type is well-defined, then it converts the status to "hard".

So columns b, c or e are not "soft" at the end of the processing, because there was at least one concrete value in those fields that allowed the type to be inferred definitively, i.e. "hard". Column d is the only one that remains "soft" after all lines are parsed, because that column has no concrete value.

korotkevics · 2019-02-28T07:21:32Z

I see! It probably makes sense for large hierarchical JSON files.
My situation is that I wouldn't skip any of such columns in a CSV file, I would only do the following,

if a value is continuously filled in and the type is unambiguous I'd conclude it is REQUIRED and of certain type
if a value is empty all the time I'd conclude it is NULLABLE and STRING
if a value is sometimes filled in and the type is unambiguous I'd conclude it is NULLABLE and of certain type
if a value is sometimes/continuously filled and the type is ambiguous in I'd conclude it is the closest matching type, STRING the least

As you may think one file could be insufficient to decide on type (usually it is sufficient though) and it is certainly not to decide on mode.

However, if I arbitrarily decide that N files (from technical POV concatenated perhaps) gives me this level of certainty and if soft/hard was for what I thought it was then I'd be at home :)

Long story short, I wonder if it inspires you further to develop it the way I described it (I could send one more PR too for starters). Frankly, I think a different meaning for soft/hard could make sense for CSV as there one doesn't make a standard out of 'chaos' but rather has a standard in the very beginning and only needs to recognize what I mentioned above. However, it could be a new concept, with a new field nearby status and possibly triggered with flags.

korotkevics · 2019-02-28T07:37:21Z

As far as my last sentence is concerned it could be something like,

in.csv

Processed with an unspecified flag --decide_on_mode with a default value set to false would give us this

in debugging mode

 {
  "a": {
    "status": "hard",
    "is_likely_required": "true"
    "info": {
      "mode": "NULLABLE",
      "name": "a",
      "type": "QINTEGER"
    }
 }

normally

[ 
{ 
    "mode": "NULLABLE",
    "name": "a",
    "type": "INTEGER"
  }
]

And with --decide_on_mode set to true,

 {
  "a": {
    "status": "hard",
    "is_likely_required": "true"
    "info": {
      "mode": "REQUIRED",
      "name": "a",
      "type": "QINTEGER"
    }
 }

normally

[ 
 { 
    "mode": "REQUIRED",
    "name": "a",
    "type": "INTEGER"
  }
]

Whereas is_likely_required is distinguishable in a manner I tweaked your take on soft and hard. :)

bxparks · 2019-02-28T08:12:11Z

Hi,
I'm having difficulty understanding the behavior that you want from the script.

It sounds like you want to figure out if the mode of a column is NULLABLE or REQUIRED. However, the script is not designed to do that. It always assumes that the column is NULLABLE, because the resulting schema will allow the data file to be imported using bq load. This is consistent with the auto-schema detection of bq load. Can you explain why it would be useful for the script to determine if the column is REQUIRED?

Let me put this another way. The auto-schema detection algorithm of bq load uses only the first 100 lines of the CSV file. The goal of the bigquery-schema-generator script is to produce a schema that the bq load algorithm would have produced if it had examined all the lines of the file, not just the 100.

korotkevics · 2019-02-28T08:25:31Z

Exactly, apart from supporting CSV file format I'd like it to allow resolving a column mode.
My motivation is that if I restrict the mode early based on sufficient amount of data then in future if I receive a file missing value in a required column then it'd mean for me that something went wrong there. I mean it may be not universally helpful, but it is the reality I am having. Therefore I would not like to sort of blindly set it to NULLABLE :)
Bq auto-schema detection clearly has limitations for my purposes as well.
So, tipping on a column mode (say, with --decide_on_mode set to true resulting into is_likely_required to impact mode) does not meet your vision for the library, correct?

bxparks · 2019-02-28T08:36:22Z

Ok, that seems like a feature request which is separate from adding support for CSV. Can we agree to merge in this PR to add support for reading a CSV file. Then can you file a feature request for the NULLABLE versus REQUIRED mode detection? In that ticket, can you describe as clearly as possible what the expected behavior you would like? See if you can examine all the edge cases, particularly the distinction between a missing field and an empty field. Your use-case is not something that I have come across myself, so if you can describe the concrete problem that you are trying to solve, that will help me understand your needs. I'm not ruling out adding the feature, it's just that I don't currently understand the pressing need.

korotkevics · 2019-02-28T09:12:56Z

You're right, it would make sense to merge this feature now and work on/discuss mode auto-detection in a dedicated PR. Not every user is as much concerned about the mode as I am so the current PR certainly has a value in it already. I'll come up with a PR the soonest I can, so we could continue our discussions. Thank you for your time!

…em for JSON, following 'bq load' behavior

…ields in the schema file for CSV input file

bxparks · 2019-03-01T00:43:44Z

I had to make a bunch of more changes to correctly support CSV files:

the order of fields must be preserved in the schema file for CSV
keep_nulls must be set True automatically for CSV, otherwise the positional mapping from value to column becomes incorrect
updated README.md and various comments in the code
updated unit tests to preserve of ordering of fields in the schema file

Will merge this into 'develop' branch and let it simmer there for a short while.

Add support for CSV files with header in first row

ef698ec

bxparks mentioned this pull request Feb 26, 2019

Add support of CSV format #25

Closed

Treat empty CSV fields (parsed as empty strings) as a 'soft String' w…

d0d3f0d

…hich can be clobbered by subsequent non-empty values (#26)

korotkevics mentioned this pull request Feb 27, 2019

Add support for CSV files with header in first row #27

Closed

bxparks added 5 commits February 28, 2019 14:57

Update README.md and code comments to include support for CSV file

35dc2e0

Preserve field ordering in schema file for CSV input, keep sorting th…

da2575f

…em for JSON, following 'bq load' behavior

Remove obsolete sort_schema() method

c027095

Update documentation and comments regarding preserving the order of f…

d8af28d

…ields in the schema file for CSV input file

Update CHANGELOG.md to indicate support for CSV files

e5824f2

bxparks merged commit 0a7b33c into develop Mar 1, 2019

bxparks deleted the csv branch March 1, 2019 00:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for CSV files with header in first row #26

Add support for CSV files with header in first row #26

bxparks commented Feb 26, 2019

korotkevics commented Feb 26, 2019

bxparks commented Feb 27, 2019 •

edited

Loading

korotkevics commented Feb 27, 2019

bxparks commented Feb 27, 2019

korotkevics commented Feb 27, 2019 •

edited

Loading

bxparks commented Feb 28, 2019

korotkevics commented Feb 28, 2019 •

edited

Loading

korotkevics commented Feb 28, 2019 •

edited

Loading

bxparks commented Feb 28, 2019

korotkevics commented Feb 28, 2019 •

edited

Loading

bxparks commented Feb 28, 2019

korotkevics commented Feb 28, 2019

bxparks commented Mar 1, 2019

Add support for CSV files with header in first row #26

Add support for CSV files with header in first row #26

Conversation

bxparks commented Feb 26, 2019

korotkevics commented Feb 26, 2019

bxparks commented Feb 27, 2019 • edited Loading

korotkevics commented Feb 27, 2019

bxparks commented Feb 27, 2019

korotkevics commented Feb 27, 2019 • edited Loading

bxparks commented Feb 28, 2019

korotkevics commented Feb 28, 2019 • edited Loading

korotkevics commented Feb 28, 2019 • edited Loading

bxparks commented Feb 28, 2019

korotkevics commented Feb 28, 2019 • edited Loading

bxparks commented Feb 28, 2019

korotkevics commented Feb 28, 2019

bxparks commented Mar 1, 2019

bxparks commented Feb 27, 2019 •

edited

Loading

korotkevics commented Feb 27, 2019 •

edited

Loading

korotkevics commented Feb 28, 2019 •

edited

Loading

korotkevics commented Feb 28, 2019 •

edited

Loading

korotkevics commented Feb 28, 2019 •

edited

Loading