Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for CSV files with header in first row #26

Merged
merged 7 commits into from
Mar 1, 2019
Merged

Add support for CSV files with header in first row #26

merged 7 commits into from
Mar 1, 2019

Conversation

bxparks
Copy link
Owner

@bxparks bxparks commented Feb 26, 2019

Support CSV files (inspired by #25). Adds a level of indirection using 2 adapters:

  • csv.DictReader to read a CSV file with the header in the first row
  • a generator to read the newline-delimited JSON file

@bxparks bxparks mentioned this pull request Feb 26, 2019
@korotkevics
Copy link
Contributor

Support CSV files (inspired by #25). Adds a level of indirection using 2 adapters:

  • csv.DictReader to read a CSV file with the header in the first row
  • a generator to read the newline-delimited JSON file

Your implementation is certainly better, thank you for spending your time on this!

What's concerning is that I've just noticed that 'hardness' and 'softness' aren't working (neither with mine nor yours) as it is continuously being reported to be hard. I haven't yet figured out why.

For example, for an input,

a,b,c
,ho,hi
3,hu,he

It returns,

{
  "a": {
    "status": "hard",
    "info": {
      "mode": "NULLABLE",
      "name": "a",
      "type": "STRING"
    }
  },
  "b": {
    "status": "hard",
    "info": {
      "mode": "NULLABLE",
      "name": "b",
      "type": "STRING"
    }
  },
  "c": {
    "status": "hard",
    "info": {
      "mode": "NULLABLE",
      "name": "c",
      "type": "STRING"
    }
  }
}

While I'd expect a to be QINTEGER and soft.

@bxparks
Copy link
Owner Author

bxparks commented Feb 27, 2019

I think you might be misunderstanding the meaning of 'soft' and 'hard'. If you look at SchemaGenerator.get_schema_entry(), there are 3 ways that an entry can be 'soft':

  • value is null
  • value is an empty array []
  • value is an empty record {}

Everything else is resolved to be 'hard', even the Q* types. For a CSV file, I am almost certain that an empty field is returned as an empty string "" by DictReader, and of course, CSV doesn't support arrays or records.

@korotkevics
Copy link
Contributor

You're right. It seems to consider it an empty String.

...
                elif self.FLOAT_MATCHER.match(value):
                    return 'QFLOAT'  # quoted float
                elif value.lower() in ['true', 'false']:
                    return 'QBOOLEAN'  # quoted boolean
                else:
->                  return 'STRING'
...

And then resolve it as,

...
        else:
            schema_entry = OrderedDict([('status', 'hard'),
                                        ('info', OrderedDict([
                                            ('mode', value_mode),
                                            ('name', key),
                                            ('type', value_type),
                                        ]))])
...

It makes me think that due to nature of CSV files it'd make sense to treat them as,

  • empty string is to equal null
  • whatever quoted type + string = unquoted type
  • hard mode + hard mode -> hard mode, hard mode + soft mode -> soft mode with type inherited from hard mode entry, soft mode + soft mode -> soft mode

A dirty show case (only for booleans and integers, and only for CSV:)),
korotkevics@8967bad

in.csv

a,b,c,d,e
,ho,hi,true
3,hu,he,

generate-schema --debugging_map --input_format=csv < in.csv > out.json; cat out.json

out.json

{
  "a": {
    "status": "soft",
    "info": {
      "mode": "NULLABLE",
      "name": "a",
      "type": "QINTEGER"
    }
  },
  "b": {
    "status": "hard",
    "info": {
      "mode": "NULLABLE",
      "name": "b",
      "type": "STRING"
    }
  },
  "c": {
    "status": "hard",
    "info": {
      "mode": "NULLABLE",
      "name": "c",
      "type": "STRING"
    }
  },
  "d": {
    "status": "soft",
    "info": {
      "mode": "NULLABLE",
      "name": "d",
      "type": "QBOOLEAN"
    }
  }
}

If we could have had something a similar logic triggered when a CSV format is chosen. Not sure if you'd like it :)

…hich can be clobbered by subsequent non-empty values (#26)
@bxparks
Copy link
Owner Author

bxparks commented Feb 27, 2019

You are absolutely right that an empty CSV field must be handled specially. I missed that.

It looks like csv.DictReader distinguishes 2 cases:

  1. if the field is missing (the comma is missing), then DictReader returns a python None
  2. if the field is empty (in other words, there exists a comma that delimits the field), then DictReader returns an empty string ""

The first case (None) is handled automatically by the current code.

In the second case, I think we just need to infer this to be soft String. Then the rest of the type inference logic works as expected.

I added a bunch of unit tests to cover these cases. Let me know if there's something I missed.

@korotkevics
Copy link
Contributor

korotkevics commented Feb 27, 2019

Thank you!
As far as empty values are concerned I think we are on the same page now and your solution again turns out to be more compact :)
However, with regards to soft/hard I am having different expectations here, I am not sure maybe it is simply because my take on this is different.
I branched out from csv to csv-ext and shared what I meant by that. I tweaked it under an if statement, and in tests I had to use --keep_nulls flag as well as change an expected float to integer (it is explained further below). Ideally, it shouldn't be tweaked like so, I guess.

In simple words, having an input file such as,

a,b,c,d,e
1,x,true,2.0
2,x,,,4
3,,,,

I would conclude that,
a is consistently filled in with a value and it is always an INTEGER, therefore it is hard to assume it as such
b is filled with a value sometimes and when it is filled it is a STRING, therefore it is soft to assume it as such
c is same, it is soft to assume it a BOOLEAN
d is empty, so we'd assume it a soft STRING by default
e is sometimes INTEGER and sometimes FLOAT, it is soft to assume that it is INTEGER (the closest approximation) UPDATE: well, I am not sure about this point, we'll have a look later.

@bxparks
Copy link
Owner Author

bxparks commented Feb 28, 2019

I think you have a small typo in your example. If you want d to be always empty, I think you need to add an extra comma in Line 2, before the "2.0", like this:

a,b,c,d,e
1,x,true,,2.0
2,x,,,4
3,,,,

With this correction, d is always empty, so you need --keep_nulls to include it in the schema, to get:

[ 
  { 
    "mode": "NULLABLE",
    "name": "a",
    "type": "INTEGER"
  },
  { 
    "mode": "NULLABLE",
    "name": "b",
    "type": "STRING"
  },
  { 
    "mode": "NULLABLE",
    "name": "c",
    "type": "BOOLEAN"
  },
  { 
    "mode": "NULLABLE",
    "name": "d",
    "type": "STRING"
  },
  { 
    "mode": "NULLABLE",
    "name": "e",
    "type": "FLOAT"
  }
]

All of those seem right to me. Column e is inferred to be a FLOAT because "2.0" is a floating point number. bq load also interprets "2.0" as a FLOAT, so I think that's consistent.

I wonder if we are thinking about "soft" and "hard" in the same way. In the code, I use "soft" to mean the same as "provisional" or "temporary". When the script comes across a value whose type is not certain, it assigns a status of "soft" (usually, this is a null value, or an empty string ""). If the script then encounters a subsequent data record with a value whose type is well-defined, then it converts the status to "hard".

So columns b, c or e are not "soft" at the end of the processing, because there was at least one concrete value in those fields that allowed the type to be inferred definitively, i.e. "hard". Column d is the only one that remains "soft" after all lines are parsed, because that column has no concrete value.

@korotkevics
Copy link
Contributor

korotkevics commented Feb 28, 2019

I see! It probably makes sense for large hierarchical JSON files.
My situation is that I wouldn't skip any of such columns in a CSV file, I would only do the following,

  • if a value is continuously filled in and the type is unambiguous I'd conclude it is REQUIRED and of certain type
  • if a value is empty all the time I'd conclude it is NULLABLE and STRING
  • if a value is sometimes filled in and the type is unambiguous I'd conclude it is NULLABLE and of certain type
  • if a value is sometimes/continuously filled and the type is ambiguous in I'd conclude it is the closest matching type, STRING the least

As you may think one file could be insufficient to decide on type (usually it is sufficient though) and it is certainly not to decide on mode.

However, if I arbitrarily decide that N files (from technical POV concatenated perhaps) gives me this level of certainty and if soft/hard was for what I thought it was then I'd be at home :)

Long story short, I wonder if it inspires you further to develop it the way I described it (I could send one more PR too for starters). Frankly, I think a different meaning for soft/hard could make sense for CSV as there one doesn't make a standard out of 'chaos' but rather has a standard in the very beginning and only needs to recognize what I mentioned above. However, it could be a new concept, with a new field nearby status and possibly triggered with flags.

@korotkevics
Copy link
Contributor

korotkevics commented Feb 28, 2019

As far as my last sentence is concerned it could be something like,

in.csv

a
1
2
3

Processed with an unspecified flag --decide_on_mode with a default value set to false would give us this

  • in debugging mode
 {
  "a": {
    "status": "hard",
    "is_likely_required": "true"
    "info": {
      "mode": "NULLABLE",
      "name": "a",
      "type": "QINTEGER"
    }
 }
  • normally
[ 
{ 
    "mode": "NULLABLE",
    "name": "a",
    "type": "INTEGER"
  }
]

And with --decide_on_mode set to true,

 {
  "a": {
    "status": "hard",
    "is_likely_required": "true"
    "info": {
      "mode": "REQUIRED",
      "name": "a",
      "type": "QINTEGER"
    }
 }
  • normally
[ 
 { 
    "mode": "REQUIRED",
    "name": "a",
    "type": "INTEGER"
  }
]

Whereas is_likely_required is distinguishable in a manner I tweaked your take on soft and hard. :)

@bxparks
Copy link
Owner Author

bxparks commented Feb 28, 2019

Hi,
I'm having difficulty understanding the behavior that you want from the script.

It sounds like you want to figure out if the mode of a column is NULLABLE or REQUIRED. However, the script is not designed to do that. It always assumes that the column is NULLABLE, because the resulting schema will allow the data file to be imported using bq load. This is consistent with the auto-schema detection of bq load. Can you explain why it would be useful for the script to determine if the column is REQUIRED?

Let me put this another way. The auto-schema detection algorithm of bq load uses only the first 100 lines of the CSV file. The goal of the bigquery-schema-generator script is to produce a schema that the bq load algorithm would have produced if it had examined all the lines of the file, not just the 100.

@korotkevics
Copy link
Contributor

korotkevics commented Feb 28, 2019

Exactly, apart from supporting CSV file format I'd like it to allow resolving a column mode.
My motivation is that if I restrict the mode early based on sufficient amount of data then in future if I receive a file missing value in a required column then it'd mean for me that something went wrong there. I mean it may be not universally helpful, but it is the reality I am having. Therefore I would not like to sort of blindly set it to NULLABLE :)
Bq auto-schema detection clearly has limitations for my purposes as well.
So, tipping on a column mode (say, with --decide_on_mode set to true resulting into is_likely_required to impact mode) does not meet your vision for the library, correct?

@bxparks
Copy link
Owner Author

bxparks commented Feb 28, 2019

Ok, that seems like a feature request which is separate from adding support for CSV. Can we agree to merge in this PR to add support for reading a CSV file. Then can you file a feature request for the NULLABLE versus REQUIRED mode detection? In that ticket, can you describe as clearly as possible what the expected behavior you would like? See if you can examine all the edge cases, particularly the distinction between a missing field and an empty field. Your use-case is not something that I have come across myself, so if you can describe the concrete problem that you are trying to solve, that will help me understand your needs. I'm not ruling out adding the feature, it's just that I don't currently understand the pressing need.

@korotkevics
Copy link
Contributor

You're right, it would make sense to merge this feature now and work on/discuss mode auto-detection in a dedicated PR. Not every user is as much concerned about the mode as I am so the current PR certainly has a value in it already. I'll come up with a PR the soonest I can, so we could continue our discussions. Thank you for your time!

@bxparks
Copy link
Owner Author

bxparks commented Mar 1, 2019

I had to make a bunch of more changes to correctly support CSV files:

  • the order of fields must be preserved in the schema file for CSV
  • keep_nulls must be set True automatically for CSV, otherwise the positional mapping from value to column becomes incorrect
  • updated README.md and various comments in the code
  • updated unit tests to preserve of ordering of fields in the schema file

Will merge this into 'develop' branch and let it simmer there for a short while.

@bxparks bxparks merged commit 0a7b33c into develop Mar 1, 2019
@bxparks bxparks deleted the csv branch March 1, 2019 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants