-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/csv: skipping of empty rows leads to loss of data in single-column datasets #39119
Comments
As mentioned, a quick look at RFC 4180 does not seem to explicitly describe what the behavior should be for empty lines. |
According to RFC 4180, section 2.1:
Which suggests that this is a bug in the reader code. However, I'm not going to be surprised if fixing this breaks lots of users due to Hyrum's law and we're forced to just document the errant behavior. Even though the package tries to follow RFC 4180, CSV files are one of those formats with many strange variants that do not folllow any specific grammar. |
Interesting that this used to work in go1.3 and stopped working in go1.4, but not because the reader behavior changed, but that the writer behavior changed. In go1.3, it would produce:
but in go1.4, it would produce:
The presence of an explicit |
The change in Go 1.4 is documented, but I guess not in user-visible documentation.
See #7586. |
If we're worried about breaking existing code, we could add a boolean flag, which defaults to the current behaviour and then we could discuss flipping it (and potentially removing it) for Go 2. If I understand the implementation correctly, then one cannot just initialise a Reader struct, one has to go through Something along the lines of this (haven't tested it, just sketching): diff --git a/src/encoding/csv/reader.go b/src/encoding/csv/reader.go
index c40aa506b0..cd2b0ccfc1 100644
--- a/src/encoding/csv/reader.go
+++ b/src/encoding/csv/reader.go
@@ -16,8 +16,8 @@
//
// Carriage returns before newline characters are silently removed.
//
-// Blank lines are ignored. A line with only whitespace characters (excluding
-// the ending newline character) is not considered a blank line.
+// Blank lines are ignored by default. A line with only whitespace characters
+// (excluding the ending newline character) is not considered a blank line.
//
// Fields which start and stop with the quote character " are called
// quoted-fields. The beginning and ending quote are not part of the
@@ -142,6 +142,9 @@ type Reader struct {
// By default, each call to Read returns newly allocated memory owned by the caller.
ReuseRecord bool
+ // If SkipBlankLines is true (default), rows with no data are skipped.
+ SkipBlankLines bool
+
TrailingComma bool // Deprecated: No longer used.
r *bufio.Reader
@@ -169,8 +172,9 @@ type Reader struct {
// NewReader returns a new Reader that reads from r.
func NewReader(r io.Reader) *Reader {
return &Reader{
- Comma: ',',
- r: bufio.NewReader(r),
+ Comma: ',',
+ SkipBlankLines: true,
+ r: bufio.NewReader(r),
}
}
@@ -268,7 +272,7 @@ func (r *Reader) readRecord(dst []string) ([]string, error) {
line = nil
continue // Skip comment lines
}
- if errRead == nil && len(line) == lengthNL(line) {
+ if r.SkipBlankLines && errRead == nil && len(line) == lengthNL(line) {
line = nil
continue // Skip empty lines
} |
Also, if this patch goes through, I'd suggest for Go 2 to have the default reverted - interpret blank lines by default and either add a |
Change https://golang.org/cl/237658 mentions this issue: |
We've pretty much decided that we are not going to add additional knobs to encoding/csv. There are just too many variations in CSV files out there. Trying to handle all of them would lead to a massive proliferation of knobs and an unusable package API. |
That's too bad, because two of the existing knobs are actually going against the "standard" - lazyquotes and comments, while this suggested knob actually tries to bring Go closer to standard compliance (while keeping backward compatibility). Anyway, where can I register interest in changing this in Go 2? No knobs, just changing the default - blank lines are perfectly fine and should be interpreted as data. |
There isn't going to be a big shift to Go 2. There may never be a Go 2, except perhaps for marketing reasons. For a package like this, there may someday be a encoding/csv/v2. Although personally I think that our experience with this package is that it would be better to drop it from the standard library, and encourage a variety of third party packages instead. |
Should the documentation be updated to indicate that RFC 4180 is not fully supported? I just ran into this myself, and reading the RFC
suggests strongly that blank lines should correspond to a single empty field. I realize there's already a note a couple of paragraphs down in the encoding/csv documentation that states blank lines are ignored, but it's not made clear that this behavior contradicts the RFC. |
Change https://golang.org/cl/363001 mentions this issue: |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I wrote a CSV with a single column and missing data.
What did you expect to see?
I expected to load the data back, intact.
What did you see instead?
I lost the missing values,
encoding/csv
skipped them as it skips blank lines. In this case, a blank line actually represents data.I'm not sure I understand the rationale behind skipping blank lines. Neither in terms of common practice (why would I have blank lines in my CSVs?) nor in terms of standards (the closest we have is RFC 4180 and I couldn't find anything about blank lines - so I'm not sure if Go follows it).
Here's a reproduction of the problem. I wrote a dataset into a file and was unable to read it back.
The text was updated successfully, but these errors were encountered: